Delete duplicate lines in a text file. Can it be done?
A simple question maybe but is it simple to do…?
Example of content in file:
John Doe
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe
--Get text from selected file
set this_file to (choose file) as text
set this_text to read file this_file
set new_text to ""
--Loop through paragraphs of old text
repeat with myPara in paragraphs of this_text
--Check for paragraph's contents in new text
--If not there add new text to end of new text
if new_text does not contain myPara then set new_text to new_text & myPara & return
end repeat
--Remove final return
set new_text to (characters 1 thru -2 of new_text) as text
WriteToFile(this_file, new_text, true)
--handler to write text to text file with flag to clear text before
on WriteToFile(theFile, theText, clearFLag)
try
open for access file theFile with write permission
end try
if clearFLag is true then
set eof of file theFile to 0
end if
write theText & return to file theFile starting at eof
try
close access file theFile
end try
end WriteToFile
This looks extremely useful, particularly if set up as a “do shell” command. The man page for “sort” shows a powerful set of switches too (-d and -n are appealing for numerical sorts, ignoring leading white space is useful, -M is neat, -r is great).
Now the questions:
Can in_file.txt be an AppleScript variable, assuming that the contents, whatever they were (a list, a string, contents of a document), have been coerced into text delimited by returns or line feeds (and which is required)?
will "set myVar to do shell script … " with no output file specified return the standard output to the script?
I’m interested because I’ve been using this code to refine a list of random numbers and it seems I could do it much more efficiently with a switch of delimiters and a unix sort.
on Uniquify(num_list)
(* Sorts the list, then compares adjacent items for
duplicates. If a duplicate is found, a new number
is generated and tested for duplication in the list.
If the new number is not a duplication with of the original
list elements, the new number is inserted in place of
the duplicate and the list is sorted again*)
set sortedList to UpSort(num_list)
set theDup to 0
repeat with j from 1 to ((count of sortedList) - 1)
if (item j of sortedList) = (item (j + 1) of sortedList) then
set theDup to j
end if
end repeat
if theDup is not equal to 0 then
set noMatch to false
repeat until noMatch = true
set newNumber to random number from 1 to 49
set noMatch to test_matches(sortedList, pad(newNumber))
end repeat
if noMatch is true then
set idx to theDup as integer
set item idx of sortedList to pad(newNumber)
return sortedList
end if
else
return sortedList
end if
end Uniquify
on UpSort(my_list)
set the index_list to {}
set the sorted_list to {}
repeat (the number of items in my_list) times
set the low_item to ""
repeat with i from 1 to (number of items in my_list)
if i is not in the index_list then
set this_item to item i of my_list
if the low_item is "" then
set the low_item to this_item
set the low_item_index to i
else if this_item comes before the low_item then
set the low_item to this_item
set the low_item_index to i
end if
end if
end repeat
set the end of sorted_list to the low_item
set the end of the index_list to the low_item_index
end repeat
return the sorted_list
end UpSort
on test_matches(theList, theTestItem)
set matchCounter to 0
repeat with k from 1 to number of items in theList
if item k of theList is theTestItem then set matchCounter to matchCounter + 1
end repeat
if matchCounter > 0 then
return false
else
return true
end if
end test_matches
The Unix sort isn’t unicode-aware, so its usefulness in AppleScript is limited in practice. Less of an issue for what the OP’s doing; it could fail if input was un-normalised unicode data, though that’s probably unlikely. See AppleMods’ List library for general-purpose AS sort commands.
[snip] Not sure what that code’s supposed to do, but it’s a pretty inefficient algorithm whatever it is. Maybe if you explain its purpose a more efficient version could be suggested.
Thanks for the link to the sourceforge mods - a complete eye-opener. I was completely unaware of their existence - in fact, a well kept secret in the usual AppleScript literature. I note that the appleMod FAQ (1.1, second paragraph) gives you all the credit for starting it as an open source resource. I’ll read on to discover how to use them.
Thanks for this too - the definitive word on “do script” construction - I’m not very good at searching out these references yet.
Acknowledged. Those handlers were written (and bits plagiarized) about two weeks after I set about learning AppleScript and since they actually work, I’ve never changed them. I was making a list of 6 random numbers and then checking that there were no two the same replacing duplicates with a new random pick. Now, although I don’t need to, I’ll use that problem as a framework for learning to use appleMods.
set this_text to "John Doe
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe"
on remove_duplicates(this_text)
set not_list to class of this_text is not list
if not_list then set this_text to paragraphs of this_text
set new_text to {}
repeat with this_line in this_text
if this_line is not in new_text then set end of new_text to (contents of this_line)
end repeat
if not_list then
set text item delimiters to return
tell new_text to set new_text to beginning & ({""} & rest)
set text item delimiters to ""
end if
return new_text
end remove_duplicates
get remove_duplicates(this_text)
Should be able to do lists too.
Thanks to kai for his line.