Delete duplicate lines in a text file

Abisilla · August 17, 2005, 12:05pm

Delete duplicate lines in a text file. Can it be done?

A simple question maybe but is it simple to do…?

Example of content in file:
John Doe
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe

The new content:
John Doe
Jessica Rabbit

hhas · August 17, 2005, 1:40pm

If line order doesn’t need to be preserved:

sort -u -o out_file.txt in_file.txt

John_M · August 17, 2005, 1:49pm

Hi Abisilla,

Try

--Get text from selected file
set this_file to (choose file) as text 
set this_text to read file this_file 
set new_text to "" 

--Loop through paragraphs of old text
repeat with myPara in paragraphs of this_text 
--Check for paragraph's contents in new text
--If not there add new text to end of new text
if new_text does not contain myPara then set new_text to new_text & myPara & return
end repeat 
--Remove final return 
set new_text to (characters 1 thru -2 of new_text) as text 
WriteToFile(this_file, new_text, true) 

--handler to write text to text file with flag to clear text before 
on WriteToFile(theFile, theText, clearFLag) 
try 
open for access file theFile with write permission 
end try 
if clearFLag is true then 
set eof of file theFile to 0 
end if 
write theText & return to file theFile starting at eof 
try 
close access file theFile 
end try 
end WriteToFile

Best wishes

John M

Adam_Bell · August 17, 2005, 2:17pm

This looks extremely useful, particularly if set up as a “do shell” command. The man page for “sort” shows a powerful set of switches too (-d and -n are appealing for numerical sorts, ignoring leading white space is useful, -M is neat, -r is great).

Now the questions:

Can in_file.txt be an AppleScript variable, assuming that the contents, whatever they were (a list, a string, contents of a document), have been coerced into text delimited by returns or line feeds (and which is required)?
will "set myVar to do shell script … " with no output file specified return the standard output to the script?

I’m interested because I’ve been using this code to refine a list of random numbers and it seems I could do it much more efficiently with a switch of delimiters and a unix sort.

on Uniquify(num_list)
    
    (* Sorts the list, then compares adjacent items for
   duplicates. If a duplicate is found, a new number
   is generated and tested for duplication in the list.
   If the new number is not a duplication with of the original 
   list elements, the new number is inserted in place of 
   the duplicate and the list is sorted again*)
    
    set sortedList to UpSort(num_list)
    set theDup to 0
    repeat with j from 1 to ((count of sortedList) - 1)
        if (item j of sortedList) = (item (j + 1) of sortedList) then
            set theDup to j
        end if
    end repeat
    if theDup is not equal to 0 then
        set noMatch to false
        repeat until noMatch = true
            set newNumber to random number from 1 to 49
            set noMatch to test_matches(sortedList, pad(newNumber))
        end repeat
        if noMatch is true then
            set idx to theDup as integer
            set item idx of sortedList to pad(newNumber)
            return sortedList
        end if
    else
        return sortedList
    end if
end Uniquify

on UpSort(my_list)
    set the index_list to {}
    set the sorted_list to {}
    repeat (the number of items in my_list) times
        set the low_item to ""
        repeat with i from 1 to (number of items in my_list)
            if i is not in the index_list then
                set this_item to item i of my_list
                if the low_item is "" then
                    set the low_item to this_item
                    set the low_item_index to i
                else if this_item comes before the low_item then
                    set the low_item to this_item
                    set the low_item_index to i
                end if
            end if
        end repeat
        set the end of sorted_list to the low_item
        set the end of the index_list to the low_item_index
    end repeat
    return the sorted_list
end UpSort

on test_matches(theList, theTestItem)
    set matchCounter to 0
    repeat with k from 1 to number of items in theList
        if item k of theList is theTestItem then set matchCounter to matchCounter + 1
    end repeat
    if matchCounter > 0 then
        return false
    else
        return true
    end if
end test_matches

hhas · August 17, 2005, 4:18pm

The Unix sort isn’t unicode-aware, so its usefulness in AppleScript is limited in practice. Less of an issue for what the OP’s doing; it could fail if input was un-normalised unicode data, though that’s probably unlikely. See AppleMods’ List library for general-purpose AS sort commands.

http://developer.apple.com/technotes/tn2002/tn2065.html

[snip] Not sure what that code’s supposed to do, but it’s a pretty inefficient algorithm whatever it is. Maybe if you explain its purpose a more efficient version could be suggested.

Abisilla · August 17, 2005, 5:34pm

Nice! It works really well!

But is it possible to make it work without the dialogs?

John M:

Hi Abisilla,

Try

--Get text from selected file
set this_file to (choose file) as text 
set this_text to read file this_file 
set new_text ...

...close access file theFile 
end try 
end WriteToFile

Best wishes

John M

Adam_Bell · August 17, 2005, 5:56pm

Thanks for the link to the sourceforge mods - a complete eye-opener. I was completely unaware of their existence - in fact, a well kept secret in the usual AppleScript literature. I note that the appleMod FAQ (1.1, second paragraph) gives you all the credit for starting it as an open source resource. I’ll read on to discover how to use them.

Thanks for this too - the definitive word on “do script” construction - I’m not very good at searching out these references yet.

Acknowledged. Those handlers were written (and bits plagiarized) about two weeks after I set about learning AppleScript and since they actually work, I’ve never changed them. I was making a list of 6 random numbers and then checking that there were no two the same replacing duplicates with a new random pick. Now, although I don’t need to, I’ll use that problem as a framework for learning to use appleMods.

John_M · August 17, 2005, 7:11pm

Which dialogs? If you want the file path to be fixed, replace the first line with:

set this_file to "Path/to/file.txt"

You can get the file path from looking at the result of choosing your file:

(choose file) as text

Are you getting other dialogs?

John M

Qwerty_Denzel · August 18, 2005, 5:45am

Nothing original here:

set this_text to "John Doe
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe
John Doe
Jessica Rabbit
Jessica Rabbit
Jessica Rabbit
John Doe"

on remove_duplicates(this_text)
	set not_list to class of this_text is not list
	if not_list then set this_text to paragraphs of this_text
	set new_text to {}
	repeat with this_line in this_text
		if this_line is not in new_text then set end of new_text to (contents of this_line)
	end repeat
	if not_list then
		set text item delimiters to return
		tell new_text to set new_text to beginning & ({""} & rest)
		set text item delimiters to ""
	end if
	return new_text
end remove_duplicates

get remove_duplicates(this_text)

Should be able to do lists too.
Thanks to kai for his line.

Abisilla · August 18, 2005, 6:44am

Sorry, I meant the choose dialog.

But, your suggestion was doing the thing!!! Halleluja!!!

John M:

Abisilla:

But is it possible to make it work without the dialogs?

Which dialogs? If you want the file path to be fixed, replace the first line with:
set this_file to "Path/to/file.txt"
You can get the file path from looking at the result of choosing your file:
(choose file) as text
Are you getting other dialogs?

John M