Extract list of unique lines from text file

Hello guys,

I have a small amount of text lines in text documents.

Those can either be only email addresses or only order numbers.

What I would like to do is to check if a file includes a list with only email addresses and if yes to sort them in alphabetical order and to get rid of the duplicates. So to have a list with unique email addresses.

Any advice of how can this be done?

Hi, epaminos.
Using 2 ready to use handlers from Shane Stanley, your task is easy:


use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

set theText to "0001 some@gmail.com
12345 kniazidis.rompert@gmail.com
235 someoneOther@gmail.com
403"
-- OR:
-- set textFile to choose file of type "txt"
-- set theText to read textFile as text

-- Extract email adresses
set emailList to paragraphs of (my findEmailAddressesIn:theText)
if emailList is {} then
	display dialog "The text doesn't contain any email address"
	return
end if
-- Sort emails list alphabetical
set sortedAlphabeticalEmailList to my sortListOfStrings:emailList
-- Remove duplicates
-- the handler findEmailAddressesIn:theString eliminates duplicates too
-- so, no need additional steps

on findEmailAddressesIn:theString
	-- Locate all the "links" in the text.
	set theNSDataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theString to current application's NSString's stringWithString:theString
	set theURLsNSArray to theNSDataDetector's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	-- Extract email links
	set emailPredicate to current application's NSPredicate's predicateWithFormat:"self.URL.scheme == 'mailto'"
	set emailURLs to theURLsNSArray's filteredArrayUsingPredicate:emailPredicate
	-- Get just the addresses
	set emailsArray to emailURLs's valueForKeyPath:"URL.resourceSpecifier"
	--eliminate duplicates
	set emailsArray to (current application's NSSet's setWithArray:emailsArray)'s allObjects()
	-- Join the remainder as a single, return-delimited text
	set emailAddresses to emailsArray's componentsJoinedByString:(return)
	-- Return as AppleScript text
	return emailAddresses as text
end findEmailAddressesIn:

on sortListOfStrings:theList
	-- convert list to Cocoa array
	set theArray to current application's NSArray's arrayWithArray:theList
	-- sort the array using a specific function
	set theArray to ¬
		theArray's sortedArrayUsingSelector:"localizedStandardCompare:"
	-- return the sorted array as an AppleScript list
	return theArray as list
end sortListOfStrings:

Hello hacker!

Thank you for one more time for your instant reply!

Well, that is awesome! My code was much poor:


	set {aList, aSet} to {input, {}}
	
	ignoring case
		repeat with i from 1 to count aList
			set anItem to item i of aList
			if (anItem is in aSet) 
then
				-- do nothing
			else
				set end of aSet to anItem
			end if
		end repeat
	end ignoring
	
	set aSet to simple_sort(aSet)
	set end of aSet to "
"
	return aSet
	

on simple_sort(my_list)
	set the index_list to {}
	set the sorted_list to {}
	repeat (the number of items in my_list) times
		set the low_item to ""
		repeat with i from 1 to (number of items in my_list)
			if i is not in the index_list then
				set this_item to item i of my_list as text
				if the low_item is "" then
					set the low_item to this_item
					set the low_item_index to i
				else if this_item comes before the low_item then
					set the low_item to this_item
					set the low_item_index to i
				end if
			end if
		end repeat
		set the end of sorted_list to the low_item
		set the end of the index_list to the low_item_index
	end repeat
	return the sorted_list
end simple_sort

PS: Where I struggled: if the input had only email addresses to sort them alphabetically. But if the input was a mixed text, with email addresses and other text as well, then to extract and sort only the email addresses.

I don’t see any reason to avoid extracting step. Anyway, to determine if the text has only emails in it, you should extract emails firstly.
You can, of course break theText on text items using delimiters {space, linefeed} and check one by one if it is email address, but this only involves repeat loop and additional operations. This will only make the script slower, and therefore your plan is of no use to me personally. Or, I do not quite understand you and there is some benefit?

So, for fun, I’ll add the determining of the file content:


use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

set theText to "0001 some@gmail.com
12345 kniazidis.rompert@gmail.com
235 someoneOther@gmail.com
403"
-- OR:
-- set textFile to choose file of type "txt"
-- set theText to read textFile as text

set ATID to AppleScript's text item delimiters
set AppleScript's text item delimiters to {space, return, linefeed}
set theTextItems to text items of theText
set AppleScript's text item delimiters to ATID

-- Extract email adresses
set emailList to paragraphs of (my findEmailAddressesIn:theText)
if emailList is {} then
	display dialog "The text doesn't contain any email address"
	return
end if

if (count emailList) is (count theTextItems) then
	display dialog "The text has only email addresses"
else
	display dialog "The text has mixed stuff"
end if

-- Sort emails list alphabetical
set sortedAlphabeticalEmailList to my sortListOfStrings:emailList
-- Remove duplicates
-- the handler findEmailAddressesIn:theString eliminates duplicates too
-- so, no need additional steps

on findEmailAddressesIn:theString
	-- Locate all the "links" in the text.
	set theNSDataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theString to current application's NSString's stringWithString:theString
	set theURLsNSArray to theNSDataDetector's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	-- Extract email links
	set emailPredicate to current application's NSPredicate's predicateWithFormat:"self.URL.scheme == 'mailto'"
	set emailURLs to theURLsNSArray's filteredArrayUsingPredicate:emailPredicate
	-- Get just the addresses
	set emailsArray to emailURLs's valueForKeyPath:"URL.resourceSpecifier"
	--eliminate duplicates
	set emailsArray to (current application's NSSet's setWithArray:emailsArray)'s allObjects()
	-- Join the remainder as a single, return-delimited text
	set emailAddresses to emailsArray's componentsJoinedByString:(return)
	-- Return as AppleScript text
	return emailAddresses as text
end findEmailAddressesIn:

on sortListOfStrings:theList
	-- convert list to Cocoa array
	set theArray to current application's NSArray's arrayWithArray:theList
	-- sort the array using a specific function
	set theArray to ¬
		theArray's sortedArrayUsingSelector:"localizedStandardCompare:"
	-- return the sorted array as an AppleScript list
	return theArray as list
end sortListOfStrings:

To be honest I used the “Extract email addresses” of Automator and my aforementioned script.

So although your script works great in Script Editor and thank you so much for it, I have to figure out why it gives me an error in Automator.

But in general it is something very specific that will help me a lot at my job and this is the reason I would like to automate it better.

To be more specific:

  • I have a part of a large text (around 300-400 lines) in daily text files (which are created from emails).
  • I open a specific day’s file, copy the text in clipboard, run my Script code and then AppleScript extracts the text of selected paragraphs (those that start with “orders” and their contents, which are the 10 lines underneath)
  • Then it extracts the email addresses from these selected paragraphs.
  • And then it removes the duplicates and I have a list of unique email addresses one by one in a list without quotes that I use to contact the customers about their orders.

But on Mondays I have a list of email addresses that have placed orders during the weekend.
So I have a list of email addresses from Saturday and a different list from Sunday. Unfortunately many of them are duplicates.

This is why I was dreaming of having one script…

To open Saturday’s file, copy the whole text to clipboard, run the script, get the first list.
To open Sunday’s file, copy all to clipboard, run the script, get the second list.

But if I run the same script with input from clipboard which has only email addresses (from the combined two lists or more), to unify them.

And this list can be for up to a week or a fortnight, so I hope you understand how time saving it can be.

Thank you again for the help and please feel free to contact me through PM if you would like some more clarifications which I cannot post here.

My previous script as it should be when it is Automator’s service (Quick Action):


use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

on run {input, parameters}
	
	set theText to "0001 some@gmail.com
12345 kniazidis.rompert@gmail.com
235 someoneOther@gmail.com
403"
	-- OR:
	-- set textFile to choose file of type "txt"
	-- set theText to read textFile as text
	
	set ATID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to {space, return, linefeed}
	set theTextItems to text items of theText
	set AppleScript's text item delimiters to ATID
	
	-- Extract email adresses
	set emailList to paragraphs of (my findEmailAddressesIn:theText)
	if emailList is {} then
		display dialog "The text doesn't contain any email address"
		return
	end if
	
	if (count emailList) is (count theTextItems) then
		display dialog "The text has only email addresses"
	else
		display dialog "The text has mixed stuff"
	end if
	
	-- Sort emails list alphabetical
	set sortedAlphabeticalEmailList to my sortListOfStrings:emailList
	-- Remove duplicates
	-- the handler findEmailAddressesIn:theString eliminates duplicates too
	-- so, no need additional steps
	
	
end run


on findEmailAddressesIn:theString
	-- Locate all the "links" in the text.
	set theNSDataDetector to current application's NSDataDetector's dataDetectorWithTypes:(current application's NSTextCheckingTypeLink) |error|:(missing value)
	set theString to current application's NSString's stringWithString:theString
	set theURLsNSArray to theNSDataDetector's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	-- Extract email links
	set emailPredicate to current application's NSPredicate's predicateWithFormat:"self.URL.scheme == 'mailto'"
	set emailURLs to theURLsNSArray's filteredArrayUsingPredicate:emailPredicate
	-- Get just the addresses
	set emailsArray to emailURLs's valueForKeyPath:"URL.resourceSpecifier"
	--eliminate duplicates
	set emailsArray to (current application's NSSet's setWithArray:emailsArray)'s allObjects()
	-- Join the remainder as a single, return-delimited text
	set emailAddresses to emailsArray's componentsJoinedByString:(return)
	-- Return as AppleScript text
	return emailAddresses as text
end findEmailAddressesIn:

on sortListOfStrings:theList
	-- convert list to Cocoa array
	set theArray to current application's NSArray's arrayWithArray:theList
	-- sort the array using a specific function
	set theArray to ¬
		theArray's sortedArrayUsingSelector:"localizedStandardCompare:"
	-- return the sorted array as an AppleScript list
	return theArray as list
end sortListOfStrings:

NOTE: you can get text from clipboard. Change


set theText to "0001 some@gmail.com
12345 kniazidis.rompert@gmail.com
235 someoneOther@gmail.com
403"

to


set theText to (the clipboard as text)

Brilliant! Thank you sir!