Applescript for extracting email addresses from the content / body

Hi Everyone,

On my gallant search for my holygrail I have found myself here :slight_smile: I have spent nearly an entire day looking for what I need and can’t seem to find the answer.

I use mail.app on my iMac running mountain lion and need to extract all email addresses in all mailboxes. I am trying to tidy up my database and network to be able to group clients into very specific and targetted fields.

So far, I have a script which allows me to extract email addresses in the ‘to’, ‘cc’, ‘bcc’ fields but what I need is to add into this applescript some scripting which looks through the rest of the email and extracts all other email addresses too. I have very bits of script which will do this, but the results are put in to the ‘results’ section of the editor.

I need to have the 2 scripts together and and also for all the data to be dropped into a .txt file.

Below is what I currently have.

"
font-family: Monaco, 'Courier New', Courier, monospace;
font-size: 10px;
font-weight: normal;
margin: 0px;
padding: 5px;
border: 1px solid #000000;
width: 720px; height: 340px; 
color: #000000;
background-color: #E6E6EE;
overflow: auto;"
"this text can be pasted into the AppleScript Editor"

-- Merge Two Scripts Here

tell application "Mail"
	set selectionMessage to selection -- just select the first message in the folder
	set thisMessage to item 1 of selectionMessage
	set theseMessages to (every message in (mailbox of thisMessage))
	set listOfEmails to {}
	
	-- End of Original set and beginning of new set
	
	repeat with eachMessage in theseMessages
		try
			set theFrom to (extract address from sender of eachMessage)
			if listOfEmails does not contain theFrom then
				copy theFrom to the end of listOfEmails
			end if
			
			-- To field Extract
			
			if (address of to recipient) of eachMessage is not {} then
				repeat with i from 1 to count of to recipient of eachMessage
					set theTo to (address of to recipient i) of eachMessage as string
					if listOfEmails does not contain theTo then
						copy theTo to the end of listOfEmails
					end if
					
				end repeat
			end if
			
			-- BCC Extract
			
			if (address of bcc recipient) of eachMessage is not {} then
				repeat with i from 1 to count of bcc recipient of eachMessage
					set thebcc to (address of bcc recipient i) of eachMessage as string
					if listOfEmails does not contain thebcc then
						copy thebcc to the end of listOfEmails
					end if
					
				end repeat
			end if
			
			-- CC Extract
			
			if (address of cc recipient) of eachMessage is not {} then
				repeat with i from 1 to count of cc recipient of eachMessage
					set theCC to (address of cc recipient i) of eachMessage as string
					if listOfEmails does not contain theCC then
						copy theCC to the end of listOfEmails
					end if
				end repeat
			end if
			
			-- Body Extract
			
			if (address of bcc recipient) of eachMessage is not {} then
				repeat with i from 1 to count of bcc recipient of eachMessage
					set thebcc to (address of bcc recipient i) of eachMessage as string
					if listOfEmails does not contain thebcc then
						copy thebcc to the end of listOfEmails
					end if
					
				end repeat
			end if
			
		end try
	end repeat
end tell


tell application "Finder" to set ptd to path to documents folder as string
set theFile to ptd & "extracted.txt"
set theFileID to open for access theFile with write permission
set SortedListOfEmails to simple_sort(listOfEmails)
repeat with i from 1 to count of SortedListOfEmails
	write item i of SortedListOfEmails & return to theFileID as «class utf8»
end repeat
close access theFileID

on simple_sort(my_list)
	set the index_list to {}
	set the sorted_list to {}
	repeat (the number of items in my_list) times
		set the low_item to ""
		repeat with i from 1 to (number of items in my_list)
			if i is not in the index_list then
				set this_item to item i of my_list as text
				if the low_item is "" then
					set the low_item to this_item
					set the low_item_index to i
				else if this_item comes before the low_item then
					set the low_item to this_item
					set the low_item_index to i
				end if
			end if
		end repeat
		set the end of sorted_list to the low_item
		set the end of the index_list to the low_item_index
	end repeat
	return the sorted_list
end simple_sort

Please can you help? Anyone??

Thanks

John

Hi John. Welcome to MacScripter.

Firstly, a couple of points about your existing script. ‘(address of to repicient)’ should be ‘(address of to recipients)’ ” ie. ‘recipients’ in the plural. Similarly with the ‘cc recipients’ and ‘bcc recipients’. These references get you the information you want anyway, so there’s no no point in then laboriously counting the message’s recipients and extracting the e-mail address from each in turn:

I’ve knocked together a shell script to extract every e-mail address from a body text. It’s pretty crude, but should work in most cases. However, if the text of the message is very long, it’ll be necessary to use another method to get it into the shell script.

tell application "Mail"
	set selectionMessage to selection -- just select the first message in the folder
	set thisMessage to item 1 of selectionMessage
	set theseMessages to (every message in (mailbox of thisMessage))
	set listOfEmails to {}
	
	-- End of Original set and beginning of new set
	
	repeat with eachMessage in theseMessages
		set theFrom to (extract address from sender of eachMessage)
		if listOfEmails does not contain theFrom then
			set the end of listOfEmails to theFrom
		end if
		
		-- To field Extract
		
		set toAddresses to address of to recipients of eachMessage
		repeat with i from 1 to (count toAddresses) -- The repeat won't happen if toAddresses is empty
			set theTo to item i of toAddresses
			if listOfEmails does not contain theTo then
				set the end of listOfEmails to theTo
			end if
			
		end repeat
		
		-- BCC Extract
		
		set bccAddresses to address of bcc recipients of eachMessage
		repeat with i from 1 to (count bccAddresses)
			set thebcc to item i of bccAddresses
			if listOfEmails does not contain thebcc then
				set the end of listOfEmails to thebcc
			end if
			
		end repeat
		
		-- CC Extract
		
		set ccAddresses to address of cc recipients of eachMessage
		repeat with i from 1 to (count ccAddresses)
			set theCC to item i of ccAddresses
			if listOfEmails does not contain theCC then
				set the end of listOfEmails to theCC
			end if
			
		end repeat
		
		-- Body Extract
		
		set bodyText to content of eachMessage
		try
			set bodyAddresses to paragraphs of (do shell script "<<<" & quoted form of bodyText & " grep -Eo '[[:alnum:]][^[:space:]<>@\":;]+@[^ <>\"]+[][:alpha:]]'")
			repeat with i from 1 to (count bodyAddresses)
				set thisAddress to item i of bodyAddresses
				if (listOfEmails does not contain thisAddress) then
					set end of listOfEmails to thisAddress
				end if
			end repeat
		end try
	end repeat
end tell

set SortedListOfEmails to simple_sort(listOfEmails)
set ptd to path to documents folder as string
set theFile to ptd & "extracted.txt"
set theFileID to open for access theFile with write permission
try
	repeat with i from 1 to count of SortedListOfEmails
		write item i of SortedListOfEmails & return to theFileID as «class utf8»
	end repeat
end try
close access theFileID

on simple_sort(my_list)
	set the index_list to {}
	set the sorted_list to {}
	repeat (the number of items in my_list) times
		set the low_item to ""
		repeat with i from 1 to (number of items in my_list)
			if i is not in the index_list then
				set this_item to item i of my_list as text
				if the low_item is "" then
					set the low_item to this_item
					set the low_item_index to i
				else if this_item comes before the low_item then
					set the low_item to this_item
					set the low_item_index to i
				end if
			end if
		end repeat
		set the end of sorted_list to the low_item
		set the end of the index_list to the low_item_index
	end repeat
	return the sorted_list
end simple_sort

If you have a very full mailbox and need a little extra speed, you could extract all the data from Mail in one go and sort through it by vanilla means:

tell application "Mail"
	set selectionMessage to selection -- just select the first message in the folder
	set thisMessage to item 1 of selectionMessage
	set {allSenders, allTos, allBCCs, allCCs, allBodyTexts} to {sender, address of to recipients, address of bcc recipients, address of cc recipients, content} of every message in mailbox of thisMessage
end tell

set listOfEmails to {}

-- End of Original set and beginning of new set

repeat with i from 1 to (count allSenders)
	
	tell application "Mail" to set theFrom to (extract address from item i of allSenders)
	if listOfEmails does not contain theFrom then
		set end of listOfEmails to theFrom
	end if
	
	-- To field Extract
	
	set toAddresses to item i of allTos
	repeat with j from 1 to (count toAddresses) -- The repeat won't happen if toAddresses is empty
		set theTo to item j of toAddresses
		if listOfEmails does not contain theTo then
			set the end of listOfEmails to theTo
		end if
		
	end repeat
	
	-- BCC Extract
	
	set bccAddresses to item i of allBCCs
	repeat with j from 1 to (count bccAddresses)
		set thebcc to item j of bccAddresses
		if listOfEmails does not contain thebcc then
			set the end of listOfEmails to thebcc
		end if
		
	end repeat
	
	-- CC Extract
	
	set ccAddresses to item i of allCCs
	repeat with j from 1 to (count ccAddresses)
		set theCC to item j of ccAddresses
		if listOfEmails does not contain theCC then
			set the end of listOfEmails to theCC
		end if
		
	end repeat
	
	-- Body Extract
	
	set bodyText to item i of allBodyTexts
	try
		set bodyAddresses to paragraphs of (do shell script "<<<" & quoted form of bodyText & " grep -Eo '[[:alnum:]][^[:space:]<>@\":;]+@[^ <>\"]+[][:alpha:]]'")
		repeat with j from 1 to (count bodyAddresses)
			set thisAddress to item j of bodyAddresses
			if (listOfEmails does not contain thisAddress) then
				set end of listOfEmails to thisAddress
			end if
		end repeat
	end try
end repeat

-- Rest of the script as per.

Hey Nigel,

Awesome script. It is what I need. Now how do I make the following adjustments to it:

  1. I want to save it as a csv file in the following format:

From:
To:
Cc:
Bcc:
Date & Time:
Subject:

  1. If I have a common email address, for instance @xyz.gov.in is there a way to just search all those and save all those email addressed in the above format?

Thanks!!!

:lol: Write an entirely new script which:

  1. Extracts not only the e-mail addresses of the messages’ senders and recipients (but probably not any addresses in the body texts), but also the times sent/received, the subjects, and optionally the senders’ and recipients’ display names, where these exist. It’s not clear if these data will be required just for selected messages or for all messages in the same mailbox as the selection, as in the original script.

  2. Doesn’t weed out duplicate e-mail addesses, sort the remainder lexically, and store them one-per-line with returns for line endings; but collates all the extracted data for each message in CSV format, apparently with six records per message, each record having two fields: a header and a value. CSV records are of course separated by linefeeds or return-linefeed pairs.

Compose all the CSV data only for messages where any of the sender/recipient e-mail addresses has a particular domain? It’s undoubtedly doable. You’d have to decide if this was to be the normal modus operandi and whether the domain was to be fixed/default/askable-for.