Speed at parsing large tab-delimited files

I’m a relative noobie when it comes to applescripting, but I am very eager to learn. I have a script that is working fine in accomplishing what I want, which simply is just a small segment of the tab-delimited file.

Some of the files that I am working with have about 180 items with 1000s of records. But out of all that information the only thing that I really care about is the 8th item in each record, but I can’t figure out if there is a way to get it to just ignore the rest of the file, and therefore speed the process up.

Any help would be greatly appreciated.

set theFile to choose file with prompt "Choose a tab-delimited text file" of type {"TEXT", "TXT"}
open for access theFile
set theData to paragraphs of (read theFile)
close access theFile

set theBigList to {}

set text item delimiters to tab
repeat with i from 1 to count of theData
	set theLine to text items of item i of theData
	copy theLine to the end of theBigList
end repeat
set text item delimiters to ""

set theTitles to ""
repeat with i from 1 to count of theBigList
	if item 8 of item i of theBigList is not "" then
		set theTitles to theTitles & item 8 of item i of theBigList & return
	end if
end repeat

return theTitles

tell application "Finder"
	set theFilePath to ("Rohan:Users:tmernst:Desktop:" as string) & "missing image" & ".txt" as string
	set theFileReference to open for access theFilePath with write permission
	write ((items of theTitles as string) & return) to theFileReference starting at eof
	close access theFileReference
end tell

the tab-delimited file looks something like this…I care about the bolded field only:

Personally, I would use the command line.

choose file with prompt "Choose a tab-delimited text file" of type {"TEXT", "TXT"} without invisibles

do shell script "/usr/bin/cut -f 8 " & quoted form of POSIX path of result & " >> " & quoted form of POSIX path of ((path to desktop as Unicode text) & "missing image.txt")

This should also work:

choose file with prompt "Choose a tab-delimited text file" of type {"TEXT", "TXT"} without invisibles
set theseLines to paragraphs of (read result)

set titles to {}
set ASTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to tab

repeat with thisLine in theseLines
	if text item 8 of thisLine is not "" then set end of titles to text item 8 of thisLine
end repeat

set AppleScript's text item delimiters to ASCII character 10 -- newline
set titles to "" & titles
set AppleScript's text item delimiters to ASTID

writeFile from titles into ((path to desktop as Unicode text) & "missing image.txt") with append


on writeFile from someData into someFile given append:append
	try
		open for access someFile with write permission
		set fileRef to result
		if not append then set eof of fileRef to 0
		
		write someData to fileRef starting at eof
		close access fileRef
		return true
	on error errMsg number errNum
		try
			close access fileRef
		end try
		
		error errMsg number errNum
	end try
end writeFile

You don’t have to open and close a file if you just need to read it once. (If you do open a file, you should use error handling to try to ensure that the file is closed in the event of an error.)

copy creates an actual copy of the data in memory, which makes it slower than the set command. You should only use AppleScript’s copy command when you want to explicitly avoid data sharing.

Using a string in this fashion, theTitles must be read into memory every time (which would be slower as it grows).

I prefer to put items into a list (the entire list does not have to be read when referencing the end of it) and then use AppleScript’s text item delimiters to place the newlines.

write, read, and close are part of StandardAdditions; They don’t need to be inside a Finder tell block.

Bruce

thanks for all the suggestions I’ll give them a shot tonight.

Todd

this seems like a good idea, although does this accomplish the whole task, or just the importing of the file? I tried it several different ways and always ended up with a blank output.

this one definitely worked for what I was trying to do, and definitely faster than my attempt. The one thing that I was unable to figure out is that when I compiled the script in Script Editor, I had no problems. I then tried to incorporate it isn’t to an automator action as an applescript and it wouldn’t compile because of the “append” statement. Just wondering if anyone knew why this might be happening.

Thanks again.

Todd

Now that I tried it with some numbers missing, I see what you mean.

It seems that append is a command in Automator… Well, we can avoid that problem by changing my script to use appending instead.

on run {input, parameters}
	choose file with prompt "Choose a tab-delimited text file" of type {"TEXT", "TXT"} without invisibles
	set theseLines to paragraphs of (read result)
	
	set titles to {}
	set ASTID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to tab
	
	repeat with thisLine in theseLines
		if text item 8 of thisLine is not "" then set end of titles to text item 8 of thisLine
	end repeat
	
	set AppleScript's text item delimiters to ASCII character 10 -- newline
	set titles to "" & titles
	set AppleScript's text item delimiters to ASTID
	
	writeFile from titles into ((path to desktop as Unicode text) & "missing image.txt") with appending
end run

on writeFile from someData into someFile given appending:appending
	try
		open for access someFile with write permission
		set fileRef to result
		if not appending then set eof of fileRef to 0
		
		write someData to fileRef starting at eof
		close access fileRef
		return true
	on error errMsg number errNum
		try
			close access fileRef
		end try
		
		error errMsg number errNum
	end try
end writeFile

Also, here is a different command line version:

choose file with prompt "Choose a tab-delimited text file" of type {"TEXT", "TXT"} without invisibles

do shell script "/usr/bin/ruby -e '
$stdin.readlines.each {|line| title = line.split(\"\\t\")[7]; puts title if title != \"\" }
' < " & quoted form of POSIX path of result & " >> " & quoted form of POSIX path of ((path to desktop as Unicode text) & "missing image.txt")