Parsing words with special characters

Kevin_Bradley · July 6, 2006, 6:47am

I’ve got an odd problem. I have a text file that is a listing (much like a directory listing). The list contains 3 columns of data separated by a variable number of spaces.

I can read the input lines fine using “read using delimiters {ascii character 10}” but when I ask for “words of” the first item in the list, it reads the @ sign as a separate word, the / as a word separator (meaning it disappears from the result) and the words with the - as one word. HUH? I’ve tried “considering/ignoring punctuation” without success.

Here’s the code:

set theFile to "Brain:Users:nitewing:dtext.txt" as alias
set theText to getFromFile(theFile)
set AppleScript's text item delimiters to "*"  -- let's peek at what we have
--first record
set myitem to contents of item 1 of theText
display dialog (words of myitem) as text
--last record
set myitem to contents of item 1 of reverse of theText
display dialog (words of myitem) as text

to getFromFile(theFile)
    try
        -- Does the file exist?
        open for access (theFile as alias)
        copy the result to theFile_ID
    on error
        display dialog "Can't open the input file."
    end try
    
    -- Get the file length
    get eof theFile_ID
    copy the result to theFile_eof
    
    -- get our data
    try
        read theFile_ID as string using delimiter {ASCII character 10}
        copy the result to theFileinput
        --display dialog "Got " & (theFileinput as text)
    on error
        display dialog "Can't read from file " & theFilename
    end try
    
    --close the file
    close access theFile_ID
    return theFileinput
end getFromFile

And here’s a sample of the file:

zope-zopetree<variable number of spaces>@1.3<variable number of spaces>zope/zope-zopetree zope-zopezen<variable number of spaces>@0.5<variable number of spaces>zope/zope-zopezen
As you can see, there are a variable number of spaces between items, so I figured using “words of” would help dump the extra spaces. In order to keep processing time down, I need a FAST solution: The file is about 200k and if I have to go through it character by character it will take too long for my purposes.

John_M · July 6, 2006, 7:47am

Hi Kevin,

What about this?

set theFile to alias "Brain:Users:nitewing:dtext.txt"
try
	set theText to paragraphs of (read theFile)
on error
	display dialog "Can't open the input file."
end try

repeat with myText from 1 to count of theText
	set AppleScript's text item delimiters to " "
	set myitems to text items of (theText's item myText)
	set AppleScript's text item delimiters to {""}
	
	set myResult to {}
	repeat with myitem in myitems
		if myitem's contents is not equal to "" then set end of myResult to myitem's contents
	end repeat
	set (theText's item myText) to myResult
end repeat

return theText
-->{{"zope-zopetree", "@1.3", "zope/zope-zopetree"}, {"zope-zopezen", "@0.5", "zope/zope-zopezen"}}

If speed is important you should probably be looking towards a UNIX solution.

Best wishes

John M

Kevin_Bradley · July 6, 2006, 8:02am

John,

That works well. I never thought about “paragraphs”–DUH!

I’m curious about your read of the file. I’ve used “open for access” for years before reading a file and even the Standard Additions dictionary says “read: Read data from a file that has been opened for access.”

How long have we been able to do this? Or is this an “undocumented feature?” (You’ll find AS has a lot of them, the AS Language Guide hasn’t been updated since version 1.3.7…)

John_M · July 6, 2006, 8:23am

I don’t know.

I started using ‘open for access’ before reading the file until a few years ago someone pointed out that you didn’t need to. You do need to open access when you are writing the file. This makes sesne to me as you wouldn’t want two users writing to a file at the same time, but reading it simultainiously would be ok.

Best wishes

John M

Fredo_d_o · July 6, 2006, 8:26am

Hi

Here another suggestion:


on run
	-- Exemple of original text
	set theText to "zope-zopetree                       @1.3                  zope/zope-zopetree
zope-zopezen              @0.5              zope/zope-zopezen"
	-- Paragraphs of original text
	set theParagraphList to paragraphs of theText
	
	-- Initialize the result list
	set theResultList to {}
	
	-- For each paragraph
	repeat with theParagraph in theParagraphList
		-- Erase every space character, except the last one, in the paragraph
		set theResultText to my subTextWithoutSpaces(contents of theParagraph)
		
		-- Build the clean text list, without spaces, and store it in the result list
		set text item delimiters of AppleScript to space
		set theResultList to theResultList & (text items of theResultText)
	end repeat
	
	set text item delimiters of AppleScript to "" -- default value
	return theResultList
	--> {"zope-zopetree", "@1.3", "zope/zope-zopetree", "zope-zopezen", "@0.5", "zope/zope-zopezen"}
end run

on subTextWithoutSpaces(theText)
	set theSpaceList to ¬
		{space & space & space & space & space & space, ¬
			space & space & space & space & space, ¬
			space & space & space & space, ¬
			space & space & space, ¬
			space & space, ¬
			space & space}
	repeat with theSpace in theSpaceList
		set theText to my subFindReplaceText(contents of theText, contents of theSpace, space)
	end repeat
	return theText
end subTextWithoutSpaces

on subFindReplaceText(theText, theFindText, theReplaceText)
	set text item delimiters of AppleScript to theFindText
	set theList to text items of (contents of theText)
	set text item delimiters of AppleScript to theReplaceText
	set theText to theList as string
	set text item delimiters of AppleScript to ""
	return theText
end subFindReplaceText

Nigel_Garvey · July 6, 2006, 11:32am

Hi, Kevin.

Basically, open for access is for when you want to do more than one thing while a file’s open. It opens an access to the file and returns a reference number for that access to the script.

The main parameter for the other File Read/Write commands can be either this reference number or a file reference in the form of an alias or file specification. If they’re given the reference number, the commands are hotlined directly to the open access, which is fast and unambiguous.

With an alias or file specification, the commands first have to see if there are any accesses open to the file. If they find one, they use it. Since it’s possible (though not necessarily desirable) to have several accesses open to the file at once, each with its own file position pointer, you can’t always guarantee that the access found this way is the one you want. It’s also slower because there’s more to do. If no open access is found for the file, the commands open an access for themselves (with write permission, if necessary), do their stuff, then close the access again afterwards.

Philosophy, based on safety aspects and the time it takes to do things:

If you want to perform more than one Read/Write action on a file, open it for access first and use the returned reference number as the parameter for the other commands. Use a try block to trap any errors so that the script doesn’t stop before the access is closed again.

If you only want to perform one action (such as reading the file once only), and the file isn’t already open for access, simply use the appropriate command with an alias or file specification.

It so happens I’m struggling to complete an article about File Read/Write for ScriptWire. It’s going to be quite long…

The File Read/Write commands are actually written up in an even older document called “AppleScript Scripting Additions Guide”, which dates back to when the various suites in what is now the StandardAdditions were separate OSAXen! I don’t know if this document’s still available from the Apple site or not.

Kevin_Bradley · July 7, 2006, 8:17am

I think it is still available, I know I have a copy, which you just made me go back and re-read (at least the read/write part). I’ve not read it in years (literally), which explains a lot.

Apple really really needs to update the AS documentation with a single, all-encompassing document that addresses the language as it is today. I’ve read several of the third-party books but most of them are for beginners and don’t cover any of the advanced topics.

I have to say that AS is the most thoroughly and yet badly documented language I know of. If you know where to look, all the information is available, but it’s strewn across the Apple Developer landscape in bits and pieces. The AppleScript Studio reference is the best AS reference to come out in a long, long time, and even it has flaws.