words of...

Fredo_d_o · July 30, 2008, 3:03pm

Hi all

Well, someone would have any source to explain this behavior:

set theNom to "my-first_text document.txt"
set theNomTxt to theNom as string
set theNomUtf to theNom as Unicode text
return {words of theNomTxt, words of theNomUtf}
-- {{"my", "first", "_", "text document.txt"}, {"my", "first_text", "document", "txt"}}

The space between “text” and “document” is a non-breaking space (alt + space)

Thanks

Joy · July 30, 2008, 8:33pm

Hi,
what’s the goal of your script? well, how you can see, there exists vary mods to call and filter output data, normally. Personally it’s much more creativ and easy (at least for me) to experiment, after you discovered the fundamental or direct referrals.

Bruce_Phillips · July 30, 2008, 10:46pm

word:

Fredo_d_o · July 31, 2008, 9:39am

Hi,

Ok, thank you very much for your replies.

I do parser various short texts and I would like to be sure that the characters of separation that I chose to use produce the same result with the “string” and the “Unicode Text” …

If these differences between “string” and “unicode text” are produced by the language system of the user, how could I be safer to use the universal separators in this case?

So, It would be very interesting to know, for me at least, if you get the same result as me when you test this code (I work with a French OsX system).

Thanks

PS. Sorry for my bad English

regulus6633 · August 1, 2008, 4:02am

I always use text item delimiters instead of the term “words” for this reason. Usually I only use the space characer to define the separation of words, but if you wanted to use more than the space character it gets a little more complicated but this would do it. Note that the variable wordSeparationCharacters is a list of the characters you would want to define as separating words.

set theNom to "my-first_text document.txt"
set theNomTxt to theNom as string
set theNomUtf to theNom as Unicode text

set wordSeparationCharacters to {space, "-", "_"}
set dummyWordSeparator to "&^*%" --> this is anything unique, it's only used as an interim value

-- first we replace each found wordSeparationCharacter with the dummyWordSeparator
repeat with aChar in wordSeparationCharacters
	set theNomTxt to findReplace(theNomTxt, aChar, dummyWordSeparator)
	set theNomUtf to findReplace(theNomUtf, aChar, dummyWordSeparator)
end repeat
--> after this step our 2 strings become {"my&^*%first&^*%text&^*%document.txt", "my&^*%first&^*%text&^*%document.txt"}

-- then we use the dummyWordSeparator to break it into the final words
set txtWords to getTextItems(theNomTxt, dummyWordSeparator)
set utfWords to getTextItems(theNomUtf, dummyWordSeparator)
return {txtWords, utfWords}
--> Result: {{"my", "first", "text", "document.txt"}, {"my", "first", "text", "document.txt"}}


(*=========== SUBROUTINES ===========*)
on findReplace(theString, search_string, replacement_string)
	if theString contains search_string then
		set AppleScript's text item delimiters to search_string
		set text_item_list to text items of theString
		set AppleScript's text item delimiters to replacement_string
		set theString to text_item_list as Unicode text
		set AppleScript's text item delimiters to ""
	end if
	return theString
end findReplace

on getTextItems(theString, theDelimiter)
	set AppleScript's text item delimiters to theDelimiter
	set textItems to text items of theString
	set AppleScript's text item delimiters to ""
	return textItems
end getTextItems

Fredo_d_o · August 1, 2008, 9:29pm

Thanks Regulus for your reply

Of course, i can use the “text item delimiters” method, but i must use the “word” method to select the parts of the string because is more simple in my project and, in any case, that is the more faster method with the strings as “unicode text”.

Any way, thank-you very much for your reply.