Well, someone would have any source to explain this behavior:
set theNom to "my-first_text document.txt"
set theNomTxt to theNom as string
set theNomUtf to theNom as Unicode text
return {words of theNomTxt, words of theNomUtf}
-- {{"my", "first", "_", "text document.txt"}, {"my", "first_text", "document", "txt"}}
The space between “text” and “document” is a non-breaking space (alt + space)
Hi,
what’s the goal of your script? well, how you can see, there exists vary mods to call and filter output data, normally. Personally it’s much more creativ and easy (at least for me) to experiment, after you discovered the fundamental or direct referrals.
I do parser various short texts and I would like to be sure that the characters of separation that I chose to use produce the same result with the “string” and the “Unicode Text” …
If these differences between “string” and “unicode text” are produced by the language system of the user, how could I be safer to use the universal separators in this case?
So, It would be very interesting to know, for me at least, if you get the same result as me when you test this code (I work with a French OsX system).
I always use text item delimiters instead of the term “words” for this reason. Usually I only use the space characer to define the separation of words, but if you wanted to use more than the space character it gets a little more complicated but this would do it. Note that the variable wordSeparationCharacters is a list of the characters you would want to define as separating words.
set theNom to "my-first_text document.txt"
set theNomTxt to theNom as string
set theNomUtf to theNom as Unicode text
set wordSeparationCharacters to {space, "-", "_"}
set dummyWordSeparator to "&^*%" --> this is anything unique, it's only used as an interim value
-- first we replace each found wordSeparationCharacter with the dummyWordSeparator
repeat with aChar in wordSeparationCharacters
set theNomTxt to findReplace(theNomTxt, aChar, dummyWordSeparator)
set theNomUtf to findReplace(theNomUtf, aChar, dummyWordSeparator)
end repeat
--> after this step our 2 strings become {"my&^*%first&^*%text&^*%document.txt", "my&^*%first&^*%text&^*%document.txt"}
-- then we use the dummyWordSeparator to break it into the final words
set txtWords to getTextItems(theNomTxt, dummyWordSeparator)
set utfWords to getTextItems(theNomUtf, dummyWordSeparator)
return {txtWords, utfWords}
--> Result: {{"my", "first", "text", "document.txt"}, {"my", "first", "text", "document.txt"}}
(*=========== SUBROUTINES ===========*)
on findReplace(theString, search_string, replacement_string)
if theString contains search_string then
set AppleScript's text item delimiters to search_string
set text_item_list to text items of theString
set AppleScript's text item delimiters to replacement_string
set theString to text_item_list as Unicode text
set AppleScript's text item delimiters to ""
end if
return theString
end findReplace
on getTextItems(theString, theDelimiter)
set AppleScript's text item delimiters to theDelimiter
set textItems to text items of theString
set AppleScript's text item delimiters to ""
return textItems
end getTextItems
Of course, i can use the “text item delimiters” method, but i must use the “word” method to select the parts of the string because is more simple in my project and, in any case, that is the more faster method with the strings as “unicode text”.