Text items with ASObjC

Sorry for the beginner’s question but I can’t seem to get this to work.

I’m trying to find the ASObjC equivalent of:

set theString to "abcde"
set theCharacter to text 2 of theString --> "b"

set theString to "abcde"
set theCharacters to text 2 thru 3 of theString --> "bc"

Thanks for the help.

It is one of those situations where vanilla AppleScript is probably clearer than AppleScriptObjC.
As vanilla AppleScript has a simpler syntax to achieve this type of thing than AppleScriptObjC.

You have two options with the NSString class, one you can get the character or characters form the NSString with either the “characterAtIndex:” or “getCharacters:range:” functions.
But then you have to convert the results from “unichar” characters to NSString, which is not as straightforward as you might think.

Or you can use the NSString’s “substringWithRange:” function, to specify the character location and length via an NSRange, but you have to remember that the index’s of the NSSString are a zero based array, where if you want the first character you would ask for a location of “0”, and in your case above, the the second character would have a location of “1”.


use framework "Foundation"

property myApp : a reference to current application

set theString to myApp's NSString's stringWithString:"abcde"

set theCharacter to (theString's substringWithRange:(myApp's NSMakeRange(1, 1))) as text --> "b"

set theCharacters to (theString's substringWithRange:(myApp's NSMakeRange(1, 2))) as text --> "bc"

As previously stated, there are other options, but non of them as clean and clear as vanilla AppleScript.

One word of caution with using the above example, if the range you request is outside of the string’s range, then an error will occur.
So it’s worth checking the length of the string, before trying to access a particular range location and length.

Regards Mark

Thanks Mark–substringWithRange will do just what I want.

Hi.

Bear in mind that NSStrings are measured in 16-bit code units, not characters. If any of your characters are more than 16 bits long, your assumed locations will go up the spout.

Thanks Nigel. That’s a good thing to be aware of.

I suppose the relevant character range could be determined using a regex:

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

set theString to string id {66304, 98, 66306, 100, 101} -- Sorry. MacScripter can't display the 32-bit characters.
-- Say we want text 2 thru 3 of the above string.
set ASFirstCharacterIndex to 2
set numberOfCharacters to 2
-- This regex pattern matches the relevant number of characters (including line endings),
-- starting after AS index - 1 characters from the beginning of the string.
set substringSearchPattern to "(?s)(?<=\\A.{" & ASFirstCharacterIndex - 1 & "}).{" & numberOfCharacters & "}"

set |⌘| to current application
set theString to |⌘|'s class "NSString"'s stringWithString:(theString)
set substringRange to theString's rangeOfString:(substringSearchPattern) options:(|⌘|'s NSRegularExpressionSearch)
set substring to theString's substringWithRange:(substringRange)

To help with abnormal characters, You could possibly Normalize, Fold or Transform
With NSString methods:

https://developer.apple.com/documentation/foundation/nsstring?language=objc

Or encode as UTF-8 ?

This general topic became an issue in another thread, so I spent some additional time researching the matter. Just for test purposes, I created a file that contained four emoticons and nothing else. I also read the section of Shane’s ASObjC book entitled “How long is a piece of string”, which succinctly explains the issues involved.

I first modified Mark’s script and it only returned the first two emoticons:

use framework "Foundation"
use scripting additions
set theFile to POSIX path of (choose file)
set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set theCharacters to (theString's substringWithRange:(current application's NSMakeRange(0, 4)))

I then ran Nigel’s script and it returned all four emoticons.

use framework "Foundation"
use scripting additions
set theFile to POSIX path of (choose file)
set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set ASFirstCharacterIndex to 1
set numberOfCharacters to 4
-- This regex pattern matches the relevant number of characters (including line endings),
-- starting after AS index - 1 characters from the beginning of the string.
set substringSearchPattern to "(?s)(?<=\\A.{" & ASFirstCharacterIndex - 1 & "}).{" & numberOfCharacters & "}"
set |⌘| to current application
set theString to |⌘|'s class "NSString"'s stringWithString:(theString)
set substringRange to theString's rangeOfString:(substringSearchPattern) options:(|⌘|'s NSRegularExpressionSearch)
set substring to theString's substringWithRange:(substringRange)

Finally, I created a text file with four accented Latin characters and both of the above scripts returned all four characters. I don’t understand why this is the case.

So, I have a basic understanding of the issues involved, which gives rise to a second topic, which is how to count the number of characters in an NSString. I ran the following on my emoticon file and it returned 8. Nigel’s script has an entirely different purpose and cannot be used. I looked at normalizing and folding string in the NSString documentation, but they are a bit beyond my current knowledge level. I guess the simple answer is to use basic AppleScript.

use framework "Foundation"
use scripting additions
set theFile to POSIX path of (choose file)
set theString to current application's NSString's stringWithContentsOfFile:theFile encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
set stringCount to theString's |length|()

Obviously I need to spend some additional time on this, but a few questions are at the front of my mind right now:

  • Why do the files with emoticons and accented Latin characters return different results?

  • There may be circumstances with known text where I want to use NSScript’s length property–an example might be text that contains nothing but numbers. Just in general, what sort of characters are going to return erroneous results?

  • Has anyone written a ASObjC handler that would normalize or fold a string so that it would return usable results?

BTW, I was unsure what character encoding to use when reading the text file, and I used UTF-8 based on posts in the following thread.

https://macscripter.net/viewtopic.php?id=48020

Thanks for reading my post and for any thoughts or comments.

Hi peavine.

If a character’s UTF-16 value can be expressed in 16 bits, its length is 1 in ObjectiveC. Most accented Latin characters fall within this range. Emoticon codes are between 17 and 32 bits long, so their length in ObjectiveC is 2 (ie. 2 * 16 bits). Your “Mark” script reads for a length of 4, so that gets four characters of length 1, two of length 2, or whatever else constitutes the first four code units in the string.

The “Nigel” script uses the regex “(?s).{numberOfCharacters}” to identify a certain number of actual characters in the string, returning a range in 16-bit code units.

I don’t know of any decent way to do this. The alternatives are either to coerce the NSString to text and count the result or to use a character-spotting regex again:

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

set theString to string id {66304, 98, 66306, 100, 101} -- Sorry. MacScripter can't display the 32-bit characters.
set theString to current application's class "NSString"'s stringWithString:(theString)

set theRegex to current application's class "NSRegularExpression"'s regularExpressionWithPattern:("(?s).") options:(0) |error|:(missing value)
return theRegex's numberOfMatchesInString:(theString) options:(0) range:({0, theString's |length|()})
--> 5

There’s an alternative NSString file reading method which attempts to identify the encoding for itself, returning a number code for what it’s used if you want to know that. Perfect results aren’t guaranteed, but it usually gets it right:

set {theString, theEncoding, theError} to current application's class "NSString"'s stringWithContentsOfFile:(posixPath) usedEncoding:(reference) |error|:(reference)
-- Or:
set {theString, theEncoding} to current application's class "NSString"'s stringWithContentsOfFile:(posixPath) usedEncoding:(reference) |error|:(missing value)
-- Or:
set theString to current application's class "NSString"'s stringWithContentsOfFile:(posixPath) usedEncoding:(missing value) |error|:(missing value)

Peavine,

Nigel has answered part of your query, but it also comes down to what is the definition of a character, which can get complicated in some languages. There’s also the issue of accents – for example, ü can be stored as a single codepoint, or as seperate ¨ and u codepoints.

AppleScript essentially treats grapheme clusters as characters. According to the 10.6 release notes, AppleScript counts them using CFStringGetRangeOfComposedCharactersAtIndex:, and the Objective-C equivalent of that is rangeOfComposedCharacterSequenceAtIndex:.

So when you ask AppleScript for the length of a string, it basically runs a repeat loop, getting CFStringGetRangeOfComposedCharactersAtIndex:0, then advancing by the length of that range to check the next grapheme cluster, and so on.

In Objective-C or C this actually happens very quickly, but it’s the sort of thing that doesn’t translate to ASObjC very well at all. Fortunately, it’s rarely actually needed.

If you want to have a play with it, there are a couple of methods in my BridgePlus library that let you deal with the issue in ASObjC. For example (and I hope the emoticon in the string doesn’t get eaten):

use scripting additions
use framework "Foundation"
use script "BridgePlus"
load framework

set aString to "A ???? string."
set theResult to current application's SMSForder's rangesOfCharactersOfString:aString
theResult as list
-->	{{location:0, length:1}, {location:1, length:1}, {location:2, length:2}, {location:4, length:1}, {location:5, length:1}, {location:6, length:1}, {location:7, length:1}, {location:8, length:1}, {location:9, length:1}, {location:10, length:1}, {location:11, length:1}}

As I said, you could do the same thing in ASObjC using the method above, but it’s going to be very slow for long strings.

Where you might use the method is when slicing text files and you want to make sure the last/first character of each slice doesn’t reslut in splitting what shouldn’t be split – that can corrupt text. So you might call rangeOfComposedCharacterSequenceAtIndex: on the proposed past character of a chunk, and if its length is more than 1, increase the chunk size so it grabs the full grapheme cluster.

Thanks Nigel and Shane for all the great information. I’m making good progress in understanding this topic, and I’ll continue my study utilizing the information in your posts. It’s good to have a working knowledge of the issues involved.

I thought I would see if I could write a script that accomplishes the above and have included it below. I took a somewhat different approach (because it was easier to implement) but it does seem to work correctly. The timing result as written was 1 millisecond and was 14 milliseconds if the string is increased to 200 characters composed of 100 letters and 100 emoticons. I don’t know if I would ever have reason to use this script but it is helpful for learning purposes. The forum apparently deleted the emoticons in theString, so those will have to be added back for testing.

use framework "Foundation"
use scripting additions

set theString to "a????b????c" -- composed of 3 letters and 2 emoticons
set theString to current application's NSString's stringWithString:theString
set characterCount to 0
set previousLocation to (missing value)
repeat with i from 1 to (theString's |length|())
	set theRange to (theString's rangeOfComposedCharacterSequenceAtIndex:(i - 1))
	set theLocation to (theRange's location())
	if theLocation ≠ previousLocation then set characterCount to characterCount + 1
	set previousLocation to theLocation
end repeat
characterCount --> 5

BTW, I tested Nigel’s character-count script from post 9 with the identical string of 100 letters and 100 emoticons, and the timing result was 1 millisecond. So, Nigel’s script should clearly be preferred for actual use.

That works, but it’s potentially a little inefficient in that you’re checking every index, whereas if a character has a length of more than 1, you can skip ahead to the next character. Something like this:

use framework "Foundation"
use scripting additions

set theString to "a????b????????c" -- composed of 3 letters and 2 emoticons
set theString to current application's NSString's stringWithString:theString
set characterCount to 0
set theIndex to 0
set theLength to theString's |length|()
repeat while theIndex < theLength
	set theRange to theString's rangeOfComposedCharacterSequenceAtIndex:theIndex
	set characterCount to characterCount + 1
	set theIndex to theIndex + (theRange's |length|)
end repeat

Thanks Shane. I tested that with my test string (100 letters and 100 emoticons) and it reduced the timing result from 14 to 9 milliseconds.

For now, I’ve completed my study of this topic, but I had one last nagging question. The AppleScript Language Guide states as follows:

If a string contains a combining character cluster, will the count command and text’s length property return different results? This should be easily tested for, but I don’t know where to find a combining character cluster. Thanks for the help.

I’m not sure what you’re asking for, but maybe this snippet will help clarify:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theString to current application's NSString's stringWithString:"é"
set a to theString's decomposedStringWithCompatibilityMapping()'s |length|()
set b to theString's precomposedStringWithCompatibilityMapping()'s |length|()
{a, b}

Thanks Shane. That answers my question.

FWIW, I wanted to use Nigel’s script from post 9 and couldn’t get it to work until I noticed that NSString is misspelled (3 spots).

Oops! Thanks, peavine. Now corrected.