Peavine,
Nigel has answered part of your query, but it also comes down to what is the definition of a character, which can get complicated in some languages. There’s also the issue of accents – for example, ü can be stored as a single codepoint, or as seperate ¨ and u codepoints.
AppleScript essentially treats grapheme clusters as characters. According to the 10.6 release notes, AppleScript counts them using CFStringGetRangeOfComposedCharactersAtIndex:, and the Objective-C equivalent of that is rangeOfComposedCharacterSequenceAtIndex:.
So when you ask AppleScript for the length of a string, it basically runs a repeat loop, getting CFStringGetRangeOfComposedCharactersAtIndex:0, then advancing by the length of that range to check the next grapheme cluster, and so on.
In Objective-C or C this actually happens very quickly, but it’s the sort of thing that doesn’t translate to ASObjC very well at all. Fortunately, it’s rarely actually needed.
If you want to have a play with it, there are a couple of methods in my BridgePlus library that let you deal with the issue in ASObjC. For example (and I hope the emoticon in the string doesn’t get eaten):
use scripting additions
use framework "Foundation"
use script "BridgePlus"
load framework
set aString to "A ???? string."
set theResult to current application's SMSForder's rangesOfCharactersOfString:aString
theResult as list
--> {{location:0, length:1}, {location:1, length:1}, {location:2, length:2}, {location:4, length:1}, {location:5, length:1}, {location:6, length:1}, {location:7, length:1}, {location:8, length:1}, {location:9, length:1}, {location:10, length:1}, {location:11, length:1}}
As I said, you could do the same thing in ASObjC using the method above, but it’s going to be very slow for long strings.
Where you might use the method is when slicing text files and you want to make sure the last/first character of each slice doesn’t reslut in splitting what shouldn’t be split – that can corrupt text. So you might call rangeOfComposedCharacterSequenceAtIndex: on the proposed past character of a chunk, and if its length is more than 1, increase the chunk size so it grabs the full grapheme cluster.