And looking back, the code I posted earlier is needlessly more complicated than required. This should do the trick:
set theString to current application's NSString's stringWithString:peeps
-- get line break string used
set lineBreakRange to theString's rangeOfString:"[\\r\\n]+" options:(current application's NSRegularExpressionSearch)
-- search for first name and rest of name, and insert at the beginning of line followed by commas
set lineBreakString to theString's substringWithRange:lineBreakRange
set theString to theString's stringByReplacingOccurrencesOfString:"(?m)^([^ ,]+) *([^,]*)" withString:"$2$1,,$0" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}
-- split into paragraphs and sort them
set theArray to theString's componentsSeparatedByString:lineBreakString
set theArray to theArray's sortedArrayUsingSelector:"localizedCaseInsensitiveCompare:"
-- rejoin sorted paragraphs
set theString to theArray's componentsJoinedByString:lineBreakString
-- remove sorting strings
set theString to (theString's stringByReplacingOccurrencesOfString:"(?m)^[^,]+,," withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}) as text
Wow, so many solutions and so little time! Thanks for all the responses to my sort problem, as a simple scripter it will take a few days to pick through all this advanced coding. I must now do some more research on all the illustrious persons that have lived so close to me in Clacton on Sea.
Here’s an expansion of Shane’s second script which ignores intermediate words not beginning with capitals and deals with “Mc” and “St” surnames in the BT Phone Book manner. The assumption is that SunnyFrinton hasn’t bothered to include people’s middle names in his file and that therefore the second capital initial in each line is the beginning of sequence on which to sort.
Edits: Handling of “St” improved. French "Ste."s keep their gender, but I don’t know if that’s right for sorting purposes.
Hyphens in the doctored names are now replaced with spaces to eliminate their influence on the sort.
Dehyphenisation code replaced with the improvement suggested by Shane below (post #32). Apostrophes also zapped to exclude them from the sort.
I’ve just updated the script to improve the handling of “St” and to eliminate the influence of hyphens on the sort. Can you tell me how “Saint” and “Sainte” and their abbreviations are sorted in French? Are they treated as the same word or is the “e” significant in the sort?
I’d personally prefer not to make that assumption. SunnyFrinton’s bound to have a friend in Clacton called John-Paul Street-Porter-Lazenby-Smythe-Aardvark.
I saw your suggestion when I came back earlier to post something similar to eliminate all hyphens before the commas in the lines where they occur. But after updating the script, I realised it wasn’t doing what I thought it was, so I’ve now changed it back. :rolleyes: I’m still trying to think of an all-embracing regex for the job though.
-- Now replace any hyphens in the sorting names with spaces.
repeat
set changesMade to (theString's replaceOccurrencesOfString:"(?m)^([^,-]+)-" withString:"$1 " options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()})
if changesMade = 0 then exit repeat
end repeat
I suspect that’s going to be quicker than searching every name individually.
We just have to hope that Brett D’Oliveira doesn’t join the rush to Clacton on Sea.
I’m sure house prices are skyrocketing in the Regency Lodge area as a result of this thread.
No D’Oliveiras in my own phone book. But a De’Ath and a D’Souza ” as well as names like O’Leary ” are listed alphabetically as though the apostrophes weren’t there.
Hmmm. Looking more closely, I see the BT Phone Book lists names beginning with "de " and "van " as if they began with “D” and “V”. No wonder Beethoven didn’t bother getting a BT land line. My own entry still shows my late mother’s initials nearly three years after she died.
In my dictionary I see both sorting.
sorted upon the d :
De Bakey, De Bono, De Brosses, De Chirico, De Coster, De Foe, De Graaf, De Haas, De Havilland, De Kooning, De Laval, . Du Bos, Du Bourg, Du Camp, Du Caurroy, .
Hi, Shane. Reading a tip from a poster at Stack Overflow, it seems that there are two discrete unicode composition forms, and the wrong one is sent to the shell by echo. iconv appears”on initial inspection”to give a passable sort order. I’m still on Mavericks, so I can’t tell if this differs from the output in your method.
Hi, Shane. I also had to convert a test file on my desktop, so I was mistaken to reference echo; unicode handling in the shell via the Mac file system is the problem. Below is an arcane treatise about the differing composed/decomposed forms by user mklement0. This issue would have been nearly impossible to correct without his/her post, as there is no visible difference.
The link above explains it even better. Apple want C library libraries/function return names in UTF-8 decomposed form. So why didn’t Apple choose for an composed encoding? Well the VFS fits better on HFS when both file systems uses the same type of encoding even if they differ in size. The VFS file system would be slower in composed form and therefore it would slow down your computer. So at the end the “lack of support” is actually a good thing and difference in character encoding is not something new for someone who uses text interpreters (a unix user), It’s something new for some AppleScript users.
OK, so it seems the problem with using sort (and other utilities) is not that they can’t handle Unicode, but that they can’t handle composed Unicode characters. But using inconv as you have is only a part solution, because you’re returning subtly different characters.
You might be able to fix the issue by piping the result back through iconv, although I believe Macs use custom forms of composed/decomposed, and I don’t know if that makes any difference.
But this is the sort of thing that makes using higher-level stuff like AppleScript or Objective-C via AppleScriptObjC appealing…