Use shell script sort for csv file

Shane_Stanley · August 20, 2016, 9:26pm

Now add Lenny Bruce or Sergey Bubka, and see how they sort.

sort works by 8-bit byte value, and that just can’t work with Unicode.

Shane_Stanley · August 21, 2016, 1:43am

And looking back, the code I posted earlier is needlessly more complicated than required. This should do the trick:

set theString to current application's NSString's stringWithString:peeps
-- get line break string used
set lineBreakRange to theString's rangeOfString:"[\\r\\n]+" options:(current application's NSRegularExpressionSearch)
-- search for first name and rest of name, and insert at the beginning of line followed by commas
set lineBreakString to theString's substringWithRange:lineBreakRange
set theString to theString's stringByReplacingOccurrencesOfString:"(?m)^([^ ,]+) *([^,]*)" withString:"$2$1,,$0" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}
-- split into paragraphs and sort them
set theArray to theString's componentsSeparatedByString:lineBreakString
set theArray to theArray's sortedArrayUsingSelector:"localizedCaseInsensitiveCompare:"
-- rejoin sorted paragraphs
set theString to theArray's componentsJoinedByString:lineBreakString
-- remove sorting strings
set theString to (theString's stringByReplacingOccurrencesOfString:"(?m)^[^,]+,," withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}) as text

As a bonus, it’s quite a bit quicker.

SunnyFrinton · August 21, 2016, 8:53am

Wow, so many solutions and so little time! Thanks for all the responses to my sort problem, as a simple scripter it will take a few days to pick through all this advanced coding. I must now do some more research on all the illustrious persons that have lived so close to me in Clacton on Sea.

Nigel_Garvey · August 21, 2016, 11:12am

Here’s an expansion of Shane’s second script which ignores intermediate words not beginning with capitals and deals with “Mc” and “St” surnames in the BT Phone Book manner. The assumption is that SunnyFrinton hasn’t bothered to include people’s middle names in his file and that therefore the second capital initial in each line is the beginning of sequence on which to sort.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set peeps to "Adam de Collins,27 The Mill Apartments  Colchester  CO1 2QT,01206 863601,
Alan Armstrong,Ruby Dene  Clacton Road  Weeley Heath  CO16 9DN,01255 830942,
Alan le Irwin,53 Greenacres  Clacton on Sea  Essex  CO15 6LZ, ,
Alison Lightly,2 Upper Second Avenue  Frinton on Sea  CO13 9LL,01255 677407,
Alison O'Reilly,Small World  Coggeshall Road  Dedham  CO7 6ET,01206 323363,
Amanda Elliot,6 Manor Road  Great Holland  Essex  CO13 0JT,01255 674057,
Andrea Poulter,89 Rainham Way  Frinton on Sea  CO13 9NT,01255 673293,
Andrew Theobald,7 St.Andrews Place  Brightlingsea  CO7 0RH,01206 303000,
Angela Evans,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Paul Loup Sulitzer,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Paul-Loup Sulitzer,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Jean Paul Sartre,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Jean-Paul Sartre,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Bret Easton Ellis,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Linda Joy Singleton,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Frank Lloyd Wright,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Charles de Gaulle,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Antoine de Saint-ExupÃ©ry,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885,
Daphne du Maurier,7 Regency Lodge  Clacton on Sea  CO15 2AN,07540 433885
Maurice BÃ©jart,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885,
MaÃ¨ve Ravenwood,8 Regency Lodge Clacton on Sea CO11 2AN,07540 433885,
MaÃ«lys Moongoddess,9 Regency Lodge Clacton on Sea CO12 2AN,07540 433885,
BriÃ¨le Ironwood,10 Regency Lodge Clacton on Sea CO13 2AN,07540 433885,
Maurice BÃ´jart,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Angus MacTavish,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Doctor McCoy,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Norman St. John-Stevas,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Valeska Saab,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Sheila Staefel,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Charles-Augustin Sainte-Beuve,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Brett D'Oliveira,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885
Fred Dexter,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885"

set theString to current application's NSMutableString's stringWithString:peeps
-- get line break string used
set lineBreakRange to theString's rangeOfString:"[\\r\\n]+" options:(current application's NSRegularExpressionSearch)
-- search for first name and rest of name, and insert at the beginning of line followed by commas
set lineBreakString to theString's substringWithRange:lineBreakRange
theString's replaceOccurrencesOfString:"(?m)(^.(?:[^[:upper:]]|(?<! )[[:upper:]])++)([^,]+).++$" withString:"$2 $1,,$0" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}
-- Expand any "Mc" now at the beginning of a line to "Mac" if followed by a capital letter.
theString's replaceOccurrencesOfString:"(?m)^Mc(?=[[:upper:]])" withString:"Mac" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}
-- Similarly expand "St" or "St." to "Saint".
theString's replaceOccurrencesOfString:"(?m)^St(e)?[. ]? " withString:"Saint$1 " options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}
-- Now replace any hyphens in the sorting names with spaces and apostrophes with nothing.
repeat
	set wholeRange to {0, theString's |length|()}
	set changesMade to (theString's replaceOccurrencesOfString:"(?m)^([^,-]+)-" withString:"$1 " options:(current application's NSRegularExpressionSearch) range:(wholeRange)) -- No change to string length.
	set changesMade to changesMade + (theString's replaceOccurrencesOfString:"(?m)^([^,'']+)['']" withString:"$1" options:(current application's NSRegularExpressionSearch) range:(wholeRange)) -- Possible shortening.
	if changesMade = 0 then exit repeat
end repeat
-- split into paragraphs and sort them
set theArray to theString's componentsSeparatedByString:lineBreakString
set theArray to theArray's sortedArrayUsingSelector:"localizedCaseInsensitiveCompare:"
-- rejoin sorted paragraphs
set theString to theArray's componentsJoinedByString:lineBreakString
-- remove sorting strings
set theString to (theString's stringByReplacingOccurrencesOfString:"(?m)^[^,]+,," withString:"" options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()}) as text

Edits: Handling of “St” improved. French "Ste."s keep their gender, but I don’t know if that’s right for sorting purposes.
Hyphens in the doctored names are now replaced with spaces to eliminate their influence on the sort.
Dehyphenisation code replaced with the improvement suggested by Shane below (post #32). Apostrophes also zapped to exclude them from the sort.

Yvan_Koenig · August 21, 2016, 11:50am

Hello Nigel

As the official spellings are: Paul-Loup Sulitzer, Jean-Paul Sartre it seems that your result is perfect.

I never imagined that such a result was reachable.

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) dimanche 21 aout 2016 13:50:40

Nigel_Garvey · August 21, 2016, 12:56pm

Thanks, Yvan.

I’ve just updated the script to improve the handling of “St” and to eliminate the influence of hyphens on the sort. Can you tell me how “Saint” and “Sainte” and their abbreviations are sorted in French? Are they treated as the same word or is the “e” significant in the sort?

Shane_Stanley · August 21, 2016, 1:17pm

Nigel,

Can we assume a maximum of one hyphen, and thus use “(?m)^[^-,]±” to eliminate that repeat loop?

Nigel_Garvey · August 21, 2016, 2:08pm

Hi Shane.

I’d personally prefer not to make that assumption. SunnyFrinton’s bound to have a friend in Clacton called John-Paul Street-Porter-Lazenby-Smythe-Aardvark.

I saw your suggestion when I came back earlier to post something similar to eliminate all hyphens before the commas in the lines where they occur. But after updating the script, I realised it wasn’t doing what I thought it was, so I’ve now changed it back. :rolleyes: I’m still trying to think of an all-embracing regex for the job though.

Yvan_Koenig · August 21, 2016, 2:28pm

Hi Shane

A quick search returned :
https://fr.wikipedia.org/wiki/Charles-Augustin_Sainte-Beuve

The full name is :
[format]Charles-Augustin Sainte-Beuve[/format]

For him too I ignore the address of its grave.

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) dimanche 21 aout 2016 16:28:42

Shane_Stanley · August 22, 2016, 12:17am

OK:

-- Now replace any hyphens in the sorting names with spaces.
repeat
	set changesMade to (theString's replaceOccurrencesOfString:"(?m)^([^,-]+)-" withString:"$1 " options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()})
	if changesMade = 0 then exit repeat
end repeat

I suspect that’s going to be quicker than searching every name individually.

We just have to hope that Brett D’Oliveira doesn’t join the rush to Clacton on Sea.

Shane_Stanley · August 22, 2016, 12:19am

A Sainte and two hyphens in one entry! Nigel will be pleased

Nigel_Garvey · August 22, 2016, 11:20am

Shane Stanley:

-- Now replace any hyphens in the sorting names with spaces.
repeat
	set changesMade to (theString's replaceOccurrencesOfString:"(?m)^([^,-]+)-" withString:"$1 " options:(current application's NSRegularExpressionSearch) range:{0, theString's |length|()})
	if changesMade = 0 then exit repeat
end repeat

I suspect that’s going to be quicker than searching every name individually.

It is.

I’m sure house prices are skyrocketing in the Regency Lodge area as a result of this thread.

No D’Oliveiras in my own phone book. But a De’Ath and a D’Souza ” as well as names like O’Leary ” are listed alphabetically as though the apostrophes weren’t there.

Hmmm. Looking more closely, I see the BT Phone Book lists names beginning with "de " and "van " as if they began with “D” and “V”. No wonder Beethoven didn’t bother getting a BT land line. My own entry still shows my late mother’s initials nearly three years after she died.

Yvan_Koenig · August 22, 2016, 12:28pm

Hello Nigel

In my dictionary I see both sorting.
sorted upon the d :
De Bakey, De Bono, De Brosses, De Chirico, De Coster, De Foe, De Graaf, De Haas, De Havilland, De Kooning, De Laval, . Du Bos, Du Bourg, Du Camp, Du Caurroy, .

sorted upon the name itself
De BÃ¨ze, Du Bellay, De Fontenelle, De Gaulle, De Las-Casas, De La Mettrie, De Saint-ExupÃ©ry, De Saint-Simon, Du Deffand
and, as this thread is becoming sadistic I add that Donatien Alphonse FranÃ§ois De Sade is sorted at s (when he is not in the hell of libraries)

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) lundi 22 aout 2016 14:03:07

Marc_Anthony · August 22, 2016, 4:01pm

Hi, Shane. Reading a tip from a poster at Stack Overflow, it seems that there are two discrete unicode composition forms, and the wrong one is sent to the shell by echo. iconv appears”on initial inspection”to give a passable sort order. I’m still on Mavericks, so I can’t tell if this differs from the output in your method.

set peeps to "Alan le Irwin,53 Greenacres Clacton on Sea Essex CO15 6LZ,
Charles de Gaulle
Maurice BÃªjart,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885,
Maurice Bejart,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885,
Maurice BÃ´jart,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885,
Sergey Bubka
Lenny Bruce
Maurice BÃ©jart,7 Regency Lodge Clacton on Sea CO10 2AN,07540 433885,
? Bezja"



do shell script "echo " & (peeps)'s quoted form & " | iconv -t UTF8-MAC  | sort -dfk 2,50"

EDIT: Yield →

Yvan_Koenig · August 22, 2016, 4:50pm

Why play with matches when ASObjC does a clean job with the encodings ?

Yvan KOENIG running El Capitan 10.11.6 in French (VALLAURIS, France) lundi 22 aout 2016 18:50:13

Shane_Stanley · August 22, 2016, 11:55pm

The code runs fine on Mavericks; you just need to change the version number to 2.3 and save it as a script library.

Marc_Anthony · August 23, 2016, 1:14pm

Hi, Shane. I also had to convert a test file on my desktop, so I was mistaken to reference echo; unicode handling in the shell via the Mac file system is the problem. Below is an arcane treatise about the differing composed/decomposed forms by user mklement0. This issue would have been nearly impossible to correct without his/her post, as there is no visible difference.

http://stackoverflow.com/questions/23219482/bash-ps-grep-for-process-with-umlaut-os-x

DJ_Bazzie_Wazzie · August 23, 2016, 2:02pm

https://developer.apple.com/library/mac/qa/qa1173/_index.html

The link above explains it even better. Apple want C library libraries/function return names in UTF-8 decomposed form. So why didn’t Apple choose for an composed encoding? Well the VFS fits better on HFS when both file systems uses the same type of encoding even if they differ in size. The VFS file system would be slower in composed form and therefore it would slow down your computer. So at the end the “lack of support” is actually a good thing and difference in character encoding is not something new for someone who uses text interpreters (a unix user), It’s something new for some AppleScript users.

Shane_Stanley · August 23, 2016, 11:24pm

OK, so it seems the problem with using sort (and other utilities) is not that they can’t handle Unicode, but that they can’t handle composed Unicode characters. But using inconv as you have is only a part solution, because you’re returning subtly different characters.

Let’s say you have some text in a frame in an InDesign file, and you get the text, sort it, and put it back in the frame. Then later you decide to make Maurice BÃ©jart italic throughout the document. You open the Find dialog, type in the name, and search – but it won’t find it in the sorted version because it’s looking for the composed Ã©. (I think you can argue this is a bug in InDesign, but I’ve seen composition problems elsewhere too).

You might be able to fix the issue by piping the result back through iconv, although I believe Macs use custom forms of composed/decomposed, and I don’t know if that makes any difference.

But this is the sort of thing that makes using higher-level stuff like AppleScript or Objective-C via AppleScriptObjC appealing…

ccstone · August 24, 2016, 12:26am

Well done Marc!

I was sure there was a way to do this, but I hadn’t tried iconv.

-Chris