sorting paragraphs returned by a find shell script

mleonti · May 16, 2009, 12:24am

Hi Nigel,

Spot on. I changed the file count to after the delimiter reset and it worked, thank you.

set AppleScript's text item delimiters to "/" as Unicode text
set flCnt to (count theFiles)

I think I know the reason why I was not able to use the external customQsort in the script folder.
I was sending 4 parameters to the script instead of the three expected. My third one was -1 making it a non start.

-- considering numeric strings
-- tell lib to CustomQsort(theFiles, 1, -1, byLastTextItem) -- failed
-- end considering
try
	tell lib to CustomQsort(theFiles, 1, flCnt)
	return lib as text
on error
	Qsort(theFiles, 1, flCnt)
end try

May be there is another, more complex CustomQsort requiring the forr parameters but not mine :).
The fourth parameter was there for the script of the same name.
Did I not access it at all?

script byLastTextItem -- Handlers for customising the sort.
	on isGreater(a, b)
		(text item -1 of a > text item -1 of b)
	end isGreater
	on isLess(a, b)
		(text item -1 of a < text item -1 of b)
	end isLess
	on swap(a, b)
	end swap
	on shift(a, b)
	end shift
end script

Do I need it? Makes it faster?

I wander if it makes any difference which method is used. I feel that the function in the handler would be faster than accessing an external file.
This was very good for me to learn how to handle external files and I feel my grasp of applescript is growing stronger with your and others’ help.

I am now testing Qsort for speed.
First test:
“File sorted: 39225 It took: Hrs:0 Mts:11 Sec:51”

Nigel_Garvey · May 16, 2009, 12:31pm

mleonti:

May be there is another, more complex CustomQsort requiring the forr parameters but not mine :).
The fourth parameter was there for the script of the same name.
Did I not access it at all?
script byLastTextItem -- Handlers for customising the sort.
	on isGreater(a, b)
		(text item -1 of a > text item -1 of b)
	end isGreater
	on isLess(a, b)
		(text item -1 of a < text item -1 of b)
	end isLess
	on swap(a, b)
	end swap
	on shift(a, b)
	end shift
end script
Do I need it? Makes it faster?

It looks as though you may have saved the wrong handler! CustomQsort (as opposed to Qsort) requires four parameters: the list, the index of the start of the sort range, the index of the end of the sort range, and a script object containing handlers that customise the sort. CustomQsort can take negative indices for the range parameters.

Qsort (as it appears in your script in post #18 above) works straightforwardly by comparing the items in the list and arranging them according to which have the greater or lesser values. In this case, the items in the list are the paths to your files, so it’s the whole paths that are compared.

However, you say you want the paths sorted according to the file names, so you only want to compare the names at the ends of the paths, not the whole paths. CustomQsort, rather than comparing items itself, passes them to either the isGreater() or the isLess() handler in the supplied script object and gets back a verdict (true or false) on whether or not item ‘a’ has a greater or lesser value than item ‘b’. In the script object ‘byLastTextItem’ above, isGreater() and isLess() compare the last text items in each pair of paths passed to them. In the main script, AppleScript’s text item delimiters are set to “/” before CustomQsort is invoked, so the last text item in each path will be whatever comes after the last “/” in that path ” ie. the file name.

(The swap() and shift() handlers are for other ways in which the sort could be customised. They have to exist to match the calls in CustomQsort, but are not used here and are left blank.)

It only takes a minute fraction of a second to ‘load script’ an external script file. The external code is loaded into the main script as a script object (the value of the variable I’ve called ‘lib’), so there’s no further traffic between it and the file until the next time the script’s run.

CustomQsort is (I believe) the fastest “pure AppleScript” way to achieve this kind of sort, but Chris’s shell and Perl scripts are no doubt faster.

Fenton · May 17, 2009, 4:52pm

It seems the only problem with an earlier post was that it had the entire path instead of just the file name, why not just get the file name? I’m no Unix expert, but this seems to work:


set theFolder to choose folder
set theFiles to (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name '\\.*' -print0 | /usr/bin/xargs -0 /usr/bin/basename | /usr/bin/sort -df")

chrys · May 17, 2009, 9:27pm

Fenton:

It seems the only problem with an earlier post was that it had the entire path instead of just the file name, why not just get the file name? I’m no Unix expert, but this seems to work:
set theFolder to choose folder
set theFiles to (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name '\\.*' -print0 | /usr/bin/xargs -0 /usr/bin/basename | /usr/bin/sort -df")

That does generate a sorted list of filenames, but it loses the associated pathnames. I think the full pathname is of interest to the original poster.

It seems the original poster’s goal is not really sorting but finding “duplicate files” (which might have been the result of merging the disks from multiple systems into one disk). Such output would tell him which filenames were duplicated (based on repeated lines in the output; uniq -c could help with that analysis), but it would not tell him where those files are in the disk hierarchy.

There are some small, potential issues latent in that shell code.

First, after xargs the benefit of NUL-terminated “lines” is lost. Mostly this means that filenames with embedded newlines, will be split into two (or more) lines as far as sort will see it.

Second, there may be some latent surprises with how basename is being used with xargs. By default, xargs gives its subcommands as many arguments as possible, not one at a time. According to the documentation of my version of basename, this is an under-specified situation. In operation, basename seems to handle it by switching to “-a” mode when it is given more than two arguments. This mostly avoids the potential problem. So, as long as find does not produce exactly two outputs (NUL-terminated “lines” in this case), this usage of basename seems like it will work out OK. If find produces exactly two outputs, the second one will be used as the “suffix” argument to basename and will be stripped from the first if and only if the first ends with that suffix (very unlikely in the case of find output). Effectively, the second output from find (passed through xargs) will be dropped by basename. Using basename -a would completely fix this second problem.

Third, while not harmful in this case, the backslash before the dot in the value of the argument to -name is not needed. Technically it prevents special interpretation of the dot character, but the dot character has no special meaning in this context (the context is a “filename pattern” (like a “glob”: []*? are special, nothing else), it is not a regular expression (where the dot is special)).

mleonti · May 17, 2009, 11:28pm

Thank you Fenton,

I appreciate your help.
What I am attempting to do is to create a list of all duplicate files and their path on a HD (just like uniq -d does but with the path) with more than 500,000 files on it. If they could be sorted by name so that the duplicates come one under the other then the job of finding them and discarding the copy I do not need would be manageable.

Thank you Crhys for your explanations.