entire contents is slow, make it faster?

mleonti · May 16, 2009, 7:56am

Hi Crys
I installed MacPorts and ran this successfully in the Terminal. It created a file at the home level with all the dups in it.

OUTF=rem-duplicates.sh; echo “#! /bin/sh” > $OUTF; find “/Users/ml/Desktop/Quarantined viruses/” “$@” -type f -print0 | xargs -0 -n1 gmd5sum | sort --key=1,32 | guniq -w 32 -d --all-repeated=separate | gsed -r ‘s/^[0-9a-f]( )//;s/([^a-zA-Z0-9./_-])/\\1/g;s/(.+)/#rm \1/’ >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

Contents of the file:
#! /bin/sh
#rm /Users/ml/Desktop/Quarantined\ viruses//1
#rm /Users/ml/Desktop/Quarantined\ viruses//3
#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 1/1

#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 1/CompareSscpt
#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 2/CompareS.scpt

#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 1/SleepX.cache
#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 2/SleepX.cache

#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 2/.DS_Store
#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 2/untitled\ folder/.DS_Store

#rm /Users/ml/Desktop/Quarantined\ viruses//Folder\ 1/Icon
#rm /Users/ml/Desktop/Quarantined\ viruses//Icon
#rm /Users/ml/Desktop/Quarantined\ viruses//untitled\ folder\ alias

I then tried to do a shell script of it and I ran into trouble. The \ character was not accepted nor the “$@”
I tried to fix it like this:

set theFolder to choose folder
set m to (ASCII character 92) as Unicode text
set m1 to (ASCII character 34) as Unicode text
set cmnd to ("OUTF=rem-duplicates.sh; echo " & quote & "#! /bin/sh" & quote & " > $OUTF; find " & quoted form of POSIX path of theFolder & " " & m1 & "$@" & m1 & " -type f -print0 | xargs -0 -n1 gmd5sum | sort --key=1,32 | guniq -w 32 -d --all-repeated=separate | gsed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/" & m & m & m & "1/g;s/(.+)/#rm " & m & "1/' >> $OUTF; chmod a+x $OUTF; ls -l $OUTF")
-- return cmnd
try
	set theFile to (do shell script cmnd password "control" with administrator privileges)
on error e
	return e
end try
theFile -- The script returned: "-rwxr-xr-x  1 root  admin  11 May 16 19:54 rem-duplicates.sh"

I tested the cnd:
The returned cmnd had double the number of \ and "$@" instead of “$@” and the script did not fail.
“OUTF=rem-duplicates.sh; echo "#! /bin/sh" > $OUTF; find ‘/Users/ml/Desktop/Quarantined viruses/’ "$@" -type f -print0 | xargs -0 -n1 gmd5sum | sort --key=1,32 | guniq -w 32 -d --all-repeated=separate | gsed -r ‘s/^[0-9a-f]( )//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/’ >> $OUTF; chmod a+x $OUTF; ls -l $OUTF”

I researched it and these extra characters are supposed to be legal virtual characters that do not show in a dialog box for example. I was not however able to to make the cmnd work.

chrys · May 16, 2009, 12:39pm

mleonti:

Hi Crys
I installed MacPorts and ran this successfully in the Terminal. It created a file at the home level with all the dups in it.

OUTF=rem-duplicates.sh; echo “#! /bin/sh” > $OUTF; find “/Users/ml/Desktop/Quarantined viruses/” “$@” -type f -print0 | xargs -0 -n1 gmd5sum | sort --key=1,32 | guniq -w 32 -d --all-repeated=separate | gsed -r ‘s/^[0-9a-f]( )//;s/([^a-zA-Z0-9./_-])/\\1/g;s/(.+)/#rm \1/’ >> $OUTF; chmod a+x $OUTF; ls -l $OUTF

[.]

I then tried to do a shell script of it and I ran into trouble. The \ character was not accepted nor the “$@”
[The generated cmnd string] had double the number of \ and "$@" instead of “$@”.

I researched it and these extra characters are supposed to be legal virtual characters that do not show in a dialog box for example. I was not however able to to make the cmnd work.

I am not sure I understand all the problems you are reporting, but I will try to clear up a couple of the possible misunderstandings that I see reflected in your descriptions.

AppleScript string syntax

While creating a programming language, the designers of the language’s syntax will, at some point, come to the problem of representing arbitrary string values directly in the source code of the language. There are some techniques that have been established for so long that many people may not even think that representing such inline string values is even much of a problem. But it does bears some consideration or at least some reflection.

One of these most common techniques is to put double quote marks around the characters that are supposed to be part of the string (for the purposes of this description, I will dismiss the latent assumption that the character set used by the programming language source code and the character set used in strings manipulated by that language are identical). So, to represent the sequence comprising the first four uppercase letters of the English alphabet we write “ABCD”. We can now represent strings that incorporate most any character using this simple technique.

The problem comes when we want to represent a double quote character inside the string value. As you demonstrated in your code, you can rely on run-time string concatenation (maybe compile-time if the compiler is smart enough). In AppleScript you end up with a syntax that looks like this: “He said, “&quote&“Huh?”&quote&” with a growing scowl.”. While effective, it is now nearly impossible to quickly, visually identify the text represented in that inline string value as a piece of written dialog. What we need is a way to use actual double quotes inside the string value.

Most languages employ the concept of an escape character. In most systems (AppleScript included), the escape character gives the character that follows it a different meaning from that which it would normally have. If, in the midst of an inline string value, we put the escape character before a double quote, that double quote character does not end the string value like it normally would. Instead, it introduces a double quote character into the represented string value itself. Effectively when translating (compiling) the inline string value in the program text to an actual string value in the compiled program the two character sequence is compiled to a single character . With this technique, we can write inline string values for dialog (or other string values that need embedded double quotes, like shell code) in a way that is a bit more recognisable: “He said, "Huh?" with a growing scowl.” to represent He said, “Huh?” with a growing scowl.

So, this is why we use backslashes before double quotes to represent actual double quotes in inline string values in many programming languages that use double quotes themselves to mark the beginning and ending of an inline string value. But now, the escape character itself has a special meaning when inside inline string values. What if we need to represent an actual backslash in a string value. Again, not satisfied with concatenation, we decide to apply the same trick to the escape character itself. We escape the escape character. As with the double quote, this removes the special meaning of the character and allows us to encode the escape character into a string value: “He said, "Huh?" with a growing scowl. :-\” to represent [u]He said, “Huh?” with a growing scowl. :-[/u] (it is a scowling “smiley”?). And that is why you need doubled blackslashes in AppleScript inline string values to represent a single backslash.

There are a few other characters that are affected by the escape character when used in an inline string value. “\n” yields a linefeed (LF; ASCII 10). “\r” yields a carriage return (CR; ASCII 13). “\t” yields a horizontal tab (HT; ASCII 9). The first two are generally nice to have so that you can write an inline string value that has a line-break at the end (or the beginning, or in the middle) without having to put the actual line-break in the program’s source text (inside the double quotes; although that is also allowed: I have recently posted a Perl-program-in-a-string that uses actual line breaks inside an inline string value to make the Perl code readable even though it is inside an AppleScript string). The last one (tab) is nice because horizontal tabs can sometimes be mistaken for one or more normal space characters. Frustratingly, (pre-Leopard) Script Editor rewrites the AppleScript source text to use the actual characters embedded in the inline string values, so all this visual distinction is lost (it also seems to rewrite embedded CRs into embedded LFs if it compiles the source again). Nevertheless, “She laughed and said "It’s a seki."\nHe said, "Huh?" with a growing scowl. :-\” represents these two lines of text: She laughed and said “It’s a seki.” and He said, “Huh?” with a growing scowl. :-" (assuming your environment treats LF as a line-break).

So, when you see (or write) an inline string value you have to take this escaping mechanism into account. The last bit of the puzzle is that the Result pane in Script Editor shows values as they would need to be written if they were a part of AppleScript code (i.e. inline values). So, when you see a string in the Result pane, you are looking at it in its “inline form” (which requires the escaping mechanism). That is why it looks like the backslashes have been doubled and why the quotes have grown leading backslashes. They have not been doubled (nor spawned buddies). Script Editor is just showing the string to you in the form you would need if you were to copy and paste it into an AppleScript program as an inline value. Since display dialog is a general purpose command, it makes no sense for it to display string values in this same way. It just uses the actual characters stored in the string, not its source code representation.

shell script vs. do shell script

The “OUTF” shell code you have been dealing with lately seems to me that it was not designed to be typed directly into a shell. The key indicator is the use of the variable $@. That variable is used to refer to command line arguments passed to shell code that has been invoked as a distinct program (a “shell script”).

To create and use the code in a “shell script”, you would usually save the shell code into a file using a text editor (adding the special #!/bin/sh (“shbang”) line to the top), make that file executable (chmod a+x filename), and optionally put it in a directory that is in your usual PATH. The PATH is the list of directories that are named in the PATH environment variable. It is how the shell finds programs that are not specified with a full pathname. Just plain sed usually means /usr/bin/sed. But, that is a bit much to type all the time, thus the PATH. Also, what if Fred wants to run a shell script that uses sed but wants to use a different copy of sed? Fred can, if he changes the PATH env var in his shell.

So sometimes (sense 1) a “shell script” is an executable file that contains shell code (commands). Other times (sense 2) a “shell script” is just a collection of one or more shell commands in a single string (which you could type into a shell running in Terminal). This second meaning is the first-order meaning used in do shell script (although if you have a “shell script” (sense 1) you can invoke it like any other program in a do shell script command). To me, it looks like “OUTF” is a “shell script” in the first sense and are giving it directly to do shell script which only deals with the second sense. To be fair, both uses are often very close in behavior. Your script just happened to hit upon one of differences (the $@ variable usually has no value when used in the second sense).

In the case of your “OUTF” code, you made an almost perfect adjustment when you put the paths at which you wanted to start your search alongside “$@”. Slightly better would been to have outright replace “$@” with your paths. That is what the shell would have effectively done if you had stored the code in a “shell script” (sense 1) and invoked it as (.e.g) /path/to/find-dups.sh /path/to/start/the/search.

Differences between Terminal’s shell and do shell script’s shell

Read all about it in Apple’s TN2065: do shell script in AppleScript.

Your new GNU coretuils and GNU sed (gsedp, guniq, gmd5sum) are probably not in the PATH that do shell script uses (the MacPorts installation probably updated the one Terminal’s shells use by changing your .bash_profile file). See “My command works fine in Terminal, but[.]” in TN2065. A quick fix might be to put export PATH=“$PATH”:/opt/local/bin; at the start of your shell code (assuming you installed MacPorts with the default prefix of [i]/opt/local[i]).

Also, do shell script start its shell in the root of your startup volume, not your home directory (like Terminal’s shell). Since the output filename (stored in the shell variable named OUTF) does not have a path, it will end up in the shell’s current directory. So when run from do shell script your output file will be in the root of your startup disk (iff you have permission to write there; otherwise you will get “Permission Denied” errors). A quick fix would be to have do shell script’s shell change its current directory to somewhere else before running the rest of the commands. You could do this by adding (e.g.) cd “$HOME”/Desktop; at the start of your shell code (either immediately before or after the PATH adjustment noted above is fine; both need to be before the other code though).

Using the output file (rem-duplicates.sh)

My understanding of the output file is that you are supposed to load it into a text editor and uncomment (remove the leading hash/pound/grid/square (#) character from) all the lines for the files you want to delete (lines for each group of files with identical content are separated by blank lines). Then, once you have made your edits, save the file and run it. It is a (sense 1) “shell script”. Run it by typing /path/to/rem-duplicates.sh in a Terminal shell (or ./rem-duplicates.sh if the shell’s current directory is the location of the file). Be very sure you only uncomment lines for files you want to permanently delete. rm does not use Finder’s Trash. There is no good way to undelete files deleted with rm. Make a backup beforehand. Make a backup even if you do not delete any files. Everyone should make regular backups. Disks are cheap. Your time to recover lost files is probably not as cheap as an external disk (or even one for each machine; back them all up!).

The “Icon\r” files are not properly handled by that shell code. Do not bother uncommenting lines for them. Running any of those lines will generate an error message. Use Finder to delete them, if you really need to. If they are inside “package” folders that act look like a single object in Finder, use the Show Package Contents context menu item to get into them. Or, if you don’t want the entire “package”, then just trash the package folder with FInder. These can be deleted from the shell, but I am done writing for now.

There are probably other things about which I have not thought to write. Make a backup. Good Luck.

mleonti · May 16, 2009, 10:52pm

Dear Crys,

I think I got the AppleScript string syntax part of your generous answer, pretty reasonably.
I did not have a chance to read doc TN2065 but I am sure I’ll learn a lot from it.

Re: the shell script vs. do shell script:
Yes I did use the default installation of MacPorts and then I ran these two lines in terminal.
sudo port install coreutils
sudo port install gsed
I knew that the cmnd ran (and consequently that my cmnd variable was accepted by Applescript :)) because it returned a one line result:
I was root and file was to be created at the HD level on this date.
The output file (rem-duplicates.sh) was created correctly at the HD level, but empty, with just the header in it. The code itself did not work.
I am about to try your suggestions in the hope this code might work.

mleonti · May 17, 2009, 9:24am

Hi all

I ran

set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*' | perl -e " & quoted form of perlProgram password "control" with administrator privileges without altering line endings)
set flCnt to count theFiles

on my entire HD and I got the fiollowing Applescript error:
find: /dev/fd/7: Not a directory
find: /net/localhost: Operation timed out
find: /net/broadcasthost: Operation timed out
perl(795) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can’t allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!

I am afraid I could not find the offending file. There is however a hidden .Dev alias at the root directory with the blue icon of a (networked?) hard disk. I doubleclicked it and it failed to find its original. I tried it on 3 different computers and I had the same result.

also I corrected

-- Trim last, empty line that "without altering line endings" produces. 
if  last item of theFiles is "" then ¬
	if flCnt > 1 then
		set theFiles to items 1 through -2 of theFiles
	else
		set theFiles to {}
	end if

to

-- Trim last, empty line that "without altering line endings" produces. 
if flCnt > 0 and last item of theFiles is "" then ¬
	if flCnt > 1 then
		set theFiles to items 1 through -2 of theFiles
	else
		set theFiles to {}
	end if

as it would fail on beng pointed to an empty directory leaving theFiles{} untouched

chrys · May 17, 2009, 10:36pm

mleonti:

I ran
set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*' | perl -e " & quoted form of perlProgram password "control" with administrator privileges without altering line endings)
set flCnt to count theFiles
on my entire HD and I got the fiollowing Applescript error:
find: /dev/fd/7: Not a directory
find: /net/localhost: Operation timed out
find: /net/broadcasthost: Operation timed out
perl(795) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can’t allocate region
*** set a breakpoint in malloc_error_break to debug
Out of memory!

I am afraid I could not find the offending file. There is however a hidden .Dev alias at the root directory with the blue icon of a (networked?) hard disk. I doubleclicked it and it failed to find its original. I tried it on 3 different computers and I had the same result.

The core problem is that something (perl?) is running out of memory. I have never seen perl run out of memory on my system when trying to use it in similar situations. You probably have many more files than I do (and also a different OS, since I am on Tiger).

I am pretty sure that the error messages from find (“Not a directory” and “Operation timed out”) are “red herrings” here. They have nothing to do with running out of memory or the script as a whole failing.

/dev/fd/ is a “magic” directory that works a bit like a special mirror for process that look into it (opendir/readdir from it). When a particular instance of find looks into /dev/fd/, it sees the list of file descriptors that it has open at that instant. When a particular instance of ls looks into /dev/fs, it sees the list of file descriptors that is has open at that instant. These “reflections” not only vary from process to process, but they vary over time as well (just as your reflection in a mirror would change as you don and doff an article of clothing).

It is as if find gazed into the mirror of /dev/fd/ and makes a note of the fact that /dev/fd/7 is a directory (it is wearing a red hat). It goes about its business and in the process closes fd 7 and opens a plain file that happens to land in the fd 7 slot (doffs the red hat and dons a blue hat). Later it tries to examine a directory named /dev/fd/7, which it noted earlier but finds that the pathname now refers to a plain file instead of a directory (examines its hat and is surprised to find that it is now blue instead of red) and issues an error message (shouts an exclamation).

This behavior is a bit psychotic for a person who readily recognizes mirrors and realizes that their actions will affect not only their reality but also their appearance in a mirror. But find is not so sophisticated. It does not know that it is looking in to a type of mirror when it looks into /dev/fd/, and therefore expects its contents to be stable over time when in fact the “reflection” is as dynamic as interaction of the find program itself with the machine’s filesystem (quite dynamic).

If you are trying to scan a full volume with find, you should really use find -x pathname [predicates/actions]. That keeps it from crossing (represented as “x”) from one device (volume) to another (/dev/, /net/ (Leopard only?), usually all of the stuff under /Volumes/, and several other special locations that are all mounted from different “devices”).

If you really want to start your scan from the root of your boot volume and include all your mounted volumes (and other special filesystems), you can add something like -type d ( -path /dev -o -path /net ) -prune -o (that is shell syntax, use in an AppleScript string will need more backslashes) before your other predicates and actions to entirely skip these few really “magical” directories.

mleonti:

also I corrected

-- Trim last, empty line that "without altering line endings" produces. 
if  last item of theFiles is "" then ¬
	if flCnt > 1 then
		set theFiles to items 1 through -2 of theFiles
	else
		set theFiles to {}
	end if

to

-- Trim last, empty line that "without altering line endings" produces. 
if flCnt > 0 and last item of theFiles is "" then ¬
	if flCnt > 1 then
		set theFiles to items 1 through -2 of theFiles
	else
		set theFiles to {}
	end if

as it would fail on beng pointed to an empty directory leaving theFiles{} untouched

Yes, good catch. I think I was not using “-type f” in my testing so I never had a result with no output (if you do not exclude directories then find will always at least print the directory in which the search was started).

mleonti · May 18, 2009, 12:20am

Hi Chrys,

I changed the code to:

set theFolder to choose folder
--set theFiles to paragraphs of (do shell script "find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*' | perl -e " & quoted form of perlProgram password "control" with administrator privileges without altering line endings)
--find -x pathname [predicates/actions]
set theFiles to paragraphs of (do shell script "find -x " & quoted form of POSIX path of theFolder & " [predicates/actions] | perl -e " & quoted form of perlProgram password "control" with administrator privileges without altering line endings)
set flCnt to count theFiles
--return (last item of theFiles is "") & flCnt & theFiles as text
-- Trim last, empty line that "without altering line endings" produces. 
if flCnt > 0 and last item of theFiles is "" then ¬
	if flCnt > 1 then
		set theFiles to items 1 through -2 of theFiles
		set flCnt to flCnt - 1 -- needed also
	else
		set theFiles to {}
	end if
if flCnt < 2 then -- the chosen folder is empty or contains 1 file only
	set fldrNm to "script name goes here?" -- placeholder for missing variable
	set theFolder to POSIX file (POSIX path of theFolder)
	display dialog fldrNm & " cannot continue" & return & return & "You have chosen folder:" & (theFolder as text) & " which contains:" & flCnt & " file/s" & return & return & fldrNm & " needs at list 2 files to be able to scan for duplicates" buttons {"OK"} with icon 0 default button 1
	return
end if

and I ran it on 1 whole HD. In Activity monitor I watched perl’s memory climb to 1.46 gb before gettingerror:
perl(490) malloc: *** mmap(size=16777216) failed (error code=12) *** error: can’t allocate region *** set a breakpoint in malloc_error_break to debug Out of memory!
System Memory
Free:1.63 GB
Wired: 561.61 MB
Inactive 22.66 MB
Used 2.36 GB
VM Size: 47.48 GB
Page ins: 122
Page outs and Swaps used both 0

On my HD 86.61 GB Free and I have 4 GB of RAM installed

mleonti · May 20, 2009, 5:26am

Hi all
I have just about finished with this.
When I thought I was home and dry I found a folder with a special character after its name that caused the find command to duplicate everything. In a folder called:
TheHideAway

I pasted it in here with the extra character being the white line above so if you copy both the name and the empty line you should be able to duplicate it.

As you see from the code below I am trying to build something really fast (an hybrid of all your kind suggestions).

set theFolder to choose folder
set theFiles to paragraphs of (do shell script "find -x " & quoted form of POSIX path of theFolder & " -type f ! -name '.*'" password "control" with administrator privileges without altering line endings)
set flCnt to (count theFiles)
if flCnt > 0 and last item of theFiles is "" then ¬
	if flCnt > 1 then
		set theFiles to items 1 through -2 of theFiles
		set flCnt to flCnt - 1
	else
		set theFiles to {}
	end if
set theFilesNames to the paragraphs of (do shell script "/usr/bin/find " & quoted form of POSIX path of theFolder & " -type f ! -name '.*' -print0 | /usr/bin/xargs -0 /usr/bin/basename -a")
set nameCheckList to theFilesNames
try
	repeat with i from 1 to flCnt
		set thisName to item i of theFilesNames as text
		set thisPathRef to item i of theFiles as text
	end repeat
on error e
	return flCnt & theFiles & return & return & theFilesNames
end try
(* The result:
{4, "/Users/ml/Desktop/TheHideAway", "//212.30.11.39 - All. clipping", "/Users/ml/Desktop/TheHideAway", "//212.30.11.39 - All. clipping copy", "
", "
", "212.30.11.39 - All. clipping", "212.30.11.39 - All. clipping copy"}*)

as you can see from the result above Find thinks we scanned 4 files in stead of the 2 only inside the folder. The second find works correctly and reports only two files.
The loop fails as the number of items do not match.
Any idea how to fix this?
Thanks

chrys · May 21, 2009, 6:11am

Why run find twice? You can easily accomplish the functionality of basename in AppleScript with text item delimiters once you have the list of pathnames in AppleScript. Also, two runs of find (likewise, Finder’s entire contents of) are never guaranteed to return the same number of items. Some item could be added, removed or changed in between runs.

To me, it looks like your directory whose pathname starts with “/Users/ml/Desktop/TheHideAway” ends in a linefeed (or possibly a carriage return). If so, you will have to use -print0 (with the first find) and the associated do shell script . without altering line endings and text item delimiters = ASCII character 0 technique (instead of paragraphs of) to handle these types of pathnames (I think I mentioned this before, such pathnames are the raison d’Ãªtre of -print0).

mleonti · May 21, 2009, 8:46am

Hi Chrys,

I ran the two find commands one after another as I need the paths of the files and the names of the files in two separate lists. I found using a loop in Applescript to get the names is really slow compared to running another find in a shell script.
The lapse in between the two find commands is 4 seconds at the most (scanning nearly 900,000 files) so I am not too worried about file numbers changing in between. So far the two worked pretty solidly in all my tests.

At present I use the first do shell script find so that I can have a quick count of the files contained in a hard disk. (I do not know of any fastest way) The basename one gives me the names to check for dups with uniq -d

I have changed the first find to -print0") without altering line endings) and it read the folder correctly, thanks.