File Comparision Script - help

hi,
i’m trying to do a script that finds duplicates of images. there are a lot of folders and subfolders where the files are, and there are over 50 thousand files ( and more comming ). i want to do a script that chacks for dups, regardless the name. so, i thought using the ‘cmp’ function in the terminal. the thing is that i have no idea on how to tell the script to use all the files ( the ones in all the folders ). and, to optimize, it has to be some way to tell it to not do the same comparision twice ( if it checks file A against file B, then it won’t have to check file B against A ).
a friend told me that Java has some “node struture” in the file system… but i don’t know Java, and i’m a newbie on Applescript.
when i searched google and the forums for “file system navigation” i ended up with a lot of ‘tutorials’ for navigation through the directories… useless.
any ideas, or pointers?

thank you very much.

marto.

I don’t think I’d use cmp because that would require far too many shell calls. I vote for md5.

When I tried this with this simple example (just comparing the first pic’s md5 signature to the rest of them), it found the duplicate file (which I had renamed).


tell application "Finder" to set P to files of alias ((path to desktop folder as text) & "Pix2:") as alias list
set M to {}
repeat with aP in P
	set M's end to last word of (do shell script "md5 " & quoted form of POSIX path of aP)
end repeat
set I to {}
set P1 to item 1 of M
set Pa to item 1 of P
repeat with k from 2 to count M
	tell item k of M to if it = P1 then set I's end to item k of P
end repeat

set N1 to name of (info for Pa)
set N2 to name of (info for item 1 of I)
display dialog N1 & " is the same as " & N2

if i understood you correctly, that script stores in M, all the md5 signatures of all the files, and then compares the first with the rest, then the second, then third… and so on. thats great!

two questions:
how can i make the search for the files that should go in M to be recursive to directories?
and, what’s the maximum size of M? because having over 50.000 files, each with its signature stored in M, could be a problem…

thanks for everything!

marto.

Marto;

The script bit I showed just compares the first to the rest. It would have to be modified to compare each to all others. With 50,000 files to compare, even storing the list of aliases to all of them could be problematic. On my machine, computing the md5 signatures for 1500 pictures took about 30 seconds. It may well be that a special shell script would be the way to go.

Another approach that occurs to me is to do a Qsort (quick sort algorithm) of the signatures because then duplicates would end up next to each other. At this point, however, I don’t know how to carry the sorting order over into the picture alias file list at the same time with a Qsort so that I’d know which photo was a duplicate when I found a signature that was. Given that I could do that, then I’d just have to compare adjacent signatures or perhaps a few ahead of each one and then remove both the signature and the picture aliases from their respective lists (keeping a list of all the duplicates) and continue.

Because of the huge size of this task, it’s definitely worth mulling over for a while. Let some others think about it too.

mmm yes… i knew it would be a problem the size of the query…
i hope some could solve this, as a newbie, i’m stucked.

but i insist. how can i do to make it work as it is ( with the proper modifications ), recursivly to directories? it would have to search for directories and subdirectories… how can i do that?
( that will also be usefull to know, so its never bad to ask :wink: )

thank you very much Adam.

Marto.

Hi Marto;

This lists every file in the chosen folder no matter how deep:

tell application "Finder" to set FL to files of entire contents of (choose folder) as alias list

but it took 26 seconds to list 17,000 document files, so it might take 90 to list your images.

This is much faster, but produces a long string - where each folder is listed and then under it, it’s contents by name. Happens in a flash, but then you have the task of parsing out the file locations. Try it on a folder and look at the result.

set F to quoted form of POSIX path of (choose folder)
set L to do shell script "ls -R " & F

ls is a unix function “list”, and -R means recursively – i.e., dig down sort by name, making it -RS sorts by size.

wow! thats great! but i’ve been running with some problems…
the second script, with the ‘do shell script’ i wouldn’t know how to parse it.

NO PROBLEMS WHAT SO EVER. sorry for that hasty post.

this script is onlye to be ran every now and then… so i really don’t care if it takes several minutes to finish, as long as it is accurate and allows me to get rid of the duplicates. i really appriciate your help, i’m learning a lot! thanks!!!

Marto.

Tell me more about the second one - when do you get that message; what did you do?

I have problems with either of them when I try them on very large directories. In the first, I get a memory error, and in the second, I have problems when I try to process the result in an AppleScript - it runs out of memory. I string can only be so long (can’t remember the number).

The answer may be to do this in parts, storing file paths and checksums in an external file and then processing that. Since that will involve a lot of disk access, it will be slooow. After you’ve done your initial run, however, you just have to eliminate duplicates as you add files to the archive - you only have to check that the new ones aren’t already there.

with the second one, i did this:

tell application "Finder" to set FL to files of entire contents of (choose folder) as alias list
display dialog FL

and forgot absolutly to put ‘as string’. sorry. :stuck_out_tongue:

so, if i store them in a .txt, and then i call the lines, that could work? slowly, but work?

I’m presuming that you have lots of subfolders and not a flat file of 50,000 photos. Given memory problems, I’m suggesting that we do one internal folder at a time, write those results to a file, do the next, etc., until we’ve got a complete file of paths and message digests. The AppleScript “read” instruction includes the ability to read “from a to b” (see Nigel Garvey’s tutorial for everything you’ll ever need to know about reading and writing from AppleScript).

Doing the initial pass (the looong, slooow one) would be to take a section (one internal folder full) and compare its MDs to all the others section by section recording duplicate pairs. [There is probably a good shell method for doing that, but I’m not a shell wizard.] After that initial sort, any new files would be checked before they were entered.

How many files would typically be found in one of your internal folders? What’s the structure of all this?

well this is as far as i went, the thing is that it does nothing… and there IS a duplicate in the folder i choose.


on run
	try
		tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list
		
		--this will store the md5 signature of every file in 'signature'
		set signature to {}
		repeat with inputFile in inputFolder
			set signature's end to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
		end repeat
		display dialog signature as string
		
		set dupFile to {}
		
		repeat with x from 1 to number of items in signature
			set itemSignature to item x of signature
			set inputItem to item x of inputFolder
			repeat with k from (x + 1) to count signature
				if (item k of signature = itemSignature) then
					set dupFile's end to item k of inputFolder
					set N1 to name of (info for inputItem)
					set N2 to name of (info for item 1 of dupFile)
					my writeDuplicates(N1, N2)
				end if
			end repeat
			set x to (x + 1)
		end repeat
		
	on error the error_message number the error_number
		set the error_text to the error_number & the error_message
		my write_error_log(the error_text)
	end try
end run


on writeDuplicates(file1, file2)
	set the dupLog to "Marto:Users:Marto:Desktop:duplicates log.txt"
	try
		open for access file the dupLog with write permission
		write (file1 & " = " & file2 & return & return) to file the dupLog starting at eof
		close access file the dupLog
	on error
		try
			close access file the dupLog
		end try
	end try
end writeDuplicates

on write_error_log(this_error)
	set the error_log to "Marto:Users:Marto:Desktop:Error Log.txt"
	set the date_stamp to ((the current date) as string)
	try
		open for access file the error_log with write permission
		write (tab & date_stamp & return & this_error & return & return) to file the error_log starting at eof
		close access file the error_log
	on error
		try
			close access file the error_log
		end try
	end try
end write_error_log

i think the problem is with ‘signature’, because when i diasplay a dialog as string of it, it returns something like:

fh023986h234j206978c2n4589cnv2458967n45698cf7n249687nv222689j9c874598nv4598n267b4672bnv04567bc9824756bc29348756bc2987456bc24893576fh2984756cfh290873456h390f876h340c6237047c57856b29c8b62304mx5235zsk2340jf623405783406

since there is only one string, i think there is something wrong… i really don’t know.

and another thing, i tried your script above, the first of all, setting it to ‘choose folder’, and it returns an error:
" Can’t make every file of alias (alias “Marto:Users:Marto:Desktop:pics:”) of application “Finder” into type «class alst». "
then, i changed the ‘choose folder’ to a fixed path of a folder with the pictures and one duplicated, and it returns another error:
" Can’t get item 1 of {}. "


in reply:
yes i have lots of subfolders.
i don’t know if you are familiar with the phpwebgallery thing, but thats was this is all about. there are some categories that have 50 files, others ( hte majority ) have from 500 to 1000 files, and a few have over 1000 to 6000.
this is the typical structure, it will take me forever to write the whole thing as i have it.

|-- galleries
| |-- category-1
| | |-- category-1.1
| | | |-- category-1.1.1
| | | | |-- category-1.1.1.1
| | | | | |-- pwg_high
| | | | | | ±- wedding.jpg
| | | | | |-- thumbnail
| | | | | | ±- TN-wedding.jpg
| | | | | ±- wedding.jpg
| | | | ±- category-1.1.1.2
| | | ±- category-1.1.2
| | |-- category-1.2
| | | |-- pookie.jpg
| | | | |-- pwg_high
| | | | | ±- pookie.jpg
| | | ±- thumbnail
| | | ±- TN-pookie.jpg
| | ±- category-1.3

in the category folder, are store the preview pics, a mid size file.
in the pwg_high, are stored the high resolution files, from 1mb to 25mb files.
in the thumbnail folder, are the… well, thumbnails.
the final step for the script is to tell it somehow to ignore files in folders named ‘thumbnail’ and ‘pwg_high’, because if it doesn’t, the amount of files is 150.000…

thank you.

Marto.

PS: i edited my script to try and make it check every file with every file and not to repeat the comparision… i know it looks like C++, but thats all i’ve ever learned… :stuck_out_tongue:

display dialog signature as string takes the entire list and presents it as a single string.


tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list
--this will store the md5 signature of every file in 'signature'
set signature to {}
repeat with inputFile in inputFolder
	set signature's end to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
end repeat
-- instead of display dialog signature as string, you need:
set sig to ""
repeat with asig in signature
	set sig to sig & asig & return
end repeat
display dialog sig -- but will run out of gas for more than 30 files or so

This does the full comparison for me:



tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list

--this will store the md5 signature of every file in 'signature'
set signature to {}
repeat with inputFile in inputFolder
	set signature's end to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
end repeat
-- find the duplicates (and I tested with several)
set dupes to {}
set c to count inputFolder
repeat with k from 1 to c
	repeat with j from 1 to c
		if j ≠ k then tell signature to if item k = item j then set end of dupes to {item k of inputFolder, item j of inputFolder}
	end repeat
end repeat