Using a txt as database - why doesn't it work?

hi,
i’m working with a huge amount of files. and i need to check for duplicates. so i thought of using a txt with proper delimiters of the whole bunch of files, for another script to read and find any matches…
the ‘database’ will be a list of md5 signatures of all the files. so i want to grab a file, and compare it’s md5 signature with every signature in the ‘database’. that way, the scripts returns that the file is a duplicate.

here is a part of my file:

jd93ks02ls7gj4593jd75ks73nf83idjc7*
j48dn39dng7wn29dxk73jn8djdifnew9*
982674nxusndn8d73hjw98e7883jkd89*


on run
	try
		tell application "Finder" to set inputFile to (choose file) as alias
		set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
		set database to ((((path to desktop folder) as text) & "signature log.txt"))
		set countDB to number of items in (read file database using delimiter "*")
		set c to 0
		
		repeat with x from 1 to countDB
			set k to (item x of (read file database using delimiter "*"))
			
			if (k = signature) then
				set c to 1
			end if
			
			display dialog (k & return & signature)
			
		end repeat
		
		if (c = 1) then
			display dialog "Duplicate found!!!"
		else
			display dialog "Duplicate NOT found"
		end if
		
	on error err
		display dialog err
	end try
	
end run

note: the list is a known md5 list, made with another script of mine. and the file for testing IS in that list.

the script doen’t find the duplicate… i’ve even setted all to display dialogs, and seen the two identical strings… but it still doesn’t find the duplicate.

i really don’t know whats wrong.

thanks for all your help…

I don’t know if you have written your database file as shown, with each item on a new line or not, but that is what messed it up for me. If I save the file like this:

jd93ks02ls7gj4593jd75ks73nf83idjc7*j48dn39dng7wn29dxk73jn8djdifnew9*982674nxusndn8d73hjw98e7883jkd89*

Then, this snippet evaluates to true:

set aa to "j48dn39dng7wn29dxk73jn8djdifnew9"

set a to choose file
set b to open for access a
set c to read b using delimiter "*"
close access b
c contains aa

-->true

When I wrote the file as you had listed in your post, the same code always evaluated to false, because I believe the newline characters were being evaluated as part of each string.

no, i didn’t store it like that… thats a hand-writted thing just to illustrate… because when i tried to copy and paste… the whole thing messed up with my post deleteing it.

i belive that the problem could be in two places…
the first one is that somehow, when the signature is written in the log, and then read and stored in a variable from the script, it changes somewhere… this occurrs to me because when i tried to paste a part of that so called ‘database’ and post it, it deleted the whole post from there to the end… thats odd!

the second one, is that i’m storing the ‘database’ in some kind of form that i shouldn’t, and that’s why i can’t read it properly.

anyways, i did some testing, and if i store the md5 signature of the file that i want to know if its a duplicate, and read the signature from the txt and store it in a variable; and then compare it with the signatures in the ‘database’, it works…

so, the mess is somewhere in the writting or reading of the file…
any ideas?

thanks guys.

Marto.

This was the format of my text file

And this was my script

tell application "Finder" to set inputFile to (choose file) as alias
set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
set database to ((((path to desktop folder) as text) & "signature log.txt"))
set hash_log to paragraphs of (read file database)
if hash_log contains signature then display dialog "Duplicate Found"

Worked for me.

Both forms work for me

3649d139c96761d3adf19369b69c88d1*
3ae642c3ef6c5a66e937edaa684cb018*

or
3649d139c96761d3adf19369b69c88d13ae642c3ef6c5a66e937edaa684cb018

I get the results expected

**edit , Just seen the post above, I used the original script ***

And, just to join the fray - I wrote my file with this:

tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list

--this will store the md5 signature of every file in 'signature'
set signature to {}
repeat with inputFile in inputFolder
	set signature's end to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
end repeat
set sigs to ""
repeat with S in signature
	set sigs to sigs & S & return
end repeat
set sl to open for access ((path to desktop folder as text) & "Signature.log") with write permission
set eof of sl to 0
write sigs to sl
close access sl

and checked it with this (successfully finding the duplicate):


tell application "Finder" to set inputFile to (choose file) as alias
set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
set DB to ((((path to desktop folder) as text) & "signature.log")) as alias
set tData to paragraphs of (read DB)
set countDB to count tData
set c to 0
repeat with k from 1 to countDB
	if item k of tData = signature then set c to c + 1
end repeat
display dialog "There were " & c & " duplicate(s)"

Always taking it to the next step Mr. Bell =)

well…
clearly the problem was in my ‘database’ maker script, cause i tried every script posted in here, with the proper modification to the way it writed the signatures, and nothing happend.
until mr. Adams came and i tried his two scripts.
no more to say but thanks a lot.

just curious, why didn’t this two work?

‘database’ maker ( this one has the Progress Bar by Bruce Phillips) :

on run
	try
		set Progress to load script alias (((path to scripts folder) as text) & "Progress.scpt")
		tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list
		
		tell Progress
			initialize()
			changeIcon to POSIX path of ("Marto:Users:Marto:Desktop:AppleScript:Process imagebank:addons:Resources:m.jpg") --icon
			setTitle to "MD5 DataBase"
			barberPole(false)
			setMax to number of items in inputFolder
		end tell
		
		set c to 1
		
		repeat with inputFile in inputFolder
			tell Progress
				setStatusTop to ("Obtaining MD5 of file: " & (name of (info for inputFile)))
				setStatusBottom to ("Files left: " & (number of items in inputFolder) - c)
			end tell
			set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
			my writeSignatures(signature)
			set c to (c + 1)
			tell Progress to increase by 1
		end repeat
		
		tell Progress to quit
		
	on error err
		display dialog err
	end try
end run

on writeSignatures(signature)
	set the sigLog to ((path to desktop folder as text) & "signature log.txt")
	try
		open for access file the sigLog with write permission
		write (signature & return) to file the sigLog starting at eof
		close access file the sigLog
	on error
		try
			close access file the sigLog
		end try
	end try
end writeSignatures

and the duplicate finder:

on run
	try
		tell application "Finder" to set inputFile to (choose file) as alias
		set signature to (do shell script "md5 -q " & quoted form of POSIX path of inputFile)
		set database to (((path to desktop folder) as text) & "signature log.txt")
		set hashLog to paragraphs of (read file database)
		if (hashLog contains signature) then
			display dialog "Duplicate Found"
		else
			display dialog "Duplicate NOT Found"
		end if
	on error err
		display dialog err
	end try
	
end run

again, thanks.

Marto

If you’re already using a shell, then why not save yourself some work?

choose folder with prompt "Generate list of MD5 checksums for these files:"
set sourceFolder to POSIX path of result

do shell script "cd " & quoted form of sourceFolder & "; /sbin/md5 -q * > " & quoted form of POSIX path of ((path to desktop as Unicode text) & "Signature Log.txt")

choose file with prompt "Check signature log for duplicate MD5 checksum of this file:" without invisibles

try
	do shell script "/sbin/md5 -q " & quoted form of POSIX path of result & " | /usr/bin/grep -f - " & quoted form of POSIX path of ((path to desktop as Unicode text) & "Signature Log.txt")
	
	display dialog "Duplicate found"
on error
	display dialog "Duplicate **not** found"
end try

I love you Bruce, I completely forgot about -f on grep

I just learned about it. :stuck_out_tongue:

Edit: That caused me to almost forget about nitpicking this.

There is no need to involve the Finder for this line.

I really like the -f option - I had read that before but didn’t think about it here. Very neat. And too, I could have saved myself a “last word of…” if I’d known about the -q (or reread man md5). Thanks for both Bruce.

This sticks them together:


-- get info
set ckFile to choose file with prompt "Check signature log for duplicate MD5 checksum of this file:" without invisibles
set tFiles to choose folder with prompt "Generate list of MD5 checksums for these files:" without invisibles
set whereFile to choose folder with prompt "Choose the container of the signature file:"
set nameFile to text returned of (display dialog "Please name the output file:" default answer "")
set sigFilePath to (whereFile as text) & nameFile
-- do it
getMDs(tFiles, sigFilePath)
checkDuplicates(ckFile, sigFilePath)
-- handlers
to checkDuplicates(aFile, sigFilePath)
	try
		do shell script "/sbin/md5 -q " & quoted form of POSIX path of aFile & " | /usr/bin/grep -f - " & quoted form of POSIX path of sigFilePath
		
		display dialog "Duplicate found"
	on error
		display dialog "Duplicate **not** found"
	end try
end checkDuplicates

to getMDs(aFolder, sigFilePath)
	do shell script "cd " & quoted form of POSIX path of aFolder & "; /sbin/md5 -q * > " & quoted form of POSIX path of sigFilePath
end getMDs


set my_head to POSIX path of this user's neck
set sharpshooters to "AB, JN & BP"
set bullet to item posts of sharpshooters

do fire_at_will(sharpshooters, bullet, my_head)

tell "Surgeon" to reconstruct

--> ouch!


:smiley:

well yes… its a lot faster. but it isn’t recursive to every sub directory in the chosen folder…
if there is a way to search for every folder with name X in chosen folder, and make a repeat of the shell script, then it would be fit for my task ( because every file i have is ultimately allocated in a folder named ‘pwg_high’ ).
but… the command overwrites everything in ‘signature.txt’, so i would have to store the result in a variable, and then write it at eof… or is there a way to write it at eof directly from the shell?

many thanks.

Marto.

I’ll get back to you in a few moments with the proper syntax for finding a folder with X in it, but in the mean time lets take care of your appending issue to the sig file… change the line to read like this.

   do shell script "cd " & quoted form of POSIX path of aFolder & "; /sbin/md5 -q * >> " & quoted form of POSIX path of sigFilePath

thanks, i’d never would have gguessed that one…

And as promised here is the code to generate the md5s into the signature file (appending not overwriting) of all files that contain .jpg (change the variable to what works for you) in the file name. It will start at the folder chosen as before, but will dig through all folders inside that one as well.

--Variables--
set fileName to "*.jpg"
--/Variables--

set tFiles to choose folder with prompt "Generate list of MD5 checksums for these files:" without invisibles
set whereFile to choose folder with prompt "Choose the container of the signature file:"
set nameFile to text returned of (display dialog "Please name the output file:" default answer "")
set sigFilePath to (whereFile as text) & nameFile

getMDs(tFiles, sigFilePath, fileName)

to getMDs(aFolder, sigFilePath, fileName)
	do shell script "find " & quoted form of POSIX path of aFolder & " -name " & quoted form of fileName & " -exec md5 -q {} >> " & quoted form of POSIX path of sigFilePath & " \\;"
end getMDs

Back in our first thread on this problem, I think I pointed out that this was a tough problem because of the number of files involved. Like most problems of this sort, a solution evolves through discussion and suggestion. Bruce’s solution is not recursive. ls -Rf is recursive and doesn’t sort the files, but doesn’t give their paths, only their names. What we need is a shell way to list the path to every file in a directory recursively, and their probably is one. We want to keep that in order (unsorted) because eventually, you’ll need the path to a duplicate so you can remove it - simply knowing that two md5 signatures match doesn’t eliminate or identify the duplicate. To do that in the shell is beyond my capabilities, but it’s clear from Bruce’s example that it’s a case where the shell is much faster than an AppleScript at doing this.

Another way to go would be to use spotlight data: mdfind -onlyin (chosen directory) for the file type because it does return a path. What is common about the images you’re looking at?

I should be able to whip that up be back in a few.

I edited my comment while you were posting, adding this:

Another way to go would be to use spotlight data: mdfind -onlyin (chosen directory) for the file type because it does return a path. What is common about the images you’re looking at?