Get Duplicate Files

The script included below identifies duplicate files with the specified extensions in a selected folder and its subfolders. The sha1 checksum values are used to identify the duplicate files, and these values are included in the output, which is written to a text file on the user’s desktop. This script can be slow if the number or size of the files being processed is large, altough the script is fairly robust.

use framework "Foundation"
use scripting additions

getDuplicateFiles()

on getDuplicateFiles()
	set theExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensions
	set theFolder to POSIX path of (choose folder)
	set theFiles to getFiles(theFolder, theExtensions)
	set theData to current application's NSMutableArray's new()
	repeat with aFile in theFiles
		set aLine to do shell script "sha1 -r " & quoted form of (aFile as text)
		(theData's addObject:aLine)
	end repeat
	(theData's sortUsingSelector:"compare:")
	set theDuplicates to current application's NSMutableOrderedSet's new()
	set oldChecksum to current application's NSString's stringWithString:""
	set oldItem to current application's NSString's stringWithString:""
	repeat with anItem in theData
		set newChecksum to ((anItem's componentsSeparatedByString:space)'s objectAtIndex:0)
		if (newChecksum's isEqualToString:oldChecksum) is false then
			set oldChecksum to newChecksum
		else
			(theDuplicates's addObject:oldItem)
			(theDuplicates's addObject:anItem)
		end if
		set oldItem to anItem
	end repeat
	set theString to ((theDuplicates's array())'s componentsJoinedByString:linefeed)
	writeFile(theString)
end getDuplicateFiles

on getFiles(theFolder, fileExtensions)
	set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
	set fileManager to current application's NSFileManager's defaultManager()
	set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
	set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
	return (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
end getFiles

on writeFile(theText)
	set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
	(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile

The following script is similar to the above, except that a regex is used to identify files that are not duplicates. The regex pattern was provided by Nigel. The timing results for the two scripts are pretty much the same.

use framework "Foundation"
use scripting additions

getDuplicateFiles()

on getDuplicateFiles()
	set theExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensions
	set theFolder to POSIX path of (choose folder)
	set theFiles to getFiles(theFolder, theExtensions)
	set theData to current application's NSMutableArray's new()
	repeat with aFile in theFiles
		set aLine to do shell script "sha1 -r " & quoted form of (aFile as text)
		(theData's addObject:aLine)
	end repeat
	(theData's sortUsingSelector:"compare:")
	set dataString to (theData's componentsJoinedByString:linefeed)
	set dataNoDuplicates to (dataString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()})
	set dataNoDuplicates to (dataNoDuplicates's componentsSeparatedByString:linefeed)
	theData's removeObjectsInArray:dataNoDuplicates
	set theString to theData's componentsJoinedByString:linefeed
	writeFile(theString)
end getDuplicateFiles

on getFiles(theFolder, fileExtensions)
	set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
	set fileManager to current application's NSFileManager's defaultManager()
	set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
	set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
	return (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
end getFiles

on writeFile(theText)
	set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
	(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile

Thanks peavine for the script, I am getting the following error on Sonoma 14.5
terminal → zsh: command not found: sha1

**error** "sh: sha1: command not found" number 127

One208. Thanks for testing my script. I only have the ability to test scripts on my Sequoia computer.

I’m not certain why you are getting that error message, but I’ll research that. In the meantime, substituting md5 fo sha1 should (hopefully) fix the error.

BTW, the reason I used sha1 was because it was significantly faster in my testing in calculating checksums for JPG files.

… which is a diplomatic way of saying that peavine worked out the basic rather clever pattern himself and I managed to iron out a minor issue on my third attempt. :smile:

The sha1 utility used in the shell scripts here appears to be a recent addition to macOS. I have it on my Sequoia machine, but not on the Ventura one. (Ah. I see One208’s Sonoma system doesn’t have it either.)

Here’s a slightly faster take on the second script. It finds the files and returns the sha1 text in a single shell script.

use framework "Foundation"
use scripting additions

getNonDuplicateFiles() as text

on getNonDuplicateFiles()
	set theExtensions to {"jpg", "jpeg", "JPG", "JPEG"} --set to desired file extensions
	set theFolder to (choose folder)'s POSIX path
	-- This shell script uses find to find files in theFolder's root level which have the required extensions and aren't hidden and to
	-- feed them to sha1. I can't work out how to include sah1's -f option in this context, so sed is used instead to format the lines.
	set shellText to "find " & theFolder's quoted form & ¬
		" -maxdepth 1  \\( -name '*." & joinText(theExtensions, "' -or -name '*.") & "' \\)  ! -name '.*' -exec sha1 {} \\;  | 
		sed -E 's/SHA1 \\(([^\\)]+)\\) = (.+)/\\2 \\1/' | sort"
	set dataText to (do shell script shellText without altering line endings)
	
	set dataString to current application's class "NSMutableString"'s stringWithString:(dataText)
	dataString's replaceOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()}
	
	set dataSet to current application's class "NSMutableOrderedSet"'s orderedSetWithArray:(dataText's paragraphs)
	set dataNoDuplicates to current application's class "NSSet"'s setWithArray:(dataString's componentsSeparatedByString:(linefeed))
	dataSet's minusSet:(dataNoDuplicates)
	
	set theString to dataSet's array()'s componentsJoinedByString:(linefeed)
	writeFile(theString)
end getNonDuplicateFiles

on joinText(lst, delim)
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to delim
	set txt to lst as text
	set AppleScript's text item delimiters to astid
	return txt
end joinText

on writeFile(theText)
	set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
	(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile

I would be surprised if Apple moved the location of /sbin/sha1 between Sonoma and Sequoia. I too am using Sequoia v15.2 and your script with the regex works just fine in Script Debugger and Apple’s Script Editor. I am using the Zsh shell.

[Update: I see Nigel’s comment about sha1 not being on Ventura or Sonoma, which nixes my suggestion to use the full /sbin/sha1 path ]

If /sbin/sha1sum exists on Sonoma, it can be used as a substitute for /sbin/sha1 and produces the same output (on Sequoia v15.2) without needing any switches (e.g. -r ).

Thank you, that explains it all, would try “shasum” or upgrade to Sequoia. I just checked HMRC website, and their PAYEE is not supported on Sequoia ! :expressionless: