The script included below identifies duplicate files with the specified extensions in a selected folder and its subfolders. The sha1 checksum values are used to identify the duplicate files, and these values are included in the output, which is written to a text file on the user’s desktop. This script can be slow if the number or size of the files being processed is large, altough the script is fairly robust.
use framework "Foundation"
use scripting additions
getDuplicateFiles()
on getDuplicateFiles()
set theExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensions
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder, theExtensions)
set theData to current application's NSMutableArray's new()
repeat with aFile in theFiles
set aLine to do shell script "sha1 -r " & quoted form of (aFile as text)
(theData's addObject:aLine)
end repeat
(theData's sortUsingSelector:"compare:")
set theDuplicates to current application's NSMutableOrderedSet's new()
set oldChecksum to current application's NSString's stringWithString:""
set oldItem to current application's NSString's stringWithString:""
repeat with anItem in theData
set newChecksum to ((anItem's componentsSeparatedByString:space)'s objectAtIndex:0)
if (newChecksum's isEqualToString:oldChecksum) is false then
set oldChecksum to newChecksum
else
(theDuplicates's addObject:oldItem)
(theDuplicates's addObject:anItem)
end if
set oldItem to anItem
end repeat
set theString to ((theDuplicates's array())'s componentsJoinedByString:linefeed)
writeFile(theString)
end getDuplicateFiles
on getFiles(theFolder, fileExtensions)
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set fileManager to current application's NSFileManager's defaultManager()
set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
return (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
end getFiles
on writeFile(theText)
set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
The following script is similar to the above, except that a regex is used to identify files that are not duplicates. The regex pattern was provided by Nigel. The timing results for the two scripts are pretty much the same.
use framework "Foundation"
use scripting additions
getDuplicateFiles()
on getDuplicateFiles()
set theExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensions
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder, theExtensions)
set theData to current application's NSMutableArray's new()
repeat with aFile in theFiles
set aLine to do shell script "sha1 -r " & quoted form of (aFile as text)
(theData's addObject:aLine)
end repeat
(theData's sortUsingSelector:"compare:")
set dataString to (theData's componentsJoinedByString:linefeed)
set dataNoDuplicates to (dataString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()})
set dataNoDuplicates to (dataNoDuplicates's componentsSeparatedByString:linefeed)
theData's removeObjectsInArray:dataNoDuplicates
set theString to theData's componentsJoinedByString:linefeed
writeFile(theString)
end getDuplicateFiles
on getFiles(theFolder, fileExtensions)
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set fileManager to current application's NSFileManager's defaultManager()
set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
return (folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path"
end getFiles
on writeFile(theText)
set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
One208. Thanks for testing my script. I only have the ability to test scripts on my Sequoia computer.
I’m not certain why you are getting that error message, but I’ll research that. In the meantime, substituting md5 fo sha1 should (hopefully) fix the error.
BTW, the reason I used sha1 was because it was significantly faster in my testing in calculating checksums for JPG files.
… which is a diplomatic way of saying that peavine worked out the basic rather clever pattern himself and I managed to iron out a minor issue on my third attempt.
The sha1 utility used in the shell scripts here appears to be a recent addition to macOS. I have it on my Sequoia machine, but not on the Ventura one. (Ah. I see One208’s Sonoma system doesn’t have it either.)
Here’s a slightly faster take on the second script. It finds the files and returns the sha1 (or md5) text in a single shell script.
use framework "Foundation"
use scripting additions
getDuplicateFiles() as text
on getDuplicateFiles()
set theExtensions to {"jpg", "jpeg", "JPG", "JPEG"} --set to desired file extensions
set theFolder to (choose folder)'s POSIX path
-- Find files at theFolder's root level with the required extensions and names not beginning with "." and to feed them to sha1 or md5. Sort on checksums afterwards.
set prog to "sha1"
considering numeric strings
if ((system info)'s system version < "15.0") then set prog to "md5"
end considering
set shellText to "find " & theFolder's quoted form & " -depth 1 \\( -name '*." & joinText(theExtensions, "' -or -name '*.") & "' \\) ! -name '.*' -exec " & prog & " '-r' {} \\; | sort"
set dataText to (do shell script shellText without altering line endings)
set dataString to current application's class "NSMutableString"'s stringWithString:(dataText)
dataString's replaceOccurrencesOfString:("([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++") withString:("") options:(1024) range:({0, dataString's |length|()})
set dataSet to current application's class "NSMutableOrderedSet"'s orderedSetWithArray:(dataText's paragraphs)
set dataNoDuplicates to current application's class "NSSet"'s setWithArray:(dataString's componentsSeparatedByString:(linefeed))
dataSet's minusSet:(dataNoDuplicates)
set output to dataSet's array()'s componentsJoinedByString:(linefeed)
writeFile(output)
end getDuplicateFiles
on joinText(lst, delim)
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to delim
set txt to lst as text
set AppleScript's text item delimiters to astid
return txt
end joinText
on writeFile(theText)
set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
Edits: Script modified in the light of peavine’s suggestion above to use sha1 in Sequoia or above and md5 otherwise. Also handler name corrected and sed no longer needed in the shell script.
I would be surprised if Apple moved the location of /sbin/sha1 between Sonoma and Sequoia. I too am using Sequoia v15.2 and your script with the regex works just fine in Script Debugger and Apple’s Script Editor. I am using the Zsh shell.
[Update: I see Nigel’s comment about sha1 not being on Ventura or Sonoma, which nixes my suggestion to use the full /sbin/sha1 path ]
If /sbin/sha1sum exists on Sonoma, it can be used as a substitute for /sbin/sha1 and produces the same output (on Sequoia v15.2) without needing any switches (e.g. -r ).
I thought I would include another approach FWIW. This script is made faster by only running the do shell script utility once. However, this script breaks if the number of files being processed exceeds about 16,000. An earlier version of this script broke if a file path contained a single-quote mark, but I hopefully have that fixed.
use framework "Foundation"
use scripting additions
getDuplicateFiles()
on getDuplicateFiles()
set theFileExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensionns
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder, theFileExtensions)
if theFiles = "''" then display dialog "No matching files found" buttons {"OK"} cancel button 1 default button 1
set theData to (do shell script "sha1 -r " & theFiles)
set dataString to current application's NSString's stringWithString:theData
set dataArray to ((dataString's componentsSeparatedByString:return)'s sortedArrayUsingSelector:"compare:")'s mutableCopy()
set dataString to (dataArray's componentsJoinedByString:linefeed)
set noDuplicates to (dataString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()}) --option 1024 is regex
set noDuplicates to (noDuplicates's componentsSeparatedByString:linefeed)
dataArray's removeObjectsInArray:noDuplicates
set theString to (dataArray's componentsJoinedByString:linefeed)
writeFile(theString)
end getDuplicateFiles
on getFiles(theFolder, fileExtensions)
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set fileManager to current application's NSFileManager's defaultManager()
set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
set theFiles to ((folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path")'s componentsJoinedByString:linefeed
set theFiles to theFiles's stringByReplacingOccurrencesOfString:"'" withString:"'\\''" --escape single quotes in path
set theFiles to (theFiles's stringByReplacingOccurrencesOfString:linefeed withString:"' '") as text
return "'" & theFiles & "'"
end getFiles
on writeFile(theString)
set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
(current application's NSString's stringWithString:theString)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
I ran some timing tests with Script Geek. The test folder had 82 JPG files, each of which contained about 5.5 MB, and there were 18 duplicates. The results:
Peavine One (no regex) - 484 milliseconds
Peavine Two (with regex) - 491 milliseconds
Peavine Two (with regex but substituted md5 for sha1) - 894 milliseconds
Nigel (I deleted depth option to return all files)- 362 milliseconds
Peavine Three (with regex but breaks with large number of files) - 262 milliseconds
BTW, xxHash is a fast non-cryptographic hash algorithm that is designed for tasks such as this. Unfortunately, a compiled version of this is not available for macOS.
I’ve always just used openssl (now Apple’s LibreSSL), for example do shell script "/usr/bin/openssl dgst -sha1 " & theFiles. Since Apple has been aliasing to their own for compatibility, I am wondering if those are any more updated or slower.
I tried other approaches to get checksums–including md5sum, cksum, openssl, and shasum–and all were slower.
The script included below takes a somewhat different approach, which is to first remove from consideration files that have a unique file size. The remaining files are then checked for duplicates based on their checksum.
Using the same test folder as in my earlier post, the timing result was 143 milliseconds, which is an improvement of 45 percent. Also, this script worked when processing over 24,000 PDF files in a folder that contained over 41,000 files.
This script requires Sequoia but can be made to work with earlier versions of macOS by replacing sha1 with md5.
use framework "Foundation"
use scripting additions
on main()
set theExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensions
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder, theExtensions)
set theFiles to filterBySize(theFiles)
set theFiles to filterByChecksum(theFiles)
writeFile(theFiles)
end main
on getFiles(theFolder, fileExtensions)
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set fileManager to current application's NSFileManager's defaultManager()
set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
return (folderContents's filteredArrayUsingPredicate:thePredicate)
end getFiles
on filterBySize(theFiles)
set theData to current application's NSMutableArray's new()
set theKey to current application's NSURLFileSizeKey
repeat with aFile in theFiles
set {theResult, fileSize} to (aFile's getResourceValue:(reference) forKey:theKey |error|:(missing value))
set anItem to current application's NSString's stringWithFormat_("%@ %@", fileSize, aFile's |path|())
(theData's addObject:anItem)
end repeat
(theData's sortUsingSelector:"localizedStandardCompare:")
set dataString to theData's componentsJoinedByString:linefeed
set noDuplicates to (dataString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()}) --option 1024 is regex
set noDuplicates to (noDuplicates's componentsSeparatedByString:linefeed)
(theData's removeObjectsInArray:noDuplicates)
return theData
end filterBySize
on filterByChecksum(theFiles)
set theData to current application's NSMutableArray's new()
repeat with aFile in theFiles
set aFile to (aFile's stringByReplacingOccurrencesOfString:"(?m)^[^ ]++ (.++)$" withString:"$1" options:1024 range:{0, aFile's |length|()})
set anItem to do shell script "sha1 -r " & quoted form of (aFile as text)
(theData's addObject:anItem)
end repeat
(theData's sortUsingSelector:"localizedStandardCompare:")
set dataString to theData's componentsJoinedByString:linefeed
set noDuplicates to (dataString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()})
set noDuplicates to (noDuplicates's componentsSeparatedByString:linefeed)
theData's removeObjectsInArray:noDuplicates
set theString to (theData's componentsJoinedByString:linefeed)
return theString
end filterByChecksum
on writeFile(theText)
set theFile to (current application's NSHomeDirectory()'s stringByAppendingPathComponent:"Desktop")'s stringByAppendingPathComponent:"Duplicate Files.txt"
(current application's NSString's stringWithString:theText)'s writeToFile:theFile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
main()
I believe the limitation’s the length of text that can be fed to do shell script rather than anything to do with the shell itself or the scripted utilities. But I could be wrong.
Here’s a version of “Peavine Three” which saves the initial list of files to a text file (the same one later reused for the final result) and has the shell script read it in while executing instead of having it in its source code. Whether or not it solves the “too many files” problem, it has the advantage that xargs can take 0-separated input arguments, which don’t need to be quoted and so nothing needs to be escaped.
use framework "Foundation"
use scripting additions
getDuplicateFiles()
on getDuplicateFiles()
set theFileExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensionns
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder, theFileExtensions)
if theFiles = "" then display dialog "No matching files found" buttons {"OK"} cancel button 1 default button 1
set textFilePath to (path to desktop)'s POSIX path & "Duplicate Files.txt"
writeFile(theFiles, textFilePath) -- May as well use the same file for both the intermediate and final texts.
set prog to "sha1"
considering numeric strings
if ((system info)'s system version < "15.0") then set prog to "md5"
end considering
set theData to (do shell script "xargs -0 " & prog & " -r <" & (textFilePath's quoted form) & " | sort")
set dataArray to current application's NSMutableArray's arrayWithArray:(theData's paragraphs)
set dataString to (dataArray's componentsJoinedByString:linefeed)
set noDuplicates to (dataString's stringByReplacingOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()}) --option 1024 is regex
set noDuplicates to (noDuplicates's componentsSeparatedByString:linefeed)
dataArray's removeObjectsInArray:noDuplicates
set thestring to (dataArray's componentsJoinedByString:linefeed)
writeFile(thestring, textFilePath)
end getDuplicateFiles
on getFiles(theFolder, fileExtensions)
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set fileManager to current application's NSFileManager's defaultManager()
set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
-- Join the lines with characters id 0 instead of linefeeds.
set theFiles to ((folderContents's filteredArrayUsingPredicate:thePredicate)'s valueForKey:"path")'s componentsJoinedByString:(character id 0)
return theFiles
end getFiles
on writeFile(thestring, thefile)
(current application's NSString's stringWithString:thestring)'s writeToFile:thefile atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
Edit: Automatic choice between using sha1 or md5 based on system version.
Nigel. I tested your revision of my script and it worked great.
I ran timing tests comparing my script in post 10 with your script in post 11. The first test folder contained 82 JPG files in a folder with 86 files total. The number of duplicate JPG files was 18. My script took 148 milliseconds to run and your script took 266 milliseconds to run.
My second test folder contained over 24,000 PDF files in a folder that contained over 41,000 files. A distinguishing characteristic of this test folder was that at least 95 percent of the PDF files were duplicates, which eliminated any advantage that accrued to pre-filtering by file size. My script took 145 seconds to run and your script took 76 seconds to run.
Your idea of firstly filtering out files with unique sizes clearly saves a lot of time on unnecessary sha1/md5 operations when the proportion of duplicates is relatively low, but also has a significant overhead when most of the operations have to be performed anyway.
If you’ll forgive yet more modifications from me, the following is basically your post 10 script with faster size filtering and the xargs idea from post 11, if you’d like to give it a go.
Conventional wisdom has it that URL properties of interest should be explicitly fetched with the URLs themselves. But it doesn’t seem to make any difference here, either to the script actually working or to the time it takes. Anyway, I’ve included the relevant keys in the enumerator set-up as a formality.
use framework "Foundation"
use scripting additions
on main()
set theExtensions to {"jpg", "jpeg"} --set to desired lowercase file extensions
set textFilePath to POSIX path of (path to desktop) & "Duplicate Files.txt"
set theFolder to POSIX path of (choose folder)
set theFiles to getFiles(theFolder, theExtensions) --> NSArray of NSURL.
set theFiles to filterBySize(theFiles) --> NSArray of NSString.
set theFiles to filterByChecksum(theFiles, textFilePath) -- NSString.
if (theFiles's |length|() = 0) then set theFiles to "No duplicated files in " & theFolder
writeFile(theFiles, textFilePath)
end main
on getFiles(theFolder, fileExtensions)
set theFolder to current application's |NSURL|'s fileURLWithPath:theFolder
set fileManager to current application's NSFileManager's defaultManager()
set folderContents to (fileManager's enumeratorAtURL:theFolder includingPropertiesForKeys:{current application's NSURLFileSizeKey, current application's NSURLFilePathKey} options:6 errorHandler:(missing value))'s allObjects() --option 6 skips hidden files and package contents
set thePredicate to current application's NSPredicate's predicateWithFormat_("pathExtension.lowercaseString IN %@", fileExtensions)
return (folderContents's filteredArrayUsingPredicate:thePredicate)
end getFiles
on filterBySize(theFiles)
set theData to current application's NSMutableArray's new()
set theKeys to {current application's NSURLFileSizeKey, current application's NSURLPathKey}
set {sizeKey, pathKey} to theKeys
repeat with aFile in theFiles
(theData's addObject:(aFile's resourceValuesForKeys:(theKeys) |error|:(missing value))) -- NSDictionary with the relevant values and keys.
end repeat
set sizeSet to current application's NSCountedSet's alloc()'s initWithArray:(theData's valueForKey:(sizeKey))
sizeSet's minusSet:(current application's NSSet's setWithSet:(sizeSet))
set filter to current application's NSPredicate's predicateWithFormat_("%K in %@", sizeKey, sizeSet)
theData's filterUsingPredicate:(filter)
return theData's valueForKey:(pathKey)
end filterBySize
on filterByChecksum(theFiles, textFilePath)
set theData to theFiles's componentsJoinedByString:(character id 0)
writeFile(theData, textFilePath)
set prog to "sha1"
considering numeric strings
if ((system info)'s system version < "15.0") then set prog to "md5"
end considering
set dataText to (do shell script ("xargs -0 " & prog & " -r <" & (textFilePath's quoted form) & " | sort") without altering line endings)
set dataString to current application's NSMutableString's stringWithString:(dataText)
(dataString's replaceOccurrencesOfString:"([^ ]++ ).++\\n(?:\\1.++(?:\\n|$))++" withString:"" options:1024 range:{0, dataString's |length|()})
set noDuplicates to (dataString's componentsSeparatedByString:linefeed)
set filter to current application's NSPredicate's predicateWithFormat_("!self IN %@", noDuplicates)
set theData to current application's NSMutableArray's arrayWithArray:(dataText's paragraphs)
theData's filterUsingPredicate:(filter)
return (theData's componentsJoinedByString:linefeed)
end filterByChecksum
on writeFile(theText, textFilePath)
(current application's NSString's stringWithString:theText)'s writeToFile:textFilePath atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)
end writeFile
main()
Nigel. I tested your script with Script Geek. I didn’t have my old test folder, so I created a new one that’s similar. I also retested my script from post 10. Your script was faster at 73 milliseconds compared to my script at 141 milliseconds.
BTW, it was always my thought that the final text file should include checksums, so that the user (or the user’s script) could identify which files are duplicates of others. That’s why I included the checksums in the output of my scripts.
I always enjoy looking at different ways of doing things, and I’ll have a good look at your script later today.
Nigel, I saved your script in post 12, changed the file to “scpt”, and selected a folder in my Documents folder where I keep all things MacScripter. The result was two duplicate files, although I can not understand how they were selected. The checksum is identical, although the two scripts are completely different, even the files names are different (see screenshot).
I stand corrected. I had assumed that because the file names were different, the scripts were different. But I opened both scripts and checked them, and they are identical. So it would seem that the script posted by Nigel in post 12 is a bit smarter than I am. It disregards the file name, and checks the file’s contents. Slick!
It’s the “sha1” or “md5” utilities, included with macOS, which analyse the files and generate the checksums based on their contents. The files’s names are really just parts of their entries in the disk catalogue rather being parts of the files themselves.
It was of course peavine who did the research and wrote the original scripts above which use these utilities. I merely provided a couple of tweaks.