Alright speed demons, I need to text process a list with 1.7 M items

Are those file names all unique? What comparison are you going to make? I fear you might be approaching the problem from the wrong end.

(The prospect of using basic AppleScript to compare two lists of 1.7 million records each doesn’t sound very appealling, or practical. Assuming it’s a one-off, I think I’d just buy a new disk.)

:lol:

I agree with Shane it’s a bit of a task for vanilla AppleScript. If I was obliged to use it, I’d try to optimise the process as much as possible. For instance, by NOT building freeNASdataList by concatenation item by item! I’d try building intermediate lists — say of 4000 items each (or fewer in the last one) — by setting their ends to the individual results, then concatenating each of these to freeNASdataList. Something like the following, although you’d need to experiment to see what works best nowadays:

-- commented out lines to grab the real 86 MB 1.7M line file
-- set freeNASfile to alias "[real data file path]"
-- set freeNASdata to read freeNASfile

local freeNASdata, freeNASdataList, o

set freeNASdata to get_test_data() -- grab small test set

set freeNASdataList to {}

script
	property currentParagraphs : {}
	property subResult : {}
end script
set o to result

tell (current date) to set {dateObject, its day, its year, its month, its time} to {it, 1, 2000, January, 0} -- A known date whose day is 1.
set {delimitHolder, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "^"}
set paragraphCount to (count freeNASdata's paragraphs)
repeat with i from 1 to paragraphCount by 4000
	set j to i + 3999
	if (j > paragraphCount) then set j to paragraphCount
	set o's currentParagraphs to paragraphs i thru j of freeNASdata
	repeat with k from 1 to (j - i + 1)
		set aLine to item k of o's currentParagraphs
		try
			set {thePath, rawDate} to aLine's text items
			set thePath to text item 1 of aLine
			if thePath starts with "./" then set thePath to text 3 through end of thePath
			set rawDate to text item 2 of aLine
			set ignoreRecord to false
		on error
			set ignoreRecord to true
		end try
		if ignoreRecord is false then
			-- Get a copy of the known date and set its properties.
			copy dateObject to ASdate
			-- set dateComponents to rawDate's words -- Or the following three lines to be safe.
			set AppleScript's text item delimiters to {"-", " ", ":"}
			set dateComponents to rawDate's text items
			set AppleScript's text item delimiters to "^"
			set item 1 of dateComponents to 2000 + (beginning of dateComponents) mod 2000 -- Assuming all the dates are between 2000 and 2999.
			tell ASdate to set {its year, its month, its day, its hours, its minutes, its seconds} to dateComponents
			set end of o's subResult to {thePath, ASdate}
		end if
	end repeat

	-- Concatenate the current sub-result list to the output and start another.
	set freeNASdataList to freeNASdataList & o's subResult
	set o's subResult to {}
end repeat
set AppleScript's text item delimiters to delimitHolder
return freeNASdataList

on get_test_data()
	return "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-28 09:19:14
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Alexandra Poole/510295^17-10-23 16:14:25
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/501888^17-10-02 14:14:08
./A/April Howard/482491^17-08-22 16:58:44"
end get_test_data

Presumably, though, you’re only interested in lines which either don’t exist or aren’t the same in both texts, so you don’t actually need to convert all the dates. You just need to find the lines in each text which aren’t in the other and go from there. In ASObjC, you might do something like the following, but I haven’t tried it with 1.7M-line texts!

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

local primary, backup, primarySet, backupSet, inPrimaryButNotInBackup, inBackupButNotInPrimary

set primary to "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-28 09:19:14
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Alexandra Poole/510295^17-10-23 16:14:25
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/501888^17-10-02 14:14:08
./A/April Howard/482491^17-08-22 16:58:44"

-- The same thing with a couple of missing entries and an earlier date.
set backup to "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-15 14:32:57
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/482491^17-08-22 16:58:44"

set primarySet to current application's class "NSSet"'s setWithArray:(primary's paragraphs)
set backupSet to current application's class "NSSet"'s setWithArray:(backup's paragraphs)

set inPrimaryButNotInBackup to primarySet's mutableCopy()
tell inPrimaryButNotInBackup to minusSet:(backupSet)
set inPrimaryButNotInBackup to inPrimaryButNotInBackup's allObjects() as list

set inBackupButNotInPrimary to backupSet's mutableCopy()
tell inBackupButNotInPrimary to minusSet:(primarySet)
set inBackupButNotInPrimary to inBackupButNotInPrimary's allObjects() as list

return {inPrimaryButNotInBackup, inBackupButNotInPrimary}

Thanks so much for the replies.

Shane - this is about 50 TB of data. One copy is local, and I actually want to compare that to two other copies - one’s on Amazon S3, the other’s on Backblaze. So I don’t see a way that purchasing a new disk is going to help here. Unless I threaten to hit the people in charge of backup with the new disk if they don’t get it working right :wink:

The situation is that some people insist that this data is fully backed up and everything’s a-ok, but I keep finding that files are missing on the backup. I felt like it’s time for a thorough audit of exactly how bad this is, so there’s less room for disagreement over whether or not we have a real problem.

Sometimes with this kind of Applescript processing, I see someone post a problem here not even talking about speed, someone posts a solution, and then it turns into a speed contest with several people posting competing rewrites, and it ends up being 1,000 times faster “just for the fun of it.” So I thought I’d see if this code was one of these “1,000 times faster” situations, but it looks like it’s not.

That does get me into optimizing what I’m doing, rather than just how it’s done, as Nigel’s suggested.

It might be nice to have dates of files that are backed up to help look for patterns in what is and isn’t there, but it’s not important. So once I have all three lists, I can generate the exclusive overlap sets between each list and the master, and then only convert the dates for the files that were missing. Which (man I hope) is a MUCH smaller set.

All file names are unique - and the final part is six digit numeric. So I could strip it down to just numeric lists to run the comparison, then re-associate the much smaller overlapping set with the path and date. Not sure if that would save time or add time.

Can you guys tell me off the top of your head a fast way to get the exclusive set between two lists? I found this:
https://macscripter.net/viewtopic.php?id=41709
It’s a bit old, not sure if it’s the state-of-the-art.

Maybe I should just use shell to sort the lists and run diff, and then import the output from that to Applescript to coerce the dates and do some analysis on what I’ve got.

Thanks again,

tspoon.

Like I said :slight_smile:

Well it sort of is, if you go to ASObjC. The real question is whether 1000 times faster is still going to be unreasonably slow – your extrapolation seemed to me wildly optimistic. IAC, simple AppleScript optimizing isn’t going to cut it.

If you can just treat each entry as a string – perhaps trimming off the initial part of the paths if necessary – it becomes much, much simpler.

That’s what Nigel’s latter code above is doing. The only changes I would make is to avoid any direct use of AppleScript values, because of the size: so I’d read the contents from files, and write to files, all in ASObjC values. That might well work fine with your data set. At worst, you might have to make a handful of chunks.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- classes, constants, and enums used
property NSUTF8StringEncoding : a reference to 4

set posixPath1 to POSIX path of (choose file with prompt "Choose a file")
set posixPath2 to POSIX path of (choose file with prompt "Choose a second file")

set primary to current application's NSString's stringWithContentsOfFile:posixPath1 encoding:NSUTF8StringEncoding |error|:(missing value)
set primaryList to primary's componentsSeparatedByString:linefeed -- assuming LFs
set backup to current application's NSString's stringWithContentsOfFile:posixPath2 encoding:NSUTF8StringEncoding |error|:(missing value)
set backupList to backup's componentsSeparatedByString:linefeed -- assuming LFs

set primarySet to current application's class "NSSet"'s setWithArray:primaryList
set backupSet to current application's class "NSSet"'s setWithArray:backupList

set inPrimaryButNotInBackup to primarySet's mutableCopy()
tell inPrimaryButNotInBackup to minusSet:(backupSet)
set inPrimaryButNotInBackupText to inPrimaryButNotInBackup's allObjects()'s componentsJoinedByString:linefeed
inPrimaryButNotInBackupText's writeToFile:(posixPath1 & "-out.txt") atomically:true encoding:NSUTF8StringEncoding |error|:(missing value)

set inBackupButNotInPrimary to backupSet's mutableCopy()
tell inBackupButNotInPrimary to minusSet:(primarySet)
set inBackupButNotInPrimaryText to inBackupButNotInPrimary's allObjects()'s componentsJoinedByString:linefeed
inBackupButNotInPrimaryText's writeToFile:(posixPath2 & "-out.txt") atomically:true encoding:NSUTF8StringEncoding |error|:(missing value)

Well, things are always more complicated than expected.

I got S3 Command installed to try to suck equivalent data off S3… but the command I used to get this data off the FreeNAS was:

-- find ./ * -d 3 -type d -print0 | xargs -0 -P 0 stat -f '%N^%Sm' -t '%y-%m-%d %H:%M:%S' > /mnt/teespool/newtees/Misc/_User\ Folders/name/fileList.txt

S3 CLI has no “find” command.

It does have ls, and I don’t need mod dates here. There’s a recursive argument to ls… but no “depth” argument to limit it. That’s going to turn my current 1.7 million records into more like 20 million records, if it lists every single directory and file. Of course that record would contain as a subset everything I need… but then I’m trying to parse down a datafile with 20 million records.

So I thought maybe I can get access to full command line tools to run on this to get an exactly equivalent file by mounting the bucket to my computer. Might be too slow, but worth a shot.

It took a while for me to get S3FS working right… but it turns out S3FS sees certain kinds of folders as files. Including the ones I’m trying to enumerate here.

There’s a fork called S3FS-c that’s supposed to fix this problem. But while S3FS is homebrew… S3FS-c is not. I’m not a shell expert, guess I need to start trying to learn how to compile and install shell programs.

Thanks for the help, I’ll be back.

This is my fourth attempt to write a script for this which doesn’t simply beachball Script Debugger or Script Editor! It successfully compares a test hierarchy on my iMac’s own hard disk (26 “letter” folders, 7172 “name” folders, and 1,700,000 files) with itself in around 22-25 minutes. I do’t know how this compares with what t.spoon’s been using or even if it works with those other systems! Presumably it would take longer to complete over a network and would need even more time to analyse any differences between different sources. However, it only searches for files in folders common to both hierarchies, which could save time in some cases. It logs any relative paths not common to both sources and the relative path of any matching files whose modification dates are too far apart. The tolerated interval is set in a property at the top of the script.

I noticed when testing the difference-reporting functions (with two much smaller folders!) that nominally equal modification-date NSDates, returned as URL resource values, weren’t recognised as equal when compared. I think that something about copying the files for testing may have added a few nanoseconds to the copies’ modification dates. There are a few ways around this. I’ve gone for extracting the relevant NSDateComponents and using dates reconstituted from these if needed.

(Edit: I’ve now changed the workaround to the one suggested by Shane in post#10 below: keeping the original NSDates and using NSCalendar’s isDate:equalToDate:toUnitGranularity:
method to catch any dates which should be considered equal if they haven’t already been caught by preceding tests. I’ve also corrected a bug and removed a test line which somehow got left in. :rolleyes: )

(Further edit: The script’s now been revamped to switch judiciously between ASObjC and vanilla AppleScript for the things they do fastest. Even with the coercions involved, this has reduced the running time to about fifteen-and-a-half minutes with the test folder on my hard disk. The modification dates are now converted to AppleScript dates, which don’t see nanosecond differences. However, in case they gain this ability in the future, a zero-nanosecond reference date is subtracted from each modification date and the difference is rounded to the nearest second towards zero. It’s these differences which are compared rather than the dates themselves. The extra work only adds a couple of seconds to the running time with 3,400,000 dates.)

As I said, I don’t know if it works for the intended situation. But it’s been an interesting exercise getting it to work at all. :slight_smile:

use AppleScript version "2.5" -- El Capitan (10.11) or later
use framework "Foundation"
use scripting additions

-- Edit these properties as required. The paths must be POSIX paths.
property primaryPath : POSIX path of (path to desktop) & "Primary"
property backupPath : POSIX path of (path to desktop) & "Primary"
property reportPath : "~/Desktop/freeNAS Primary and Backup differences.txt"
property fileLevel : 3 -- Equivalent to "-depth 3" in "find".
property toleratedBackupDelay : 8 * hours -- Report relative paths of corresponding primary and backup files whose modification dates are further apart than this.
property skipHiddenFiles : true -- Ignore hidden files?

main()

on main()
	script mainScript
		-- Preset some potentially often needed values!
		property |⌘| : current application
		property primaryURL : |⌘|'s class "NSURL"'s fileURLWithPath:((|⌘|'s class "NSString"'s stringWithString:(primaryPath))'s stringByExpandingTildeInPath())
		property backupURL : |⌘|'s class "NSURL"'s fileURLWithPath:((|⌘|'s class "NSString"'s stringWithString:(backupPath))'s stringByExpandingTildeInPath())
		property primaryRelativePathOffset : ((primaryURL's |path|() as text)'s length) + 2
		property backupRelativePathOffset : ((backupURL's |path|() as text)'s length) + 2
		
		property fileManager : |⌘|'s class "NSFileManager"'s defaultManager()
		property directoryKeys : |⌘|'s class "NSArray"'s arrayWithArray:({|⌘|'s NSURLIsDirectoryKey, |⌘|'s NSURLIsPackageKey})
		property skipsHiddenFiles : |⌘|'s NSDirectoryEnumerationSkipsHiddenFiles
		property directoryResult : (|⌘|'s class "NSDictionary"'s dictionaryWithObjects:({true, false}) forKeys:(directoryKeys)) as record
		property fileAndModDateKeys : |⌘|'s class "NSArray"'s arrayWithArray:({|⌘|'s NSURLIsRegularFileKey, |⌘|'s NSURLIsPackageKey, |⌘|'s NSURLContentModificationDateKey})
		property noHiddenFiles : (|⌘|'s NSDirectoryEnumerationSkipsHiddenFiles) * (skipHiddenFiles as integer)
		property |{true}| : {true}
		property referenceDate : (current date)
		
		property FinderSort : |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("path") ascending:(true) selector:("localizedStandardCompare:")
		
		property regex : |⌘|'s NSRegularExpressionSearch
		property regexEscapedPrimaryPath : (|⌘|'s class "NSRegularExpression"'s escapedPatternForString:(primaryURL's |path|())) as text
		property regexEscapedBackupPath : (|⌘|'s class "NSRegularExpression"'s escapedPatternForString:(backupURL's |path|())) as text
		property LF : |⌘|'s class "NSString"'s stringWithString:(linefeed)
		property LFLF : |⌘|'s class "NSString"'s stringWithString:(linefeed & linefeed)
		property LFLFLF : |⌘|'s class "NSString"'s stringWithString:(linefeed & linefeed & linefeed)
		property emptyString : |⌘|'s class "NSString"'s new()
		property |path| : |⌘|'s class "NSString"'s stringWithString:("path")
		property |modDateMinusRefDate| : |⌘|'s class "NSString"'s stringWithString:("modDateMinusRefDate")
		property |%@%@%@%@| : |⌘|'s class "NSString"'s stringWithString:("%@%@%@%@")
		property |PRIMARY FILES NOT IN BACKUP| : |⌘|'s class "NSString"'s stringWithString:("PRIMARY FILES NOT IN BACKUP:")
		property |BACKUP FILES NOT IN PRIMARY| : |⌘|'s class "NSString"'s stringWithString:("BACKUP FILES NOT IN PRIMARY:")
		property |BACKUPS WITH MODIFICATION DATES TOO LONG BEFORE THE PRIMARIES'| : |⌘|'s class "NSString"'s stringWithString:("BACKUPS WITH MODIFICATION DATES TOO LONG BEFORE THE PRIMARIES':")
		property |path IN %@| : |⌘|'s class "NSString"'s stringWithString:("path IN %@")
		property report : |⌘|'s class "NSMutableString"'s new()
		
		on main()
			-- Get URLs for the file-containing folders common to both the primary and backup folders, logging any folders NOT common to both in the report string.
			set {primaryFileContainerURLs, backupFileContainerURLs} to checkSubfolders()
			-- Compare the file contents of the two sets of file-containing folders, logging any differences in the report string.
			checkFiles(primaryFileContainerURLs, backupFileContainerURLs)
			-- Write the report to a text file.
			if (report's |length|() is 0) then set report to |⌘|'s class "NSString"'s stringWithString:("The files and modification dates in both folders are the same.")
			set expandedReportPath to (|⌘|'s class "NSString"'s stringWithString:(reportPath))'s stringByExpandingTildeInPath()
			tell report to writeToFile:(expandedReportPath) atomically:(true) encoding:(|⌘|'s NSUTF8StringEncoding) |error|:(missing value)
			
			return
		end main
		
		(* Compare the folders in the primary and backup hierarchies down to the level of the file-containing folders and log any differences. Return URLs for the file-containing folders common to both hierarchies. *)
		on checkSubfolders()
			-- Get the names of the primary file-container URLs.
			set primarySubfolderURLs to getSubfolderURLs(primaryURL) -- Mutable array
			set primarySubfolderNames to primarySubfolderURLs's valueForKey:("lastPathComponent")
			-- Ditto the backup file-container URLs.
			set backupSubfolderURLs to getSubfolderURLs(backupURL) -- Mutable array.
			set backupSubfolderNames to backupSubfolderURLs's valueForKey:("lastPathComponent")
			-- If the two set of names are not the same, analyse, add to the report, and filter the URLs to leave just those for folders whose names are common to both hierarchies.
			if not (backupSubfolderNames's isEqualToArray:(primarySubfolderNames)) then
				reportOnAndFilterOutSubfolderDifferences(regexEscapedPrimaryPath, primarySubfolderURLs, backupSubfolderNames, "PRIMARY SUBFOLDERS NOT IN BACKUP:")
				reportOnAndFilterOutSubfolderDifferences(regexEscapedBackupPath, backupSubfolderURLs, primarySubfolderNames, "BACKUP SUBFOLDERS NOT IN PRIMARY!:")
			end if
			-- Filter further to leave just URLs for the folders at the file-container level.
			filterByPathComponentCount(regexEscapedPrimaryPath, primarySubfolderURLs, fileLevel - 1)
			filterByPathComponentCount(regexEscapedBackupPath, backupSubfolderURLs, fileLevel - 1)
			
			return {primarySubfolderURLs, backupSubfolderURLs}
		end checkSubfolders
		
		(* Recursively find the folders in this particular hierarchy and return URLs for them. *)
		on getSubfolderURLs(topFolderURL)
			script localScript
				property subfolderURLs : {} --|⌘|'s class "NSMutableArray"'s new()
				
				on doRecursiveStuff(folderURL, currentLevel)
					set contentsURLs to (fileManager)'s contentsOfDirectoryAtURL:(folderURL) includingPropertiesForKeys:(directoryKeys) options:(skipsHiddenFiles) |error|:(missing value)
					set nextLevel to currentLevel + 1
					set gettingNextLevel to (nextLevel < fileLevel)
					repeat with thisURL in contentsURLs
						if ((thisURL's resourceValuesForKeys:(directoryKeys) |error|:(missing value)) as record is directoryResult) then
							set end of my subfolderURLs to thisURL
							if (gettingNextLevel) then doRecursiveStuff(thisURL, nextLevel)
						end if
					end repeat
				end doRecursiveStuff
			end script
			
			tell localScript to doRecursiveStuff(topFolderURL, 1)
			set subfolderURLs to |⌘|'s class "NSMutableArray"'s arrayWithArray:(localScript's subfolderURLs)
			tell subfolderURLs to sortUsingDescriptors:({FinderSort})
			return subfolderURLs
		end getSubfolderURLs
		
		(* Log the relative paths of folders which occur in one hierarchy but not the other and filter out the URLs corresponding to those paths. *)
		on reportOnAndFilterOutSubfolderDifferences(regexEscapedTopFolderPath, subfolderURLs, otherSubfolderNames, heading)
			set filter to |⌘|'s class "NSPredicate"'s predicateWithFormat_("NOT (lastPathComponent in %@)", otherSubfolderNames)
			set unmatchedSubfolderURLs to subfolderURLs's filteredArrayUsingPredicate:(filter)
			if ((count unmatchedSubfolderURLs) > 0) then
				addToReport(heading, unmatchedSubfolderURLs's valueForKey:(|path|))
				tell report to replaceOccurrencesOfString:("(?m)^" & regexEscapedTopFolderPath & "/") withString:(emptyString) options:(regex) range:({0, its |length|()})
				set filter to |⌘|'s class "NSPredicate"'s predicateWithFormat_("NOT (self IN %@)", unmatchedSubfolderURLs)
				tell subfolderURLs to filterUsingPredicate:(filter)
			end if
		end reportOnAndFilterOutSubfolderDifferences
		
		(* Append the "path"(s) in a given array to the report text along with with a given heading. *)
		on addToReport(heading, anArray)
			tell report to appendFormat_(|%@%@%@%@|, heading, LFLF, anArray's componentsJoinedByString:(LF), LFLFLF)
		end addToReport
		
		(* Filter a hierarchy's folder URLs to leave just those for folders at the file-container level. *)
		on filterByPathComponentCount(regexEscapedTopFolderPath, subfolderURLs, containerLevel)
			set filter to |⌘|'s class "NSPredicate"'s predicateWithFormat:("path MATCHES '^" & regexEscapedTopFolderPath & "(?:/[^/]++){" & containerLevel & "}+$'")
			tell subfolderURLs to filterUsingPredicate:(filter)
		end filterByPathComponentCount
		
		(* Compare the files in each corresponding primary and backup folder and log any differences. *)
		on checkFiles(primaryFileContainerURLs, backupFileContainerURLs)
			-- The modification dates of files modified on APFS systems may lose nanosecond components when the files are copied to HFS disks or are backed up by processes ignorant of date nanoseconds. Since this script converts the file dates to AppleScript dates, such differences won't currently affect their comparison. However, in case AS dates ever gain nanosecond precision in the future, a nanosecondless reference date is subtracted from each date and it's these differences, rounded towards zero to the nearest second, which are compared rather than the dates themselves. The extra work involved only adds a couple of seconds to the running time with 3,400,000 dates. The reference date used doesn't matter (it can be in the future!) so long as its 'time' or 'seconds' is an integer.
			tell referenceDate to set {its day, its year, its month, its time} to {1, 1904, January, 0}
			repeat with i from 1 to (count primaryFileContainerURLs)
				set primaryFileInfo to getFileInfo(item i of primaryFileContainerURLs, primaryRelativePathOffset)
				set backupFileInfo to getFileInfo(item i of backupFileContainerURLs, backupRelativePathOffset)
				if not (backupFileInfo's isEqualToArray:(primaryFileInfo)) then analyseFileDifferences(primaryFileInfo, backupFileInfo)
			end repeat
		end checkFiles
		
		(* Get an array of dictionaries containing the relative paths of the files in a particular folder and the differences in whole seconds between the files' modification dates and the reference date. *)
		on getFileInfo(containerURL, topFolderRelativePathOffset)
			script o
				property fileInfo : {}
			end script
			
			set contentURLs to fileManager's contentsOfDirectoryAtURL:(containerURL) includingPropertiesForKeys:(fileAndModDateKeys) options:(noHiddenFiles) |error|:(missing value)
			repeat with thisURL in contentURLs
				set fileAndModDateValues to (thisURL's resourceValuesForKeys:(fileAndModDateKeys) |error|:(missing value)) as record
				if ((fileAndModDateValues as list) contains |{true}|) then -- The best option if hedging one's bets, otherwise:
					-- if ((fileAndModDateValues's NSURLIsRegularFileKey) or (fileAndModDateValues's NSURLIsPackageKey)) then -- Faster if the files are known to be mostly regular files.
					-- if ((fileAndModDateValues's NSURLIsPackageKey) or (fileAndModDateValues's NSURLIsRegularFileKey)) then -- Faster if the files are known to be mostly packages.
					set relativePath to (thisURL's |path|() as text)'s text topFolderRelativePathOffset thru end
					set modDate to fileAndModDateValues's NSURLContentModificationDateKey
					set end of o's fileInfo to {|path|:relativePath, |modDateMinusRefDate|:(modDate - referenceDate) div 1}
				end if
			end repeat
			-- Convert the list to an NSArray and sort by 'path', Finder-style, for later comparison with an array for the corresponding other folder.
			-- This assumes the files are likely to match in the majority of cases. (Creating, sorting, and comparing NSMutableArrays is faster than the same with NSMutableOrderedSets.) If they're more likely NOT to match, it may be better to set an NSMutableOrderedSet here instead, use 'isEqualToOrderedSet:' instead of 'isEqualToArray:' in the checkFiles() handler above, and cut the first two instructions in the analyseFileDifferences() handler below.
			set fileInfo to |⌘|'s class "NSMutableArray"'s arrayWithArray:(o's fileInfo)
			tell fileInfo to sortUsingDescriptors:({FinderSort})
			
			return fileInfo
		end getFileInfo
		
		(* Knowing that two arrays of dictionaries containing paths and modification date/reference date differences aren't equal, analyse the differences and add to the report. *)
		on analyseFileDifferences(primaryFileInfo, backupFileInfo)
			-- Switching to ordered sets is useful here.
			set primaryFileInfo to |⌘|'s class "NSOrderedSet"'s orderedSetWithArray:(primaryFileInfo)
			set backupFileInfo to |⌘|'s class "NSMutableOrderedSet"'s orderedSetWithArray:(backupFileInfo)
			-- Reduce each set to its dictionaries with no counterpart in the other.
			set inPrimaryButNotInBackup to primaryFileInfo's mutableCopy()
			tell inPrimaryButNotInBackup to minusOrderedSet:(backupFileInfo)
			set inBackupButNotInPrimary to backupFileInfo -- 's mutableCopy()
			tell inBackupButNotInPrimary to minusOrderedSet:(primaryFileInfo)
			-- Get the relative paths from the remaining dictionaries (also as ordered sets).
			set primaryPaths to inPrimaryButNotInBackup's valueForKey:(|path|)
			set backupPaths to inBackupButNotInPrimary's valueForKey:(|path|)
			-- Analyse and report on any paths which don't exist in both sets.
			set pathsOnlyInPrimary to getPathDifferences(primaryPaths, backupPaths)
			if ((count pathsOnlyInPrimary) > 0) then addToReport(|PRIMARY FILES NOT IN BACKUP|, pathsOnlyInPrimary's array())
			set pathsOnlyInBackup to getPathDifferences(backupPaths, primaryPaths)
			if ((count pathsOnlyInBackup) > 0) then addToReport(|BACKUP FILES NOT IN PRIMARY|, pathsOnlyInBackup's array())
			-- Analyse and report on any paths which DO exist in both sets. These belong to matching files with different modification dates (or with modification dates which are nominally equal but actually a few nanoseconds apart, which can happen under some circumstances).
			set pathsWithDifferentModificationDates to getModDateDifferences(primaryPaths, backupPaths, inPrimaryButNotInBackup, inBackupButNotInPrimary)
			if ((count pathsWithDifferentModificationDates) > 0) then addToReport(|BACKUPS WITH MODIFICATION DATES TOO LONG BEFORE THE PRIMARIES'|, pathsWithDifferentModificationDates's array())
		end analyseFileDifferences
		
		(* Return the relative paths in one ordered set which aren't in the other. *)
		on getPathDifferences(orderedSetA, orderedSetB)
			set orderedSetA to orderedSetA's mutableCopy()
			tell orderedSetA to minusOrderedSet:(orderedSetB)
			
			return orderedSetA
		end getPathDifferences
		
		(* Return any relative paths common to two ordered sets if the modification dates of the files to which they point are more than the tolerated interval apart. *)
		on getModDateDifferences(primaryPaths, backupPaths, inPrimaryButNotInBackup, inBackupButNotInPrimary)
			set commonPaths to primaryPaths's mutableCopy()
			tell commonPaths to intersectOrderedSet:(backupPaths)
			set commonPathCount to (count commonPaths)
			if (commonPathCount > 0) then
				-- If there are relative paths in common, get the dictionaries containing those paths …
				set infoFilter to |⌘|'s class "NSPredicate"'s predicateWithFormat_(|path IN %@|, commonPaths)
				tell inPrimaryButNotInBackup to filterUsingPredicate:(infoFilter)
				tell inBackupButNotInPrimary to filterUsingPredicate:(infoFilter)
				-- … and extract their |modDateMinusRefDate| values as lists of AS integers.
				script o
					property primaryModDateDifferences : (inPrimaryButNotInBackup's array()'s valueForKey:(|modDateMinusRefDate|)) as list
					property backupModDateDifferences : (inBackupButNotInPrimary's array()'s valueForKey:(|modDateMinusRefDate|)) as list
				end script
				repeat with i from commonPathCount to 1 by -1
					-- Compare the |modDateMinusRefDate| differences corresponding to ith relative path in commonPaths.
					set primaryModDateDifference to o's primaryModDateDifferences's item i
					set backupModDateDifference to o's backupModDateDifferences's item i
					-- If the difference between the differences is within the tolerated interval, remove the corresponding relative path from consideration.
					if (primaryModDateDifference - backupModDateDifference is not greater than toleratedBackupDelay) then tell commonPaths to removeObjectAtIndex:(i - 1)
				end repeat
			end if
			
			return commonPaths
		end getModDateDifferences
	end script
	
	mainScript's main()
end main

Awesome effort, Nigel :cool: :cool: :cool:.

Different file systems can store the values to different precision – HFS stores times to the nearest second, for example, whereas AFPS stores some crazy number of decimal points. Could that explain what you’re seeing?

Hi Shane. Thanks for the “cools”.

I didn’t know the technicalities, but I did theorise that sub-second time differences were involved. The folders and files were all created and compared on the same Mojave machine, ie. in AFPS. The curious thing is that the “backup” folder in my tests was initialised by simply duplicating the smaller “primary” one I’d created, so the dates of the contained files should have been the same in both. My initial thought was that nanoseconds had been added to the duplicates’ times for some reason, but I suppose it’s more likely that they’d actually been shaved off by the Finder during duplication. I’ll look into it more closely today.

Another workaround which worked was to coerce each NSDate to an AS date before adding the record containing it to the relevant array. But this turned out to be only slightly faster than getting the date components and may not work in the future if AppleScript dates ever get the same nanosecond precision or if nanoseconds occasionally get rounded up.

That said, nanosecond differences only affect the minusSet() operations in the script which quickly narrow down the “path”/“modification date” dictionaries to those needing further attention. Unless ‘toleratedBackupDelay’ was set to 0, they’d be unlikely to make any difference to what was actually written to the log file, just to the time taken to eliminate false differences. It’s a trade-off between the extra time needed to quantise every modification date on receipt and the time saved by being able to batch-eliminate matching pairs later on.

If I’m following correctly, you could also stick to dates rather than date components, and then use NSCalendar’s compareDate:toDate:toUnitGranularity: or isDate:equalToDate:toUnitGranularity: with NSCalendarUnitSecond.

Those look useful for individual date comparisons. Thanks. My original idea was to use a “minus” set method to bulk-eliminate dictionaries containing identical relative paths and modification dates. Any dictionaries left would have either a path or a modification date not found in the other hierarchy and only these would then need to be handled individally. The possibility of nanosecond differences between “identical” dates makes this unreliable. I got round the problem above by putting the dates’ relevant components in the dictionaries instead. But I’m beginning to see possible advantages to keeping the original dates and modifying the approach as you suggest. Identical dictionaries would be eliminated as before when identical dates were identical. Otherwise additional, individual date comparisons would be needed to establish identicality. But the additional work would only be one comparison per pair and only when paths matched, as opposed to component extractions for every date obtained and the reconstitution of dates of interest for the interval calculation. It wouldn’t be worth fixing calculation itself for when the result was within a few nanoseconds of the limit. Hmm….

I’ve edited the script above to implement Shane’s suggestion in post #10, which takes a minute or two off the time with my large test folder and produces the correct result with my two small ones. The isDate:equalToDate:toUnitGranularity: check is only done done if the difference between primary and backup modification dates is more than the tolerated interval, just in case the interval is 0 and the dates are only a few nanoseconds apart. In any other situation, it’s not needed. There’s no check for backups being newer than their primaries!


set primarySet to current application's class "NSSet"'s setWithArray:primaryList
set backupSet to current application's class "NSSet"'s setWithArray:backupList

As one variable like primarySet or backupSet takes one place in the RAM, and your data has 50 TB, this operations isn’t realable. A don’t believe that exists computer with 50 TB RAM.

The conclusion is that: you can’t avoid here overlaying the chunks of data to RAM, so your code can’t became fast. Generally operations DISK–>RAM–>DISK–>RAM–>DISK–>RAM–>… can’t be fast, as the speed of script will be almost 100% limited by the low speed of reading data from the disk. This is not a matter of code efficiency, but of lagging disk technologies.

Therefore, your database itself must be built more efficiently so that certain logical units of the database can be retrieved and not exceed the RAM capacity. For example, I see you have some kind of alphabetization(./A, ./U). This is closer to logic of things. But this partition is not enough. I see you have a database for people. Each person, however, has several other properties besides the letter with which his last name begins. For example, you may not look for men among women, or babies among old people.

The disk has 50TB. The array containing the file names and modification dates won’t be anywhere near that. A fair chunk of memory, for sure, but quite manageable.

I’ve been able to confirm this morning that all the files in my “backup” folder have zero nanoseconds in their modification dates, whereas those in the small “primary” folder from which it was derived have nanoseconds > zero. But I haven’t been able to reproduce the effect when deriving new “back up” folders today. The modification dates of the files in today’s folders exactly match those in the original. It remains a mystery. :confused:

The modification dates of the folders in the two hierarchies don’t match at all, which is easily explained by my having dragged things into and out of them during testing. All the folder dates have nanosecond components > zero.

But the main difference between the script above and three previous attempts which beachballed both script editors is that it breaks the array data down into more easily digested chunks:

• An array each of URLs for the folders in each hierarchy (down to the level above that specified for the files). After these are compared, they’re reduced to just the “name” folders common to both collections. They remain in memory while the following come and go.
• An array of URLs for the contents of one “name” folder. This provides data for a (possibly) shorter array containing the data of interest for the folder’s files and is then discarded.
• Two data-of-interest arrays resulting from the preceding point. If these prove to be identical, they’re simply discarded. Otherwise various ordered sets are derived from them for analytical purposes before both the arrays and the sets are discarded. It would conceivably save memory to build the ordered sets directly rather than building arrays and then deriving the sets from them, but the latter process appears to be slightly faster.

This approach works with the large test folder I set up on my machine, but it’s conceivable that the number of “name” folders in t.spoon’s case, or the number of files in any particular one of them, could exceed the “beachball” limit. I’ve no idea what this limit is.

The text file of the complete directory listing is 84.5 MB, so yes, nowhere near 50TB, and much less than the 16 GB of RAM in this machine.

I got stuck on getting the directory listing I’m comparing it to out of S3. I gave up on S3FS-c and realized that Transmit can mount an S3 bucket as a drive. But it didn’t work.

I don’t understand why it doesn’t work to mount S3 as a mounted volume and use a command line “find” to extract just the directories I need, like I do on the server. The command doesn’t time out or appear to fail, it appears to complete in the shell, and I get my file with the output. The output is correct, except it’s missing the vast majority of the data that should be in it, much of which I’ve confirmed is there in the file system. So I’m at an impasse on this project.

My only option seems to be to list all 20+ million files/directories using S3 Command Line Tools, rather than just the ones I actually need to compare… and if I thought parsing 1.7 million was bad, I don’t look forward to that.

I’d like to investigate the fastest tools for processing huge text datasets. I’ve wondered how it might perform if I simply import all this data to SQL and mess with it there. I’m a beginner at SQL though.

Thanks for all the help, and I suspect the techniques posted here will come back to be helpful on future projects,

tspoon.

Is the “depth” the same in both cases?

Interestingly, when I apply “find” to my test folder, it takes between 26 and 27 seconds regardless of whether I set “-depth” to 3 (the 1,724,231 files), 2 (the 7172 “name” folders), or 1 (the 26 “letter” folders), from which I deduce that it scans the entire contents anyway but only returns the items at the specified level. Very much faster for the first two levels is to use “-maxdepth” as well, which presumably limits the depth of the scan. If your disk hierarchies have deeper levels than those which interest you, it may be an idea to do this. Or you could do it anyway, since the time taken for the deepest level is the same in all cases: “-depth 3”, “-maxdepth 3 -depth 3”, or “-maxdepth 3 -mindepth 3”. Perhaps the “-maxdepth 3 -mindepth 3” pairing is very slightly the fastest, but it’s difficult to be sure.

Thanks Nigel, that’s very helpful. I had figured out that the “depth” argument wasn’t trimming any time, that it had to still be scanning everything. But I hadn’t noticed the “maxdepth” and “mindepth” arguments.

I wonder what the point is of including a “depth” argument that doesn’t stop the scan from going down the tree further than it can possibly return results?

I’ll try again with maxdepth and mindepth arguments and see if that helps,

  • tspoon

I’ve managed to knock five or six minutes off the running time of the script in post #7 by switching to vanilla AppleScript for the things it does faster (ie. building collections and comparing things) and have edited the post accordingly. To get round the nanoseconds-difference problem with the modification dates, the script now compares the seconds-granulated differences between the modification dates and a reference date rather than comparing the modification dates directly. This isn’t actually necessary with AppleScript dates, but it adds very little to the running time and is a hedge against AS dates gaining nanosecond “granularity” in the future.

As nobody actually posted a code example using maxdepth, I can’t tell if your attempted statement(s) is being interpreted, however, the function does very clearly limit traversal depth—not just returned values—if used properly. I have demonstrated this in previous posts.