Tuesday, November 19, 2019

#1 2019-11-03 09:15:49 pm

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 448

Alright speed demons, I need to text process a list with 1.7 M items

I'm trying to compare some large sets of files on different volumes - a primary set, and a backup that may or may not be missing some things it shouldn't.

The file list I have of the primary was extracted with a shell "find" and is in this format:

./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-28 09:19:14
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Alexandra Poole/510295^17-10-23 16:14:25
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/501888^17-10-02 14:14:08
./A/April Howard/482491^17-08-22 16:58:44



This shell command was mistakenly chosen thinking the date format could be coerced to Applescript dates. It turns out, at least with my date settings, that it takes a bit of massaging. If that's the main source of the slowdowns, I could try to change my date settings in system preferences to make it directly coercable, but I thought I'd come here first and see what the knowledgeable have to say about it.

I'll get the date format fixed in the shell command that generates the files for future runs, but I can't get a new list run until next weekend, and I don't know if fixing the date format is the main time sink in this code anyway.

So this is all just to process the input data of one list: after that, I'm still going to have to come up with a way to compare it to the other list. But one step at a time.

Here's my current script:

Applescript:

-- commented out lines to grab the real 86 MB 1.7M line file
-- set freeNASfile to alias "[real data file path]"
-- set freeNASdata to read freeNASfile

set freeNASdata to get_test_data() -- grab small test set

set freeNASdataList to {}

set {delimitHolder, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "^"}
repeat with aLine in paragraphs of freeNASdata
   try
       set thePath to text item 1 of aLine
       if thePath starts with "./" then set thePath to text 3 through end of thePath
       set rawDate to text item 2 of aLine
       set ignoreRecord to false
   on error
       set ignoreRecord to true
   end try
   if ignoreRecord is false then
       set theHour to text 10 through 11 of rawDate as number
       if theHour < 12 then
           set ampm to " AM"
       else
           set theHour to theHour - 12
           set ampm to " PM"
       end if
       set revisedDate to text 4 through 8 of rawDate & "-" & text 1 through 2 of rawDate & " at " & theHour & text 12 through end of rawDate & ampm
       set ASdate to date revisedDate
       set freeNASdataList to freeNASdataList & {{thePath, ASdate}}
   end if
end repeat
set AppleScript's text item delimiters to delimitHolder

on get_test_data()
return "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-28 09:19:14
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Alexandra Poole/510295^17-10-23 16:14:25
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/501888^17-10-02 14:14:08
./A/April Howard/482491^17-08-22 16:58:44"

end get_test_data()

I ran this with a larger 5,000 line sample of the data plugged in, and it works fine... except the run time was 2 minutes, which implies 11+ hours on the real dataset, if it scales linearly.

So I thought I'd see what advice you guys can give about the best way to speed it up.

Of course, the date coercion isn't going to work for those of you with different date formats in system preferences... but I don't necessarily need code, suggestions for the best things to try to speed this up would be fine.

Thanks,

tspoon


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

#2 2019-11-03 10:53:26 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6034

Re: Alright speed demons, I need to text process a list with 1.7 M items

Are those file names all unique? What comparison are you going to make? I fear you might be approaching the problem from the wrong end.

(The prospect of using basic AppleScript to compare two lists of 1.7 million records each doesn't sound very appealling, or practical. Assuming it's a one-off, I think I'd just buy a new disk.)


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#3 2019-11-04 05:44:42 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

Shane Stanley wrote:

Assuming it's a one-off, I think I'd just buy a new disk.


lol

I agree with Shane it's a bit of a task for vanilla AppleScript. If I was obliged to use it, I'd try to optimise the process as much as possible. For instance, by NOT building freeNASdataList by concatenation item by item! I'd try building intermediate lists — say of 4000 items each (or fewer in the last one) — by setting their ends to the individual results, then concatenating each of these to freeNASdataList. Something like the following, although you'd need to experiment to see what works best nowadays:

Applescript:

-- commented out lines to grab the real 86 MB 1.7M line file
-- set freeNASfile to alias "[real data file path]"
-- set freeNASdata to read freeNASfile

local freeNASdata, freeNASdataList, o

set freeNASdata to get_test_data() -- grab small test set

set freeNASdataList to {}

script
   property currentParagraphs : {}
   property subResult : {}
end script
set o to result

tell (current date) to set {dateObject, its day, its year, its month, its time} to {it, 1, 2000, January, 0} -- A known date whose day is 1.
set {delimitHolder, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "^"}
set paragraphCount to (count freeNASdata's paragraphs)
repeat with i from 1 to paragraphCount by 4000
   set j to i + 3999
   if (j > paragraphCount) then set j to paragraphCount
   set o's currentParagraphs to paragraphs i thru j of freeNASdata
   repeat with k from 1 to (j - i + 1)
       set aLine to item k of o's currentParagraphs
       try
           set {thePath, rawDate} to aLine's text items
           set thePath to text item 1 of aLine
           if thePath starts with "./" then set thePath to text 3 through end of thePath
           set rawDate to text item 2 of aLine
           set ignoreRecord to false
       on error
           set ignoreRecord to true
       end try
       if ignoreRecord is false then
           -- Get a copy of the known date and set its properties.
           copy dateObject to ASdate
           -- set dateComponents to rawDate's words -- Or the following three lines to be safe.
           set AppleScript's text item delimiters to {"-", " ", ":"}
           set dateComponents to rawDate's text items
           set AppleScript's text item delimiters to "^"
           set item 1 of dateComponents to 2000 + (beginning of dateComponents) mod 2000 -- Assuming all the dates are between 2000 and 2999.
           tell ASdate to set {its year, its month, its day, its hours, its minutes, its seconds} to dateComponents
           set end of o's subResult to {thePath, ASdate}
       end if
   end repeat

   -- Concatenate the current sub-result list to the output and start another.
   set freeNASdataList to freeNASdataList & o's subResult
   set o's subResult to {}
end repeat
set AppleScript's text item delimiters to delimitHolder
return freeNASdataList

on get_test_data()
   return "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-28 09:19:14
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Alexandra Poole/510295^17-10-23 16:14:25
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/501888^17-10-02 14:14:08
./A/April Howard/482491^17-08-22 16:58:44"

end get_test_data

Presumably, though, you're only interested in lines which either don't exist or aren't the same in both texts, so you don't actually need to convert all the dates. You just need to find the lines in each text which aren't in the other and go from there. In ASObjC, you might do something like the following, but I haven't tried it with 1.7M-line texts!

Applescript:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

local primary, backup, primarySet, backupSet, inPrimaryButNotInBackup, inBackupButNotInPrimary

set primary to "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-28 09:19:14
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Alexandra Poole/510295^17-10-23 16:14:25
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/501888^17-10-02 14:14:08
./A/April Howard/482491^17-08-22 16:58:44"


-- The same thing with a couple of missing entries and an earlier date.
set backup to "./U/Uzair Siddiq/763681^19-05-28 19:34:11
./U/Ulisa Katoa/451229^17-06-21 16:55:13
./U/Ulyses Pratt/644979^18-08-22 19:09:54
./U/Urooj Siraj/639293^18-08-08 12:04:40
./U/Ubong Edemeka/855399^19-10-21 15:31:21
./U/Ulises Rodriguez/457279^17-06-15 14:32:57
./U/Ulises Rodriguez/429732^17-05-04 10:55:17
./U/Ukamaka Oparanozie/654655^18-09-14 11:48:01
./U/Ulrick Fong/644519^18-08-21 10:20:37
./A/Alexandra Riley/688871^18-12-04 17:14:23
./A/Alexandra Riley/669899^18-10-17 16:06:04
./A/Alexandra Riley/688870^18-12-04 17:14:27
./A/Andy Hu/848254^19-10-09 17:13:46
./A/Andy Hu/848150^19-10-09 15:17:10
./A/Anna Drexler/806971^19-08-07 10:04:04
./A/April Howard/481404^17-08-23 10:47:50
./A/April Howard/482491^17-08-22 16:58:44"


set primarySet to current application's class "NSSet"'s setWithArray:(primary's paragraphs)
set backupSet to current application's class "NSSet"'s setWithArray:(backup's paragraphs)

set inPrimaryButNotInBackup to primarySet's mutableCopy()
tell inPrimaryButNotInBackup to minusSet:(backupSet)
set inPrimaryButNotInBackup to inPrimaryButNotInBackup's allObjects() as list

set inBackupButNotInPrimary to backupSet's mutableCopy()
tell inBackupButNotInPrimary to minusSet:(primarySet)
set inBackupButNotInPrimary to inBackupButNotInPrimary's allObjects() as list

return {inPrimaryButNotInBackup, inBackupButNotInPrimary}

Last edited by Nigel Garvey (2019-11-04 07:38:43 am)


NG

Offline

 

#4 2019-11-04 07:44:01 am

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 448

Re: Alright speed demons, I need to text process a list with 1.7 M items

Thanks so much for the replies.

Shane - this is about 50 TB of data. One copy is local, and I actually want to compare that to two other copies - one's on Amazon S3, the other's on Backblaze. So I don't see a way that purchasing a new disk is going to help here. Unless I threaten to hit the people in charge of backup with the new disk if they don't get it working right wink

The situation is that some people insist that this data is fully backed up and everything's a-ok, but I keep finding that files are missing on the backup. I felt like it's time for a thorough audit of exactly how bad this is, so there's less room for disagreement over whether or not we have a real problem.

Sometimes with this kind of Applescript processing, I see someone post a problem here not even talking about speed, someone posts a solution, and then it turns into a speed contest with several people posting competing rewrites, and it ends up being 1,000 times faster "just for the fun of it." So I thought I'd see if this code was one of these "1,000 times faster" situations, but it looks like it's not.

That does get me into optimizing what I'm doing, rather than just how it's done, as Nigel's suggested.

It might be nice to have dates of files that are backed up to help look for patterns in what is and isn't there, but it's not important. So once I have all three lists, I can generate the exclusive overlap sets between each list and the master, and then only convert the dates for the files that were missing. Which (man I hope) is a MUCH smaller set.

All file names are unique - and the final part is six digit numeric. So I could strip it down to just numeric lists to run the comparison, then re-associate the much smaller overlapping set with the path and date. Not sure if that would save time or add time.

Can you guys tell me off the top of your head a fast way to get the exclusive set between two lists?  I found this:
https://macscripter.net/viewtopic.php?id=41709
It's a bit old, not sure if it's the state-of-the-art.

Maybe I should just use shell to sort the lists and run diff, and then import the output from that to Applescript to coerce the dates and do some analysis on what I've got.

Thanks again,

tspoon.


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

#5 2019-11-04 05:10:10 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6034

Re: Alright speed demons, I need to text process a list with 1.7 M items

t.spoon wrote:

Unless I threaten to hit the people in charge of backup with the new disk if they don't get it working right wink



Like I said smile

Sometimes with this kind of Applescript processing, I see someone post a problem here not even talking about speed, someone posts a solution, and then it turns into a speed contest with several people posting competing rewrites, and it ends up being 1,000 times faster "just for the fun of it." So I thought I'd see if this code was one of these "1,000 times faster" situations, but it looks like it's not.



Well it sort of is, if you go to ASObjC. The real question is whether 1000 times faster is still going to be unreasonably slow -- your extrapolation seemed to me wildly optimistic. IAC, simple AppleScript optimizing isn't going to cut it.

All file names are unique - and the final part is six digit numeric. So I could strip it down to just numeric lists to run the comparison, then re-associate the much smaller overlapping set with the path and date. Not sure if that would save time or add time.



If you can just treat each entry as a string -- perhaps trimming off the initial part of the paths if necessary -- it becomes much, much simpler.

Can you guys tell me off the top of your head a fast way to get the exclusive set between two lists?



That's what Nigel's latter code above is doing. The only changes I would make is to avoid any direct use of AppleScript values, because of the size: so I'd read the contents from files, and write to files, all in ASObjC values. That might well work fine with your data set. At worst, you might have to make a handful of chunks.

Applescript:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- classes, constants, and enums used
property NSUTF8StringEncoding : a reference to 4

set posixPath1 to POSIX path of (choose file with prompt "Choose a file")
set posixPath2 to POSIX path of (choose file with prompt "Choose a second file")

set primary to current application's NSString's stringWithContentsOfFile:posixPath1 encoding:NSUTF8StringEncoding |error|:(missing value)
set primaryList to primary's componentsSeparatedByString:linefeed -- assuming LFs
set backup to current application's NSString's stringWithContentsOfFile:posixPath2 encoding:NSUTF8StringEncoding |error|:(missing value)
set backupList to backup's componentsSeparatedByString:linefeed -- assuming LFs

set primarySet to current application's class "NSSet"'s setWithArray:primaryList
set backupSet to current application's class "NSSet"'s setWithArray:backupList

set inPrimaryButNotInBackup to primarySet's mutableCopy()
tell inPrimaryButNotInBackup to minusSet:(backupSet)
set inPrimaryButNotInBackupText to inPrimaryButNotInBackup's allObjects()'s componentsJoinedByString:linefeed
inPrimaryButNotInBackupText's writeToFile:(posixPath1 & "-out.txt") atomically:true encoding:NSUTF8StringEncoding |error|:(missing value)

set inBackupButNotInPrimary to backupSet's mutableCopy()
tell inBackupButNotInPrimary to minusSet:(primarySet)
set inBackupButNotInPrimaryText to inBackupButNotInPrimary's allObjects()'s componentsJoinedByString:linefeed
inBackupButNotInPrimaryText's writeToFile:(posixPath2 & "-out.txt") atomically:true encoding:NSUTF8StringEncoding |error|:(missing value)


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#6 2019-11-04 09:11:31 pm

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 448

Re: Alright speed demons, I need to text process a list with 1.7 M items

Well, things are always more complicated than expected.

I got S3 Command installed to try to suck equivalent data off S3... but the command I used to get this data off the FreeNAS was:

Applescript:

-- find ./ * -d 3 -type d -print0 | xargs -0 -P 0 stat -f '%N^%Sm' -t '%y-%m-%d %H:%M:%S' > /mnt/teespool/newtees/Misc/_User\ Folders/name/fileList.txt

S3 CLI has no "find" command.

It does have ls, and I don't need mod dates here. There's a recursive argument to ls... but no "depth" argument to limit it. That's going to turn my current 1.7 million records into more like 20 million records, if it lists every single directory and file. Of course that record would contain as a subset everything I need... but then I'm trying to parse down a datafile with 20 million records.

So I thought maybe I can get access to full command line tools to run on this to get an exactly equivalent file by mounting the bucket to my computer. Might be too slow, but worth a shot.

It took a while for me to get S3FS working right... but it turns out S3FS sees certain kinds of folders as files. Including the ones I'm trying to enumerate here.

There's a fork called S3FS-c that's supposed to fix this problem. But while S3FS is homebrew... S3FS-c is not. I'm not a shell expert, guess I need to start trying to learn how to compile and install shell programs.

Thanks for the help, I'll be back.


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

#7 2019-11-10 03:42:35 pm

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

This is my fourth attempt to write a script for this which doesn't simply beachball Script Debugger or Script Editor! It successfully compares a test hierarchy on my iMac's own hard disk (26 "letter" folders, 7172 "name" folders, and 1,700,000 files) with itself in around 22-25 minutes. I do't know how this compares with what t.spoon's been using or even if it works with those other systems! Presumably it would take longer to complete over a network and would need even more time to analyse any differences between different sources. However, it only searches for files in folders common to both hierarchies, which could save time in some cases. It logs any relative paths not common to both sources and the relative path of any matching files whose modification dates are too far apart. The tolerated interval is set in a property at the top of the script.

I noticed when testing the difference-reporting functions (with two much smaller folders!) that nominally equal modification-date NSDates, returned as URL resource values, weren't recognised as equal when compared. I think that something about copying the files for testing may have added a few nanoseconds to the copies' modification dates. There are a few ways around this. I've gone for extracting the relevant NSDateComponents and using dates reconstituted from these if needed. (Edit: I've now changed the workaround to the one suggested by Shane in post#10 below: keeping the original NSDates and using NSCalendar's isDate:equalToDate:toUnitGranularity:
method to catch any dates which should be considered equal if they haven't already been caught by preceding tests. I've also corrected a bug and removed a test line which somehow got left in.  roll  )

As I said, I don't know if it works for the intended situation. But it's been an interesting exercise getting it to work at all.  smile

Applescript:

use AppleScript version "2.5" -- El Capitan (10.11) or later
use framework "Foundation"
use scripting additions

-- Edit these properties as required. The paths must be POSIX paths.
property primaryPath : POSIX path of (path to desktop) & "Primary"
property backupPath : POSIX path of (path to desktop) & "Primary"
property reportPath : "~/Desktop/freeNAS Primary and Backup differences.txt"
property fileLevel : 3 -- Equivalent to "-depth 3" in "find".
property toleratedBackupDelay : 8 * hours -- Report relative paths of corresponding primary and backup files whose modification dates are further apart than this.
property skipHiddenFiles : true -- Ignore hidden files?

main()

on main()
   script mainScript
       -- Preset some potentially often used Cocoa values!
       property |⌘| : current application
       property fileManager : |⌘|'s class "NSFileManager"'s defaultManager()
       property primaryURL : |⌘|'s class "NSURL"'s fileURLWithPath:(primaryPath)
       property backupURL : |⌘|'s class "NSURL"'s fileURLWithPath:(backupPath)
       
       property slashLength : (|⌘|'s class "NSString"'s stringWithString:("/"))'s |length|() -- Only used to set the next two properties!
       property primaryPathLength : (primaryURL's |path|()'s |length|()) + slashLength
       property backupPathLength : (backupURL's |path|()'s |length|()) + slashLength
       
       property regex : |⌘|'s NSRegularExpressionSearch
       property regexEscapedPrimaryPath : (|⌘|'s class "NSRegularExpression"'s escapedPatternForString:(primaryURL's |path|())) as text
       property regexEscapedBackupPath : (|⌘|'s class "NSRegularExpression"'s escapedPatternForString:(backupURL's |path|())) as text
       
       property directoryKeys : |⌘|'s class "NSArray"'s arrayWithArray:({|⌘|'s NSURLIsDirectoryKey, |⌘|'s NSURLIsPackageKey})
       property skipsHiddenFiles : |⌘|'s NSDirectoryEnumerationSkipsHiddenFiles
       property directoryResult : |⌘|'s class "NSDictionary"'s dictionaryWithObjects:({true, false}) forKeys:(directoryKeys)
       property modDateKey : |⌘|'s NSURLContentModificationDateKey
       property fileAndModDateKeys : |⌘|'s class "NSArray"'s arrayWithArray:({|⌘|'s NSURLIsRegularFileKey, |⌘|'s NSURLIsPackageKey, modDateKey})
       property noHiddenFiles : (|⌘|'s NSDirectoryEnumerationSkipsHiddenFiles) * (skipHiddenFiles as integer)
       
       property currentCalendar : |⌘|'s class "NSCalendar"'s currentCalendar()
       property calendarUnitSecond : (|⌘|'s NSCalendarUnitSecond)
       
       property FinderSort : |⌘|'s class "NSSortDescriptor"'s sortDescriptorWithKey:("path") ascending:(true) selector:("localizedStandardCompare:")
       
       property LF : |⌘|'s class "NSString"'s stringWithString:(linefeed)
       property LFLF : |⌘|'s class "NSString"'s stringWithString:(linefeed & linefeed)
       property LFLFLF : |⌘|'s class "NSString"'s stringWithString:(linefeed & linefeed & linefeed)
       property emptyString : |⌘|'s class "NSString"'s new()
       property |path| : |⌘|'s class "NSString"'s stringWithString:("path")
       property |modification date| : |⌘|'s class "NSString"'s stringWithString:("modification date")
       property |%@%@%@%@| : |⌘|'s class "NSString"'s stringWithString:("%@%@%@%@")
       property |PRIMARY FILES NOT IN BACKUP| : |⌘|'s class "NSString"'s stringWithString:("PRIMARY FILES NOT IN BACKUP:")
       property |BACKUP FILES NOT IN PRIMARY| : |⌘|'s class "NSString"'s stringWithString:("BACKUP FILES NOT IN PRIMARY:")
       property |BACKUPS WITH MODIFICATION DATES TOO LONG BEFORE THE PRIMARIES'| : |⌘|'s class "NSString"'s stringWithString:("BACKUPS WITH MODIFICATION DATES TOO LONG BEFORE THE PRIMARIES':")
       property |path IN %@| : |⌘|'s class "NSString"'s stringWithString:("path IN %@")
       property report : |⌘|'s class "NSMutableString"'s new()
       
       on main()
           -- Get URLs for the file-containing folders common to both the primary and backup folders, logging any folders NOT common to both in the report string.
           set {primaryFileContainerURLs, backupFileContainerURLs} to checkSubfolders()
           -- Compare the file contents of the two sets of file-containing folders, logging any differences in the report string.
           checkFiles(primaryFileContainerURLs, backupFileContainerURLs)
           -- Write the report to a text file.
           if (report's |length|() is 0) then set report to |⌘|'s class "NSString"'s stringWithString:("The files and modification dates in both folders are the same.")
           set expandedReportPath to (|⌘|'s class "NSString"'s stringWithString:(reportPath))'s stringByExpandingTildeInPath()
           tell report to writeToFile:(expandedReportPath) atomically:(true) encoding:(|⌘|'s NSUTF8StringEncoding) |error|:(missing value)
           
           return
       end main
       
       (* Compare the folders in the primary and backup hierarchies down to the level of the file-containing folders and log any differences. Return URLs for the file-containing folders common to both hierarchies. *)
       on checkSubfolders()
           -- Get the names of the primary file-container URLs.
           set primarySubfolderURLs to getSubfolderURLs(primaryURL) -- Mutable array
           set primarySubfolderNames to primarySubfolderURLs's valueForKey:("lastPathComponent")
           -- Ditto the backup file-container URLs.
           set backupSubfolderURLs to getSubfolderURLs(backupURL) -- Mutable array.
           set backupSubfolderNames to backupSubfolderURLs's valueForKey:("lastPathComponent")
           -- If the two set of names are not the same, analyse, add to the report, and filter the URLs to leave just those for folders whose names are common to both hierarchies.
           if not (backupSubfolderNames's isEqualToArray:(primarySubfolderNames)) then
               reportOnAndFilterOutSubfolderDifferences(regexEscapedPrimaryPath, primarySubfolderURLs, backupSubfolderNames, "PRIMARY SUBFOLDERS NOT IN BACKUP:")
               reportOnAndFilterOutSubfolderDifferences(regexEscapedBackupPath, backupSubfolderURLs, primarySubfolderNames, "BACKUP SUBFOLDERS NOT IN PRIMARY!:")
           end if
           -- Filter further to leave just URLs for the folders at the file-container level.
           filterByPathComponentCount(regexEscapedPrimaryPath, primarySubfolderURLs, fileLevel - 1)
           filterByPathComponentCount(regexEscapedBackupPath, backupSubfolderURLs, fileLevel - 1)
           
           return {primarySubfolderURLs, backupSubfolderURLs}
       end checkSubfolders
       
       (* Recursively find the folders in this particular hierarchy and return URLs for them. *)
       on getSubfolderURLs(topFolderURL)
           script localScript
               property subfolderURLs : |⌘|'s class "NSMutableArray"'s new()
               
               on doRecursiveStuff(folderURL, currentLevel)
                   set contentsURLs to (fileManager)'s contentsOfDirectoryAtURL:(folderURL) includingPropertiesForKeys:(directoryKeys) options:(skipsHiddenFiles) |error|:(missing value)
                   set nextLevel to currentLevel + 1
                   set gettingNextLevel to (nextLevel < fileLevel)
                   repeat with thisURL in contentsURLs
                       if ((thisURL's resourceValuesForKeys:(directoryKeys) |error|:(missing value))'s isEqualToDictionary:(directoryResult)) then
                           tell subfolderURLs to addObject:(thisURL)
                           if (gettingNextLevel) then doRecursiveStuff(thisURL, nextLevel)
                       end if
                   end repeat
               end doRecursiveStuff
           end script
           
           tell localScript
               doRecursiveStuff(topFolderURL, 1)
               tell its subfolderURLs to sortUsingDescriptors:({FinderSort})
               return its subfolderURLs
           end tell
       end getSubfolderURLs
       
       (* Log the relative paths of folders which occur in one hierarchy but not the other and filter out the URLs corresponding to those paths. *)
       on reportOnAndFilterOutSubfolderDifferences(regexEscapedTopFolderPath, subfolderURLs, otherSubfolderNames, heading)
           set filter to |⌘|'s class "NSPredicate"'s predicateWithFormat_("NOT (lastPathComponent in %@)", otherSubfolderNames)
           set unmatchedSubfolderURLs to subfolderURLs's filteredArrayUsingPredicate:(filter)
           if ((count unmatchedSubfolderURLs) > 0) then
               addToReport(heading, unmatchedSubfolderURLs's valueForKey:(|path|))
               tell report to replaceOccurrencesOfString:("(?m)^" & regexEscapedTopFolderPath & "/") withString:(emptyString) options:(regex) range:({0, its |length|()})
               set filter to |⌘|'s class "NSPredicate"'s predicateWithFormat_("NOT (self IN %@)", unmatchedSubfolderURLs)
               tell subfolderURLs to filterUsingPredicate:(filter)
           end if
       end reportOnAndFilterOutSubfolderDifferences
       
       (* Append the "path"(s) in a given array to the report text along with with a given heading. *)
       on addToReport(heading, anArray)
           tell report to appendFormat_(|%@%@%@%@|, heading, LFLF, anArray's componentsJoinedByString:(LF), LFLFLF)
       end addToReport
       
       (* Filter a hierarchy's folder URLs to leave just those for folders at the file-container level. *)
       on filterByPathComponentCount(regexEscapedTopFolderPath, subfolderURLs, containerLevel)
           set filter to |⌘|'s class "NSPredicate"'s predicateWithFormat:("path MATCHES '^" & regexEscapedTopFolderPath & "(?:/[^/]++){" & containerLevel & "}+$'")
           tell subfolderURLs to filterUsingPredicate:(filter)
       end filterByPathComponentCount
       
       (* Compare the files in each corresponding primary and backup folder and log any differences. *)
       on checkFiles(primaryFileContainerURLs, backupFileContainerURLs)
           repeat with i from 1 to (count primaryFileContainerURLs)
               set primaryFileInfo to getFileInfo(item i of primaryFileContainerURLs, primaryPathLength)
               set backupFileInfo to getFileInfo(item i of backupFileContainerURLs, backupPathLength)
               if not (backupFileInfo's isEqualToArray:(primaryFileInfo)) then analyseFileDifferences(primaryFileInfo, backupFileInfo)
           end repeat
       end checkFiles
       
       (* Get an array of dictionaries containing the relative paths and modification dates of the files in a particular folder. *)
       on getFileInfo(containerURL, topFolderPathLength)
           set contentURLs to fileManager's contentsOfDirectoryAtURL:(containerURL) includingPropertiesForKeys:(fileAndModDateKeys) options:(noHiddenFiles) |error|:(missing value)
           set fileInfo to |⌘|'s class "NSMutableArray"'s new()
           repeat with thisURL in contentURLs
               set fileAndModDateValues to (thisURL's resourceValuesForKeys:(fileAndModDateKeys) |error|:(missing value))
               if (fileAndModDateValues's allValues()'s containsObject:(true)) then
                   set relativePath to (thisURL's |path|()'s substringFromIndex:(topFolderPathLength))
                   set modDate to (fileAndModDateValues's valueForKey:(modDateKey))
                   tell fileInfo to addObject:({|path|:relativePath, |modification date|:modDate})
               end if
           end repeat
           -- Sort the array by path, Finder-style, for later comparison with the array from the corresponding other folder.
           tell fileInfo to sortUsingDescriptors:({FinderSort})
           
           return fileInfo
       end getFileInfo
       
       (* Knowing that two arrays of dictionaries containing paths and modification dates aren't equal, analyse the differences and add to the report. *)
       on analyseFileDifferences(primaryFileInfo, backupFileInfo)
           -- Switching to ordered sets is useful here.
           set primaryFileInfo to |⌘|'s class "NSOrderedSet"'s orderedSetWithArray:(primaryFileInfo)
           set backupFileInfo to |⌘|'s class "NSMutableOrderedSet"'s orderedSetWithArray:(backupFileInfo)
           -- Reduce each set to its dictionaries with no counterpart in the other.
           set inPrimaryButNotInBackup to primaryFileInfo's mutableCopy()
           tell inPrimaryButNotInBackup to minusOrderedSet:(backupFileInfo)
           set inBackupButNotInPrimary to backupFileInfo -- 's mutableCopy()
           tell inBackupButNotInPrimary to minusOrderedSet:(primaryFileInfo)
           -- Get the relative paths from the remaining dictionaries (also as ordered sets).
           set primaryPaths to inPrimaryButNotInBackup's valueForKey:(|path|)
           set backupPaths to inBackupButNotInPrimary's valueForKey:(|path|)
           -- Analyse and report on any paths which don't exist in both sets.
           set pathsOnlyInPrimary to getPathDifferences(primaryPaths, backupPaths)
           if ((count pathsOnlyInPrimary) > 0) then addToReport(|PRIMARY FILES NOT IN BACKUP|, pathsOnlyInPrimary's array())
           set pathsOnlyInBackup to getPathDifferences(backupPaths, primaryPaths)
           if ((count pathsOnlyInBackup) > 0) then addToReport(|BACKUP FILES NOT IN PRIMARY|, pathsOnlyInBackup's array())
           -- Analyse and report on any paths which DO exist in both sets, which must belong to corresponding files with different modification dates (or with modification dates which are nominally equal but actually a few nanoseconds apart, which can happen under some circumstances).
           set pathsWithDifferentModificationDates to getModDateDifferences(primaryPaths, backupPaths, inPrimaryButNotInBackup, inBackupButNotInPrimary)
           if ((count pathsWithDifferentModificationDates) > 0) then addToReport(|BACKUPS WITH MODIFICATION DATES TOO LONG BEFORE THE PRIMARIES'|, pathsWithDifferentModificationDates's array())
       end analyseFileDifferences
       
       (* Return the relative paths in one ordered set which aren't in the other. *)
       on getPathDifferences(orderedSetA, orderedSetB)
           set orderedSetA to orderedSetA's mutableCopy()
           tell orderedSetA to minusOrderedSet:(orderedSetB)
           
           return orderedSetA
       end getPathDifferences
       
       (* Return any relative paths common to both ordered sets if the modification dates of the files to which they point are more than the tolerated interval apart. *)
       on getModDateDifferences(primaryPaths, backupPaths, primaryInfo, backupInfo)
           set commonPaths to primaryPaths's mutableCopy()
           tell commonPaths to intersectOrderedSet:(backupPaths)
           set commonPathCount to (count commonPaths)
           if (commonPathCount > 0) then
               -- If there are relative paths in common, get the corresponding path/modification date dictionaries.
               set infoFilter to |⌘|'s class "NSPredicate"'s predicateWithFormat_(|path IN %@|, commonPaths)
               tell primaryInfo to filterUsingPredicate:(infoFilter)
               tell backupInfo to filterUsingPredicate:(infoFilter)
               repeat with i from commonPathCount to 1 by -1
                   -- Compare the modication dates from the corresponding dictionaries.
                   set primaryModDate to ((item i of primaryInfo)'s valueForKey:(|modification date|))
                   set backupModDate to ((item i of backupInfo)'s valueForKey:(|modification date|))
                   -- If the difference between the dates is within the tolerated interval, or if the tolerated interval is 0 and the dates are nominally the same but actually a few nanoseconds apart, remove the corresponding relative path from consideration.
                   if (((primaryModDate's timeIntervalSinceDate:(backupModDate)) is not greater than toleratedBackupDelay) or (currentCalendar's isDate:(primaryModDate) equalToDate:(backupModDate) toUnitGranularity:(calendarUnitSecond))) then tell commonPaths to removeObjectAtIndex:(i - 1)
               end repeat
           end if
           
           return commonPaths
       end getModDateDifferences
   end script
   
   mainScript's main()
end main

Last edited by Nigel Garvey (2019-11-11 04:24:43 pm)


NG

Offline

 

#8 2019-11-10 04:53:22 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6034

Re: Alright speed demons, I need to text process a list with 1.7 M items

Awesome effort, Nigel cool cool cool.

I noticed when testing the difference-reporting functions (with two much smaller folders!) that nominally equal modification-date NSDates, returned as URL resource values, weren't recognised as equal when compared.



Different file systems can store the values to different precision -- HFS stores times to the nearest second, for example, whereas AFPS stores some crazy number of decimal points. Could that explain what you're seeing?

Last edited by Shane Stanley (2019-11-10 04:53:51 pm)


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#9 2019-11-11 05:14:33 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

Shane Stanley wrote:

HFS stores times to the nearest second, for example, whereas AFPS stores some crazy number of decimal points. Could that explain what you're seeing?


Hi Shane. Thanks for the "cools".

I didn't know the technicalities, but I did theorise that sub-second time differences were involved. The folders and files were all created and compared on the same Mojave machine, ie. in AFPS. The curious thing is that the "backup" folder in my tests was initialised by simply duplicating the smaller "primary" one I'd created, so the dates of the contained files should have been the same in both. My initial thought was that nanoseconds had been added to the duplicates' times for some reason, but I suppose it's more likely that they'd actually been shaved off by the Finder during duplication. I'll look into it more closely today.

Another workaround which worked was to coerce each NSDate to an AS date before adding the record containing it to the relevant array. But this turned out to be only slightly faster than getting the date components and may not work in the future if AppleScript dates ever get the same nanosecond precision or if nanoseconds occasionally get rounded up.

That said, nanosecond differences only affect the minusSet() operations in the script which quickly narrow down the "path"/"modification date" dictionaries to those needing further attention. Unless 'toleratedBackupDelay' was set to 0, they'd be unlikely to make any difference to what was actually written to the log file, just to the time taken to eliminate false differences. It's a trade-off between the extra time needed to quantise every modification date on receipt and the time saved by being able to batch-eliminate matching pairs later on.


NG

Offline

 

#10 2019-11-11 06:10:56 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6034

Re: Alright speed demons, I need to text process a list with 1.7 M items

If I'm following correctly, you could also stick to dates rather than date components, and then use NSCalendar's compareDate:toDate:toUnitGranularity: or isDate:equalToDate:toUnitGranularity: with NSCalendarUnitSecond.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#11 2019-11-11 08:33:35 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

Those look useful for individual date comparisons. Thanks. My original idea was to use a "minus" set method to bulk-eliminate dictionaries containing identical relative paths and modification dates. Any dictionaries left would have either a path or a modification date not found in the other hierarchy and only these would then need to be handled individally. The possibility of nanosecond differences between "identical" dates makes this unreliable. I got round the problem above by putting the dates' relevant components in the dictionaries instead. But I'm beginning to see possible advantages to keeping the original dates and modifying the approach as you suggest. Identical dictionaries would be eliminated as before when identical dates were identical. Otherwise additional, individual date comparisons would be needed to establish identicality. But the additional work would only be one comparison per pair and only when paths matched, as opposed to component extractions for every date obtained and the reconstitution of dates of interest for the interval calculation. It wouldn't be worth fixing calculation itself for when the result was within a few nanoseconds of the limit. Hmm….


NG

Offline

 

#12 2019-11-11 04:39:32 pm

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

I've edited the script above to implement Shane's suggestion in post #10, which takes a minute or two off the time with my large test folder and produces the correct result with my two small ones. The isDate:equalToDate:toUnitGranularity: check is only done done if the difference between primary and backup modification dates is more than the tolerated interval, just in case the interval is 0 and the dates are only a few nanoseconds apart. In any other situation, it's not needed. There's no check for backups being newer than their primaries!


NG

Offline

 

#13 2019-11-12 03:08:16 am

KniazidisR
Member
Registered: 2019-03-03
Posts: 712

Re: Alright speed demons, I need to text process a list with 1.7 M items

Applescript:


set primarySet to current application's class "NSSet"'s setWithArray:primaryList
set backupSet to current application's class "NSSet"'s setWithArray:backupList

As one variable like primarySet or backupSet takes one place in the RAM, and your data has 50 TB, this operations isn't realable. A don't believe that exists computer with 50 TB RAM.

The conclusion is that: you can't avoid here overlaying the chunks of data to RAM, so your code can't became fast. Generally operations DISK-->RAM-->DISK-->RAM-->DISK-->RAM-->..... can't be fast, as the speed of script will be almost 100% limited by the low speed of reading data from the disk. This is not a matter of code efficiency, but of lagging disk technologies.

Therefore, your database itself must be built more efficiently so that certain logical units of the database can be retrieved and not exceed the RAM capacity. For example, I see you have some kind of alphabetization(./A, ./U). This is closer to logic of things. But this partition is not enough. I see you have a database for people. Each person, however, has several other properties besides the letter with which his last name begins. For example, you may not look for men among women, or babies among old people.

Last edited by KniazidisR (2019-11-12 03:38:32 am)


Model: MacBook Pro
macOS Mojave -- version 10.14.4
Safari -- version 12.1
Firefox -- version 70.0

Offline

 

#14 2019-11-12 05:01:44 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6034

Re: Alright speed demons, I need to text process a list with 1.7 M items

KniazidisR wrote:

As one variable like primarySet or backupSet takes one place in the RAM



The disk has 50TB. The array containing the file names and modification dates won't be anywhere near that. A fair chunk of memory, for sure, but quite manageable.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#15 2019-11-12 07:49:33 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

I wrote:

The folders and files were all created and compared on the same Mojave machine, ie. in AFPS. The curious thing is that the "backup" folder in my tests was initialised by simply duplicating the smaller "primary" one I'd created, so the dates of the contained files should have been the same in both. My initial thought was that nanoseconds had been added to the duplicates' times for some reason, but I suppose it's more likely that they'd actually been shaved off by the Finder during duplication. I'll look into it more closely today.


I've been able to confirm this morning that all the files in my "backup" folder have zero nanoseconds in their modification dates, whereas those in the small "primary" folder from which it was derived have nanoseconds > zero. But I haven't been able to reproduce the effect when deriving new "back up" folders today. The modification dates of the files in today's folders exactly match those in the original. It remains a mystery.  hmm

The modification dates of the folders in the two hierarchies don't match at all, which is easily explained by my having dragged things into and out of them during testing. All the folder dates have nanosecond components > zero.


NG

Offline

 

#16 2019-11-12 07:56:22 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

Shane Stanley wrote:

The disk has 50TB. The array containing the file names and modification dates won't be anywhere near that. A fair chunk of memory, for sure, but quite manageable.


But the main difference between the script above and three previous attempts which beachballed both script editors is that it breaks the array data down into more easily digested chunks:

• An array each of URLs for the folders in each hierarchy (down to the level above that specified for the files). After these are compared, they're reduced to just the "name" folders common to both collections. They remain in memory while the following come and go.
• An array of URLs for the contents of one "name" folder. This provides data for a (possibly) shorter array containing the data of interest for the folder's files and is then discarded.
• Two data-of-interest arrays resulting from the preceding point. If these prove to be identical, they're simply discarded. Otherwise various ordered sets are derived from them for analytical purposes before both the arrays and the sets are discarded. It would conceivably save memory to build the ordered sets directly rather than building arrays and then deriving the sets from them, but the latter process appears to be slightly faster.

This approach works with the large test folder I set up on my machine, but it's conceivable that the number of "name" folders in t.spoon's case, or the number of files in any particular one of them, could exceed the "beachball" limit. I've no idea what this limit is.


NG

Offline

 

#17 2019-11-12 03:16:53 pm

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 448

Re: Alright speed demons, I need to text process a list with 1.7 M items

The text file of the complete directory listing is 84.5 MB, so yes, nowhere near 50TB, and much less than the 16 GB of RAM in this machine.

I got stuck on getting the directory listing I'm comparing it to out of S3. I gave up on S3FS-c and realized that Transmit can mount an S3 bucket as a drive. But it didn't work.

I don't understand why it doesn't work to mount S3 as a mounted volume and use a command line "find" to extract just the directories I need, like I do on the server. The command doesn't time out or appear to fail, it appears to complete in the shell, and I get my file with the output. The output is correct, except it's missing the vast majority of the data that should be in it, much of which I've confirmed is there in the file system. So I'm at an impasse on this project.

My only option seems to be to list all 20+ million files/directories using S3 Command Line Tools, rather than just the ones I actually need to compare... and if I thought parsing 1.7 million was bad, I don't look forward to that.

I'd like to investigate the fastest tools for processing huge text datasets. I've wondered how it might perform if I simply import all this data to SQL and mess with it there. I'm a beginner at SQL though.

Thanks for all the help, and I suspect the techniques posted here will come back to be helpful on future projects,

tspoon.

Last edited by t.spoon (2019-11-12 03:24:43 pm)


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

#18 2019-11-13 04:52:20 am

Nigel Garvey
Moderator
From:: Warwickshire, England
Registered: 2002-11-20
Posts: 5105

Re: Alright speed demons, I need to text process a list with 1.7 M items

t.spoon wrote:

I don't understand why it doesn't work to mount S3 as a mounted volume and use a command line "find" to extract just the directories I need, like I do on the server.



Is the "depth" the same in both cases?

Interestingly, when I apply "find" to my test folder, it takes between 26 and 27 seconds regardless of whether I set "-depth" to 3 (the 1,724,231 files), 2 (the 7172 "name" folders), or 1 (the 26 "letter" folders), from which I deduce that it scans the entire contents anyway but only returns the items at the specified level. Very much faster for the first two levels is to use "-maxdepth" as well, which presumably limits the depth of the scan. If your disk hierarchies have deeper levels than those which interest you, it may be an idea to do this. Or you could do it anyway, since the time taken for the deepest level is the same in all cases: "-depth 3", "-maxdepth 3 -depth 3", or "-maxdepth 3 -mindepth 3". Perhaps the "-maxdepth 3 -mindepth 3" pairing is very slightly the fastest, but it's difficult to be sure.


NG

Offline

 

#19 2019-11-13 08:07:50 am

t.spoon
Member
From:: BFE, Massachusetts
Registered: 2013-01-13
Posts: 448

Re: Alright speed demons, I need to text process a list with 1.7 M items

Thanks Nigel, that's very helpful. I had figured out that the "depth" argument wasn't trimming any time, that it had to still be scanning everything. But I hadn't noticed the "maxdepth" and "mindepth" arguments.

I wonder what the point is of including a "depth" argument that doesn't stop the scan from going down the tree further than it can possibly return results?

I'll try again with maxdepth and mindepth arguments and see if that helps,

- tspoon


Hackintosh built February, 2012 |  Mac OS Sierra
GIGABYTE GA-Z68X-UD3H-B3 | Core i5 2500k | 16 GB DDR3 | GIGABYTE Geforce 1050 TI 4GB
250 GB Samsung 850 EVO | 4 TB RAID
Dell Ultrasharp U3011 | Dell Ultrasharp 2007FPb

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)