Tuesday, February 25, 2020

#1 2020-01-21 05:14:09 pm

Fredrik71
Member
Registered: 2019-10-23
Posts: 84

What is the best approach, extract text from webarchive...

Hi.

I'm a big fan of Safari webarchive format and many times I find it to be better and covert to PDF.
I use webarchive as a way to read documents... but also to archive.

I know QuickLook and Safari and (TextEdit in limited way) read this binary plist file.
I also know that Spotlight, mdfind and maybe other could search inside this format.

I do not like to import the webarchive to Safari to be able to extract text or copy...

So I was thinking about using apple textutil command to convert or cat to txt format, do find string matching.

I also find out that doc, docx and wordml had very good output in TextEdit. That is
very interesting if I need to edit and later... for printing. This format are more close to rtf format.

So my question to all...

If I choose to do it with textutil everything are done in background and that is great.

Here is a fast AppleScript...

Applescript:

set thePath to POSIX path of (path to desktop as alias) & "myArchive.webarchive"
set out to do shell script "textutil -cat txt " & quoted form of thePath & space & "-stdout " & "|" & "pbcopy"
set clip to the clipboard

The result of Script Editor is not same as I do directly in command-line...

I do understand I have to clean the code somehow... hmmm

What would be the best approach to be able to search in webarchive, find matching, extract text from it ??

Thanks.


Filed under: webarchive

Offline

 

#2 2020-01-21 07:49:28 pm

Marc Anthony
Member
From:: Dallas, TX
Registered: 2006-04-27
Posts: 919

Re: What is the best approach, extract text from webarchive...

Hi. In your code example, ‘cat’ appears to have been used, where ‘convert’ is indicated. If you want to search text that textutil converted, that can be done with grep. I’m not certain if that constitutes the “best" approach, as you haven’t exemplified your goal.

Applescript:

set searchTerm to (display dialog "Input the search string:" default answer "whatever")'s text returned
do shell script "textutil -convert txt " & (choose file)'s POSIX path's quoted form & space & "-stdout | fgrep -w " & searchTerm's quoted form

Offline

 

#3 2020-01-21 08:58:15 pm

Fredrik71
Member
Registered: 2019-10-23
Posts: 84

Re: What is the best approach, extract text from webarchive...

Thanks, yes grep is a good choice...

Mark, textutil has cat function and if there is multiply files it will concatenate them and
that could be a good idea before doing any search. Also I use -stdout it means I do not
convert to a file to do any matching.

I realize textutil is very powerful tool...

Thanks, Mark... smile)

The thing that hit me for hours was the text output from textutil, it look fine in terminal.
But it was not correct when I use pbcopy to pipe the text back to Script Editor.

The things was... terminal shell use utf-8 as default but not pbcopy command.

To include this line "export __CF_USER_TEXT_ENCODING=0x1F5:0x8000100:0x8000100"
Before the command textutil the text is same as in terminal.
So what does that line do... it will change pbcopy to use utf-8 encoding.

If any have the same problem when using pbcopy... or use pbcopy to pipe bash script to AppleScript.

update your .profile to include export __CF_USER_TEXT_ENCODING=0x1F5:0x8000100:0x8000100

Update of my Script, now everything looks okey in Script Editor at least.

Applescript:

set thePath to POSIX path of (path to desktop as alias) & "myArchive.webarchive"
set o to do shell script "export __CF_USER_TEXT_ENCODING=0x1F5:0x8000100:0x8000100" & ";" & space & "textutil -cat txt " & quoted form of (thePath) & space & "-stdout " & space & "|" & "pbcopy"

set clip to the clipboard

findAndReplaceInText(clip, "\"", "")

on findAndReplaceInText(theText, theSearchString, theReplacementString)
   set AppleScript's text item delimiters to theSearchString
   set theTextItems to every text item of theText
   set AppleScript's text item delimiters to theReplacementString
   set theText to theTextItems as string
   set AppleScript's text item delimiters to ""
   return theText
end findAndReplaceInText

Offline

 

#4 2020-01-22 06:19:09 am

Fredrik71
Member
Registered: 2019-10-23
Posts: 84

Re: What is the best approach, extract text from webarchive...

Mark...

if you type...
textutil -cat txt "file1.webarchive" "file2.webarchive" -stdout

It will print a text format of webarchive to the standard output (stdout)
2 files will be concatenated into 1 and print to standard output (stdout)

if you type...
textutil -cat txt "file1.webarchive" "file2.webarchive"

The standard output will no longer be (stdout), this time it print to file with name 'out.txt'

lets se how cat works for other format
textutil -cat docx "file1.webarchive" "file2.webarchive"

Will be the same as previous but instead use the extension of .docx
--> by default, the extension will be determined from the format

If the length of file1 is 5818 character and file2 is 9404 total of 15222
then we use textutil -info on the concatenated file out.txt and we get the same 15222

So why is -cat useful..

It's useful because it's possible to append files together before the pipe or conversion so
we could have the right data format to work with.

So it's also of interest to know what format is better to use.

Lets say...

We make AppleScript list of 10 files... that we like to do text manipulation, matching, search on
we could pipe all this files to 1 before we start doing a search algorithm. If we prefer to
save it as file and later pipe it to a function we could use -output options. If we like
to do it in memory and save the search result instead we do that later in process.

To get result back to Script Editor I use pbcopy and the clipboard so this could be input
to search algorithm (function).

If we like to change a specific style in the document that is also possible...or strip the metadata...

Or use the format wordml (Open XLM Format )that textutil will make a strictly xml file.
I guess it would be possible to search for tags of interests.

I hope this makes sense, let me know if it doesn't...

This is what I have find out so far, so it's new to me.

Offline

 

#5 2020-01-23 04:44:38 am

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6194

Re: What is the best approach, extract text from webarchive...

Fredrik71 wrote:

Here is a fast AppleScript...



And here is a faster one, which also doesn't require disturbing the clipboard:

Applescript:

use framework "Foundation"
use framework "AppKit"
use scripting additions

set thePath to POSIX path of (path to desktop) & "Test.webarchive"
set theURL to current application's NSURL's fileURLWithPath:thePath
set theDict to current application's NSDictionary's dictionary()
set theString to (current application's NSAttributedString's alloc()'s initWithURL:theURL options:theDict documentAttributes:(missing value) |error|:(missing value))'s |string|() as text

Last edited by Shane Stanley (2020-01-23 07:47:02 pm)


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#6 2020-01-23 06:35:35 am

Fredrik71
Member
Registered: 2019-10-23
Posts: 84

Re: What is the best approach, extract text from webarchive...

Thanks Shane, will try your code.

The latest test I have done is this approach...

Setup command-line command to pipe stdout to /tmp/file...
Open the /tmp/file for commands
Before close or end...
I do rm -f /tmp/file...

I have find out that 'Preview' only open files with right extension.

When

'TextEdit' do not care.

So it means.

the pipe stdout to /tmp/file also need to have format extension correct.

ex.

textutil -cat docx my_file -stdout | open -f -a "Preview"
-- do not work...

I do think there is a bug in the open -f flag... or very limited, becouse if we remove open command
from the above line. We will get binary form in standard output of docx.
But if we direct the output to /tmp/file... and read it from their we could use open -a "Preview"

So that's why I did my own pipe to /tmp/file...

Applescript:

-- something like this
set extension to ".docx"
set tmpFile to "/tmp/myTempFile" & extension
do shell script "textutil -cat docx " & quoted form of the_file ¬
   & space & "-stdout > " & tmpFile & " | " & "open -a \"Preview\"" & space & tmpFile & ¬
   "; " & "sleep 1" & "; " & "rm -f " & tmpFile

Last edited by Fredrik71 (2020-01-23 07:25:05 am)

Offline

 

#7 2020-01-23 07:22:20 am

Marc Anthony
Member
From:: Dallas, TX
Registered: 2006-04-27
Posts: 919

Re: What is the best approach, extract text from webarchive...

Hi, Shane. I get the following message with your code:

error "missing value doesn’t understand the “string” message." number -1708 from missing value

Offline

 

#8 2020-01-23 04:52:16 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6194

Re: What is the best approach, extract text from webarchive...

And you do have a file called Test.webarchive on your desktop?

Try this version and tell me what the error is:

Applescript:

use framework "Foundation"
use framework "AppKit"
use scripting additions

set thePath to POSIX path of (path to desktop) & "Test.webarchive"
set theURL to current application's NSURL's fileURLWithPath:thePath
set theDict to current application's NSDictionary's dictionary()
set {styledText, theError} to current application's NSAttributedString's alloc()'s initWithURL:theURL options:theDict documentAttributes:(missing value) |error|:(reference)
if styledText is missing value then error theError's localizedDescription() as text
set theString to styledText's |string|() as text


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#9 2020-01-23 05:26:02 pm

Marc Anthony
Member
From:: Dallas, TX
Registered: 2006-04-27
Posts: 919

Re: What is the best approach, extract text from webarchive...

That works without error on my Sierra system. My test path originally had a forward slash in it, which I've since removed. When retesting the first code, the result is now

Applescript:

«class ocid» id «data optr000000000050DF80AA7F0000»

.

Last edited by Marc Anthony (2020-01-23 05:31:23 pm)

Offline

 

#10 2020-01-23 07:46:17 pm

Shane Stanley
Member
From:: Australia
Registered: 2002-12-07
Posts: 6194

Re: What is the best approach, extract text from webarchive...

Marc Anthony wrote:

When retesting the first code, the result is now

Applescript:

«class ocid» id «data optr000000000050DF80AA7F0000»

.



Yes, it needs an "as text" in there. I'll edit it.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/
latenightsw.com

Offline

 

#11 2020-01-25 10:45:20 am

Fredrik71
Member
Registered: 2019-10-23
Posts: 84

Re: What is the best approach, extract text from webarchive...

Wow... that was fast.

So useful... thank you so much Shane.

I thought, when you have so much knowledge in AppleScript and AppleScriptObjC

Give google search: 'Shane Stanley AppleScriptObjc'

I only get 45 hits... that is very strange. wink

Thanks again.

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)