Sunday, May 9, 2021

#1 2021-04-22 01:07:59 am

lotr
Member
Registered: 2011-08-22
Posts: 94
Website

Copy contents of specific web page

I originally asked a query here 2 years ago but the topic of query was only one facet of the overall objective. Not wishing to be going off the original topic, I am looking at expanding an answer given by Shane Stanley here https://macscripter.net/viewtopic.php?p … 69#p199269 but not sure his approach would work with the specific web site desired if somehow adjusted.

Objective: (1) Copy contents of page url=http://wireshare.sourceforge.net/bootstrap/  (2) Replace the tabs/spaces with LF (linux/mac os) line breaks. (3) Output to UTF new text document.

I’d prefer to be doing this without using JavaScript. In the previous query 2 years ago, one of the websites has since been offline, another was incorrectly updated & unusable. The only comparable website that works with either Shane or KniazidisR’s script has outdated data (up to 9 months old.) Whereas the webpage listed above is constantly updated and reliable.

Note: I am on OS 10.11 with Script Editor 2.8.1, AS 2.5. As much backward compatibility is preferred. A (bundled) script will be offered free to public to assist with a p2p protocol.

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 78.0
Operating System: macOS 10.11


The shy can light the dull dark room rich with life when their eyes mirror their inner sunrise

Offline

 

#2 2021-04-22 04:25:47 am

KniazidisR
Member
From:: Greece
Registered: 2019-03-03
Posts: 1788

Re: Copy contents of specific web page

Using plain AppleScript:

Applescript:


-- open webpage
tell application "Safari" to open location "http://wireshare.sourceforge.net/bootstrap/"

-- wait until the webpage is loaded fully
tell application "System Events" to tell application process "Safari"
   set frontmost to true
   repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists)
       delay 0.1
   end repeat
end tell

-- get text of webpage
tell application "Safari" to set theText to text of document 1

-- replace tabs/spaces with linefeed
set text item delimiters of AppleScript to {space, tab}
set theText to text items of theText
set text item delimiters of AppleScript to linefeed
set theText to theText as text
set text item delimiters of AppleScript to ""

-- choose text file name
tell application "Safari" to set the_file to choose file name default location path to desktop folder

-- write to UTF-8 encoded text file
set file_ID to open for access the_file with write permission
set eof file_ID to 0
write theText to file_ID as «class utf8»
close access file_ID

The same using AsObjC:

Applescript:


use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- get the text of webpage
set anNSURL to current application's class "NSURL"'s URLWithString:"http://wireshare.sourceforge.net/bootstrap/"
set WebPageText to current application's class "NSString"'s stringWithContentsOfURL:(anNSURL) usedEncoding:(missing value) |error|:(missing value)

-- replace tabs and spaces with linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:space withString:linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:tab withString:linefeed

-- write to UTF8_encoded text file
set the_file to choose file name default location path to desktop folder
(WebPageText's writeToFile:(POSIX path of the_file) atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value))

Last edited by KniazidisR (2021-04-22 05:16:25 am)


Model: MacBook Pro
OS X: Catalina 10.15.4
Web Browser: Safari 14.1
Ram: 4 GB

Offline

 

#3 2021-04-22 05:29:26 am

lotr
Member
Registered: 2011-08-22
Posts: 94
Website

Re: Copy contents of specific web page

Thanks KniazidisR for such a fast response. Whilst I'd have preferred not using a web browser, I'm not going to be fussy.

The first script does not work for me, possibly because I already have multiple tabs open. The new tab loads but the log shows repeatedly:
exists UI element "Reload this page" of group 2 of toolbar 1 of window 1 of application process "Safari"
exists UI element "Reload this page" of group 2 of toolbar 1 of window 1 of application process "Safari"
exists UI element "Reload this page" of group 2 of toolbar 1 of window 1 of application process "Safari"
...

and the script does not finish running.

The second script works fine. I simply adjusted the final line to output to a specific name. This is exactly what I wanted. Thanks again, you've put together a great script! smile


The shy can light the dull dark room rich with life when their eyes mirror their inner sunrise

Offline

 

#4 2021-04-22 11:55:44 pm

wch1zpink
Member
Registered: 2011-08-20
Posts: 16

Re: Copy contents of specific web page

Here is a different solution without the necessity of launching a web browser

Applescript:

property saveToFile : (path to desktop as text) & "webpage_text.txt"
property theURL : "http://wireshare.sourceforge.net/bootstrap/"

do shell script "curl " & quoted form of theURL & ¬
   " > " & quoted form of POSIX path of saveToFile

Offline

 

#5 2021-04-23 12:18:38 am

lotr
Member
Registered: 2011-08-22
Posts: 94
Website

Re: Copy contents of specific web page

wch1zpink wrote:

Here is a different solution without the necessity of launching a web browser



Thanks! Excellent second approach similar to Shane's script in regards no browser required. The opening of a browser or tab might confuse some persons young or old as to what's happening, so that's sorted well here.

As far as I knew curl had major limitations but you've shown an ingenious method. smile


The shy can light the dull dark room rich with life when their eyes mirror their inner sunrise

Offline

 

#6 2021-04-23 06:42:37 am

Fredrik71
Member
Registered: 2019-10-23
Posts: 703

Re: Copy contents of specific web page

The source of the URL is on 1 line of code so here is example of that without browser.
If you need linefeed you could change this:
set AppleScript's text item delimiters to " "
to this
set AppleScript's text item delimiters to " " & linefeed


Applescript:

set textInput to every paragraph of (do shell script "curl http://wireshare.sourceforge.net/bootstrap/")

set ASTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to " "
set theItem to textInput as text
set AppleScript's text item delimiters to ASTID
return theItem

Last edited by Fredrik71 (2021-04-23 06:43:24 am)


if you are the expert, who will you call if its not your imagination.

Offline

 

#7 2021-04-23 11:28:05 pm

KniazidisR
Member
From:: Greece
Registered: 2019-03-03
Posts: 1788

Re: Copy contents of specific web page

When you provide people with a solution, it is desirable that it work at all, and then, for a long time, and not with one single webpage. The OP specifically requested to replace tabs and spaces to the linefeed, as well as to write the result as UTF-8 text file.

Your solution, wch1zpink and Fredrik71 in contrast to the solution in general, is private, not completed, since it uses the specifics of the content of a specific webpage. For her it works, but with many other pages it will simply fail.

Last edited by KniazidisR (2021-04-23 11:50:52 pm)


Model: MacBook Pro
OS X: Catalina 10.15.4
Web Browser: Safari 14.1
Ram: 4 GB

Offline

 

#8 2021-04-24 01:47:56 am

lotr
Member
Registered: 2011-08-22
Posts: 94
Website

Re: Copy contents of specific web page

Fredrik71 wrote:

... If you need linefeed you could change this:
set AppleScript's text item delimiters to " " ...


Thanks Fredrik71. For some reason I was finding either an empty line on first line or if not, in both cases the first number of the first listing would be missing. It may well be due to each of my methods of writing to file.

I'm personally tending toward wch1zpink's answer as it is simple yet effective. It also works at least far back as OS 10.8

KniazidisR wrote:

... it works, but with many other pages it will simply fail.

True. But the above approaches work for this specific topic and target.
In the previous query a couple of years ago, I found the approaches difficult to adapt to a different variation of these types of website. Thanks KniazidisR for your support.

And thanks to everyone for your excellent efforts.

Last edited by lotr (2021-04-24 08:19:03 am)


The shy can light the dull dark room rich with life when their eyes mirror their inner sunrise

Offline

 

#9 2021-04-24 10:21:23 pm

lotr
Member
Registered: 2011-08-22
Posts: 94
Website

Re: Copy contents of specific web page

Second URL

I changed my mind as long as this is not stretching things too far, if possible a solution be found for the following website:
url=http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella
This is one of those I struggled with.

The site lists 100 lines, but I’d prefer only the first 50 (as the rest are a day or two old.) Similar to above only the address listing is desired stripping all other details in similar fashion to the query of 2 years ago.

Reason for second URL: 1. I wish to add this to the results of the other website already discussed. 2. Either one of the websites might be down such as for maintenance purposes (usually no more than a few hours once/twice a year.)

Objective: 1. Copy and process details of the webpage. 2. (if both online) Add results of each webpage together. 3. If both websites are online & been processed, remove duplicates. 4. Output to file.

The increase in output might only be slightly larger but it offers an insurance against one site being temporarily offline.

I totally understand if this second request for assistance within same thread is stretching things too far. Although the topic is basically the same. smile

Model: mp3,1
AppleScript: 2.8.1
Browser: Firefox 78.9
Operating System: macOS 10.11


The shy can light the dull dark room rich with life when their eyes mirror their inner sunrise

Offline

 

#10 2021-04-25 12:23:40 am

Fredrik71
Member
Registered: 2019-10-23
Posts: 703

Re: Copy contents of specific web page

Your second website was not as easy as your first. (I had issues with curl)
And maybe this is not the solution you are looking for but the script below give you the data
from the second website: http://wireshare.sourceforge.net/gwc/gw … y=gnutella

To test it out.. you need to open URL location in Safari and run script below.

When you have 2 files from the 2 sites...
You could use command-line cat file1 file2 > file1AndFile2.txt
You could use command-line uniq to remove duplicate

set the clipboard to theText
In command-line you use: pbpaste > file2

Applescript:

tell application "Safari" to tell window 1
   tell current tab
       set theSource to its text
   end tell
end tell

set theLines to paragraphs 6 thru 55 of theSource

set theList to {}
repeat with i from 1 to count theLines
   set theItem to words 1 thru 2 of (item i of theLines)
   set theResult to (item 1 of theItem) & ":" & (item 2 of theItem)
   set the end of theList to theResult
end repeat

set ASTID to AppleScript's text item delimiters
set AppleScript's text item delimiters to linefeed
set theText to theList as text
set AppleScript's text item delimiters to ASTID
return theText

Last edited by Fredrik71 (2021-04-25 01:01:09 am)


if you are the expert, who will you call if its not your imagination.

Offline

 

#11 2021-04-25 12:29:53 am

KniazidisR
Member
From:: Greece
Registered: 2019-03-03
Posts: 1788

Re: Copy contents of specific web page

The second page has XHTML. I will think about none browser solution. For now, full solution using Safari, is this:

Applescript:


-- open first webpage, wait for full loading
tell application "Safari" to open location "http://wireshare.sourceforge.net/bootstrap/"
my waitFullLoading()

tell application "Safari" to set goodLinks to text of document 1 -- get text of webpage

-- replace tabs/spaces with linefeed, get links list
set text item delimiters of AppleScript to {space, tab}
set goodLinks to text items of goodLinks
set text item delimiters of AppleScript to linefeed
set goodLinks to goodLinks as text
set text item delimiters of AppleScript to ""

-- open second webpage, wait for full loading
tell application "Safari" to open location "http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella"
my waitFullLoading()

-- execute JavaScipt in the XHTML of webpage, to retrieve the links
tell application "Safari" to set theLinks to do JavaScript (my jScript()) in document 1

-- filter useful staff
repeat with nextLink in theLinks
   if nextLink contains "gnutella:host:" then
       set dottedLink to text 15 thru -1 of nextLink
       if not (dottedLink is in goodLinks) then set goodLinks to goodLinks & dottedLink & linefeed
   end if
end repeat

my writeToUTF8TextFile(goodLinks) -- write to UTF8 ecncoded text file


--================================= HANDLERS ==========================================

on waitFullLoading() -- wait until the webpage is loaded fully
   tell application "System Events" to tell application process "Safari"
       set frontmost to true
       repeat until (UI element "Reload this page" of group 2 of toolbar 1 of window 1 exists) or (UI element "Reload this page" of group 3 of toolbar 1 of window 1 exists)
           delay 0.1
       end repeat
   end tell
end waitFullLoading

on jScript() -- this JavaScript will executed in the Safari to get links from XHTML
   "function documentLinks() {
   //
   var arr = [], links = document.links;
   for(var i = 0; i < links.length; i++) {
    arr.push(links[i].href);
   }
   return arr
   }
   //
   documentLinks()
   "

end jScript

on writeToUTF8TextFile(goodLinks)
   -- choose text file name
   tell application "Safari" to set the_file to choose file name default location path to desktop folder
   -- write to UTF-8 encoded text file
   set file_ID to open for access the_file with write permission
   set eof file_ID to 0
   write (text 1 thru -2 of goodLinks) to file_ID as «class utf8»
   close access file_ID
end writeToUTF8TextFile

Last edited by KniazidisR (2021-04-25 01:25:48 am)


Model: MacBook Pro
OS X: Catalina 10.15.4
Web Browser: Safari 14.1
Ram: 4 GB

Offline

 

#12 2021-04-25 03:14:49 am

KniazidisR
Member
From:: Greece
Registered: 2019-03-03
Posts: 1788

Re: Copy contents of specific web page

None browser solution:

Applescript:


use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

-- get the text of first webpage
set anNSURL to current application's class "NSURL"'s URLWithString:"http://wireshare.sourceforge.net/bootstrap/"
set WebPageText to current application's class "NSString"'s stringWithContentsOfURL:(anNSURL) usedEncoding:(missing value) |error|:(missing value)
-- replace tabs and spaces with linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:space withString:linefeed
set WebPageText to WebPageText's stringByReplacingOccurrencesOfString:tab withString:linefeed

set theText to WebPageText as text -- text generated from first webpage

-- get the text of second webpage
set anNSURL to current application's class "NSURL"'s URLWithString:"http://wireshare.sourceforge.net/gwc/gwc.php?display=gnutella"
set WebPageText to current application's class "NSString"'s stringWithContentsOfURL:(anNSURL) usedEncoding:(missing value) |error|:(missing value)

-- get custom paragraphs
set AppleScript's text item delimiters to {"href=", "\" class=\"table_address\">"}
set theParagraphs to text items of (WebPageText as text)
set AppleScript's text item delimiters to ""

-- add not duplicate dotted links to theText
repeat with anItem in theParagraphs
   if anItem contains "\"gnutella:host:" then
       set dottedLink to text 16 thru -1 of anItem
       if not (theText contains dottedLink) then set theText to theText & dottedLink & linefeed
   end if
end repeat

-- choose text file name
set the_file to choose file name default location path to desktop folder
-- write to UTF-8 encoded text file
set file_ID to open for access the_file with write permission
set eof file_ID to 0
write theText to file_ID as «class utf8»
close access file_ID


Model: MacBook Pro
OS X: Catalina 10.15.4
Web Browser: Safari 14.1
Ram: 4 GB

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)