Can someone help me with this please. I am trying to get image path details from the following HTML extract. The website in question requires a login, so I can’t include a link to it here but I hope that the cut and paste HTML will be sufficient.
There are 5 .PNG file names required to be found - 71377, 71378, 71379, 71380 and 71381.
I have tried using
set theName to do JavaScript "document.getElementsByClassName('" & theItem & "')[" & itemCount & "].innerText"
for the various classes I can see in the HTML but I can’t find any Class that gives me the required file names. Is there another method I could try please?
NB. It may be important that the required text is displayed as a hyperlink in the html code (from /get… to .PNG)
Here is a simple script to parse the html example you gave.
My script assumes the html text is on the clipboard from a copy operation.
It returns a list of text paths.
use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions
on run
local html, htmlList, tid
set tid to text item delimiters
set html to the clipboard
set text item delimiters to "<img alt=\"picture\" class=\"block h-full w-full\" src=\""
set htmlList to text items of html
set htmlList to rest of htmlList
set text item delimiters to "\">"
repeat with anItem in htmlList
set contents of anItem to first text item of contents of anItem
end repeat
set text item delimiters to tid
return htmlList
end run
I tested Robert’s suggestion and it works great. It’s fast and does exactly what the OP wants.
I’ve been working to learn RegEx and capture groups, and this was a good opportunity to test my skills. As with Robert’s script, my suggestion assumes the HTML is on the clipboard and returns a list of POSIX paths.
use framework "Foundation"
use scripting additions
set theHTML to the clipboard
set thePaths to getPaths(theHTML)
on getPaths(theHTML)
set theHTML to current application's NSString's stringWithString:theHTML
set thePattern to "src=\\\"(.*)\\\">" -- from src=" to ">
-- set thePattern to "(?i)src=\\\"(.*\\.png)" -- from src=" to and including .png
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theHTML options:0 range:{location:0, |length|:theHTML's |length|()}
set theMatches to current application's NSMutableArray's new()
repeat with aMatch in regexResults
set theRange to (aMatch's rangeAtIndex:1)
(theMatches's addObject:(theHTML's substringWithRange:theRange))
end repeat
return theMatches as list
end getPaths
Thanks all for your replies. As usual, I wasn’t clear what my problem actually was!
I couldn’t see any reference to the actual file name in the output from my code
set theName to do JavaScript "document.getElementsByClassName('" & theItem & "')[" & itemCount & "].innerText"
But changing the ‘.innerText’ to ‘.innerHTML’ has revealed the missing file names.
theXPath is used:
1)‘//’ → drills down to any child element level
2)‘img’ → matches nodes named ‘img’
3)‘/@src’ → matches the nodes attribute value ‘src’
then must get the stringValue for each Matched ElementNode to get theValues
then theValues as list for theValuesList
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
property theURL : "https://www.google.com"
property theXML : ""
property theXMLDoc : missing value
property theXPath : "//img/@src"
property theMatches : missing value
property theValues : missing value
property theValuesList : {}
set theXML to "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<div class=\"container mx-auto py-4\">
<div class=\"-m-4 flex flex-wrap\">
<div class=\"w-1/2 p-4 lg:w-1/4\">
<a class=\"relative block h-48 overflow-hidden rounded\">
<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71377.PNG\">
</img>
<div class=\"mt-4\">
<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
1.32 Some text
</h3>
</div>
</a>
<div class=\"w-1/2 p-4 lg:w-1/4\">
<a class=\"relative block h-48 overflow-hidden rounded\">
<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71378.PNG\">
</img>
<div class=\"mt-4\">
<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
1.32 Some text
</h3>
</div>
</a>
<div class=\"w-1/2 p-4 lg:w-1/4\">
<a class=\"relative block h-48 overflow-hidden rounded\">
<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71379.PNG\">
</img>
<div class=\"mt-4\">
<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
1.32 Some text
</h3>
</div>
</a>
<div class=\"w-1/2 p-4 lg:w-1/4\">
<a class=\"relative block h-48 overflow-hidden rounded\">
<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71380.PNG\">
</img>
<div class=\"mt-4\">
<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
1.32 Some text
</h3>
</div>
</a>
<div class=\"w-1/2 p-4 lg:w-1/4\">
<a class=\"relative block h-48 overflow-hidden rounded\">
<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71381.PNG\">
</img>
<div class=\"mt-4\">
<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
3.32 Some text
</h3>
</div>
</a>
</div>
</div>
<div/>
</div>
</div>
</div>
</div>
</div>
"
-- OPT 1 CREATE theXMLDoc from NSURL link
--set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithContentsOfURL:theURL options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
-- OPT 2 CREATE theXMLDoc from STRING
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithXMLString:theXML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)
set {theMatches, theError} to (theXMLDoc's nodesForXPath:(theXPath) |error|:(reference))
-- {(NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71377.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71378.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71379.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71380.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71381.PNG"}
set theValues to (theMatches's valueForKey:"stringValue")
set theValuesList to theValues as list
-- {"/get/app/Images/Pictures/Comp/71377.PNG", "/get/app/Images/Pictures/Comp/71378.PNG", "/get/app/Images/Pictures/Comp/71379.PNG", "/get/app/Images/Pictures/Comp/71380.PNG", "/get/app/Images/Pictures/Comp/71381.PNG"}