Extract Image detail from HTML

Can someone help me with this please. I am trying to get image path details from the following HTML extract. The website in question requires a login, so I can’t include a link to it here but I hope that the cut and paste HTML will be sufficient.

There are 5 .PNG file names required to be found - 71377, 71378, 71379, 71380 and 71381.

I have tried using

set theName to do JavaScript "document.getElementsByClassName('" & theItem & "')[" & itemCount & "].innerText"

for the various classes I can see in the HTML but I can’t find any Class that gives me the required file names. Is there another method I could try please?

NB. It may be important that the required text is displayed as a hyperlink in the html code (from /get… to .PNG)

HTML as follows:

picture

1.32 Some text

picture

1.32 Some text

picture

1.32 Some text

picture

1.32 Some text

picture

3.32 Some text

Here is a simple script to parse the html example you gave.

My script assumes the html text is on the clipboard from a copy operation.
It returns a list of text paths.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

on run
	local html, htmlList, tid
	set tid to text item delimiters
	set html to the clipboard
	set text item delimiters to "<img alt=\"picture\" class=\"block h-full w-full\" src=\""
	set htmlList to text items of html
	set htmlList to rest of htmlList
	set text item delimiters to "\">"
	repeat with anItem in htmlList
		set contents of anItem to first text item of contents of anItem
	end repeat
	set text item delimiters to tid
	return htmlList
end run

I tested Robert’s suggestion and it works great. It’s fast and does exactly what the OP wants.

I’ve been working to learn RegEx and capture groups, and this was a good opportunity to test my skills. As with Robert’s script, my suggestion assumes the HTML is on the clipboard and returns a list of POSIX paths.

use framework "Foundation"
use scripting additions

set theHTML to the clipboard
set thePaths to getPaths(theHTML)

on getPaths(theHTML)
	set theHTML to current application's NSString's stringWithString:theHTML
	set thePattern to "src=\\\"(.*)\\\">" -- from src=" to ">
	-- set thePattern to "(?i)src=\\\"(.*\\.png)" -- from src=" to and including .png
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegex's matchesInString:theHTML options:0 range:{location:0, |length|:theHTML's |length|()}
	set theMatches to current application's NSMutableArray's new()
	repeat with aMatch in regexResults
		set theRange to (aMatch's rangeAtIndex:1)
		(theMatches's addObject:(theHTML's substringWithRange:theRange))
	end repeat
	return theMatches as list
end getPaths

Have a look at this post and my reply about using NSXMLDocument and Xpaths

https://macscripter.net/viewtopic.php?id=49139

Thanks all for your replies. As usual, I wasn’t clear what my problem actually was!
I couldn’t see any reference to the actual file name in the output from my code

set theName to do JavaScript "document.getElementsByClassName('" & theItem & "')[" & itemCount & "].innerText"

But changing the ‘.innerText’ to ‘.innerHTML’ has revealed the missing file names.:slight_smile:

Here’s Script using NSXMLDocument…

theXPath is used:
1)‘//’ → drills down to any child element level
2)‘img’ → matches nodes named ‘img’
3)‘/@src’ → matches the nodes attribute value ‘src’

then must get the stringValue for each Matched ElementNode to get theValues
then theValues as list for theValuesList

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

property theURL : "https://www.google.com"
property theXML : ""

property theXMLDoc : missing value
property theXPath : "//img/@src"
property theMatches : missing value
property theValues : missing value
property theValuesList : {}

set theXML to "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<div class=\"container mx-auto py-4\">
	<div class=\"-m-4 flex flex-wrap\">
		<div class=\"w-1/2 p-4 lg:w-1/4\">
			<a class=\"relative block h-48 overflow-hidden rounded\">
				<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71377.PNG\">				
				</img>
				<div class=\"mt-4\">
					<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
						
						1.32 Some text
						
					</h3>
				</div>
			</a>
			<div class=\"w-1/2 p-4 lg:w-1/4\">
				<a class=\"relative block h-48 overflow-hidden rounded\">
					<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71378.PNG\">
						
					</img>
					<div class=\"mt-4\">
						<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
							
							1.32 Some text
							
						</h3>
					</div>
				</a>
				<div class=\"w-1/2 p-4 lg:w-1/4\">
					<a class=\"relative block h-48 overflow-hidden rounded\">
						<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71379.PNG\">
							
						</img>
						<div class=\"mt-4\">
							<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
								
								1.32 Some text
								
							</h3>
						</div>
					</a>
					<div class=\"w-1/2 p-4 lg:w-1/4\">
						<a class=\"relative block h-48 overflow-hidden rounded\">
							<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71380.PNG\">
								
							</img>
							<div class=\"mt-4\">
								<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
									
									1.32 Some text
								</h3>
							</div>
						</a>
						<div class=\"w-1/2 p-4 lg:w-1/4\">
							<a class=\"relative block h-48 overflow-hidden rounded\">
								<img alt=\"picture\" class=\"block h-full w-full\" src=\"/get/app/Images/Pictures/Comp/71381.PNG\">
									
								</img>
								<div class=\"mt-4\">
									<h3 class=\"title-font mb-1 text-xs tracking-widest text-gray-500\">
										
										3.32 Some text 
										
									</h3>
								</div>
							</a>
						</div>
					</div>
					<div/>
				</div>
			</div>
		</div>
	</div>
</div>
"

-- OPT 1 CREATE theXMLDoc from NSURL link
--set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithContentsOfURL:theURL options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)

-- OPT 2 CREATE theXMLDoc from STRING
set {theXMLDoc, theError} to current application's NSXMLDocument's alloc()'s initWithXMLString:theXML options:(current application's NSXMLDocumentTidyHTML) |error|:(reference)

set {theMatches, theError} to (theXMLDoc's nodesForXPath:(theXPath) |error|:(reference))
-- {(NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71377.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71378.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71379.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71380.PNG", (NSXMLNamedNode) src="/get/app/Images/Pictures/Comp/71381.PNG"}

set theValues to (theMatches's valueForKey:"stringValue")
set theValuesList to theValues as list
-- {"/get/app/Images/Pictures/Comp/71377.PNG", "/get/app/Images/Pictures/Comp/71378.PNG", "/get/app/Images/Pictures/Comp/71379.PNG", "/get/app/Images/Pictures/Comp/71380.PNG", "/get/app/Images/Pictures/Comp/71381.PNG"}