Script to read html title tag & save doc with that title

I’m looking for sample or guidance for a script to batch process a folder of html files. I can use bbedit (preferred), textwrangler, pages, word, etc.

The script needs to find and read the html title tag, example:


and then simply save the file with the title text as the filename (with .html extension).

I have about 1200 documents to change…

Your help/advice would be sincerely appreciated!

And what would you do if two (or more) titles were alike? :slight_smile:

This little snippet needs to be prepended with a cd your working directory. You need to add a test for identical filenames before mvi’ng the file to the one with new name, after you have incremented a counter or such, so that it is given a truly unique filename, that is, if there where a collision.


set HTMLData to do shell script "curl http://macscripter.net" without altering line endings
set HTMLTitle to do shell script "sed -n 's/<title>\\(.*\\)<\\/title>/\\1/p' <<< " & quoted form of HTMLData

Hello.

I made this shell script for you. I have called mine renbytitle. You may save it in the folder with the html as utf8 NO BOM, then open a terminal window for that folder, execute chmod u+x renbytitle, then enter ./renbytitle and hit return. (you may of course make a copy of your folder with files, and perform a dry run, so nothing gets broken.

@DJ Bazzie Wazzie: I take height for some old non xhtml 1.0 compliant html.

I love how this is an APPLESCRIPT forum … hahah.

set myFolder to quoted form of "/Users/John/Desktop/URLS"
set myFiles to every paragraph of (do shell script "find " & myFolder & " \\! -name \".*\" -type f")

repeat with aFile in myFiles
	set fileName to (do shell script "grep -io \\<title\\>[^\\<]* " & quoted form of aFile & " | sed 's/<title>//'") & ".html"
	tell application "System Events" to set file aFile's name to fileName
end repeat

You may also need to watch for characters which are illegal or discouraged in file names and for HTML circumlocutions, as in the title of this very thread.

Hello.

I realized I hadn’t posted the correct version, that is now corrected.

That is so true, I’ll be back in a bit with it.

I have updated it, to escape slashes, as those can be difficult at times, both in titles, and in filenames.
Not with Safari, but with wget they are a pain, but then you can force wget to use strictly unix filenames.

If there are any other characters that I should really remove, then please inform me. :slight_smile:

I settled for shell scripting, since it was 1200 documents. :stuck_out_tongue:

Thanks ALL for the code and discussion! Very much appreciated!

Let me also say… I’ll try the Applescript AND the shell script both (for learning). I had considered a perl script as I have done a bit of that in the past, but not much. I certainly didn’t expect that in this form! Again, thanks to all.