I am trying to extract the plain text of a pdf document using Automator’s Extract PDF Text and then clean the paragraph marks at the end of each line (except the ones near the period at the end of the paragraph) of the returned text to reformat the paragraphs, using a regex.
But, how can I call the Extract PDF Text application in a script?
Skim, a free PDF viewer, might be of interest for you. It is scriptable and also features an AppleScript command to extract the text from certain PDF pages.
Here is an example:
tell application "Skim"
open (POSIX file "/Users/martin/Desktop/example.pdf")
tell document 1
tell page 1
set pdftext to get text for
end tell
end tell
end tell
thanks, Jacques. It does not work on the version i used earlier but does work on the latest version. However, it extracts text (bold and italics) only from the first page of document 1.
sorry for my previous post. I referred the dictionary of Skim and it was not difficult to write a script to extract bold words from all pages.
This is the script that works:
tell application "Skim"
tell document 1
set bold_Word_list to words of every page whose its font contains "Bold"
end tell
end tell
Suppose there are two or more consecutive bold words then they are displayed as separate words in the results which i get after running the script. I would want such consecutive words to be shown together as one word. Any ideas?