Tagging & Parsing an .rtf to an .xml

Hi!
Well i know that this might sound as already approached a thousand times, but I couldn’t find anything truly usefull.
First of all, I’m trying to find and repleace in a .rtf file, because in some cases it is the formatting what tells me that something needs to be replaced. Secondly, I’ve seen that the tipical approache finds and repleaces the whole text of the file, instead of just the word or words you wanted; I belive this wouldn’t be so practical if my file is really long…

Let me tell you what i want to do:
I have exported from a PDF all text from a section. Now, since I’m gonna use that text to fill a database, I need it to be tagged in some specific way. Here is an example of the raw text:

and this is what I should end with:

So… yes, it IS a D&D3.5 spell. (We are trying to make a dynamic database to access them more rapidly, for fun).
So, as you see except for a couple fo tags, all replaced some portiong of the text, and then inserted text at the end of that line… as here:


so i assume that if I could find every "Components: " and replace it for “” and then somehow at the end of that same line add “”… that would be half of the work.

As I see it, the most difficult part is the , and TAGs…
The Name and School each have their own Rich Text Formatting, so if I can search for those, then i know where to instert the tags… but the description… i have no idea how to look for it.

So… any ideas on how to accomplish this?
I just need some ideas on how to beging with this… I think i could code the script myself, but I’ll need your experienced pointers.

Thanks to all!!!

Model: Mac Mini Intel
Browser: Firefox 2.0.0.14
Operating System: Mac OS X (10.4)

I started out making a short script to show you how to do it. But after getting into it it was easy to just write the whole thing. If the format of the raw text doesn’t change then this should work for you, otherwise you can make the adjustments as necessary. Good luck.

set rawText to "Acid Fog 
Conjuration (Creation) [Acid] 
Level: Sor/Wiz 6, Water 7
Components: V, S, M/DF
Casting Time: 1 standard action
Range: Medium (100 ft. + 10 ft./level)
Effect: Fog spreads in 20-ft. radius, 20 ft. high
Duration: 1 round/level
Saving Throw: None
Spell Resistance: No
Acid fog creates a billowing mass of misty vapors similar to that produced by a solid fog spell (page 281). In addition to slowing creatures down and obscuring sight, this spell's vapors are highly acidic. Each round on your turn, starting when you cast the spell, the fog deals 2d6 points of acid damage to each creature and object within it.
Arcane Material Component: A pinch of dried, powdered peas combined with powdered animal hoof."

-- the keyWords are the words to be replaced in the rawText
-- note the format is {(the word to be replaced), (the beginning tag), (the end tag)}
set keyWords to {"Level", "<level>", "</level>", "Components", "<components>", "</components>", "Casting Time", "<castingtime>", "</castingtime>", "Range", "<range>", "</range>", "Effect", "<effect>", "</effect>", "Duration", "<duration>", "</duration>", "Saving Throw", "<savingthrow>", "</savingthrow>", "Spell Resistance", "<spellresistance>", "</spellresistance>"}

-- make the rawText into a list
set rawParagraphs to paragraphs of rawText

-- some of the items in the rawParagraphs list have leading and trailing spaces, so we remove those spaces
set moddedParagraphs to {}
repeat with i from 1 to count of rawParagraphs
	removeLeadingSpaces(item i of rawParagraphs)
	set end of moddedParagraphs to removeTrailingSpaces(result)
end repeat

-- add the spell tag to our final list and fix the first item in the rawText
set finalList to {"<spell>"}
set end of finalList to "<name>" & item 1 of moddedParagraphs & "</name>"

-- fix the second item in the rawText and add it to our finalList
set text item delimiters to " ["
set modItem2 to text items of (item 2 of moddedParagraphs)
set text item delimiters to ""
set end of finalList to "<school>" & item 1 of modItem2 & "</school>"
set end of finalList to "<element>[" & item 2 of modItem2 & "</element>"

-- fix the items in the rawText that have our keyWords
set text item delimiters to ": "
repeat with j from 3 to 10
	set theseWords to text items of item j of moddedParagraphs
	
	repeat with i from 1 to count of keyWords by 3
		if (item 1 of theseWords) is (item i of keyWords) then
			set end of finalList to (item (i + 1) of keyWords) & (item 2 of theseWords) & (item (i + 2) of keyWords)
			exit repeat
		end if
	end repeat
end repeat
set text item delimiters to ""

-- fix the last 2 rawText items and add the ending spell tag
set end of finalList to "<description>" & (item 11 of moddedParagraphs) & "</description>"
set end of finalList to "<materialcomponent>" & item 12 of moddedParagraphs & "</materialcomponent>"
set end of finalList to "</spell>"

-- make the list back into a string
set text item delimiters to return
set finalString to finalList as string
set text item delimiters to ""
finalString





(*===================== SUBROUTINES =====================*)
on removeLeadingSpaces(theString)
	set newname to theString
	repeat with i from 1 to (count of theString)
		if text i of theString is not " " then
			exit repeat
		else
			set newname to text (i + 1) thru -1 of theString
		end if
	end repeat
	return newname
end removeLeadingSpaces

on removeTrailingSpaces(theString)
	set newname to theString
	repeat with i from (count of theString) to 1 by -1
		if text i of theString is not " " then
			exit repeat
		else
			set newname to text 1 thru (i - 1) of theString
		end if
	end repeat
	return newname
end removeTrailingSpaces

Wow! Thanks Regulus!
I’ll test this tonight when I get home!
Thank you very much for your time!

No problem. Hopefully it will work as you expect. Let me know how it turns out.