I’ve created a script to process TMX (= XML) files in TextWrangler. I’ve used some ‘clumsy’ ways to add some lines at the start and ending of the resulting file. See: – Fill these 5 lines with the TMX header
I was wondering, what would be the ‘elegant’ way?
And also: what is the best way to represent the line feed? I now type “\n” and this works. However, when the script is saved in the editor I get an ‘ugly’ word wrap (in the part – Fill these 5 lines with the TMX header).
-- Release: 2016-10-20
-- Purpose: Create a compacted version of a TMX file
tell application "TextWrangler"
tell front text document
-- Remove line breaks at segment start
replace "<seg>\\n" using "<seg>" options {search mode:grep, starting at top:true}
-- Remove line breaks at segment ending
replace "\\n</seg>" using "</seg>" options {search mode:grep, starting at top:true}
-- Conversion from TMX to tab-del
-- Insert tab characters between source and target segments
replace "<\\/seg.*?seg>" using " " options {search mode:grep, starting at top:true}
-- Remove segment ending markup
replace "</seg></tuv></tu>" using "" options {starting at top:true}
-- Remove closing body markup
replace "</body>" using "" options {starting at top:true}
-- Remove closing TMX markup
replace "</tmx>" using "" options {starting at top:true}
-- Remove any TMX header
--replace "<\\?[\\w\\W]*<body>\\r" using "" options {search mode:grep, starting at top:true}
-- Remove segment start markup
replace "<tu.*?seg>" using "" options {search mode:grep, starting at top:true}
-- Start cleaning up the tab-del
-- Remove numbers
replace "\\d+" using "0" options {search mode:grep, starting at top:true}
-- Remove punctuation characters
replace "[!?ž""'˜\"]" using "" options {search mode:grep, starting at top:true}
-- Remove HTML entities
replace "&.*?;" using "" options {search mode:grep, starting at top:true}
-- Replace non-breaking space with normal spaces
replace "\\x{A0}" using " " options {search mode:grep, starting at top:true}
-- Reduce space sequences to single spaces
replace "[ ]{2,}" using " " options {search mode:grep, starting at top:true}
-- Remove spaces at segment start
replace "\\r[ ]" using "\\r" options {search mode:grep, starting at top:true}
replace "\\t[ ]" using "\\t" options {search mode:grep, starting at top:true}
-- Remove spaces at segment ending
replace "[ ]\\r" using "\\r" options {search mode:grep, starting at top:true}
replace "[ ]\\t" using "\\t" options {search mode:grep, starting at top:true}
-- Remove duplicate lines
process duplicate lines duplicates options {match mode:leaving_one} output options {deleting duplicates:true}
-- Delete lines where source=target
process lines containing matching string "^(.*?)\\t\\1" matching with grep true ¬
output options {deleting matched lines:true}
-- Delete lines without any letter
process lines containing matching string "^((?![A-z]).)*$" matching with grep true ¬
output options {deleting matched lines:true}
-- Delete lines without a TAB
process lines containing matching string "^[^\\t]*?$" matching with grep true ¬
output options {deleting matched lines:true}
-- Add one line break at the start of the document
replace "\\A(.)" using "\\n\\1" options {search mode:grep, starting at top:true}
-- Add segment markup step 1 start and endings
replace "\\n" using "<\\/seg><\\/tuv><\\/tu>\\n<tu><tuv xml:lang=\"de-DE\"><seg>" options {search mode:grep, starting at top:true}
-- Add segment markup step 2 tab characters
replace "\\t" using "<\\/seg><\\/tuv><tuv xml:lang=\"nl-NL\"><seg>" options {search mode:grep, starting at top:true}
-- Insert 5 empty lines at the start of the file
set first line to "one" & "
" & "two" & "
" & "three" & "
" & "four" & "
" & "five"
-- Fill these 5 lines with the TMX header
set first line to "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
set second line to "<tmx version=\"1.4\">"
set third line to "<header datatype=\"plaintext\" segtype=\"sentence\" adminlang=\"EN-US\" srclang=\"de-DE\"><note>size=4</note>"
set fourth line to "</header>"
set fifth line to "<body>"
-- Replace the last line of the file with the TMX closing markups
set last line to "</body>" & "
" & "</tmx>"
save to ((path to desktop folder) as text) & "Compacted memory.tmx"
end tell
end tell