Split TMX files

mactranslator · April 19, 2018, 6:20pm

I’m not sure if AppleScript is the best approach for this, but I’m looking for a script to chop up a gigantic (1,5 GB) TMX file (https://en.wikipedia.org/wiki/Translation_Memory_eXchange) into parts of e.g. 100,000 units.

Here are two examples of a TMX file:

With all units on one (separate) line:

With all units divided over several lines:

<?xml version="1.0" encoding="utf-8"?> RecognizeAll Test TM DefaultFlags DefaultFlags ABC\hans 0, 0 TM Translated This is the first line. This is the first line. ABC\hans -4072969220006651662, 2865162695760659201 TM Translated This is the second line. This is the second line. ABC\hans 2875699195589488344, -3087045305273620005 TM Translated This is the third line. This is the third line. ABC\hans -4072600167146536805, 2876603334424219768 TM Translated This is the fourth line. This is the fourth line. ABC\hans 2865323864096399602, -3408680581559371007 TM Translated This is the fifth line. This is the fifth line. ABC\hans -4072969549614067125, 2865152477930779848 TM Translated This is the last line. This is the last line.

Ideally, every chunk would start with a header (like in the first example) and end with </body

At least, every chunk should start with <tu … and end with the corresponding .

Is a task like this possible in AppleScript, with such huge files?

Nigel_Garvey · April 22, 2018, 2:27pm

The ‘read’ command doesn’t seem to like files of that size, but ASObjC may be OK. (The test files I created are only 1.3 GB.)

This can take up to a minute or more, depending on the file size, the unit size, and the speed of your computer:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

property maxUnitsPerFile : 100000

on splitTMXFile(TMXfile)
	-- Original assumptions: The TMX file is UTF-8 encoded and consists entirely of an initial block (which is to be reproduced in all the smaller files) ending with a "<body>" tag, a large number of "<tu …> … </tu>" entries, and closing "</body>" and "</tmx>" tags.
	-- Modified assumptions: The TMX text may or may not be UTF-8 encoded and "unit" sections may actually begin with straight "<tu>" tags instead of the "<tu …(+ other data)…>" type. The source file is now read using a method which works out the text encoding for itself and the split text is saved with that encoding.
	
	set |⌘| to current application
	-- Read the file.
	set originalPath to |⌘|'s class "NSString"'s stringWithString:(POSIX path of TMXfile)
	-- set UTF8 to |⌘|'s NSUTF8StringEncoding
	set {originalText, originalEncoding} to |⌘|'s class "NSString"'s stringWithContentsOfFile:(originalPath) usedEncoding:(reference) |error|:(missing value)
	
	-- Find where the "units" start.
	set unitsStart to (originalText's rangeOfString:("<tu[ >]") options:(|⌘|'s NSRegularExpressionSearch))'s location()
	-- Get the text up to there and the two end tags.
	set initialBlock to originalText's substringWithRange:({0, unitsStart})
	set endBlock to |⌘|'s class "NSString"'s stringWithString:("</body>" & linefeed & "</tmx>")
	
	-- Set up and use a regex to match the "units" in blocks of up to the maximum number required per file.
	set unitsBlockRegex to |⌘|'s class "NSRegularExpression"'s regularExpressionWithPattern:("(?:<tu[ >](?:[^>]++(?<!</tu)>)++[^>]++(?<=</tu)>[^<]*+){1," & maxUnitsPerFile & "}+") options:(0) |error|:(missing value)
	set unitsBlockMatches to unitsBlockRegex's matchesInString:(originalText) options:(0) range:({unitsStart, (originalText's |length|()) - unitsStart})
	
	-- Get the original file path without the extension to use as the basis for the file paths to be created.
	set rootPath to originalPath's stringByDeletingPathExtension()
	-- Work out how many digits will be required in the numeric suffixes to be included in the file names.
	set newFilesNeeded to (count unitsBlockMatches)
	set suffixLength to (count (newFilesNeeded as text))
	set n to (10 ^ suffixLength) as integer
	
	-- Work through the "unit" block matches.
	repeat with i from 1 to newFilesNeeded
		-- Create a new text consisting of the initial block, the current matched block of "units", and the end tags.
		set thisUnitsBlockMatch to item i of unitsBlockMatches
		set thisUnitsBlock to (originalText's substringWithRange:(thisUnitsBlockMatch's range()))
		set newText to initialBlock's stringByAppendingFormat_("%@%@", thisUnitsBlock, endBlock)
		-- Put together a path with an appropriate numeric suffix before the extension and save the text to it.
		set newPath to (rootPath's stringByAppendingString:(" [" & text 2 thru -1 of (n + i as text) & "].tmx"))
		tell newText to writeToFile:(newPath) atomically:(true) encoding:(originalEncoding) |error|:(missing value)
	end repeat
end splitTMXFile

set TMXfile to (choose file of type {"tmx"} with prompt "Choose a gigantic TMX file …")
splitTMXFile(TMXfile)

Edit: Script modified to make it more flexible in what it recognises.

mactranslator · April 23, 2018, 8:39am

Hi Nigel,

Thank you so much. This is great work. I’ve tested it with a couple of TMX files, created by different creation tools. All works fine.

However, there is this one big TMX from the EU that seems to be valid but nevertheless causes this error:

error "missing value doesn’t understand the “rangeOfString_” message." number -1708 from missing value

In line:

set unitsStart to (originalText's rangeOfString:("<tu "))'s location()

In case you want to test, here’s a link to the EU TMX file:

https://www.dropbox.com/s/3mt1uudszmmjsz9/DGT_DE_NL.tmx.zip?dl=0

Thanks,

HL

Nigel_Garvey · April 23, 2018, 10:31am

OK. That particular file differs from the others in that it’s UTF-16LE encoded and has plain “” tags instead of the “<tu …(+ other data)…>” variety. I’ve modified the script above to allow for these differences. It saves the extracted text with same encoding as the original — or at least with what’s deduced to be the original encoding by the method now used to read the file.

Let me know if there are any further problems.

mactranslator · April 23, 2018, 3:54pm

Thanks for providing the Mac using translators community with such a nice solution. I’ve posted it here: https://www.proz.com/forum/apple_mac_operating_systems/324749-splitting_up_gigantic_tms_applescript_solution.html

Note that there is one closing bracket too much in the last line. Once I removed that, the new version of the script worked fine

Nigel_Garvey · April 23, 2018, 7:34pm

Ah. Thanks. The last bracket must have been left in when I pasted the new version. Now corrected.