Transliteration from Cyrillic to Latin with only AppleScript

Hi All,

I have a working script on perl, which I use through the service in the Automator.app, I have a desire to adapt this task to applescript, but so far nothing works out.

perl script

So far, I’m only in the process of working transliteration, but my experience is not enough to do it


set theString to "Брин Сергей Михайлович"
--
set RuUp0 to "Й"
set RuUp1 to "Ё"
set RuUp2 to "АБВГДЕЗИКЛМНОПРСТУФЫЭ"
set RuUp3 to "Ж"
set RuUp4 to "Х"
set RuUp5 to "Ц"
set RuUp6 to "Ч"
set RuUp7 to "Ш"
set RuUp8 to "Щ"
set RuUp9 to "Ю"
set RuUp10 to "Я"
--
set RuDown0 to "й"
set RuDown1 to "ё"
set RuDown2 to "абвгдезиклмнопрстуфыэ"
set RuDown3 to "ж"
set RuDown4 to "х"
set RuDown5 to "ц"
set RuDown6 to "ч"
set RuDown7 to "ш"
set RuDown8 to "щ"
set RuDown9 to "ь"
set RuDown10 to "ъ"
set RuDown11 to "ю"
set RuDown12 to "я"
--
set EnUp0 to "Y"
set EnUp1 to "Ye"
set EnUp2 to "ABVGDEZIKLMNOPRSTUFYE"
set EnUp3 to "Zh"
set EnUp4 to "Kh"
set EnUp5 to "Ts"
set EnUp6 to "Ch"
set EnUp7 to "Sh"
set EnUp8 to "Sch"
set EnUp9 to "Yu"
set EnUp10 to "Ya"
--
set EnDown0 to "y"
set EnDown1 to "ye"
set EnDown2 to "abvgdeziklmnoprstufy"
set EnDown3 to "zh"
set EnDown4 to "kh"
set EnDown5 to "ts"
set EnDown6 to "ch"
set EnDown7 to "sh"
set EnDown8 to "sch"
set EnDown9 to ""
set EnDown10 to ""
set EnDown11 to "yu"
set EnDown12 to "ya"

set sourceList to {RuDown0, RuDown1, RuDown2, RuDown3, RuDown4, RuDown5, ¬
	RuDown6, RuDown7, RuDown8, RuDown9, RuDown10, RuDown11, RuDown12, ¬
	RuUp0, RuUp1, RuUp2, RuUp3, RuUp4, RuUp5, RuUp6, RuUp7, RuUp8, RuUp9, RuUp10} as Unicode text
set targetList to {EnDown0, EnDown1, EnDown2, EnDown3, EnDown4, EnDown5, ¬
	EnDown6, EnDown7, EnDown8, EnDown9, EnDown10, EnDown11, EnDown12, ¬
	EnUp0, EnUp1, EnUp2, EnUp3, EnUp4, EnUp5, EnUp6, EnUp7, EnUp8, EnUp9, EnUp10} as Unicode text

set translit to {}
repeat with theChar in theString
	set sym to offset of theChar in sourceList
	if sym is 0 then
		set end of translit to contents of theChar
	else
		set end of translit to character sym of targetList
	end if
end repeat

on findReplace(findText, replaceText, sourceText)
	set ASTID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to findText
	set sourceText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	set sourceText to "" & sourceText
	set AppleScript's text item delimiters to ASTID
	return sourceText
end findReplace

set translitwithdot to my findReplace({" "}, ".", translit as text)

return translitwithdot as text

The result of this script – “apzm.rdpvdy.lzheyknbzh”, but it should be “Brin.Sergey.Mikhaylovich”
How do I replace the characters correctly?
How do I flip the result and delete the excess?

The result of the execution should be: Sergey.Brin

Do you have any ideas on how to implement it?

Thanks!

This requires macOS 10.11 or later:

use AppleScript version "2.5" -- 10.11 or later
use framework "Foundation"
use scripting additions

set theString to "Брин Сергей Михайлович"
set theString to current application's NSString's stringWithString:theString
set theString to theString's stringByApplyingTransform:(current application's NSStringTransformLatinToCyrillic) |reverse|:true
set theString to theString as string
--> "Brin Sergej Mihajlovič"
return theString's word 2 & "." & theString's word 1
--> "Sergej.Brin"

Hi, Stanley,

Thanks for your example. It turned out that it was easier to make a reverse with a dot separator than I thought. :slight_smile:
But the sub-results not correct.
Your example: Brin Sergej Mihajlovič
сorrectly: Brin Sergey Mikhaylovich

The transliteration dictionary should be used custom.
Do you have any ideas for using your custom dictionary?

No, it’s not possible to write a custom transformation my code could use in AppleScript.

If you do not use the NSString class, but use your own template for translation, as was the case in my non-working version?
Can you show an example of how to go through the string and change the character correctly so that there are no skews, as in my example?
I have seen similar examples, but they describe the situation where 1 character changes to 1 other. My situation is different, 1 character can be replaced by 0, 2 or 3 characters.
Example:
“ь” - > “”
“ч” - > “ch”
“Щ” → “Sch”

use unicode id?
Example: 1065 convert to 83, 99, 104

If it’s really impossible, then I’ll think about how to embed this perl script inside AppleScript.

You could do it like this:

set sourceList to {RuDown0, RuDown1, RuDown2, RuDown3, RuDown4, RuDown5, ¬
	RuDown6, RuDown7, RuDown8, RuDown9, RuDown10, RuDown11, RuDown12, ¬
	RuUp0, RuUp1, RuUp2, RuUp3, RuUp4, RuUp5, RuUp6, RuUp7, RuUp8, RuUp9, RuUp10}
set targetList to {EnDown0, EnDown1, EnDown2, EnDown3, EnDown4, EnDown5, ¬
	EnDown6, EnDown7, EnDown8, EnDown9, EnDown10, EnDown11, EnDown12, ¬
	EnUp0, EnUp1, EnUp2, EnUp3, EnUp4, EnUp5, EnUp6, EnUp7, EnUp8, EnUp9, EnUp10}

set translit to {}
set ASTID to AppleScript's text item delimiters
repeat with i from 1 to count of sourceList
	set theString to findReplace(item i of sourceList, item i of targetList, theString)
end repeat
set AppleScript's text item delimiters to ASTID

return theString

on findReplace(findText, replaceText, sourceText)
	set AppleScript's text item delimiters to findText
	set sourceText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	return sourceText as text
end findReplace

But you can’t use this sort of thing:

set RuDown2 to "абвгдезиклмнопрстуфыэ"

You’d need to either put each of those characters in their own variable, or do a separate repeat, calling the handler for each character in the string.

I don’t know what is the reason for such stubbornness on the part of russian sites, but they rarely find Russian text (Cyrillic) encoded in MacRoman, UTF-8, or UTF-16 that are more universal. What especially kills me is the fact that Russians don’t like UTF-8, which is introduced for international convenience on the web.

Therefore, the main question is: in what encoding do you receive your Russian text? (input)

Because if it’s not MacRoman, UTF-8, or UTF-16, then before replacing the Russian letters (last Shane Stanley’s script) needs an important step: transcoding the text, for example, from IBM866 (or other) to MacRoman (or other, like UTF-8 in your case).

Here is one example (I wrote one time) for transcoding IBM866 Cyrillic text to MacRoman Cyrillic text:


use AppleScript version "2.4"
use framework "Foundation"
use framework "CoreFoundation"
use scripting additions
property NSMacRomanStringEncoding : a reference to 30

set str to "熂†´™† èÆ´‚†¢™† (à.䆢†´•‡®§ß•, 1936).srt"

-- get appropriate NSStringEncoding from known kCFStringEncoding (IBM866 here)
set theEncoding to current application's CFString's CFStringConvertEncodingToNSStringEncoding(current application's kCFStringEncodingDOSRussian)
--> 2.147484699E+9

set theData to ((current application's NSString)'s stringWithString:str)'s dataUsingEncoding:NSMacRomanStringEncoding

-- convert to MacRoman encoding
return (current application's NSString's alloc()'s initWithData:theData encoding:theEncoding) as text
--> "Наталка Полтавка (И.Кавалеридзе, 1936).srt"

Great! Thanks, Stanley, 99% success


set theString to "Брин Сергей Михайлович"

set RULow1 to "а"
set RULow2 to "б"
set RULow3 to "в"
set RULow4 to "г"
set RULow5 to "д"
set RULow6 to "е"
set RULow7 to "ё"
set RULow8 to "ж"
set RULow9 to "з"
set RULow10 to "и"
set RULow11 to "й"
set RULow12 to "к"
set RULow13 to "л"
set RULow14 to "м"
set RULow15 to "н"
set RULow16 to "о"
set RULow17 to "п"
set RULow18 to "р"
set RULow19 to "с"
set RULow20 to "т"
set RULow21 to "у"
set RULow22 to "ф"
set RULow23 to "х"
set RULow24 to "ц"
set RULow25 to "ч"
set RULow26 to "ш"
set RULow27 to "щ"
set RULow28 to "ъ"
set RULow29 to "ы"
set RULow30 to "ь"
set RULow31 to "э"
set RULow32 to "ю"
set RULow33 to "я"
--
set RUCap1 to "А"
set RUCap2 to "Б"
set RUCap3 to "В"
set RUCap4 to "Г"
set RUCap5 to "Д"
set RUCap6 to "Е"
set RUCap7 to "Ё"
set RUCap8 to "Ж"
set RUCap9 to "З"
set RUCap10 to "И"
set RUCap11 to "Й"
set RUCap12 to "К"
set RUCap13 to "Л"
set RUCap14 to "М"
set RUCap15 to "Н"
set RUCap16 to "О"
set RUCap17 to "П"
set RUCap18 to "Р"
set RUCap19 to "С"
set RUCap20 to "Т"
set RUCap21 to "У"
set RUCap22 to "Ф"
set RUCap23 to "Х"
set RUCap24 to "Ц"
set RUCap25 to "Ч"
set RUCap26 to "Ш"
set RUCap27 to "Щ"
set RUCap28 to "Э"
set RUCap29 to "Ю"
set RUCap30 to "Я"
--
set ENLow1 to "a"
set ENLow2 to "b"
set ENLow3 to "v"
set ENLow4 to "g"
set ENLow5 to "d"
set ENLow6 to "e"
set ENLow7 to "ye"
set ENLow8 to "zh"
set ENLow9 to "z"
set ENLow10 to "i"
set ENLow11 to "y"
set ENLow12 to "k"
set ENLow13 to "l"
set ENLow14 to "m"
set ENLow15 to "n"
set ENLow16 to "o"
set ENLow17 to "p"
set ENLow18 to "r"
set ENLow19 to "s"
set ENLow20 to "t"
set ENLow21 to "u"
set ENLow22 to "f"
set ENLow23 to "kh"
set ENLow24 to "ts"
set ENLow25 to "ch"
set ENLow26 to "sh"
set ENLow27 to "sch"
set ENLow28 to ""
set ENLow29 to "y"
set ENLow30 to ""
set ENLow31 to "e"
set ENLow32 to "yu"
set ENLow33 to "ya"
--
set ENCap1 to "A"
set ENCap2 to "B"
set ENCap3 to "V"
set ENCap4 to "G"
set ENCap5 to "D"
set ENCap6 to "E"
set ENCap7 to "Ye"
set ENCap8 to "Zh"
set ENCap9 to "Z"
set ENCap10 to "I"
set ENCap11 to "Y"
set ENCap12 to "K"
set ENCap13 to "L"
set ENCap14 to "M"
set ENCap15 to "N"
set ENCap16 to "O"
set ENCap17 to "P"
set ENCap18 to "R"
set ENCap19 to "S"
set ENCap20 to "T"
set ENCap21 to "U"
set ENCap22 to "F"
set ENCap23 to "Kh"
set ENCap24 to "Ts"
set ENCap25 to "Ch"
set ENCap26 to "Sh"
set ENCap27 to "Sch"
set ENCap28 to "E"
set ENCap29 to "Yu"
set ENCap30 to "Ya"
--

set sourceList to {RULow1, RULow2, RULow3, RULow4, RULow5, RULow6, RULow7, RULow8, ¬
	RULow9, RULow10, RULow11, RULow12, RULow13, RULow14, RULow15, RULow16, RULow17, ¬
	RULow18, RULow19, RULow20, RULow21, RULow22, RULow23, RULow24, RULow25, RULow26, ¬
	RULow27, RULow28, RULow29, RULow30, RULow31, RULow32, RULow33, RUCap1, RUCap2, ¬
	RUCap3, RUCap4, RUCap5, RUCap6, RUCap7, RUCap8, RUCap9, RUCap10, RUCap11, RUCap12, ¬
	RUCap13, RUCap14, RUCap15, RUCap16, RUCap17, RUCap18, RUCap19, RUCap20, RUCap21, ¬
	RUCap22, RUCap23, RUCap24, RUCap25, RUCap26, RUCap27, RUCap28, RUCap29, RUCap30}
set targetList to {ENLow1, ENLow2, ENLow3, ENLow4, ENLow5, ENLow6, ENLow7, ENLow8, ENLow9, ¬
	ENLow10, ENLow11, ENLow12, ENLow13, ENLow14, ENLow15, ENLow16, ENLow17, ENLow18, ENLow19, ¬
	ENLow20, ENLow21, ENLow22, ENLow23, ENLow24, ENLow25, ENLow26, ENLow27, ENLow28, ENLow29, ¬
	ENLow30, ENLow31, ENLow32, ENLow33, ENCap1, ENCap2, ENCap3, ENCap4, ENCap5, ENCap6, ENCap7, ¬
	ENCap8, ENCap9, ENCap10, ENCap11, ENCap12, ENCap13, ENCap14, ENCap15, ENCap16, ENCap17, ¬
	ENCap18, ENCap19, ENCap20, ENCap21, ENCap22, ENCap23, ENCap24, ENCap25, ENCap26, ENCap27, ¬
	ENCap28, ENCap29, ENCap30}

set ASTID to AppleScript's text item delimiters
repeat with i from 1 to count of sourceList
	set theString to findReplace(item i of sourceList, item i of targetList, theString)
end repeat
set AppleScript's text item delimiters to ASTID

return theString's word 2 & "." & theString's word 1

on findReplace(findText, replaceText, sourceText)
	set AppleScript's text item delimiters to findText
	set sourceText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	return sourceText as text
end findReplace
-->"sergey.brin"

For some reason, capital letter translation doesn’t work. Did I do something wrong?

If you swap the Low and Cap variables, the result in capital letters.
Is it possible to fix it?

Another question, is it possible to interact with selected text in any application using AppleScript or is it possible only for Automator.app?

KniazidisR, I copy the text from the application, it looks readable and no encoding translation is required. Thanks for the decoding example.

P.S. Have you seen this movie?

AppleScript ignores case by default. Wrap your repeat loop with considering case/end considering.

An app has to support working with the selected text. There are some text editors that do so, including BBEdit and SubEthaEdit. BBEdit has both a paid and free version while subethaedit is free to use. Both also offer a shell command-line tool.

It works!


set theString to "Брин Сергей Михайлович"

set RULow1 to "а"
set RULow2 to "б"
set RULow3 to "в"
set RULow4 to "г"
set RULow5 to "д"
set RULow6 to "е"
set RULow7 to "ё"
set RULow8 to "ж"
set RULow9 to "з"
set RULow10 to "и"
set RULow11 to "й"
set RULow12 to "к"
set RULow13 to "л"
set RULow14 to "м"
set RULow15 to "н"
set RULow16 to "о"
set RULow17 to "п"
set RULow18 to "р"
set RULow19 to "с"
set RULow20 to "т"
set RULow21 to "у"
set RULow22 to "ф"
set RULow23 to "х"
set RULow24 to "ц"
set RULow25 to "ч"
set RULow26 to "ш"
set RULow27 to "щ"
set RULow28 to "ъ"
set RULow29 to "ы"
set RULow30 to "ь"
set RULow31 to "э"
set RULow32 to "ю"
set RULow33 to "я"
--
set RUCap1 to "А"
set RUCap2 to "Б"
set RUCap3 to "В"
set RUCap4 to "Г"
set RUCap5 to "Д"
set RUCap6 to "Е"
set RUCap7 to "Ё"
set RUCap8 to "Ж"
set RUCap9 to "З"
set RUCap10 to "И"
set RUCap11 to "Й"
set RUCap12 to "К"
set RUCap13 to "Л"
set RUCap14 to "М"
set RUCap15 to "Н"
set RUCap16 to "О"
set RUCap17 to "П"
set RUCap18 to "Р"
set RUCap19 to "С"
set RUCap20 to "Т"
set RUCap21 to "У"
set RUCap22 to "Ф"
set RUCap23 to "Х"
set RUCap24 to "Ц"
set RUCap25 to "Ч"
set RUCap26 to "Ш"
set RUCap27 to "Щ"
set RUCap28 to "Э"
set RUCap29 to "Ю"
set RUCap30 to "Я"
--
set ENLow1 to "a"
set ENLow2 to "b"
set ENLow3 to "v"
set ENLow4 to "g"
set ENLow5 to "d"
set ENLow6 to "e"
set ENLow7 to "ye"
set ENLow8 to "zh"
set ENLow9 to "z"
set ENLow10 to "i"
set ENLow11 to "y"
set ENLow12 to "k"
set ENLow13 to "l"
set ENLow14 to "m"
set ENLow15 to "n"
set ENLow16 to "o"
set ENLow17 to "p"
set ENLow18 to "r"
set ENLow19 to "s"
set ENLow20 to "t"
set ENLow21 to "u"
set ENLow22 to "f"
set ENLow23 to "kh"
set ENLow24 to "ts"
set ENLow25 to "ch"
set ENLow26 to "sh"
set ENLow27 to "sch"
set ENLow28 to ""
set ENLow29 to "y"
set ENLow30 to ""
set ENLow31 to "e"
set ENLow32 to "yu"
set ENLow33 to "ya"
--
set ENCap1 to "A"
set ENCap2 to "B"
set ENCap3 to "V"
set ENCap4 to "G"
set ENCap5 to "D"
set ENCap6 to "E"
set ENCap7 to "Ye"
set ENCap8 to "Zh"
set ENCap9 to "Z"
set ENCap10 to "I"
set ENCap11 to "Y"
set ENCap12 to "K"
set ENCap13 to "L"
set ENCap14 to "M"
set ENCap15 to "N"
set ENCap16 to "O"
set ENCap17 to "P"
set ENCap18 to "R"
set ENCap19 to "S"
set ENCap20 to "T"
set ENCap21 to "U"
set ENCap22 to "F"
set ENCap23 to "Kh"
set ENCap24 to "Ts"
set ENCap25 to "Ch"
set ENCap26 to "Sh"
set ENCap27 to "Sch"
set ENCap28 to "E"
set ENCap29 to "Yu"
set ENCap30 to "Ya"
--

set sourceList to {RULow1, RULow2, RULow3, RULow4, RULow5, RULow6, RULow7, RULow8, ¬
	RULow9, RULow10, RULow11, RULow12, RULow13, RULow14, RULow15, RULow16, RULow17, ¬
	RULow18, RULow19, RULow20, RULow21, RULow22, RULow23, RULow24, RULow25, RULow26, ¬
	RULow27, RULow28, RULow29, RULow30, RULow31, RULow32, RULow33, RUCap1, RUCap2, ¬
	RUCap3, RUCap4, RUCap5, RUCap6, RUCap7, RUCap8, RUCap9, RUCap10, RUCap11, RUCap12, ¬
	RUCap13, RUCap14, RUCap15, RUCap16, RUCap17, RUCap18, RUCap19, RUCap20, RUCap21, ¬
	RUCap22, RUCap23, RUCap24, RUCap25, RUCap26, RUCap27, RUCap28, RUCap29, RUCap30}
set targetList to {ENLow1, ENLow2, ENLow3, ENLow4, ENLow5, ENLow6, ENLow7, ENLow8, ENLow9, ¬
	ENLow10, ENLow11, ENLow12, ENLow13, ENLow14, ENLow15, ENLow16, ENLow17, ENLow18, ENLow19, ¬
	ENLow20, ENLow21, ENLow22, ENLow23, ENLow24, ENLow25, ENLow26, ENLow27, ENLow28, ENLow29, ¬
	ENLow30, ENLow31, ENLow32, ENLow33, ENCap1, ENCap2, ENCap3, ENCap4, ENCap5, ENCap6, ENCap7, ¬
	ENCap8, ENCap9, ENCap10, ENCap11, ENCap12, ENCap13, ENCap14, ENCap15, ENCap16, ENCap17, ¬
	ENCap18, ENCap19, ENCap20, ENCap21, ENCap22, ENCap23, ENCap24, ENCap25, ENCap26, ENCap27, ¬
	ENCap28, ENCap29, ENCap30}

set ASTID to AppleScript's text item delimiters
repeat with i from 1 to count of sourceList
	considering case
		set theString to findReplace(item i of sourceList, item i of targetList, theString)
	end considering
end repeat
set AppleScript's text item delimiters to ASTID

return theString's word 2 & "." & theString's word 1

on findReplace(findText, replaceText, sourceText)
	set AppleScript's text item delimiters to findText
	set sourceText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	return sourceText as text
end findReplace
--> "Sergey.Brin"

Thanks Stanley, without your support I would still get the wrong result.