I am generating internal e-mail addresses from text files with strings.
I would like to “clean” these strings of invalid characters, e.g., space or 8-bit characters (such as é). For example Mac Scriptér should be MacScripter or MacScriptr
It doesn’t have to be super stable or handle every weird corner case one could imagine. Any ideas on how to implement this or does it exist something I could use as is?
I don’t imagine that this would be exhaustive but it should deal with any diacriticals you may encounter. It (as far as I can tell) will replace any accented character with its unaccented counterpart.
For whitespace, you would have to do something additional but you didn’t say how these characters should be treated. You might want to look up the standard for email addresses which will provide a complete list of usable characters.
Update: I guess you did mention that you wanted to delete spaces so I added deletion of spaces and tabs.
set goodList to "abcdefghijklmnopqrstuvwxyz0123456789@._-"
set test to "maç scrïptér 2023@macscripter.com"
set AppleScript's text item delimiters to {space, tab}
set testList to text items of test
set AppleScript's text item delimiters to ""
set test to testList as text
set charList to characters of test as list
--> {"m", "a", "ç", "s", "c", "r", "ï", "p", "t", "é", "r", "2", "0", "2", "3", "@", "m", "a", "c", "s", "c", "r", "i", "p", "t", "e", "r", ".", "c", "o", "m"}
repeat with listpos from 1 to count of test
considering diacriticals
set eachChar to contents of item listpos of test
if eachChar is not in goodList then
ignoring diacriticals
set gc to character (offset of eachChar in goodList) of goodList
set item listpos of charList to gc
end ignoring
end if
end considering
end repeat
charList as text
--> "macscripter2023@macscripter.com"
You could use the tr shell utility. It has many options, and the following is only a simple example. Depending on your requirements, you can either include specified characters, exclude specified characters, or both.
set theString to "Mac Scriptér 1@gmail.com"
set includeCharacters to "A-Za-z0-9@."
set cleanedText to do shell script "echo " & theString & " | tr -cd " & quoted form of includeCharacters
return cleanedText --> "MacScriptr1@gmailcom"
set theString to "Mac Scriptér 1@gmail.com"
set excludeCharacters to " é"
set cleanedText to do shell script "echo " & theString & " | tr -d " & quoted form of excludeCharacters
return cleanedText --> "MacScriptr1@gmail.com"
But JavaScript has a function for that, and you can run it from AppleScriptObjC.
--------------------------------------------------------
# Auth: Christopher Stone <scriptmeister@thestoneforge.com>
# : building upon work by @ComplexPoint
# dCre: 2023/03/27 20:03
# dMod: 2023/03/27 20:03
# Appl: AppleScriptObjC & JavaScript
# Task: Normalize Diacritical Strings and Remove Spaces.
# Libs: None
# Osax: None
# URLs: https://forum.keyboardmaestro.com/t/using-javascript-for-automation-from-applescript-and-vice-versa/4054?u=ccstone
# Tags: @Applescript, @Script, @ASObjC, @Normalize, @Diacriticals, @Remove, @Spaces
--------------------------------------------------------
use AppleScript version "2.4" --» Yosemite or later
use framework "Foundation"
use framework "OSAKit"
use scripting additions
--------------------------------------------------------
set dataStr to "àáâãäå ÈÉÊË ÀÁÂÃÄÅ"
set jsCmdStr to "
(() => {
let inputStr = '" & dataStr & "';
let outPutStr = inputStr.normalize('NFKD').replace(/[^\\w]/g, '');
return outPutStr;
})();
"
set strEncoded to evalOSA("JavaScript", jsCmdStr)
--------------------------------------------------------
--» HANDLERS
--------------------------------------------------------
# evalOSA :: ("JavaScript" | "AppleScript") -> String -> String
--------------------------------------------------------
on evalOSA(strLang, strCode)
set ca to current application
set oScript to ca's OSAScript's alloc's initWithSource:strCode ¬
language:(ca's OSALanguage's languageForName:(strLang))
set {blnCompiled, oError} to oScript's compileAndReturnError:(reference)
if blnCompiled then
set {oDesc, oError} to oScript's executeAndReturnError:(reference)
if (oError is missing value) then return oDesc's stringValue as text
end if
return oError's NSLocalizedDescription as text
end evalOSA
--------------------------------------------------------
Alt-Title == Run JavaScript Directly from AppleScript
When comparing texts, you can have the script take into account (or ignore) certain types of text, including diacriticals (and also case, hyphens, numeric strings, punctuation and white space).
So, for example, the script asks if é = e and while considering diacriticals, they are not. So the if…then statement then finds the offset of the character in the clean characters while ignoring diacriticals, and uses the resulting character as the replacement. You can get more details in the Language Guide. If the first use was ignoring, then it would skip over the é because it would consider it equal to the e.
What the offset line does is find the appropriate clean character (in this case, an ‘e’) and determine its offset (5) in the string of good characters and then get the letter at that offset (e).
Here is a more focused example:
set initialStr to "béd"
set goodStr to "abcde"
considering diacriticals
-- is é in 'abcde'
character 2 of initialStr is in goodStr
end considering
--> false
-- if false then…
ignoring diacriticals
set x to character 2 of initialStr
--> offset of 'e' in 'abcde'
character (offset of x in goodStr) in goodStr
end ignoring
--> character at offset 5 in goodStr
--> e
So, the script substitutes an unadorned ‘e’ for any accented ‘e’, including ‘éëèê’. Meanwhile, the letter ‘a’ has all of the same accents but also an ‘ã’, so ‘áäàâã’ but you don’t need to know as they should all be swapped.
So in the full script, the following test string should return this:
set test to "including -éëèê- and -áäàâã-"
--> "including-eeee-and-aaaaa-"
-- requires macOS 10.11 or later
use framework "Foundation"
use scripting additions
set dataStr to "àáâãäå ÈÉÊË ÀÁÂÃÄÅ"
set theString to current application's NSString's stringWithString:dataStr
set theString to theString's stringByApplyingTransform:(current application's NSStringTransformStripDiacritics) |reverse|:false
return theString as text
Or:
use framework "Foundation"
use scripting additions
set dataStr to "àáâãäå ÈÉÊË ÀÁÂÃÄÅ"
set theString to current application's NSString's stringWithString:dataStr
set theString to theString's stringByFoldingWithOptions:(current application's NSDiacriticInsensitiveSearch) locale:(current application's NSLocale's currentLocale())
return theString as text
I used your example, slightly modified, like this:
set theResponse to display dialog "Name?" default answer name with icon note buttons {"Cancel", "Continue"} default button "Continue"
(input data)
set theResponse to fixString(theResponse)
(call to the method)
on fixString(mystring)
set goodList to "abcdefghijklmnopqrstuvwxyz0123456789@._-"
set test to "maç scrïptér 2023@macscripter.com"
set AppleScript's text item delimiters to {space, tab}
set mystrings to text items of mystring
set AppleScript's text item delimiters to ""
set mystring to mystrings as text
set charList to characters of mystring as list
repeat with listpos from 1 to count of mystring
considering diacriticals
set eachChar to contents of item listpos of mystring
if eachChar is not in goodList then
ignoring diacriticals
set gc to character (offset of eachChar in goodList) of goodList
set item listpos of charList to gc
end ignoring
end if
end considering
end repeat
return charList as text
end fixString
When I later on did this
set arrName to (text returned of theResponse)
The result was really weird. If I enter alice as input arrName is Continuealice and I get an error:
error “Can’t get text returned of "Continuealice".” number -1728 from text returned of “Continuealice”.
If I skip the call to fixString everything works as expected, that is, arrName is a string set to alice.
You solution has worked very fine for several months but recently I bumped into two strings that it couldn’t handle. Not sure why
Verkís Verkfræðistofa
and
™️
I don’t understand the problem with the first one, it is just quite regular characters. The second one is obviously a “special character”. Any solution works for that, either delete it or convert it to tm.
Hmm… those don’t actually qualify as diacriticals, so yeah, they won’t be handled by my script.
ð is actually an unadorned letter (eth), albeit one of narrow usage. Apparently, it drifted out of the English language over a thousand years ago. Its upper case version is Ð. You’d have to decide what you would like to replace it with as it’s not member of what Apple considers to be the diacritical class — at least so far as I can tell.
As for the trademarque symbol, it can be replaced with any variation of the standard replace methods using text item delimiters (see below) but you should note whether it should include a space or not. Many uses of it have it flush against the preceding text — it would be odd to just append ‘tm’ to a word.
I’d probably put this inside a handler but this is the basic idea.
set bigText to "and ™️"
set AppleScript's text item delimiters to "™️"
set ti to text items of bigText
set AppleScript's text item delimiters to "tm"
set xt to ti as text
No, in English at least, it is a ligature, although technically, the ligature is where the letters are joined. On a mac, you can type it with option-'.
From a quick lookup on wikipedia, in Icelandic (and a few other languages), it is apparently its own letter. It seems that on an Icelandic keyboard, it has its own key… the semi-colon key on an English keyboard (key code 41).
What would you want to do with it?
Also, in addition to Icelandic, the ð is also Old English. There seems to be a fair amount of overlap between Scandinavian languages and Old English. Lots of boots walking around England back in the day.