Hi everyone,
I made the following function to replace all non-alphanumeric characters with a space, but when I started using it on somewhat bigger text files (like 16K) it became obvious that it is very very slow really.
I just looks at each character and then adds it to a new string when it is ok, otherwise it replaces it with a space.
So, am I doing something completely wrong? It 's a straightforward operation, but processing times seem to grow exponentially longer with longer texts. 2k text => 15 seconds, 16k text => let’s just say I stopped waiting after 20 minutes (not exaggerating), go figure When I do this in php for instance it is done in under 1 second.
-- replace non-alphanumeric characters in source_text with a space and removes duplicate spaces
-- note: returns an empty string when the entire input is reduced to 1 space
on fn_normalize(source_text)
-- check input
try
set source_text to source_text as string
on error
return false
end try
if length of source_text < 1 then return ""
-- replace all characters by a space that are not plain alphanumeric
set allowed_characters to "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
-- built result text with allowed characters
set result_string to ""
set last_insert_is_space to true
-- step trhough input characters
repeat with single_character in the characters of source_text
if ((offset of single_character in allowed_characters) > 0) then
set result_string to result_string & single_character
set last_insert_is_space to false
else
if last_insert_is_space is false then
set result_string to result_string & space
set last_insert_is_space to true
end if
end if
end repeat
-- trim end of string
-- if the string is only one space (all characters were illegal) it will be truncated to an empty string
if (character -1 of result_string as string) is space then
set result_string to (characters 1 thru ((length of result_string) - 1) of result_string) as string
end if
return result_string
end fn_normalize