finding words

code_monkey · November 18, 2005, 11:56pm

I want to write this script, but i cant seem to get started. I want the script to find the number of times i use a word starting with certain letters, and then print those words. For example, If i set it to find the words beginning with “fr”, i might get french, free, freckle, depending on what i wrote. Is this possible?

Bruce_Phillips · November 19, 2005, 12:22am

Where are these words coming from?

code_monkey · November 19, 2005, 12:43am

this would be any text document, selectable by the user.

Vincent · November 19, 2005, 12:58am

I think it would look something like this:

set thetext to "foobar bazblee foobar"
set allwords to (every word of thetext)
set wordcount to 0
repeat with i in my allwords
	if i starts with "foo" then set wordcount to wordcount + 1
end repeat
wordcount

you could also say:

.
if i contains "foo" then set wordcount to wordcount + 1
.

or you could say:

.
if i ends with "foo" then set wordcount to wordcount + 1
.

Craig_Smith · November 19, 2005, 1:16am

Starting with an empty list, and then filling it as you go will give the words themselves, which is what I believe you are interested in:

set thetext to "foobar bazblee foobar fobar"
set sel_words to {}
set allwords to (every word of thetext)
set wordcount to 0
repeat with i in my allwords
	if i starts with "fo" then
		set end of sel_words to i as Unicode text--If the word fits the test, add it to the list
		set wordcount to wordcount + 1--Counts the total words that fit the test
	end if
end repeat
wordcount & return & sel_words

Bruce_Phillips · November 19, 2005, 4:11am

Would something like this work?

tell application "TextEdit"
	activate
	choose file "Get word info for this file:" without invisibles
	open (result)
	
	set theWords to every word of front document whose first character is "f" and second character is "r"
	
	close front document
	quit
end tell

return {wordCount:(count theWords), wordList:theWords}

hhas · November 19, 2005, 10:51am

Simpler:

	set theWords to every word of front document where it starts with "fr"

Another option is to use regular expressions, e.g. using TextCommands:

tell application "TextCommands"
    search txt for "\\b(fr.*?)\\b" with regex
end tell

BTW, be aware that different tools may follow different rules in determining what constitutes a ‘word’.

HTH

Vincent · November 19, 2005, 1:49pm

This UNIX command searches for every word starting with foo and prints a list with the count

set thetext to "dgdfgd-gfdgh dfgkdsgfdk ksufgsdvb fsdkfusfgsvb fsdkfusgd fbsdfik foobarskddfbksdf vsdfsdklofusdv foobarkdbs asdgsjadvas foobarskddfbksdf"

do shell script "echo " & thetext & " | perl -pe 's/[^[a-z]|[A-Z]]*/" & (ASCII character 10) & "/g;' | grep ^foo | sort -f | uniq -c"

CAUTION:
¢ It’s case sensitive!
¢ hyphenated words are split into their parts!

Searching words containing “foo”

do shell script "echo " & thetext & " | perl -pe 's/[^[a-z]|[A-Z]]*/" & (ASCII character 10) & "/g;' | grep foo | sort -f | uniq -c"

Searching words ending with “foo”

do shell script "echo " & thetext & " | perl -pe 's/[^[a-z]|[A-Z]]*/" & (ASCII character 10) & "/g;' | grep foo$ | sort -f | uniq -c"

Bruce_Phillips · November 19, 2005, 2:23pm

I knew there had to be a way to do that, but I was so tired I couldn’t think of it. Thanks for the input.

Vincent · November 19, 2005, 5:08pm

Bruce Phillips:

hhas:
Simpler:
set theWords to every word of front document where it starts with "fr"
I knew there had to be a way to do that, but I was so tired I couldn’t think of it. Thanks for the input.

Why script a 3rd party app if you can do it the direct way? (like the two methods I posted)
Or SHALL it be done in TextEdit?

hhas · November 19, 2005, 5:36pm

Here’s a version that’s not:

set theText to "froo bar baz
Frub fig froo
frub Frub frub"

set wordPattern to "fr.*?" -- careful

do shell script "echo " & quoted form of theText & " | perl -0777 -e " & quoted form of ("
$_ = <STDIN>;
@lst = sort {lc $a cmp lc $b} m/\\b(" & wordPattern & ")\\b/ig;
push @lst, '';
$wrd = shift @lst;
while (@lst) {
	$i = 1;
	$i++ while (lc($nxt = shift @lst) eq lc $wrd);
	print \"$i\\t$wrd\\n\";
	$wrd = $nxt;
}")

(* -->
"2	froo
4	Frub"
*)

Preserves the case of [the first instance of] each word found, which is why it’s a bit longer (uniq’s no good for this). It’s still not unicode-aware though (Perl’s unicode support is awkward and I can’t be bothered to sort it - I’d normally use Python anyway).

Bruce_Phillips · November 19, 2005, 7:13pm

Looping over a list may be slower; Also, some may count shell scripts as using 3rd party apps. I was just offering an alternative.

Vincent · November 20, 2005, 2:16am

I was just curious because I always try to prevent “App scripting” .

I know - and my question wasn’t meant to be offensive in any way .