0

I want a loop that can find the letter that ends words most frequently in multiple languages and output the data in columns. So far I have

count="./wordlist/french/fr.txt ./wordlist/spanish/es.txt ./wordlist/german/de.$
lang="French Spanish German Portuguese Italian"
(
echo -e "Language Letter Count"
for i in $count
do
    (for j in {a..z}
        do
            echo -e "LANG" $j $(grep -c $j\> $i)
        done
    ) | sort -k3 -rn | head -1
done
) | column -t

I want it to output as shown:


Language  Letter  Count
French     e       196195
Spanish    a       357193
German     e       251892
Portuguese a       217178
Italian    a       216125

Instead I get:


Language  Letter  Count
LANG      z       0
LANG      z       0
LANG      z       0
LANG      z       0
LANG      z       0

The words files have the format: Word Freq(#) where the word and its frequency are delimited by a space.

This means I have 2 problems; First, the grep command is not handling the argument $j\> to find a character at the end of a word. I have tried using grep -E $j\> and grep '$j\>' and neither worked.

The second problem is that I don't know how to output the name of the language (in the variable lang). Nesting another for loop did not work when I tried it like this (or with i and k in the opposite order):


(
for i in $count
do
    for k in $lang
    do
        for j in {a..z}
        do
             echo -e $k $j $(grep -c $j\> $i)
        done
        ) | sort -k3 -rn | head -1
done
done
) | column -t

Since this outputs multiples of the name of the language "$k" in places where it does not belong.

I know that I can just copy and paste the loop for each language, but I would like to extend this to every language. Thanks in advance!

3
  • Can you paste a couple of lines from, say, two of the wordlist files to test against? Commented Oct 20, 2016 at 0:43
  • even if this worked, wouldn't it output the wrong numbers? e.g. if your word-count file has three entries: is 1000; xertz 1; showbiz 1; the result would be z 2 (rather thans 1000) Commented Oct 20, 2016 at 7:22
  • Yeah, Umlaute, it would z 2 which is what I want since I want to count the frequency and display the character that most frequently ends a word within the file itself. And, roelofs, a sample of the file is shown here: de 1622928 je 1622619 est 1348809 pas 1128894 le 1093232 so within this file itself, e most commonly ends a word. Sorry for the misconception. Commented Oct 20, 2016 at 10:04

1 Answer 1

2

grep word boundaries

To make special delimiters (e.g. \> for word-end) work with egrep when being called from the shell, you should put them into "quotes".

 count=$(egrep -c "${char}\>" "${file}")

Btw, you really should use double quote ("), because single quotes will prevent variable-expansion. (e.g. in j="foo"; k='$j\>', the first character of k's value will be $ rather than f)

Language name display

Getting the right language string is a bit more tricky; here's a few suggestions:

  • Derive the displayed language from the path of the wordlist:

    lang=${file%/*}
    lang=${lang##*/}
    

    With bash (though not with dash and some other shells) you might even do lang=${lang^} to capitalize the string.

  • Lookup the proper language name in a dictionary. Bash-4 has dictionaries built in, but you can also use filebased dicts:

    $ cat languagues.txt
    ./wordlist/french/fr.txt Français 
    ./wordlist/english/en.txt English
    ./wordlist/german/de.txt Deutsch
    
    $ file=./wordlist/french/fr.txt
    $ lang=$(egrep "^${file}/>" languages.txt | awk '{print $2}')
    
  • You can also iterate over file,lang pairs, e.g.

    languages="french/fr,French spanish/es,Español german/de,Deutsch"
    for l in $languages; do
       file=./wordlist/${l%,*}.txt
       lang=${l#*,}
       # ...
    done
    

Taking word frequencies into account

The third problem I see (though I might misunderstand the problem), is that you are not taking the word frequency into account. e.g. a word A that is used 1000 times more often than the word B will only get counted once (just like B).

You can use awk to sum up the word frequencies of matching words:

count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')

All Together Now

So a full solution to the problem could look like:

languages="french/fr,French spanish/es,Español german/de,Deutsch"

(
echo -e "Language Letter Count"
for l in ${languages}; do
  file=./wordlist/${l%,*}.txt
  lang=${l#*,}

  for char in {a..z}; do
     #count=$(egrep -c "${char}\>" "${file}")
     count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')
     echo ${file} ${char} ${count}
  done | sort -k3 -rn | head -1
done
) | column -t
Sign up to request clarification or add additional context in comments.

3 Comments

This worked wonderfully and I learned something new, thanks Umlaute!
I do have a question if you don't mind, Can you tell me how you used the ${l%,*} and ${l#*,} ? I am still confused about the usage of % and # within the script, what exactly do they mean?
@Angelo man bash and search for ## should give you an explanation that is better than anything i could say.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.