How to increment a string variable within a for loop

Question

I want a loop that can find the letter that ends words most frequently in multiple languages and output the data in columns. So far I have

count="./wordlist/french/fr.txt ./wordlist/spanish/es.txt ./wordlist/german/de.$
lang="French Spanish German Portuguese Italian"
(
echo -e "Language Letter Count"
for i in $count
do
    (for j in {a..z}
        do
            echo -e "LANG" $j $(grep -c $j\> $i)
        done
    ) | sort -k3 -rn | head -1
done
) | column -t

I want it to output as shown:


Language  Letter  Count
French     e       196195
Spanish    a       357193
German     e       251892
Portuguese a       217178
Italian    a       216125

Instead I get:


Language  Letter  Count
LANG      z       0
LANG      z       0
LANG      z       0
LANG      z       0
LANG      z       0

The words files have the format: Word Freq(#) where the word and its frequency are delimited by a space.

This means I have 2 problems; First, the grep command is not handling the argument $j\> to find a character at the end of a word. I have tried using grep -E $j\> and grep '$j\>' and neither worked.

The second problem is that I don't know how to output the name of the language (in the variable lang). Nesting another for loop did not work when I tried it like this (or with i and k in the opposite order):


(
for i in $count
do
    for k in $lang
    do
        for j in {a..z}
        do
             echo -e $k $j $(grep -c $j\> $i)
        done
        ) | sort -k3 -rn | head -1
done
done
) | column -t

Since this outputs multiples of the name of the language "$k" in places where it does not belong.

I know that I can just copy and paste the loop for each language, but I would like to extend this to every language. Thanks in advance!

Can you paste a couple of lines from, say, two of the wordlist files to test against? — roelofs
– roelofs, Commented Oct 20, 2016 at 0:43
even if this worked, wouldn't it output the wrong numbers? e.g. if your word-count file has three entries: is 1000; xertz 1; showbiz 1; the result would be z 2 (rather thans 1000) — umläute
– umläute, Commented Oct 20, 2016 at 7:22
Yeah, Umlaute, it would z 2 which is what I want since I want to count the frequency and display the character that most frequently ends a word within the file itself. And, roelofs, a sample of the file is shown here: de 1622928 je 1622619 est 1348809 pas 1128894 le 1093232 so within this file itself, e most commonly ends a word. Sorry for the misconception. — Angelo
– Angelo, Commented Oct 20, 2016 at 10:04

umläute · Accepted Answer · 2016-10-20 11:23:00Z

2

`grep` word boundaries

To make special delimiters (e.g. \> for word-end) work with egrep when being called from the shell, you should put them into "quotes".

 count=$(egrep -c "${char}\>" "${file}")

Btw, you really should use double quote ("), because single quotes will prevent variable-expansion. (e.g. in j="foo"; k='$j\>', the first character of k's value will be $ rather than f)

Language name display

Getting the right language string is a bit more tricky; here's a few suggestions:

Derive the displayed language from the path of the wordlist:
```
lang=${file%/*}
lang=${lang##*/}
```
With bash (though not with dash and some other shells) you might even do lang=${lang^} to capitalize the string.

Lookup the proper language name in a dictionary. Bash-4 has dictionaries built in, but you can also use filebased dicts:

$ cat languagues.txt
./wordlist/french/fr.txt Français 
./wordlist/english/en.txt English
./wordlist/german/de.txt Deutsch

$ file=./wordlist/french/fr.txt
$ lang=$(egrep "^${file}/>" languages.txt | awk '{print $2}')

You can also iterate over file,lang pairs, e.g.

languages="french/fr,French spanish/es,Español german/de,Deutsch"
for l in $languages; do
   file=./wordlist/${l%,*}.txt
   lang=${l#*,}
   # ...
done

Taking word frequencies into account

The third problem I see (though I might misunderstand the problem), is that you are not taking the word frequency into account. e.g. a word A that is used 1000 times more often than the word B will only get counted once (just like B).

You can use awk to sum up the word frequencies of matching words:

count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')

All Together Now

So a full solution to the problem could look like:

languages="french/fr,French spanish/es,Español german/de,Deutsch"

(
echo -e "Language Letter Count"
for l in ${languages}; do
  file=./wordlist/${l%,*}.txt
  lang=${l#*,}

  for char in {a..z}; do
     #count=$(egrep -c "${char}\>" "${file}")
     count=$(egrep "${char}\>" "${file}" | awk '{s+=$2} END {print s}')
     echo ${file} ${char} ${count}
  done | sort -k3 -rn | head -1
done
) | column -t

edited Oct 20, 2016 at 11:23

answered Oct 20, 2016 at 7:52

umläute

32.2k11 gold badges75 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Angelo Over a year ago

This worked wonderfully and I learned something new, thanks Umlaute!

Angelo Over a year ago

I do have a question if you don't mind, Can you tell me how you used the ${l%,*} and ${l#*,} ? I am still confused about the usage of % and # within the script, what exactly do they mean?

umläute Over a year ago

@Angelo man bash and search for ## should give you an explanation that is better than anything i could say.

Collectives™ on Stack Overflow

How to increment a string variable within a for loop

1 Answer 1

`grep` word boundaries

Language name display

Taking word frequencies into account

All Together Now

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

grep word boundaries

Language name display

Taking word frequencies into account

All Together Now

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related

`grep` word boundaries