0

This code really confuses me, it is using some Stanford libraries for the Vector (array) class. Can anyone tell me what is the purpose of int index = line [j] - 'a'; why - 'a'?

void countLetters(string filename)
{
Vector<int> result;

ifstream in2;
in2.open(filename.c_str());
if (in.fail()) Error("Couldn't read '" + filename + "'");

for (int i = 0; i < ALPHABETH_SIZE; i++)
{
    result.add(0);  // Must initialize contents of array
}

string line;
while (true)
{
    getLine(in, line);
    // Check that we got a line
    if (in.fail()) break;

    line = ConvertToLowerCase(line);
    for (int j = 0; j < line.length(); j++)
    {
        int index = line [j] - 'a';
        if (index >= 0 && index < ALPHABETH_SIZE)
        {
            int prevTotal = result[index];
            result[index] = prevTotal +1;
        }
    }
}
}

The purpose of the code:

Takes a filename and prints the number of times each letter of the alphabet appears in that file. Because there are 26 numbers to be printed, CountLetters needs to create a Vector. For example, if the file is:

2
  • 1
    Presumably it would find how far into the alphabet a letter is, but that doesn't always hold true. Commented Nov 9, 2012 at 5:12
  • 1
    The code as a whole is calculating letter frequencies. result['c' - 'a'] would be the number of times the character 'c' appears in the file. Commented Nov 9, 2012 at 5:14

2 Answers 2

2

Characters in a string are encoded using a character set... typically ASCII on hardware common in English language systems. You can see the ASCII table at http://en.wikipedia.org/wiki/ASCII

In ASCII (and most other character sets), the numbers representing letters are contiguous. So, this is the natural way to test whether the character at index j in character-array line is a letter:

line[j] >= 'a' && line[j] <= 'z'

Your program is equivalent to that, in an algebra-kind of sense it subtracts a from both sides (knowing that a is the first character in the character set):

line[j] >= 'a' - `a` && line[j] <= 'z' - `a`

line[j] >= 0 && line[j] <= 'z' - `a`

Replacing "<= z - a" with am equivalent:

line[j] >= 0 && line[j] < ALPHABET_SIZE

where ALPHABET_SIZE is 26. This trades a dependency on knowing z is the last character of your character set for knowing how many characters are in your character set - both are a little fragile, but fine if you know you're dealing with a well-known, stable character set encoding.

A better way to check for a letter is to use the isalpha() predicate: http://www.cplusplus.com/reference/clibrary/cctype/isalpha/

Sign up to request clarification or add additional context in comments.

2 Comments

ALPHABET_SIZE is actually a bad idea because it introduces a second assumption: that the alphabet is contiguous. It's that broken assumption which causes the code above to fail on EBCDIC, where 'j'-'i' != 1. In French/ISO-8859-1, similar errors crop up between c and ç
@MSalters: the idea that a pair of >= and <= comparisons can identify the alphabet is similar flawed for non-contiguous alphabets - nothing specific to ALPHABET_SIZE about that issue.
2

"a" is at the beginning of ASII chars.

int index = line [j] - 'a'; if (index >= 0 && index < ALPHABETH_SIZE)

These two line of code is to just if line[j] is a character.

1 Comment

But note that ASCII isn't guaranteed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.