Python algorithm of counting occurrence of specific word in csv

Question

I've just started to learn python. I'm curious about what are the efficient ways to count the occurrence of a specific word in a CSV file, other than simply use for loop to go through line by line and read.

To be more specific, let's say I have a CSV file contain two columns, "Name" and "Grade", with millions of records.

How would one count the occurrence of "A" under "Grade"?

Python code samples would be greatly appreciated!

You have to read the whole file, otherwise your algorithm can be proven to be incorrect. Reading it linearly, line by line, is not a bad approach. — user684934
– user684934, Commented Feb 12, 2012 at 7:44
import csv; count = sum(1 for row in csv.dictreader(open(filename)) if row['Grade'] == 'A') — agf
– agf, Commented Feb 12, 2012 at 7:58
@agf: nice, but when I tried this it was a factor of 6-8 slower than the other answers — steabert
– steabert, Commented Feb 12, 2012 at 8:39
@steabert That speed factor almost certainly doesn't matter. — agf
– agf, Commented Feb 12, 2012 at 9:26

reclosedev · Accepted Answer · 2012-02-12 07:57:30Z

12

Basic example, with using csv and collections.Counter (Python 2.7+) from standard Python libraly:

import csv
import collections

grades = collections.Counter()
with open('file.csv') as input_file:
    for row in csv.reader(input_file, delimiter=';'):
        grades[row[1]] += 1

print 'Number of A grades: %s' % grades['A']
print grades.most_common()

Output (for small dataset):

Number of A grades: 2055
[('A', 2055), ('B', 2034), ('D', 1995), ('E', 1977), ('C', 1939)]

answered Feb 12, 2012 at 7:57

reclosedev

9,53237 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

steabert Over a year ago

OK, but you could apply Counter on a generator expression for the first element of the lines in the file

laotanzhurou Over a year ago

Thanks! I accepted your answer. But I was wondering comparing this with using dictionary, with grade as key and occurrence as value, which way will be more efficient?

reclosedev Over a year ago

@laotanzhurou, Counter is a dict subclass, but it's little slower. If you really need speedup collections.defaultdict(int) or if ... count += 1 probably will be faster. But you always can benchmark it by yourself with timeit, see Johnsyweb's answer

steabert · Accepted Answer · 2012-02-12 08:30:04Z

3

You should of course read all the grades, which in this case also means reading the entire file. You can use the csv module to easily read comma separated value files:

import csv
my_reader = csv.reader(open('my_file.csv'))
ctr = 0
for record in my_reader:
    if record[1] == 'A':
        ctr += 1
print(ctr)

This is pretty fast, and I couldn't do better with the Counter method:

from collections import Counter
grades = [rec[1] for rec in my_reader] # generator expression was actually slower
result = Counter(grades)
print(result)

Last but not least, lists have a count method:

from collections import Counter
grades = [rec[1] for rec in my_reader]
result = grades.count('A')
print(result)

answered Feb 12, 2012 at 8:30

steabert

7,0082 gold badges30 silver badges33 bronze badges

Collectives™ on Stack Overflow

Python algorithm of counting occurrence of specific word in csv

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related