python count number of unique elements in csv column

Question

I'm trying to get the counts of unique items in a csv column using Python.

Sample CSV file (has no header):

AB,asd
AB,poi
AB,asd
BG,put
BG,asd

I've tried this so far.

import csv
from collections import defaultdict, Counter

input_file = open('Results/1_sample.csv')
csv_reader = csv.reader(input_file, delimiter=',')

data = defaultdict(list)
for row in csv_reader:
    data[row[0]].append(row[1])
for k, v in data.items():
    print k
    print Counter(v)

This gives output in this format:

AB
Counter({'asd': 2, 'poi': 1})
BG
Counter({'asd': 1, 'put': 1})

But I want my output to be like:

AB:2
BG:2
total_unique_count:3 #unique count of column[1], irrespective of the data in column[0]

It has two unique values in column[1], asd and poi. @PadraicCunningham — pam
– pam, Commented Apr 14, 2015 at 18:31
Ok so you want to remove duplicates, not count actually unique values? — Padraic Cunningham
– Padraic Cunningham, Commented Apr 14, 2015 at 18:32
@PadraicCunningham yes, remove duplicates and then get the count. — pam
– pam, Commented Apr 14, 2015 at 18:35

Andy Hayden · Accepted Answer · 2015-04-14 18:19:47Z

5

You're looking for the SeriesGroupby method nunique:

In [11]: df
Out[11]:
    0    1
0  AB  asd
1  AB  poi
2  AB  asd
3  BG  put
4  BG  asd

In [12]: g = df.groupby(0)

In [13]: g[1].nunique()
Out[13]:
0
AB    2
BG    2
Name: 1, dtype: int64

answered Apr 14, 2015 at 18:19

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

pam Over a year ago

It looks promising, but I get pandas.hashtable.PyObjectHashTable.get_item KeyError: 0. I'll try to fix that and update.

Andy Hayden Over a year ago

0 and 1 are the column names in the above DataFrame, yours may be different? (This groups by column 0 and the counts the number of unique elements in column 1, for each group.)

pam Over a year ago

They are the same for my data also.

Andy Hayden Over a year ago

@pam also to get the total number of groups use len(df[1].unique()). Ok, not sure why that is, you've been able to do that since forever, perhaps the columns names are strings '0'?

pam Over a year ago

My bad. You're right. I forgot to give header = None and it was considering the first row to be header. It works great! Thank you very much!

|

Celeo · Accepted Answer · 2015-04-14 18:27:40Z

4

Use sets:

data = (('AB', 'asd'),
    ('AB', 'poi'),
    ('AB', 'asd'),
    ('BG', 'put'),
    ('BG', 'asd'))
unique_items = set(data)
keys = [[entry[0] for entry in unique_items]]
for key in set(keys):
    print("Key '{}' appears {} unique times".format(key, keys.count(key)))

Key 'BG' appears 2 unique times
Key 'AB' appears 2 unique times

edited Apr 14, 2015 at 18:27

answered Apr 14, 2015 at 18:12

Celeo

5,7138 gold badges41 silver badges44 bronze badges

3 Comments

pam Over a year ago

Thank you for your answer. But I need AB count to be only 2, not 3 (since asd is repeated in column[1] for AB)

Celeo Over a year ago

Ah, so you're looking for completely unique entries, and then count by the key?

pam Over a year ago

Yes. Sorry for my bad phrasing.

Collectives™ on Stack Overflow

python count number of unique elements in csv column

2 Answers 2

6 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related