5

I'm trying to get the counts of unique items in a csv column using Python.

Sample CSV file (has no header):

AB,asd
AB,poi
AB,asd
BG,put
BG,asd

I've tried this so far.

import csv
from collections import defaultdict, Counter

input_file = open('Results/1_sample.csv')
csv_reader = csv.reader(input_file, delimiter=',')

data = defaultdict(list)
for row in csv_reader:
    data[row[0]].append(row[1])
for k, v in data.items():
    print k
    print Counter(v)

This gives output in this format:

AB
Counter({'asd': 2, 'poi': 1})
BG
Counter({'asd': 1, 'put': 1})

But I want my output to be like:

AB:2
BG:2
total_unique_count:3 #unique count of column[1], irrespective of the data in column[0]
3
  • It has two unique values in column[1], asd and poi. @PadraicCunningham Commented Apr 14, 2015 at 18:31
  • Ok so you want to remove duplicates, not count actually unique values? Commented Apr 14, 2015 at 18:32
  • @PadraicCunningham yes, remove duplicates and then get the count. Commented Apr 14, 2015 at 18:35

2 Answers 2

5

You're looking for the SeriesGroupby method nunique:

In [11]: df
Out[11]:
    0    1
0  AB  asd
1  AB  poi
2  AB  asd
3  BG  put
4  BG  asd

In [12]: g = df.groupby(0)

In [13]: g[1].nunique()
Out[13]:
0
AB    2
BG    2
Name: 1, dtype: int64
Sign up to request clarification or add additional context in comments.

6 Comments

It looks promising, but I get pandas.hashtable.PyObjectHashTable.get_item KeyError: 0. I'll try to fix that and update.
0 and 1 are the column names in the above DataFrame, yours may be different? (This groups by column 0 and the counts the number of unique elements in column 1, for each group.)
They are the same for my data also.
@pam also to get the total number of groups use len(df[1].unique()). Ok, not sure why that is, you've been able to do that since forever, perhaps the columns names are strings '0'?
My bad. You're right. I forgot to give header = None and it was considering the first row to be header. It works great! Thank you very much!
|
4

Use sets:

data = (('AB', 'asd'),
    ('AB', 'poi'),
    ('AB', 'asd'),
    ('BG', 'put'),
    ('BG', 'asd'))
unique_items = set(data)
keys = [[entry[0] for entry in unique_items]]
for key in set(keys):
    print("Key '{}' appears {} unique times".format(key, keys.count(key)))

Key 'BG' appears 2 unique times
Key 'AB' appears 2 unique times

3 Comments

Thank you for your answer. But I need AB count to be only 2, not 3 (since asd is repeated in column[1] for AB)
Ah, so you're looking for completely unique entries, and then count by the key?
Yes. Sorry for my bad phrasing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.