How to count number of unique strings in two columns?

Question

I have a DataFrame with two columns containing strings, like:

col1 --- col2
Ernst --- Jim
Peter --- Ernst
Bill --- NaN
NaN --- Doug
Jim --- Jake

Now I want to create a new DataFrame with a list of unique strings in the first column and in the second column the number of occurrences of each string in the 2 original columns, like:

str --- occurences
Ernst --- 2
Peter --- 1
Bill --- 1
Jim --- 2
Jake --- 1
Doug --- 1

How do I do that in the most efficient way? Thanks!

TomAugspurger · Accepted Answer · 2014-01-20 17:25:26Z

7

First combine your original two columns into one:

In [127]: s = pd.concat([df.col1, df.col2], ignore_index=True)

In [128]: s
Out[128]: 
0    Ernst
1    Peter
2     Bill
3      NaN
4      Jim
5      Jim
6    Ernst
7      NaN
8     Doug
9     Jake
dtype: object

and then use value_counts:

In [129]: s.value_counts()
Out[129]: 
Ernst    2
Jim      2
Bill     1
Doug     1
Jake     1
Peter    1
dtype: int64

answered Jan 20, 2014 at 17:25

TomAugspurger

29k8 gold badges89 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

DSM Over a year ago

Alternatively, df.unstack().value_counts(). (If there are more columns than just col1 and col2 in the frame, you'd want to select those first.)

TomAugspurger Over a year ago

Ohh that's nice too. df.stack().value_counts() gives the same.

astreal · Accepted Answer · 2014-01-20 17:21:10Z

I'd do that way (assuming you taking the data from a file your_file.txt and you want to print out the result):

from collections import Counter;

separator = ' --- '
with open('your_file.txt') as f:
    content = f.readlines()  # here you got a list of elements corresponding to the lines
    people = separator.join(content).split(separator) # here you got a list of all elements
    people_count = Counter(people) # you got here a dict-like object with key=name value=count
    for name, val in people_count.iteritems():
        # print the column the way you want
        print '{name}{separator}{value}'.format(name=name, separator=separator, value=val)

The example use the Counter object which allows you to efficiently count element from an iterable. the rest of the code is only string manipulation.

Alvaro Fuentes · Accepted Answer · 2014-01-20 17:27:47Z

0

Try this:

df = pd.DataFrame({"col1" : ["Ernst", "Peter","Bill",np.nan,"Jim"],
 "col2" : ["Jim","Ernst",np.nan,"Doug","Jake"]})
print df
df1 = df.groupby("col1")["col1"].count()
df2 = df.groupby("col2")["col2"].count()
print df1.add(df2,fill_value=0)

answered Jan 20, 2014 at 17:27

Alvaro Fuentes

17.5k4 gold badges59 silver badges68 bronze badges

Collectives™ on Stack Overflow

How to count number of unique strings in two columns?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related