1

I have a DataFrame with two columns containing strings, like:

col1 --- col2
Ernst --- Jim
Peter --- Ernst
Bill --- NaN
NaN --- Doug
Jim --- Jake

Now I want to create a new DataFrame with a list of unique strings in the first column and in the second column the number of occurrences of each string in the 2 original columns, like:

str --- occurences
Ernst --- 2
Peter --- 1
Bill --- 1
Jim --- 2
Jake --- 1
Doug --- 1

How do I do that in the most efficient way? Thanks!

3 Answers 3

7

First combine your original two columns into one:

In [127]: s = pd.concat([df.col1, df.col2], ignore_index=True)

In [128]: s
Out[128]: 
0    Ernst
1    Peter
2     Bill
3      NaN
4      Jim
5      Jim
6    Ernst
7      NaN
8     Doug
9     Jake
dtype: object

and then use value_counts:

In [129]: s.value_counts()
Out[129]: 
Ernst    2
Jim      2
Bill     1
Doug     1
Jake     1
Peter    1
dtype: int64
Sign up to request clarification or add additional context in comments.

2 Comments

Alternatively, df.unstack().value_counts(). (If there are more columns than just col1 and col2 in the frame, you'd want to select those first.)
Ohh that's nice too. df.stack().value_counts() gives the same.
0

I'd do that way (assuming you taking the data from a file your_file.txt and you want to print out the result):

from collections import Counter;

separator = ' --- '
with open('your_file.txt') as f:
    content = f.readlines()  # here you got a list of elements corresponding to the lines
    people = separator.join(content).split(separator) # here you got a list of all elements
    people_count = Counter(people) # you got here a dict-like object with key=name value=count
    for name, val in people_count.iteritems():
        # print the column the way you want
        print '{name}{separator}{value}'.format(name=name, separator=separator, value=val)

The example use the Counter object which allows you to efficiently count element from an iterable. the rest of the code is only string manipulation.

Comments

0

Try this:

df = pd.DataFrame({"col1" : ["Ernst", "Peter","Bill",np.nan,"Jim"],
 "col2" : ["Jim","Ernst",np.nan,"Doug","Jake"]})
print df
df1 = df.groupby("col1")["col1"].count()
df2 = df.groupby("col2")["col2"].count()
print df1.add(df2,fill_value=0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.