Distinct string count in dataframe column

Question

I have a .tsv data file. I want to print the count of strings in a certain column. The column looks like this:

column1
A aaa
A, C c
C
D
E ee,F
A aaa, B, C cc
F
E ee

I want distinct counts of A,B,C, A aaa etc. But in the column, there are sometimes spaces after the ",". So my code counts "B" and " B" differently. This is the code I am currently using:

import pandas as pd
import os

# Import data from file into Pandas DataFrame
data= pd.read_csv("data.tsv", encoding='utf-8', delimiter="\t")
pd.set_option('display.max_rows', None)
out = data['Column1'].str.split(',', expand=True).stack().value_counts()
print (out)

Any leads are appreciated.

Quang Hoang · Accepted Answer · 2020-11-25 16:15:03Z

1

you need to add ' ' into your split, i.e. split(', '). Try ',\s*' for , followed by optional spaces:

out = df['column1'].str.split(',\s*', expand=True).stack().value_counts()

Output:

F        2
E ee     2
A aaa    2
C c      1
C        1
A        1
C cc     1
B        1
D        1
dtype: int64

Also, you can replace ', ' with ',' and use get_dummies:

df['column1'].str.replace(',\s*',',').str.get_dummies(',').sum()

Output:

A        1
A aaa    2
B        1
C        1
C c      1
C cc     1
D        1
E ee     2
F        2
dtype: int64

answered Nov 25, 2020 at 16:15

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Distinct string count in dataframe column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related