1

I have a .tsv data file. I want to print the count of strings in a certain column. The column looks like this:

column1
A aaa
A, C c
C
D
E ee,F
A aaa, B, C cc
F
E ee

I want distinct counts of A,B,C, A aaa etc. But in the column, there are sometimes spaces after the ",". So my code counts "B" and " B" differently. This is the code I am currently using:

import pandas as pd
import os

# Import data from file into Pandas DataFrame
data= pd.read_csv("data.tsv", encoding='utf-8', delimiter="\t")
pd.set_option('display.max_rows', None)
out = data['Column1'].str.split(',', expand=True).stack().value_counts()
print (out)

Any leads are appreciated.

1 Answer 1

1

you need to add ' ' into your split, i.e. split(', '). Try ',\s*' for , followed by optional spaces:

out = df['column1'].str.split(',\s*', expand=True).stack().value_counts()

Output:

F        2
E ee     2
A aaa    2
C c      1
C        1
A        1
C cc     1
B        1
D        1
dtype: int64

Also, you can replace ', ' with ',' and use get_dummies:

df['column1'].str.replace(',\s*',',').str.get_dummies(',').sum()

Output:

A        1
A aaa    2
B        1
C        1
C c      1
C cc     1
D        1
E ee     2
F        2
dtype: int64
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.