1

Column term stores a set with a few strings (out of a fixed set of ~1000 strings).

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])

Out[1]
           terms
0  {mouse, city}
1        {mouse}
2         {blue}

I want to iterate over the rows and count occurrences of each unique term per row, so I plan to create a boolean column for each term found. Something like:

           terms  has_mouse  has_city  has_blue
0  {mouse, city}          1         1         0
1        {mouse}          1         0         0
2         {blue}          0         0         1

I tried this:

def count_terms_in_row(row):
    for term in row['terms']:
        row['has_{}'.format(term)] = 1

df.apply(count_terms_in_row, axis=1)

However, that didn't work as planned . What's the right approach here?

2
  • df.terms.apply(len)? Commented Apr 27, 2020 at 14:15
  • Thank you, please see edit - need to count each term separately. Commented Apr 27, 2020 at 14:30

3 Answers 3

2

You can do the following:

import pandas as pd
import numpy as np

df = pd.DataFrame([[{'city', 'mouse'}], 
                   [{'mouse'}], 
                   [{'blue'}]], 
                  columns=['terms'])


all_terms = set()
for idx, data in df.iterrows():
  all_terms = all_terms.union(data["terms"])

# find out all new columns
new_columns = []
term2idx = {}
for idx, term in enumerate(all_terms):
  new_columns.append("has_term_{}".format(term))
  term2idx[term] = idx

# add new data per new column
new_data = []
for idx, data in df.iterrows():
  _row = [0] * len(new_columns)
  for term in data["terms"]:
    _row[term2idx[term]] = 1
  new_data.append(_row)

# add new data to existing DataFrame
new_data = np.asarray(new_data)
for idx in range(len(new_columns)):
  df[new_columns[idx]] = new_data[:,idx]

print(df.head())

This results in:

    terms   has_term_city   has_term_blue   has_term_mouse
0   {city, mouse}   1   0   1
1   {mouse} 0   0   1
2   {blue}  0   1   
Sign up to request clarification or add additional context in comments.

Comments

1

This is essentially get_dummies:

df.join(pd.get_dummies(df.terms.apply(list).explode())
          .sum(level=0)
          .add_prefix('has_')
       ) 

Output:

           terms  has_blue  has_city  has_mouse
0  {mouse, city}         0         1          1
1        {mouse}         0         0          1
2         {blue}         1         0          0

Comments

0

You can try this:

df['count'] = df['terms'].str.len()
print(df)

           terms  count
0  {mouse, city}      2
1        {mouse}      1
2         {blue}      1

1 Comment

Thank you, please see edit - need to count each term separately.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.