Count occurrence of column values in other dataframe column

Question

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.

First dataframe:

Second dataframe:

Result so far:

Desired Result:

My script so far:

 import pandas as pd

fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')

fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()

pat = '({})'.format('|'.join(clas['classifier'].unique()))

fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)

clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)

Thanks!

Laurent · Accepted Answer · 2021-05-10 19:21:45Z

1

You could try this:

import pandas as pd

fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()

# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
    fl.loc[fl["fullname"].str.contains(value, case=False), value] = value

# Count values and create a new dataframe
new_clas = pd.DataFrame(
    {
        "classifier": [col for col in clas["classifier"].values],
        "count": [fl[col].count() for col in clas["classifier"].values],
    }
)

# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
    left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)

# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])

print(new_clas)
# Outputs
classifier    count    coordinate
repair        3        52.520008, 13.404954
car           3        54.520008, 15.404954

edited May 10, 2021 at 19:21

answered May 10, 2021 at 17:51

Laurent

13.7k7 gold badges30 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

remotesatellite Over a year ago

Thank you for your answer. I ran the script with fullname= ['churchtree', 'church', 'Dorf der Church', 'Apfel', 'Apfelmus', 'Stadt des Apfelmuses', 'Neue Krone', 'Stadt der Krone', 'Dorf der Krone', 'Kindergarten der Krone', 'Kronenstraße', 'Apfelmuschurch', 'kronenapfelmus'] and Classifier = ['krone', 'church', 'apfelmus']. The count of the script is krone = 5, apfelmus= 4 and church =3 and the correct output should be krone = 6, apfelmus= 4 and church =4. Do you know what could be the problem? And I also get one Nan value for the coordinate :)

Laurent Over a year ago

Sorry, my initial answer wasn't right, see my updated answer, which now gives the correct values (krone = 6, apfelmus= 4 and church =4). Note that you will get Nan values for coordinate for 'krone' and 'apfelmus' as they are not present in 'fullname' dataframe.

remotesatellite Over a year ago

Awesome thank you. Is there a way to also include the coordinates where fullname only contains the classifier to not get the nan values? E.g. for krone where it is not contained as a single token in fullname.

Laurent Over a year ago

Sorry, I don't think so (as the non single tokens are repeated multiple times).

Laurent Over a year ago

If this answer has solved your question, please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. In any case, have a nice day.

Collectives™ on Stack Overflow

Count occurrence of column values in other dataframe column

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related