1

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.

First dataframe:

enter image description here

Second dataframe:

enter image description here

Result so far:

enter image description here

Desired Result:

enter image description here

My script so far:

 import pandas as pd

fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')

fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()

pat = '({})'.format('|'.join(clas['classifier'].unique()))

fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)

clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)   

Thanks!

1 Answer 1

1

You could try this:

import pandas as pd

fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()

# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
    fl.loc[fl["fullname"].str.contains(value, case=False), value] = value

# Count values and create a new dataframe
new_clas = pd.DataFrame(
    {
        "classifier": [col for col in clas["classifier"].values],
        "count": [fl[col].count() for col in clas["classifier"].values],
    }
)

# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
    left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)

# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])

print(new_clas)
# Outputs
classifier    count    coordinate
repair        3        52.520008, 13.404954
car           3        54.520008, 15.404954

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you for your answer. I ran the script with fullname= ['churchtree', 'church', 'Dorf der Church', 'Apfel', 'Apfelmus', 'Stadt des Apfelmuses', 'Neue Krone', 'Stadt der Krone', 'Dorf der Krone', 'Kindergarten der Krone', 'Kronenstraße', 'Apfelmuschurch', 'kronenapfelmus'] and Classifier = ['krone', 'church', 'apfelmus']. The count of the script is krone = 5, apfelmus= 4 and church =3 and the correct output should be krone = 6, apfelmus= 4 and church =4. Do you know what could be the problem? And I also get one Nan value for the coordinate :)
Sorry, my initial answer wasn't right, see my updated answer, which now gives the correct values (krone = 6, apfelmus= 4 and church =4). Note that you will get Nan values for coordinate for 'krone' and 'apfelmus' as they are not present in 'fullname' dataframe.
Awesome thank you. Is there a way to also include the coordinates where fullname only contains the classifier to not get the nan values? E.g. for krone where it is not contained as a single token in fullname.
Sorry, I don't think so (as the non single tokens are repeated multiple times).
If this answer has solved your question, please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. In any case, have a nice day.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.