How do I compare a string to a string in a dataframe using pandas?

Question

Let's say I have a string stored in text. I want to compare this string with a list of strings stored in a dataframe and check if the text contains words like car, plane, etc. For each keyword found, I want to add 1 value belonging to the correlated topic.

| topic      | keywords                                  |
|------------|-------------------------------------------|
| Vehicles   | [car, plane, motorcycle, bus]             |
| Electronic | [television, radio, computer, smartphone] |
| Fruits     | [apple, orange, grape]                    |

I have written the following code, but I don't really like it. And it doesn't work as intended.

def foo(text, df_lex):

    keyword = []
    score = []
    for lex_list in df_lex['keyword']:
        print(lex_list)
        val = 0
        for lex in lex_list:

            if lex in text:
                val =+ 1
        keyword.append(key)
        score.append(val)
    score_list = pd.DataFrame({
    'keyword':keyword,
    'score':score
    })

Is there a way to do this efficiently? I don't like having too many loopings in my program, as they don't seem to be very efficient. I will elaborate more if needed. Thank you.

EDIT: For example my text is like this. I made it simple, just so it's understandable.

I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.

So, my expected output would be something like this.

| topic      | score |
|------------|-------|
| Vehicles   | 2     |
| Electronic | 1     |
| Fruits     | 0     |

EDIT2: I finally found my own solution with some help from @jezrael.

df['keywords'] = df['keywords'].str.strip('[]').str.split(', ')

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

score_list = []
for lex in df['keywords']:
    val = 0
    for w in lex:
        if w in text:
            val +=1
    score_list.append(val)
df['score'] = score_list
print(df)

And it prints exactly what I need.

Your current code thinks scare matches car. Do you want that behavior or not? If not, do you want cars to match car (stemming) or is that not important? — John Zwinck
– John Zwinck, Commented Feb 10, 2019 at 7:02
@JohnZwinck no, I don't want that. I only want to match car with car not cars. My actual dataset isn't in English anyway. That kind of plural as in cars doesn't exist here. — catris25
– catris25, Commented Feb 10, 2019 at 7:10
is there a specific reason why you use pandas and just not vanilla python for something like this, there are good tools for these things in the standard library that easily can be used — ahed87
– ahed87, Commented Feb 10, 2019 at 7:42
@jezrael if what you mean by keywords like nice car and car, yes there is a possibility for both. What do you mean by text in expected output? My expected output would be the scores for the appearance of those words in the keyword. I hope that's clear enough? — catris25
– catris25, Commented Feb 10, 2019 at 7:49
@ahed87 I use pandas because I load the data from csv or txt as dataframe. Is that a bad idea? I am so used to using pandas to load data from documents like csv. What tools do you recommend? — catris25
– catris25, Commented Feb 10, 2019 at 7:50

ahed87 · Accepted Answer · 2019-02-10 13:03:08Z

Here are 2 alternative ways only using vanilla python. First the data of interest.

kwcsv = """topic, keywords
Vehicles, car, plane, motorcycle, bus
Electronic, television, radio, computer, smartphone
Fruits, apple, orange, grape
"""

test = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'
testr = test
from io import StringIO

StringIO is only used to make runnable examples, it's symbolizing reading a file. Then construct a kwords dictionary to use for counting.

import csv

kwords = dict()
#with open('your_file.csv') as mcsv:
mcsv = StringIO(kwcsv)
reader = csv.reader(mcsv, skipinitialspace=True)
next(reader, None) # skip header
for row in reader:
    kwords[row[0]] = tuple(row[1:])

Now we have what to count in a dictionary. First alternative is just doing a count in the text-strings.

for r in list('.,'): # remove chars that removes counts
    testr = testr.replace(r, '')

result = {k: sum((testr.count(w) for w in v)) for k, v in kwords.items()}

Or another version using regex for splitting strings and Counter.

import re
from collections import Counter

words = re.findall(r'\w+', StringIO(test).read().lower())
count = Counter(words)

result2 = {k: sum((count[w] for w in v)) for k, v in kwords.items()}

Not saying that anyone of these is better, just alternatives only using vanilla python. Personally I would use the re/Counter version.

jezrael · Accepted Answer · 2019-02-10 13:08:33Z

2

Extract words with re.findall, convert to lowercase and then to sets, last get length of matched sets in list comprehension:

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

import re
s = set(x.lower() for x in re.findall(r'\b\w+\b', text))
print (s)
{'go', 'motorcycle', 'a', 'car', 'my', 'the', 'got', 
 'message', 'to', 'home', 'went', 'riding', 'checked', 
 'i', 'showroom', 'when', 'buy', 'smartphone', 'today', 'unluckily'}

df['score'] = [len(s & set(x)) for x in df['keywords']]
print (df)
        topic                                   keywords  score
0    Vehicles              [car, plane, motorcycle, bus]      2
1  Electronic  [television, radio, computer, smartphone]      1
2      Fruits                     [apple, orange, grape]      0

Alternative solution is count only True values in list comprehension:

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

edited Feb 10, 2019 at 13:08

answered Feb 10, 2019 at 8:37

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

10 Comments

catris25 Over a year ago

thanks. But I don't know why when I run this code, I got 1 on Vehicles, 2 on Electronic, and 1 on Fruits.

jezrael Over a year ago

@AnnaRG - Added sample dataFrame, can you check it? Because it is really weird - Fruits are 1

catris25 Over a year ago

Yeah, I have run your code and get Fruits as 1, which doesn't make sense, because none of the keywords of Fruits is found on text.

jezrael Over a year ago

@AnnaRG So use df['keywords'] = df['keywords'].str.strip('[]').str.split(', ') or import ast and df['keywords'] = df['keywords'].apply(ast.literal_eval)

jezrael Over a year ago

@Anna RG So it is like df['score'] = [sum(z in text for z in x) for x in df['keywords']]

|

Collectives™ on Stack Overflow

How do I compare a string to a string in a dataframe using pandas?

2 Answers 2

Comments

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related