1

Let's say I have a string stored in text. I want to compare this string with a list of strings stored in a dataframe and check if the text contains words like car, plane, etc. For each keyword found, I want to add 1 value belonging to the correlated topic.

| topic      | keywords                                  |
|------------|-------------------------------------------|
| Vehicles   | [car, plane, motorcycle, bus]             |
| Electronic | [television, radio, computer, smartphone] |
| Fruits     | [apple, orange, grape]                    |

I have written the following code, but I don't really like it. And it doesn't work as intended.

def foo(text, df_lex):

    keyword = []
    score = []
    for lex_list in df_lex['keyword']:
        print(lex_list)
        val = 0
        for lex in lex_list:

            if lex in text:
                val =+ 1
        keyword.append(key)
        score.append(val)
    score_list = pd.DataFrame({
    'keyword':keyword,
    'score':score
    })

Is there a way to do this efficiently? I don't like having too many loopings in my program, as they don't seem to be very efficient. I will elaborate more if needed. Thank you.

EDIT: For example my text is like this. I made it simple, just so it's understandable.

I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.

So, my expected output would be something like this.

| topic      | score |
|------------|-------|
| Vehicles   | 2     |
| Electronic | 1     |
| Fruits     | 0     |

EDIT2: I finally found my own solution with some help from @jezrael.

df['keywords'] = df['keywords'].str.strip('[]').str.split(', ')

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

score_list = []
for lex in df['keywords']:
    val = 0
    for w in lex:
        if w in text:
            val +=1
    score_list.append(val)
df['score'] = score_list
print(df)

And it prints exactly what I need.

11
  • 1
    Your current code thinks scare matches car. Do you want that behavior or not? If not, do you want cars to match car (stemming) or is that not important? Commented Feb 10, 2019 at 7:02
  • @JohnZwinck no, I don't want that. I only want to match car with car not cars. My actual dataset isn't in English anyway. That kind of plural as in cars doesn't exist here. Commented Feb 10, 2019 at 7:10
  • 1
    is there a specific reason why you use pandas and just not vanilla python for something like this, there are good tools for these things in the standard library that easily can be used Commented Feb 10, 2019 at 7:42
  • @jezrael if what you mean by keywords like nice car and car, yes there is a possibility for both. What do you mean by text in expected output? My expected output would be the scores for the appearance of those words in the keyword. I hope that's clear enough? Commented Feb 10, 2019 at 7:49
  • @ahed87 I use pandas because I load the data from csv or txt as dataframe. Is that a bad idea? I am so used to using pandas to load data from documents like csv. What tools do you recommend? Commented Feb 10, 2019 at 7:50

2 Answers 2

2

Here are 2 alternative ways only using vanilla python. First the data of interest.

kwcsv = """topic, keywords
Vehicles, car, plane, motorcycle, bus
Electronic, television, radio, computer, smartphone
Fruits, apple, orange, grape
"""

test = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'
testr = test
from io import StringIO

StringIO is only used to make runnable examples, it's symbolizing reading a file. Then construct a kwords dictionary to use for counting.

import csv

kwords = dict()
#with open('your_file.csv') as mcsv:
mcsv = StringIO(kwcsv)
reader = csv.reader(mcsv, skipinitialspace=True)
next(reader, None) # skip header
for row in reader:
    kwords[row[0]] = tuple(row[1:])

Now we have what to count in a dictionary. First alternative is just doing a count in the text-strings.

for r in list('.,'): # remove chars that removes counts
    testr = testr.replace(r, '')

result = {k: sum((testr.count(w) for w in v)) for k, v in kwords.items()}

Or another version using regex for splitting strings and Counter.

import re
from collections import Counter

words = re.findall(r'\w+', StringIO(test).read().lower())
count = Counter(words)

result2 = {k: sum((count[w] for w in v)) for k, v in kwords.items()}

Not saying that anyone of these is better, just alternatives only using vanilla python. Personally I would use the re/Counter version.

Sign up to request clarification or add additional context in comments.

Comments

2

Extract words with re.findall, convert to lowercase and then to sets, last get length of matched sets in list comprehension:

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

import re
s = set(x.lower() for x in re.findall(r'\b\w+\b', text))
print (s)
{'go', 'motorcycle', 'a', 'car', 'my', 'the', 'got', 
 'message', 'to', 'home', 'went', 'riding', 'checked', 
 'i', 'showroom', 'when', 'buy', 'smartphone', 'today', 'unluckily'}

df['score'] = [len(s & set(x)) for x in df['keywords']]
print (df)
        topic                                   keywords  score
0    Vehicles              [car, plane, motorcycle, bus]      2
1  Electronic  [television, radio, computer, smartphone]      1
2      Fruits                     [apple, orange, grape]      0

Alternative solution is count only True values in list comprehension:

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

10 Comments

thanks. But I don't know why when I run this code, I got 1 on Vehicles, 2 on Electronic, and 1 on Fruits.
@AnnaRG - Added sample dataFrame, can you check it? Because it is really weird - Fruits are 1
Yeah, I have run your code and get Fruits as 1, which doesn't make sense, because none of the keywords of Fruits is found on text.
@AnnaRG So use df['keywords'] = df['keywords'].str.strip('[]').str.split(', ') or import ast and df['keywords'] = df['keywords'].apply(ast.literal_eval)
@Anna RG So it is like df['score'] = [sum(z in text for z in x) for x in df['keywords']]
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.