0

I have a Dataframe that looks like this.

   done    sentence                        3_tags
0  0       ['What', 'were', 'the', '...]   ['WP', 'VBD', 'DT']
1  0       ['What', 'was', 'the', '...]    ['WP', 'VBD', 'DT']
2  0       ['Why', 'did', 'John', '...]    ['WP', 'VBD', 'NN']
...

For each row I want to check if the list in column '3_tags' is on a list temp1, as follows:

a = pd.read_csv('sentences.csv')
temp1 = [ ['WP', 'VBD', 'DT'], ['WRB', 'JJ', 'VBZ'], ['WP', 'VBD', 'DT'] ]
q = a['3_tags'] 
q in temp1

For the first sentence in row 0, the value of '3_tags' = ['WP', 'VBD', 'DT'] which is in temp1 so I expect the result of the above to be:

True

However, I get this error:

ValueError: Arrays were different lengths: 1 vs 3

I suspect that there is some problem with the datatype of q:

print(type(q))
<class 'pandas.core.series.Series'>

Is the problem that q is a Series and temp1 contains lists? What should I do to get the logical result 'True' ?

1 Answer 1

1

You want those lists to be tuples instead.
Then use pd.Series.isin

*temp1, = map(tuple, temp1)

q = a['3_tags'].apply(tuple)

q.isin(temp1)

0     True
1     True
2    False
Name: 3_tags, dtype: bool

However, it appears that the '3_tags' column consists of strings that look like lists. In this case, we want to parse them with ast.literal_eval

from ast import literal_eval

*temp1, = map(tuple, temp1)

q = a['3_tags'].apply(lambda x: tuple(literal_eval(x)))

q.isin(temp1)

0     True
1     True
2    False
Name: 3_tags, dtype: bool

Setup1

a = pd.DataFrame({
    'done': [0, 0, 0],
    'sentence': list(map(str.split, ('What were the', 'What was the', 'Why did John'))),
    '3_tags': list(map(str.split, ('WP VBD DT', 'WP VBD DT', 'WP VBD NN')))
}, columns='done sentence 3_tags'.split())

temp1 = [['WP', 'VBD', 'DT'], ['WRB', 'JJ', 'VBZ'], ['WP', 'VBD', 'DT']]

Setup2

a = pd.DataFrame({
    'done': [0, 0, 0],
    'sentence': list(map(str.split, ('What were the', 'What was the', 'Why did John'))),
    '3_tags': list(map(str, map(str.split, ('WP VBD DT', 'WP VBD DT', 'WP VBD NN'))))
}, columns='done sentence 3_tags'.split())

temp1 = [['WP', 'VBD', 'DT'], ['WRB', 'JJ', 'VBZ'], ['WP', 'VBD', 'DT']]
Sign up to request clarification or add additional context in comments.

11 Comments

In the Setup, I do not understand how to prepare my (very large) DataFrame the way you show. How do I convert it to tuples?
The setup is to produce the variables a and temp1 as you had them. You shouldn't have to do anything. That is for others who may want to test it out. You just need to use the code in the top portion.
Thanks, got it. When I use the top portion it gives another error: ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
When I do q = a['3_tags'].apply(tuple) and then print(q), I get: ([, ', D, T, ', ,, , ', N, N, ', ,, , ', I, ...
That means your data frame is all messed up. In your post it looks like those elements in '3_tags' are lists when they are strings that look like lists. I'll update my post to account for that. In fact, if you are able, you should provide a method to reproduce exactly what your data is.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.