I have the following pandas DataFrame in Python3.x, with several numeric columns and two columns with strings:
import numpy as np
import pandas as pd
dict = {"numericvals": np.repeat(25, 8),
"numeric":np.repeat(42, 8),
"first":["beneficiary, duke", "compose", "herd primary", "stall", "deep", "regular summary classify", "timber", "property”],
"second": ["abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”, "abcde”]}
df = pd.DataFrame(dict1)
df = df[['numeric', 'numericvals', 'first', 'second']]
print(df)
numeric numericvals first second
0 42 25 beneficiary, duke abcde
1 42 25 compose abcde
2 42 25 herd primary abcde
3 42 25 stall abcde
4 42 25 deep abcde
5 42 25 regular summary classify abcde
6 42 25 timber abcde
7 42 25 property abcde
The column first contains one or more strings. If there are more that one, these are separated by either a space or comma.
My goal is to create a column recording the length of strings in first which are longer or shorter in length than the strings in second. If these are the same size, this case should be ignored.
My idea would be to create two lists:
longer = []
shorter = []
If the string in first is longer, append the string length via len() in longer. If the string is shorter, record the string length in short via len().
Here is how the analysis should look like (in pandas DataFrame format):
numericvals numeric first second longer shorter
0 25 42 beneficiary, duke abcde 11 4
1 25 42 compose abcde 7 0
2 25 42 herd primary abcde 7 4
3 25 42 stall abcde 0 0
4 25 42 deep abcde 0 4
5 25 42 regular summary classify abcde 7, 7, 8 0
6 25 42 timber abcde 6 0
7 25 42 property abcde 8 0
I don't know how to deal with multiple strings in first, especially if there are 3. How should one do this comparison in pandas?