Populating pandas dataframe by searching data in substring of string in a list

Question

Suppose a list of strings:

lst1 = ['A1 B1 C1', 'A2 B2 D1', 'S1 M1 A3', 'A4 B3 G1','H1 K1 W1']

I want to create a table by searching a specific value within each string(if available) then populate a pandas dataframe .

Like so:

         'A'     'B'     'C'      'D'
string1   A1      B1      C1      Nan
string2   A2      B2     Nan       D1
string3   A3      Nan    Nan      Nan
string4   A4      B3     Nan      Nan
string5   Nan     Nan    Nan      Nan

In order to search within each string, I split each of them into a list making it a nested list in order to run a for loop within each string to search. My RegEx game is not too strong but I think this can be done with a good handle on RegEx.

My current code :

import pandas as pd
lst1 = ['A1 B1 C1', 'A2 B2 D1', 'S1 M1 A3', 'A4 B3 G1','H1 K1 W1']
modlst1 = []
for each in lst1:
    modlst1.append(each.split())

rows = range(len(modlst1)) ### rows for each string
cols = ['A','B','C','D']   ### cols for each string
df = pd.DataFrame(index = rows, columns = cols)
df = df.fillna(0)

### Populating values
for each in rows:
    for stuff in modlst1[each]:
        if stuff.startswith('A'):
           df['A'] = stuff
        elif stuff.startswith('B'):
           df['B'] = stuff
        elif stuff.startswith('C'):
           df['C'] = stuff
        elif stuff.startswith('D'):
           df['D'] = stuff

I'm very new to Python so I am still learning string manipulation and search and find. I am sure there has to be a better way to do this. My solution is not working as same values keep populating in my dataframe, when I try to put them in dataframe. But when I do:

        if stuff.startswith('A'):
           print(stuff)

loop runs fine and I get different values of "A","B","C","D". For eg: (i DON'T WANT THIS)

         'A'     'B'     'C'      'D'
string1   A1      B1      C1      Nan
string2   A1      B1      C1       D1
string3   A1      B1      C1       D1
string4   A1      B1      C1       D1
string5   A1      B1      C1       D1

Julien Roullé · Accepted Answer · 2020-05-22 17:12:04Z

1

Here is a way to do it:

import pandas as pd

lst1 = ['A1 B1 C1', 'A2 B2 D1', 'S1 M1 A3', 'A4 B3 G1','H1 K1 W1']

cols = ['A', 'B', 'C', 'D']   ### cols for each string
df = pd.DataFrame(columns=cols)

### Populating values
for elt in lst1:
    new = {}
    for sub_elt in elt.split(" "):
        if sub_elt[0] in cols:
            new[sub_elt[0]] = sub_elt
    df = df.append(pd.Series(new), ignore_index=True)

Feel free to ask if some part is unclear

answered May 22, 2020 at 17:12

Julien Roullé

6724 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Adheesh Saxena Over a year ago

Thank you. Although when I apply it my actual data I get a traceback on line: if sub_elt[0] in cols: IndexError: string index out of range. My data is exactly of this format, so I can't figure out the reason behind this error

Julien Roullé Over a year ago

Hard to tell without the data. Maybe you have a trailing space before your first element?

Adheesh Saxena Over a year ago

I thought that would be it. But I tried messing with data to test my theory but getting the same error. Can you perhaps explain what if sub_elt[0] in cols: is essentially doing?

Collectives™ on Stack Overflow

Populating pandas dataframe by searching data in substring of string in a list

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related