0

Suppose a list of strings:

lst1 = ['A1 B1 C1', 'A2 B2 D1', 'S1 M1 A3', 'A4 B3 G1','H1 K1 W1']

I want to create a table by searching a specific value within each string(if available) then populate a pandas dataframe .

Like so:

         'A'     'B'     'C'      'D'
string1   A1      B1      C1      Nan
string2   A2      B2     Nan       D1
string3   A3      Nan    Nan      Nan
string4   A4      B3     Nan      Nan
string5   Nan     Nan    Nan      Nan

In order to search within each string, I split each of them into a list making it a nested list in order to run a for loop within each string to search. My RegEx game is not too strong but I think this can be done with a good handle on RegEx.

My current code :

import pandas as pd
lst1 = ['A1 B1 C1', 'A2 B2 D1', 'S1 M1 A3', 'A4 B3 G1','H1 K1 W1']
modlst1 = []
for each in lst1:
    modlst1.append(each.split())

rows = range(len(modlst1)) ### rows for each string
cols = ['A','B','C','D']   ### cols for each string
df = pd.DataFrame(index = rows, columns = cols)
df = df.fillna(0)

### Populating values
for each in rows:
    for stuff in modlst1[each]:
        if stuff.startswith('A'):
           df['A'] = stuff
        elif stuff.startswith('B'):
           df['B'] = stuff
        elif stuff.startswith('C'):
           df['C'] = stuff
        elif stuff.startswith('D'):
           df['D'] = stuff

I'm very new to Python so I am still learning string manipulation and search and find. I am sure there has to be a better way to do this. My solution is not working as same values keep populating in my dataframe, when I try to put them in dataframe. But when I do:

        if stuff.startswith('A'):
           print(stuff)

loop runs fine and I get different values of "A","B","C","D". For eg: (i DON'T WANT THIS)

         'A'     'B'     'C'      'D'
string1   A1      B1      C1      Nan
string2   A1      B1      C1       D1
string3   A1      B1      C1       D1
string4   A1      B1      C1       D1
string5   A1      B1      C1       D1

1 Answer 1

1

Here is a way to do it:

import pandas as pd

lst1 = ['A1 B1 C1', 'A2 B2 D1', 'S1 M1 A3', 'A4 B3 G1','H1 K1 W1']

cols = ['A', 'B', 'C', 'D']   ### cols for each string
df = pd.DataFrame(columns=cols)

### Populating values
for elt in lst1:
    new = {}
    for sub_elt in elt.split(" "):
        if sub_elt[0] in cols:
            new[sub_elt[0]] = sub_elt
    df = df.append(pd.Series(new), ignore_index=True)

Feel free to ask if some part is unclear

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you. Although when I apply it my actual data I get a traceback on line: if sub_elt[0] in cols: IndexError: string index out of range. My data is exactly of this format, so I can't figure out the reason behind this error
Hard to tell without the data. Maybe you have a trailing space before your first element?
I thought that would be it. But I tried messing with data to test my theory but getting the same error. Can you perhaps explain what if sub_elt[0] in cols: is essentially doing?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.