Parsing List Elements into Multiple Lists in Python

Question

I have managed to pull a list from a data source. The list elements are formatted like this (note the first number is NOT the index):

0                   cheese    100
1                   cheddar cheese    1100
2                   gorgonzola    1300
3                   smoked cheese    200

etc.

This means when printed, one line contains "0 cheese 100", with all the spaces.

What I would like to do is parse each entry to divide it into two lists. I don't need the first number. Instead, I want the cheese type and the number after.

For instance:

cheese
cheddar cheese
gorgonzola
smoked cheese

and:

The ultimate goal is to be able to attribute the two lists to columns in a pd.DataFrame so they can be processed in their own individual way.

Any help is much appreciated.

Mark · Accepted Answer · 2022-10-23 00:41:59Z

2

If the goal is a dataframe, why not just make that rather than the two lists. If you turn your string into a Series, you can us pandas.Series.str.extract() to split it into the columns you want:

import pandas as pd

s = '''0                   cheese    100
1                   cheddar cheese    1100
2                   gorgonzola    1300
3                   smoked cheese    200'''

pd.Series(s.split('\n')).str.extract(r'.*?\s+(?P<type>.*?)\s+(?P<value>\d+)')

This gives a Dataframe:

    type             value
0   cheese           100
1   cheddar cheese   1100
2   gorgonzola       1300
3   smoked cheese    200

edited Oct 23, 2022 at 0:41

answered Oct 22, 2022 at 21:03

Mark

92.6k8 gold badges116 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

BeRT2me Over a year ago

Also, for a pd.Series.str solution, personally I'd use .str.split('\s\s+', expand=True) and drop the first column~

Tranbi · Accepted Answer · 2022-10-22 21:03:20Z

1

IIUC your strings are elements of a list. You can use re.split to split where two or more spaces are found:

import re
import pandas as pd

your_list = [
  "0                   cheese    100",
  "1                   cheddar cheese    1100",
  "2                   gorgonzola    1300",
  "3                   smoked cheese    200",
]

df = pd.DataFrame([re.split(r'\s{2,}', s)[1:] for s in your_list], columns=["type", "value"])

Output:

             type value
0          cheese   100
1  cheddar cheese  1100
2      gorgonzola  1300
3   smoked cheese   200

answered Oct 22, 2022 at 21:03

Tranbi

12.8k6 gold badges19 silver badges39 bronze badges

Comments

Luis · Accepted Answer · 2022-10-22 21:10:59Z

1

I think something on these lines might work:

import pandas as pd
import re
mylist=['0 cheese 100','1 cheddar cheese 200']


numbers = '[0-9]'

list1=[i.split()[-1] for i in mylist]
list2=[re.sub(numbers, '', i).strip() for i in mylist]


your_df=pd.DataFrame({'name1':list1,'name2':list2})
your_df

edited Oct 22, 2022 at 21:10

answered Oct 22, 2022 at 20:57

Luis

3282 silver badges13 bronze badges

2 Comments

Mark Over a year ago

You conveniently left out data with spaces like cheddar cheese. What happens with those?

Luis Over a year ago

Yeah sorry, I missed those. I edited my previous answer now. If the structure is always like that, using regex might help you eliminating numbers from the total string.

user3435121 · Accepted Answer · 2022-10-22 21:31:18Z

1

May I suggest this simple solution:

lines = [
         "1                   cheddar cheese    1100 ",
         "2                   gorgonzola    1300 ",
         "3                   smoked cheese    200",
        ]

for line in lines:
  words = line.strip().split()
  print( ' '.join( words[1:-1]), words[-1])

Result:

cheddar cheese 1100
gorgonzola 1300
smoked cheese 200

answered Oct 22, 2022 at 21:31

user3435121

6754 silver badges15 bronze badges

Comments

BeRT2me · Accepted Answer · 2022-10-23 00:36:43Z

1

If you have:

text = '''0                   cheese    100
1                   cheddar cheese    1100
2                   gorgonzola    1300
3                   smoked cheese    200'''

# OR

your_list = [
 '0                   cheese    100',
 '1                   cheddar cheese    1100',
 '2                   gorgonzola    1300',
 '3                   smoked cheese    200'
]

text = '\n'.join(your_list)

Doing:

from io import StringIO

df = pd.read_csv(StringIO(text), sep='\s\s+', names=['col1', 'col2'], engine='python')
print(df)

Output:

             col1  col2
0          cheese   100
1  cheddar cheese  1100
2      gorgonzola  1300
3   smoked cheese   200

This is treating that first number as the index, but you can reset it with df=df.reset_index(drop=True) if desired.

edited Oct 23, 2022 at 0:36

answered Oct 23, 2022 at 0:14

BeRT2me

13.3k2 gold badges18 silver badges39 bronze badges

Comments

Jannes · Accepted Answer · 2022-10-22 21:43:20Z

You could achieve this by using slicing:

from curses.ascii import isdigit


inList = ['0                   cheese    100', '1                   cheddar cheese    1100', '2                   gorgonzola    1300', '3                   smoked cheese    200']

cheese = []
prices = []

for i in inList:
    temp = i[:19:-1] #Cuts out first number and all empty spaces until first character and reverses the string
    counter = 0
    counter2 = 0
    for char in temp: #Temp is reversed, meaning the number e.g. '100' for 'cheese' is in front but reversed
        if char.isdigit(): 
            counter += 1
        else:   #If the character is an empty space, we know the number is over
            prices.append((temp[:counter])[::-1]) #We know where the number begins (at position 0) and ends (at position counter), we flip it and store it in prices

            cheeseWithSpace = (temp[counter:]) #Since we cut out the number, the rest has to be the cheese name with some more spaces in front
            for char in cheeseWithSpace:
                if char == ' ': #We count how many spaces are in front
                    counter2 += 1
                else:   #If we reach something other than an empty space, we know the cheese name begins.
                    cheese.append(cheeseWithSpace[counter2:][::-1]) #We know where the cheese name begins (at position counter2) cut everything else out, flip it and store it
                    break
            break

print(prices)
print(cheese)

View in-code comments to understand the approach. Basically you flip your strings around using [::-1] to make them easier to process. Then you remove every part one by one.

Collectives™ on Stack Overflow

Parsing List Elements into Multiple Lists in Python

6 Answers 6

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related