0

I have managed to pull a list from a data source. The list elements are formatted like this (note the first number is NOT the index):

0                   cheese    100
1                   cheddar cheese    1100
2                   gorgonzola    1300
3                   smoked cheese    200

etc.

This means when printed, one line contains "0 cheese 100", with all the spaces.

What I would like to do is parse each entry to divide it into two lists. I don't need the first number. Instead, I want the cheese type and the number after.

For instance:

cheese
cheddar cheese
gorgonzola
smoked cheese

and:

100
1100
1300
200

The ultimate goal is to be able to attribute the two lists to columns in a pd.DataFrame so they can be processed in their own individual way.

Any help is much appreciated.

6 Answers 6

2

If the goal is a dataframe, why not just make that rather than the two lists. If you turn your string into a Series, you can us pandas.Series.str.extract() to split it into the columns you want:

import pandas as pd

s = '''0                   cheese    100
1                   cheddar cheese    1100
2                   gorgonzola    1300
3                   smoked cheese    200'''

pd.Series(s.split('\n')).str.extract(r'.*?\s+(?P<type>.*?)\s+(?P<value>\d+)')

This gives a Dataframe:

    type             value
0   cheese           100
1   cheddar cheese   1100
2   gorgonzola       1300
3   smoked cheese    200
Sign up to request clarification or add additional context in comments.

1 Comment

Also, for a pd.Series.str solution, personally I'd use .str.split('\s\s+', expand=True) and drop the first column~
1

IIUC your strings are elements of a list. You can use re.split to split where two or more spaces are found:

import re
import pandas as pd

your_list = [
  "0                   cheese    100",
  "1                   cheddar cheese    1100",
  "2                   gorgonzola    1300",
  "3                   smoked cheese    200",
]

df = pd.DataFrame([re.split(r'\s{2,}', s)[1:] for s in your_list], columns=["type", "value"])

Output:

             type value
0          cheese   100
1  cheddar cheese  1100
2      gorgonzola  1300
3   smoked cheese   200

Comments

1

I think something on these lines might work:

import pandas as pd
import re
mylist=['0 cheese 100','1 cheddar cheese 200']


numbers = '[0-9]'

list1=[i.split()[-1] for i in mylist]
list2=[re.sub(numbers, '', i).strip() for i in mylist]


your_df=pd.DataFrame({'name1':list1,'name2':list2})
your_df

2 Comments

You conveniently left out data with spaces like cheddar cheese. What happens with those?
Yeah sorry, I missed those. I edited my previous answer now. If the structure is always like that, using regex might help you eliminating numbers from the total string.
1

May I suggest this simple solution:

lines = [
         "1                   cheddar cheese    1100 ",
         "2                   gorgonzola    1300 ",
         "3                   smoked cheese    200",
        ]

for line in lines:
  words = line.strip().split()
  print( ' '.join( words[1:-1]), words[-1])

Result:

cheddar cheese 1100
gorgonzola 1300
smoked cheese 200

Comments

1

If you have:

text = '''0                   cheese    100
1                   cheddar cheese    1100
2                   gorgonzola    1300
3                   smoked cheese    200'''

# OR

your_list = [
 '0                   cheese    100',
 '1                   cheddar cheese    1100',
 '2                   gorgonzola    1300',
 '3                   smoked cheese    200'
]

text = '\n'.join(your_list)

Doing:

from io import StringIO

df = pd.read_csv(StringIO(text), sep='\s\s+', names=['col1', 'col2'], engine='python')
print(df)

Output:

             col1  col2
0          cheese   100
1  cheddar cheese  1100
2      gorgonzola  1300
3   smoked cheese   200
  • This is treating that first number as the index, but you can reset it with df=df.reset_index(drop=True) if desired.

Comments

0

You could achieve this by using slicing:

from curses.ascii import isdigit


inList = ['0                   cheese    100', '1                   cheddar cheese    1100', '2                   gorgonzola    1300', '3                   smoked cheese    200']

cheese = []
prices = []

for i in inList:
    temp = i[:19:-1] #Cuts out first number and all empty spaces until first character and reverses the string
    counter = 0
    counter2 = 0
    for char in temp: #Temp is reversed, meaning the number e.g. '100' for 'cheese' is in front but reversed
        if char.isdigit(): 
            counter += 1
        else:   #If the character is an empty space, we know the number is over
            prices.append((temp[:counter])[::-1]) #We know where the number begins (at position 0) and ends (at position counter), we flip it and store it in prices

            cheeseWithSpace = (temp[counter:]) #Since we cut out the number, the rest has to be the cheese name with some more spaces in front
            for char in cheeseWithSpace:
                if char == ' ': #We count how many spaces are in front
                    counter2 += 1
                else:   #If we reach something other than an empty space, we know the cheese name begins.
                    cheese.append(cheeseWithSpace[counter2:][::-1]) #We know where the cheese name begins (at position counter2) cut everything else out, flip it and store it
                    break
            break

print(prices)
print(cheese)

View in-code comments to understand the approach. Basically you flip your strings around using [::-1] to make them easier to process. Then you remove every part one by one.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.