convert list of strings contain lists into pandas dataframe

Question

I have a list of lists containing 12 elements in 10 items. I would like to obtain a dataframe with 12 columns and 10 rows. The twelve comma separated items are treated as one column by pd.DataFrame(). The apostrophe are part of the list to indicate a string but I suspect it is interpreted by the DataFrame function as the column boundaries. They cannot be replaced. How can this be done? What is causing this behaviour? Here is the sample data:

[['1,er,2,Fado de Padd,1\'18"1,H,6,2600,J. Dekker,17 490 €,A. De Wrede,1,6'],
 ['2,e,7,Elixir Normand,1\'18"2,H,7,2600,S. Schoonhoven,24 755 €,S. Schoonhoven,14'],
 ['3,e,3,Give You All of Me,1\'18"2,H,5,2600,JF. Van Dooyeweerd,17 600 €,JF. Van Dooyeweerd,10'],
 ['4,e,4,Gouritch,1\'18"3,H,5,2600,BJ. Crebas,20 700 €,BJ. Crebas,32'],
 ['5,e,1,Franky du Cap Vert,1\'18"4,H,6,2600,JH. Mieras,15 536 €,N. De Vreede,65'],
 ['6,e,10,Défi Magik,1\'18"0,H,8,2620,F. Verkaik,44 865 €,AW. Bosscha,6,3'],
 ['7,e,9,Fleuron,1\'18"2,H,6,2620,M. Brouwer,44 830 €,D. Brouwer,7,3'],
 ['8,e,8,Dream Gibus,1\'18"6,H,8,2620,R. Ebbinge,33 330 €,Mme A. Lehmann,36'],
 ['9,e,5,Beau Gaillard,1\'19"5,H,10,2600,A. Bakker,20 140 €,N. De Vreede,44'],
 ['0,DAI,6,Bikini de Larcy,H,10,2600,D. Den Dubbelden,21 834 €,N. Rip,52']]

Any help welcome.

Simply because it is part of a bigger chain of actions and I'm not ready at this point in the sequence to open a file in append mode, write the file, read the file to get the dataframe. 3 lines of code vs. 1. I need to iterate this step through a range that will multiply the code. Besides I wanted to know how to do this, I'm still learning. I already know read_csv ;) — Zen4ttitude
– Zen4ttitude, Commented Jun 13, 2022 at 16:04
read_csv can also read from io.StringIO, i.e from a string. :) — ramslök
– ramslök, Commented Jun 13, 2022 at 16:05
StringIO does not like lists "initial_value must be str or None, not list" — Zen4ttitude
– Zen4ttitude, Commented Jun 13, 2022 at 16:19

blackraven · Accepted Answer · 2022-06-14 05:03:31Z

2

The apostrophe means that the data is string type in the list, but can be extracted as the first element using my_list[0]. Need to process each list using list comprehension before putting into the dataframe.

There seems some typo (missing coordinates) in the last line of data, so I corrected it by adding 'null'.

import pandas as pd

data = [['1,er,2,Fado de Padd,1\'18"1,H,6,2600,J. Dekker,17 490 €,A. De Wrede,1,6'],
 ['2,e,7,Elixir Normand,1\'18"2,H,7,2600,S. Schoonhoven,24 755 €,S. Schoonhoven,14'],
 ['3,e,3,Give You All of Me,1\'18"2,H,5,2600,JF. Van Dooyeweerd,17 600 €,JF. Van Dooyeweerd,10'],
 ['4,e,4,Gouritch,1\'18"3,H,5,2600,BJ. Crebas,20 700 €,BJ. Crebas,32'],
 ['5,e,1,Franky du Cap Vert,1\'18"4,H,6,2600,JH. Mieras,15 536 €,N. De Vreede,65'],
 ['6,e,10,Défi Magik,1\'18"0,H,8,2620,F. Verkaik,44 865 €,AW. Bosscha,6,3'],
 ['7,e,9,Fleuron,1\'18"2,H,6,2620,M. Brouwer,44 830 €,D. Brouwer,7,3'],
 ['8,e,8,Dream Gibus,1\'18"6,H,8,2620,R. Ebbinge,33 330 €,Mme A. Lehmann,36'],
 ['9,e,5,Beau Gaillard,1\'19"5,H,10,2600,A. Bakker,20 140 €,N. De Vreede,44'],
 ['0,DAI,6,Bikini de Larcy,null,H,10,2600,D. Den Dubbelden,21 834 €,N. Rip,52']]

df = pd.DataFrame([line[0].split(',') for line in data])
print(df)

Output

   0    1   2                   3       4  5   6     7                   8   \
0  1   er   2        Fado de Padd  1'18"1  H   6  2600           J. Dekker   
1  2    e   7      Elixir Normand  1'18"2  H   7  2600      S. Schoonhoven   
2  3    e   3  Give You All of Me  1'18"2  H   5  2600  JF. Van Dooyeweerd   
3  4    e   4            Gouritch  1'18"3  H   5  2600          BJ. Crebas   
4  5    e   1  Franky du Cap Vert  1'18"4  H   6  2600          JH. Mieras   
5  6    e  10          Défi Magik  1'18"0  H   8  2620          F. Verkaik   
6  7    e   9             Fleuron  1'18"2  H   6  2620          M. Brouwer   
7  8    e   8         Dream Gibus  1'18"6  H   8  2620          R. Ebbinge   
8  9    e   5       Beau Gaillard  1'19"5  H  10  2600           A. Bakker   
9  0  DAI   6     Bikini de Larcy    null  H  10  2600    D. Den Dubbelden   

          9                  10  11    12  
0  17 490 €         A. De Wrede   1     6  
1  24 755 €      S. Schoonhoven  14  None  
2  17 600 €  JF. Van Dooyeweerd  10  None  
3  20 700 €          BJ. Crebas  32  None  
4  15 536 €        N. De Vreede  65  None  
5  44 865 €         AW. Bosscha   6     3  
6  44 830 €          D. Brouwer   7     3  
7  33 330 €      Mme A. Lehmann  36  None  
8  20 140 €        N. De Vreede  44  None  
9  21 834 €              N. Rip  52  None

Second method with the same output:

df = pd.DataFrame(data)[0].str.split(',', expand=True)

Third method with similar output:

from io import StringIO

stringdata = StringIO('\n'.join([line[0] for line in data]))
df = pd.read_csv(stringdata, sep=',', header=None)

However, please note that the first method (list comprehension) is still the most efficient!

edited Jun 14, 2022 at 5:03

answered Jun 13, 2022 at 10:55

blackraven

5,6797 gold badges27 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Zen4ttitude Over a year ago

So, list comprehension cycles through the lines and split() separates each entry. It would be simpler if you could tell DataFrame() that your records are comma separated like in read_csv because the function already cycles through each line.

blackraven Over a year ago

you could use the alternative: pd.DataFrame(data)[0].str.split(',', expand=True)

Zen4ttitude Over a year ago

hey I like this, it seems more efficient!

jezrael Over a year ago

@Zen4ttitude - hmmm, more efficient is list compreheonsion here

Zen4ttitude Over a year ago

let's have a race... list comprehension: --- 0.000804901123046875 seconds --- DataFrame with split: --- 0.0012080669403076172 seconds --- you are right @jezrael

|

jezrael · Accepted Answer · 2022-06-13 10:48:29Z

If use only split it working well but last row is mismatched, so all values from column 4 are shifted:

df = pd.DataFrame([y.split(',') for x in L for y in x])

df.iloc[-1, 4:] = df.iloc[-1, 4:].shift()

print (df)
  0    1   2                   3       4  5   6     7                   8   \
0  1   er   2        Fado de Padd  1'18"1  H   6  2600           J. Dekker   
1  2    e   7      Elixir Normand  1'18"2  H   7  2600      S. Schoonhoven   
2  3    e   3  Give You All of Me  1'18"2  H   5  2600  JF. Van Dooyeweerd   
3  4    e   4            Gouritch  1'18"3  H   5  2600          BJ. Crebas   
4  5    e   1  Franky du Cap Vert  1'18"4  H   6  2600          JH. Mieras   
5  6    e  10          Défi Magik  1'18"0  H   8  2620          F. Verkaik   
6  7    e   9             Fleuron  1'18"2  H   6  2620          M. Brouwer   
7  8    e   8         Dream Gibus  1'18"6  H   8  2620          R. Ebbinge   
8  9    e   5       Beau Gaillard  1'19"5  H  10  2600           A. Bakker   
9  0  DAI   6     Bikini de Larcy     NaN  H  10  2600    D. Den Dubbelden   

         9                   10  11    12  
0  17 490 €         A. De Wrede   1     6  
1  24 755 €      S. Schoonhoven  14  None  
2  17 600 €  JF. Van Dooyeweerd  10  None  
3  20 700 €          BJ. Crebas  32  None  
4  15 536 €        N. De Vreede  65  None  
5  44 865 €         AW. Bosscha   6     3  
6  44 830 €          D. Brouwer   7     3  
7  33 330 €      Mme A. Lehmann  36  None  
8  20 140 €        N. De Vreede  44  None  
9  21 834 €              N. Rip  52  None

Collectives™ on Stack Overflow

convert list of strings contain lists into pandas dataframe

2 Answers 2

7 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related