Split string in columns in Python

Question

I have a list like this:

[[{'contributionScore': 0.841473400592804, 'variable': 'series_2'},
  {'contributionScore': 0.6113986968994141, 'variable': 'series_3'},
  {'contributionScore': 0.5985525250434875, 'variable': 'series_1'},
  {'contributionScore': 0.5641148686408997, 'variable': 'series_4'},
  {'contributionScore': 0.138543963432312, 'variable': 'series_0'}],

 [{'contributionScore': 1.1316605806350708, 'variable': 'series_1'},
  {'contributionScore': 0.5188271403312683, 'variable': 'series_4'},
  {'contributionScore': 0.38711458444595337, 'variable': 'series_3'},
  {'contributionScore': 0.35055238008499146, 'variable': 'series_0'},
  {'contributionScore': 0.06044715642929077, 'variable': 'series_2'}]]

How can I obtain a dataframe with a column for each series?

I'd like to get a dataframe with contributionScore for each series.

Thanks!

confused banana · Accepted Answer · 2021-11-14 20:01:49Z

I am a bit confused with the statement

How can I obtain a dataframe with a column for each series?

if you meant a single column, for all the series data with column "variable" then Celius Stingher's answer should be good enough.

If you meant as in each series value as its own individual column, I will extend on Celius's answer as :

##As already stated above
df = pd.concat([pd.DataFrame(x) for x in raw_list])
##To get a sorted list of unique Series values
series_list = sorted(df['variable'].unique())
##We first get a dictionary where each key is the unique series value and each dictionary value is the list of contributionScore unique to that series value. We turn it into a DataFrame in the end
series_df = pd.DataFrame({series : list(df[df['variable'] == series]["contributionScore"]) for series in series_list})

The output will look like

    series_0    series_1    series_2    series_3    series_4
0   0.138544    0.598553    0.841473    0.611399    0.564115
1   0.350552    1.131661    0.060447    0.387115    0.518827

A reminder that this will work only when the series values all have the same count of contribution score.(all series have 2 contribution scores each above)

If each series has different counts of contribution score values, this will work when the third statement is replaced with the line shown below:

## We turn each "series" value and their contribution score as DataFrame and concatenate them to accommodate for the varying array lengths of each "series" column.
series_df = pd.concat([pd.DataFrame({series : list(df[df['variable'] == series]["contributionScore"])}) for series in series_list], axis = 1)

Example : If series_3 had 3 contribution Scores it will look like this

    series_0    series_1    series_2    series_3    series_4
0   0.138544    0.598553    0.841473    0.611399    0.564115
1   0.350552    1.131661    0.060447    0.387115    0.518827
2   NaN         NaN         NaN         1.200000    NaN

What pd.concat does here is that it allows us to join pandas DataFrames of different column lengths together. It fills the gap values with NaN. Something that wasnt possible with a mere pd.DataFrame() all together before. The "axis = 1" param tells the function to join the DataFrames created in the list to be "Concatenated" along the columns each.

@lucacanonico. I have added a caveat and a workaround for that as well. please make sure you check that out. And mark the answer as complete if it works for you!

Celius Stingher · Accepted Answer · 2021-11-14 17:45:00Z

1

You should be able to create a dataframe using pd.DataFrame(). Since each element in the list would be a dataframe itself, you can try using a list comprehension.

Let's say the list its called "raw_list":

df = pd.concat([pd.DataFrame(x).pivot_table(columns='variables') for x in raw_list])

This would output:

   contributionScore  variable
0           0.841473  series_2
1           0.611399  series_3
2           0.598553  series_1
3           0.564115  series_4
4           0.138544  series_0

EDIT:

Given OPs comment, we should pivot the table first so:

df = pd.concat([pd.DataFrame(x).pivot_table(columns='variables') for x in raw_list])

Outputting:

variable           series_0  series_1  series_2  series_3  series_4
contributionScore  0.138544  0.598553  0.841473  0.611399  0.564115
contributionScore  0.350552  1.131661  0.060447  0.387115  0.518827

edited Nov 14, 2021 at 17:45

answered Nov 14, 2021 at 16:29

Celius Stingher

18.4k6 gold badges26 silver badges54 bronze badges

2 Comments

luca canonico Over a year ago

yes but in this way I append rows I would like to get a dataframe with columns contributionScore, series_0, series_1, series_2, series_3, series_4

Celius Stingher Over a year ago

Thanks I didn't understand what was your expected output. Please remember to include it to make it easier for us to understand what's needed. Then it's as easy as pivoting the table. Please check the edit.

Collectives™ on Stack Overflow

Split string in columns in Python

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related