3

Suppose I have following dataset:

  0
0 foo:1 bar:2 baz:3
1 bar:4 baz:5
2 foo:6

So each line is essentially a dict serialized into string, where key:value pairs are separated by space. There are hundreds of key:value pairs in each row, while number of unique keys is some few thousands. So data is sparse, so to speak.

What I want to get is a nice DataFrame where keys are columns and values are cells. And missing values are replaced by zeros. Like this:

  foo bar baz
0   1   2   3
1   0   4   5
2   6   0   0

I know I can split string into key:value pairs:

In: frame[0].str.split(' ')
Out:
  0
0 [foo:1, bar:2, baz:3]
1 [bar:4, baz:5]
2 [foo:6]

But what's next?

Edit: I'm running within AzureML Studio environment. So efficiency is important.

1 Answer 1

3

You can try list comprehension and then create new DataFrame from_records and fillna with 0:

s = df['0'].str.split(' ')

d = [dict(w.split(':', 1) for w in x) for x in s]
print d
#[{'baz': '3', 'foo': '1', 'bar': '2'}, {'baz': '5', 'bar': '4'}, {'foo': '6'}]

print pd.DataFrame.from_records(d).fillna(0)
#  bar baz foo
#0   2   3   1
#1   4   5   0
#2   0   0   6

EDIT:

You can get better performance, if use in function from_records parameters index and columns:

print df
                               0
0              foo:1 bar:2 baz:3
1                    bar:4 baz:5
2                          foo:6
3  foo:1 bar:2 baz:3 bal:8 adi:5

s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]
print d
[{'baz': '3', 'foo': '1', 'bar': '2'}, 
 {'baz': '5', 'bar': '4'}, 
 {'foo': '6'}, 
 {'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]

If longest dictionary have all keys, which create all possible columns:

cols = sorted(d, key=len, reverse=True)[0].keys()
print cols
['baz', 'bal', 'foo', 'bar', 'adi']

df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)

print df
  baz bal foo bar adi
0   3   0   1   2   0
1   5   0   0   4   0
2   0   0   6   0   0
3   3   8   1   2   5

EDIT2: If longest dictionary doesnt contain all keys and keys are in other dictionaries, use:

list(set( val for dic in d for val in dic.keys()))

Sample:

print df
                               0
0            foo1:1 bar:2 baz1:3
1                    bar:4 baz:5
2                          foo:6
3  foo:1 bar:2 baz:3 bal:8 adi:5

s = df['0'].str.split(' ')
d = [dict(w.split(':', 1) for w in x) for x in s]

print d
[{'baz1': '3', 'bar': '2', 'foo1': '1'}, 
 {'baz': '5', 'bar': '4'}, 
 {'foo': '6'}, 
 {'baz': '3', 'bal': '8', 'foo': '1', 'bar': '2', 'adi': '5'}]

cols =  list(set( val for dic in d for val in dic.keys()))
print cols 
['bar', 'baz', 'baz1', 'bal', 'foo', 'foo1', 'adi']

df = pd.DataFrame.from_records( d, index= df.index, columns=cols )
df = df.fillna(0)

print df
  bar baz baz1 bal foo foo1 adi
0   2   0    3   0   0    1   0
1   4   5    0   0   0    0   0
2   0   0    0   0   6    0   0
3   2   3    0   8   1    0   5
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for you solution. It looks really promising and straightforward. Unfortunately, I'm running this inside AzureML Studio as a Jupyter notebook and it looks like I'm hitting limits. Kernel either crashes or stalls at the very last step pd.DataFrame.from_records(d).fillna(0).
index and columns parameters significantly improve performance. Thank you!
Glad can help you! Good luck!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.