pandas get nested string values from arrays

Question

Starting from this dataframe df:

node1,node2,lang,w,c1,c2
1,2,it,1,a,a
1,2,en,1,a,a
2,3,es,2,a,b
3,4,it,1,b,b
5,6,it,1,c,c
3,5,tg,1,b,c
1,7,it,1,a,a
7,1,es,1,a,a
3,8,es,1,b,b
8,4,es,1,b,b
1,9,it,1,a,a

I performed a groupby operation like:

g = df.groupby(['c1','c2'])['lang'].unique().reset_index()

results in:

  c1 c2          lang
0  a  a  [it, en, es]
1  a  b          [es]
2  b  b      [it, es]
3  b  c          [tg]
4  c  c          [it]

Saving to .csv and read it back:

g.to_csv('myfile.csv')
g = pd.read_csv('myfile.csv')

obtaining a different format of the lang column:

  c1 c2              lang
0  a  a  ['it' 'en' 'es']
1  a  b            ['es']
2  b  b       ['it' 'es']
3  b  c            ['tg']
4  c  c            ['it']

My goal now is to count the number of items in each row of lang, and be able to get those values individually. I tried to build a new column with the length of the array of string:

g['len'] = df['lang'].apply(lambda x: x.size)

obtaining:

AttributeError: 'str' object has no attribute 'size'

Looking up the values of the lang column, I realized that after the groupby that column became a mess:

In [113]: g['lang'].values
Out[113]: array(["['it' 'en' 'es']", "['es']", "['it' 'es']", "['tg']", "['it']"], dtype=object)

How can I obtain the length of each nested string array and then get the values of each string within it? I thought in this type of conversion but my case is a little too complicated.

EDIT: add information about the different format of the lang column before and after writing/reading to/from .csv.

please provide expected output

MaxU - stand with Ukraine
– MaxU - stand with Ukraine

2016-03-02 11:51:39 +00:00
Commented Mar 2, 2016 at 11:51 — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Mar 2, 2016 at 11:51

EdChum · Accepted Answer · 2016-03-02 11:36:33Z

3

Just apply len:

In [145]:
g['size'] = g['lang'].apply(len)
g

Out[145]:
  c1 c2          lang  size
0  a  a  [it, en, es]     3
1  a  b          [es]     1
2  b  b      [it, es]     2
3  b  c          [tg]     1
4  c  c          [it]     1

answered Mar 2, 2016 at 11:36

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Fabio Lamanna Over a year ago

Thanks! Do you know why writing to csv after the groupby and read the file back give me a different format of the lang column? So I can apply your method before saving to file but not after reading it back?

EdChum Over a year ago

by default the index will written out, you maybe reading it back in again which is adding a new column is my guess

B. M. Over a year ago

It doesn't work on my PC after csv read/write, gar['lang'].apply(len) return [16, 6, 11, 6, 6], the length of the strings. IMHO, using pickle instead csv read/write is the good solution here; or g=pd.read_csv('myfile.csv',converters={'lang': a_very_tricky_function}).

Fabio Lamanna Over a year ago

@B.M. I encountered the same issue when writing/reading back.

Collectives™ on Stack Overflow

pandas get nested string values from arrays

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related