3

Starting from this dataframe df:

node1,node2,lang,w,c1,c2
1,2,it,1,a,a
1,2,en,1,a,a
2,3,es,2,a,b
3,4,it,1,b,b
5,6,it,1,c,c
3,5,tg,1,b,c
1,7,it,1,a,a
7,1,es,1,a,a
3,8,es,1,b,b
8,4,es,1,b,b
1,9,it,1,a,a

I performed a groupby operation like:

g = df.groupby(['c1','c2'])['lang'].unique().reset_index()

results in:

  c1 c2          lang
0  a  a  [it, en, es]
1  a  b          [es]
2  b  b      [it, es]
3  b  c          [tg]
4  c  c          [it]

Saving to .csv and read it back:

g.to_csv('myfile.csv')
g = pd.read_csv('myfile.csv')

obtaining a different format of the lang column:

  c1 c2              lang
0  a  a  ['it' 'en' 'es']
1  a  b            ['es']
2  b  b       ['it' 'es']
3  b  c            ['tg']
4  c  c            ['it']

My goal now is to count the number of items in each row of lang, and be able to get those values individually. I tried to build a new column with the length of the array of string:

g['len'] = df['lang'].apply(lambda x: x.size)

obtaining:

AttributeError: 'str' object has no attribute 'size'

Looking up the values of the lang column, I realized that after the groupby that column became a mess:

In [113]: g['lang'].values
Out[113]: array(["['it' 'en' 'es']", "['es']", "['it' 'es']", "['tg']", "['it']"], dtype=object)

How can I obtain the length of each nested string array and then get the values of each string within it? I thought in this type of conversion but my case is a little too complicated.

EDIT: add information about the different format of the lang column before and after writing/reading to/from .csv.

1
  • please provide expected output Commented Mar 2, 2016 at 11:51

1 Answer 1

3

Just apply len:

In [145]:
g['size'] = g['lang'].apply(len)
g

Out[145]:
  c1 c2          lang  size
0  a  a  [it, en, es]     3
1  a  b          [es]     1
2  b  b      [it, es]     2
3  b  c          [tg]     1
4  c  c          [it]     1
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Do you know why writing to csv after the groupby and read the file back give me a different format of the lang column? So I can apply your method before saving to file but not after reading it back?
by default the index will written out, you maybe reading it back in again which is adding a new column is my guess
It doesn't work on my PC after csv read/write, gar['lang'].apply(len) return [16, 6, 11, 6, 6], the length of the strings. IMHO, using pickle instead csv read/write is the good solution here; or g=pd.read_csv('myfile.csv',converters={'lang': a_very_tricky_function}).
@B.M. I encountered the same issue when writing/reading back.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.