2

I have a pandas dataframe, which the first column are list values. I want to loop each str value of each list, and the values of next columns will be in included together.

For example:

tm = pd.DataFrame({'author':[['author_a1','author_a2','author_a3'],['author_b1','author_b2'],['author_c1','author_c2']],'journal':['journal01','journal02','journal03'],'date':pd.date_range('2015-02-03',periods=3)})
tm

    author                               date         journal
0   [author_a1, author_a2, author_a3]    2015-02-03   journal01
1   [author_b1, author_b2]               2015-02-04   journal02
2   [author_c1, author_c2]               2015-02-05   journal03

I want this:

    author       date          journal
0   author_a1    2015-02-03    journal01
1   author_a2    2015-02-03    journal01
2   author_a3    2015-02-03    journal01
3   author_b1    2015-02-04    journal02
4   author_b2    2015-02-04    journal02
5   author_c1    2015-02-05    journal03
6   author_c2    2015-02-05    journal03

I 've used a complex method to solve the problem. Is there any simple and efficient method by using pandas?

author_use = []
date_use = []
journal_use = []

for i in range(0,len(tm['author'])):    
    for m in range(0,len(tm['author'][i])):
        author_use.append(tm['author'][i][m])
        date_use.append(tm['date'][i])
        journal_use.append(tm['journal'][i])

df_author = pd.DataFrame({'author':author_use,
                         'date':date_use,
                         'journal':journal_use,                        
                         })

df_author

1 Answer 1

2

I think you can use numpy.repeat for repeat values by legths by str.len and flat values of nested lists by chain:

from  itertools import chain

lens = tm.author.str.len()

df = pd.DataFrame({
        "date": np.repeat(tm.date.values, lens),
        "journal": np.repeat(tm.journal.values,lens),
        "author": list(chain.from_iterable(tm.author))})

print (df)

      author       date    journal
0  author_a1 2015-02-03  journal01
1  author_a2 2015-02-03  journal01
2  author_a3 2015-02-03  journal01
3  author_b1 2015-02-04  journal02
4  author_b2 2015-02-04  journal02
5  author_c1 2015-02-05  journal03
6  author_c2 2015-02-05  journal03

Another numpy solution:

df = pd.DataFrame(np.column_stack((tm[['date','journal']].values.\
     repeat(list(map(len,tm.author)),axis=0) ,np.hstack(tm.author))), 
     columns=['date','journal','author'])

print (df)
                  date    journal     author
0  2015-02-03 00:00:00  journal01  auther_a1
1  2015-02-03 00:00:00  journal01  auther_a2
2  2015-02-03 00:00:00  journal01  auther_a3
3  2015-02-04 00:00:00  journal02  auther_b1
4  2015-02-04 00:00:00  journal02  auther_b2
5  2015-02-05 00:00:00  journal03  auther_c1
6  2015-02-05 00:00:00  journal03  auther_c2
Sign up to request clarification or add additional context in comments.

8 Comments

TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' What's wrong? @jezrael
This problem is with sample or with real data?
This problem is with sample.
What is your version of python and pandas?
Python 2.7.12 |Anaconda custom (32-bit), pandas 0.19.1
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.