37

I have the following df:

code . role    . persons
123 .  Janitor . 3
123 .  Analyst . 2
321 .  Vallet  . 2
321 .  Auditor . 5

The first line means that I have 3 persons with the role Janitors. My problem is that I would need to have one line for each person. My df should look like this:

df:

code . role    . persons
123 .  Janitor . 3
123 .  Janitor . 3
123 .  Janitor . 3
123 .  Analyst . 2
123 .  Analyst . 2
321 .  Vallet  . 2
321 .  Vallet  . 2
321 .  Auditor . 5
321 .  Auditor . 5
321 .  Auditor . 5
321 .  Auditor . 5
321 .  Auditor . 5

How could I do that using pandas?

1

4 Answers 4

66

reindex+ repeat

df.reindex(df.index.repeat(df.persons))
Out[951]: 
   code  .     role ..1  persons
0   123  .  Janitor   .        3
0   123  .  Janitor   .        3
0   123  .  Janitor   .        3
1   123  .  Analyst   .        2
1   123  .  Analyst   .        2
2   321  .   Vallet   .        2
2   321  .   Vallet   .        2
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5
3   321  .  Auditor   .        5

PS: you can add.reset_index(drop=True) to get the new index

Sign up to request clarification or add additional context in comments.

4 Comments

Wonderful, I knew there was a good solution with repeat, but this nailed it.
Yep, this was nice. Maybe a final reset_index() too?
@Wen I love learning new stuff! You would believe, that i didn't no you could reuse index values in reindex. I have alway used reindex to shuffle or add indexes but never to duplicate has you have done here. Beautiful. Nice one. +1
@ScottBoston thanks Man :-) SO is good place push us learning from each others (I learn it from coldspeed long time ago :-) )
18

Wen's solution is really nice and intuitive, however it will fail for duplicate rows by throwing ValueError: cannot reindex from a duplicate axis.

Here's an alternative which avoids this by calling repeat on df.values.

df

   code     role  persons
0   123  Janitor        3
1   123  Analyst        2
2   321   Vallet        2
3   321  Auditor        5


pd.DataFrame(df.values.repeat(df.persons, axis=0), columns=df.columns)

   code     role persons
0   123  Janitor       3
1   123  Janitor       3
2   123  Janitor       3
3   123  Analyst       2
4   123  Analyst       2
5   321   Vallet       2
6   321   Vallet       2
7   321  Auditor       5
8   321  Auditor       5
9   321  Auditor       5
10  321  Auditor       5
11  321  Auditor       5

4 Comments

when it comes to performance, what is it better .reindex() or .values.repeat()?
@lmiguelvargasf This solution is faster. But Wen's solution requires fewer characters, plus I was nice enough to leave a nice comment under his answer which spurred all the extra upvotes.
the only problem I saw with your solution is that the dtypes are changed to object for every column in the dataframe.
This works when there are duplicate rows as opposed to BENY's solution which throws a ValueError (ValueError: cannot reindex from a duplicate axis)
4

Not enough reputation to comment, but building on @cs95's answer and @lmiguelvargasf's comment, one can preserve dtypes with:

pd.DataFrame(
    df.values.repeat(df.persons, axis=0),
    columns=df.columns,
).astype(df.dtypes)

Comments

3

You can apply the Series method repeat:

df = pd.DataFrame({'col1': [2, 3],
                   'col2': ['a', 'b'],
                   'col3': [20, 30]})

df.apply(lambda x: x.repeat(df['col1']))
# df.apply(pd.Series.repeat, repeats=df['col1'])

or the numpy function repeat:

df.apply(np.repeat, repeats=df['col1'])

Output:

   col1 col2  col3
0     2    a    20
0     2    a    20
1     3    b    30
1     3    b    30
1     3    b    30

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.