Reshape python dataframe

Question

I have dataframe like this.

description
Brian
No.22
Tel:+00123456789
email:[email protected]

Sandra
No:43
Tel:+00312456789

Michel
No:593

Kent
No:13
Engineer
Tel:04512367890
email:[email protected]

and I want it like this.

name	address	designation	telephone	email
Brian	No:22	null	Tel:+00123456789	email:[email protected]
Sandra	No:43	null	Tel:+00312456789	null
Michel	No:593	null	null	null
Kent	No:13	Engineer	Tel:04512367890	email:[email protected]

How to do this in python.

Corralien · Accepted Answer · 2021-12-03 07:48:52Z

4

Use np.where to label each row then pivot your dataframe.

Step 1.

condlist = [df['description'].shift(fill_value='').eq(''),
            df['description'].str.contains('^No[:.]'),
            df['description'].str.startswith('Tel:'),
            df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)

# Output:
              description       column
0                   Brian         name
1                   No.22      address
2        Tel:+00123456789    telephone
3   email:[email protected]        email
4                          designation
5                  Sandra         name
6                   No:43      address
7        Tel:+00312456789    telephone
8                          designation
9                  Michel         name
10                 No:593      address
11                         designation
12                   Kent         name
13                  No:13      address
14               Engineer  designation
15        Tel:04512367890    telephone
16   email:[email protected]        email

Step 2. Now remove empty rows and create an index to allow the pivot:

df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)

# Output:
              description       column  index
0                   Brian         name      1
1                   No.22      address      1
2        Tel:+00123456789    telephone      1
3   email:[email protected]        email      1
5                  Sandra         name      2
6                   No:43      address      2
7        Tel:+00312456789    telephone      2
9                  Michel         name      3
10                 No:593      address      3
12                   Kent         name      4
13                  No:13      address      4
14               Engineer  designation      4
15        Tel:04512367890    telephone      4
16   email:[email protected]        email      4

Step 3. Pivot your dataframe:

cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
        .rename_axis(index=None, columns=None)
print(out)

# Output:
     name address designation         telephone                  email
1   Brian   No.22         NaN  Tel:+00123456789  email:[email protected]
2  Sandra   No:43         NaN  Tel:+00312456789                    NaN
3  Michel  No:593         NaN               NaN                    NaN
4    Kent   No:13    Engineer   Tel:04512367890   email:[email protected]

Edit

There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.

There is no magic to solve this problem because your data are mess. The designation label is the fallback if the row was not tagged to name, address, telephone and email. So there is a great chance, you have multiple rows labelled designation for a same person.

At then end of this step, check if you have duplicates (person/label -> index/column) with this command:

df.value_counts(['index', 'column']).loc[lambda x: x > 1]

Probably (and I hope for you), the output should indicate only designation label under column column unless one person can have multiple telephone or email. Now you can adjust the condlist to catch a maximum of pattern. I don't know anything about your data so I can't help you much.

edited Dec 3, 2021 at 7:48

answered Dec 2, 2021 at 21:45

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Anton Asiri Over a year ago

Corralien, Thank you very much and it is a great help. Your steps are very clear. But when I apply your steps to actual, There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.. Please help me..

Corralien Over a year ago

The problem with your data is everything has not a label get the default label designation because we don't recognize any pattern to label the entry. So if for a group (index), you have many labels (column), pivot will raise an exception because it don't know how to process the duplicate cell (index/column).

Anton Asiri Over a year ago

Is it ok to drop duplicate index and how. Thanks

Corralien Over a year ago

Try:

out = df.drop_duplicates(['index', 'column']).pivot('index', 'column', 'description')[cols].rename_axis(index=None, columns=None)

Anton Asiri Over a year ago

As you instructed, I drop the duplicates and it worked..Thank you very much..

Collectives™ on Stack Overflow

Reshape python dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related