Use np.where to label each row then pivot your dataframe.
Step 1.
condlist = [df['description'].shift(fill_value='').eq(''),
df['description'].str.contains('^No[:.]'),
df['description'].str.startswith('Tel:'),
df['description'].str.startswith('email:')]
choicelist = ['name', 'address', 'telephone', 'email']
df['column'] = np.select(condlist, choicelist, default='designation')
print(df)
# Output:
description column
0 Brian name
1 No.22 address
2 Tel:+00123456789 telephone
3 email:[email protected] email
4 designation
5 Sandra name
6 No:43 address
7 Tel:+00312456789 telephone
8 designation
9 Michel name
10 No:593 address
11 designation
12 Kent name
13 No:13 address
14 Engineer designation
15 Tel:04512367890 telephone
16 email:[email protected] email
Step 2. Now remove empty rows and create an index to allow the pivot:
df = df[df['description'].ne('')].assign(index=df['column'].eq('name').cumsum())
print(df)
# Output:
description column index
0 Brian name 1
1 No.22 address 1
2 Tel:+00123456789 telephone 1
3 email:[email protected] email 1
5 Sandra name 2
6 No:43 address 2
7 Tel:+00312456789 telephone 2
9 Michel name 3
10 No:593 address 3
12 Kent name 4
13 No:13 address 4
14 Engineer designation 4
15 Tel:04512367890 telephone 4
16 email:[email protected] email 4
Step 3. Pivot your dataframe:
cols = ['name', 'address', 'designation', 'telephone', 'email']
out = df.pivot('index', 'column', 'description')[cols] \
.rename_axis(index=None, columns=None)
print(out)
# Output:
name address designation telephone email
1 Brian No.22 NaN Tel:+00123456789 email:[email protected]
2 Sandra No:43 NaN Tel:+00312456789 NaN
3 Michel No:593 NaN NaN NaN
4 Kent No:13 Engineer Tel:04512367890 email:[email protected]
Edit
There is an error at final step" ValueError: Index contains duplicate entries, cannot reshape" how can I overcome this.
There is no magic to solve this problem because your data are mess. The designation label is the fallback if the row was not tagged to name, address, telephone and email. So there is a great chance, you have multiple rows labelled designation for a same person.
At then end of this step, check if you have duplicates (person/label -> index/column) with this command:
df.value_counts(['index', 'column']).loc[lambda x: x > 1]
Probably (and I hope for you), the output should indicate only designation label under column column unless one person can have multiple telephone or email. Now you can adjust the condlist to catch a maximum of pattern. I don't know anything about your data so I can't help you much.