4

I have a .xlsx file that I am opening with this code:

import pandas as pd

df = pd.read_excel(open('file.xlsx','rb'))
df['Description'].head

and I have the following result, which looks pretty good.

ID     | Description
:----- | :-----------------------------
0      | Some Description with no hash
1      | Text with #one hash
2      | Text with #two #hashes

Now I want to create a new column, keeping only words started with #, like this one:

ID     | Description                      |  Only_Hash
:----- | :-----------------------------   |  :-----------------
0      | Some Description with no hash    |   Nan
1      | Text with #one hash              |   #one
2      | Text with #two #hashes           |   #two #hashes

I was able to count/separate lines with #:

descriptionWithHash = df['Description'].str.contains('#').sum()

but now I want to create the column like I described above. What is the easiest way to do that?

Regards!

PS: it is supposed to show a table format in the question but I can't figure out why it is showing wrong!

2 Answers 2

5

You can use str.findall with str.join:

df['new'] =  df['Description'].str.findall('(\#\w+)').str.join(' ')
print(df)
   ID                    Description           new
0   0  Some Description with no hash              
1   1            Text with #one hash          #one
2   2         Text with #two #hashes  #two #hashes

And for NaNs:

df['new'] = df['Description'].str.findall('(\#\w+)').str.join(' ').replace('',np.nan)
print(df)
   ID                    Description           new
0   0  Some Description with no hash           NaN
1   1            Text with #one hash          #one
2   2         Text with #two #hashes  #two #hashes
Sign up to request clarification or add additional context in comments.

1 Comment

this solution is much more elegant !
4
In [126]: df.join(df.Description
     ...:           .str.extractall(r'(\#\w+)')
     ...:           .unstack(-1)
     ...:           .T.apply(lambda x: x.str.cat(sep=' ')).T
     ...:           .to_frame(name='Hash'))
Out[126]:
   ID                    Description          Hash
0   0  Some Description with no hash           NaN
1   1            Text with #one hash          #one
2   2         Text with #two #hashes  #two #hashes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.