2

I need to extract URLs from a column of DataFrame which was created using following values

creation_date,tweet_id,tweet_text
2020-06-06 03:01:37,1269102116364324865,#Webinar: Sign up for @SumoLogic's June 16 webinar to learn how to navigate your #Kubernetes environment and unders… https://stackoverflow.com/questions/42237666/extracting-information-from-pandas-dataframe
2020-06-06 01:29:38,1269078966985461767,"In this #webinar replay, @DisneyStreaming's @rothgar chats with @SumoLogic's @BenoitNewton about how #Kubernetes is… https://stackoverflow.com/questions/46928636/pandas-split-list-into-columns-with-regex

column name tweet_text contains URL. I am trying following code.

df["tweet_text"]=df["tweet_text"].astype(str)
pattern = r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'

df['links'] = ''
df['links']= df["tweet_text"].str.extract(pattern, expand=True)

print(df)

I am using regex from answer of this question and it matches URL in both rows.screenshot But I am getting NaN as values of new column df['links]'. I have also tried solution provided in first answer of this question, which was

df['links']= df["tweet_text"].str.extract(pattern, expand=False).str.strip()

But I am getting following error

AttributeError: 'DataFrame' object has no attribute 'str'

Lastly I created an empty column using df['links'] = '', because I was getting ValueError: Wrong number of items passed 2, placement implies 1 error. If that's relevant. Can someone help me out here?

2
  • 1
    Your URL pattern is not quite clean, but the main problem is that it contains capturing groups where you need non-capturing ones. You need to wrap it with a capturing group, pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&/=]*)' Commented Jun 6, 2020 at 9:56
  • It worked thank you, can you move this comment to answers so I can mark it. Commented Jun 6, 2020 at 9:58

1 Answer 1

6

The main problem is that your URL pattern contains capturing groups where you need non-capturing ones. You need to replace all ( with (?: in the pattern.

However, it is not enough since str.extract requires a capturing group in the pattern so that it could return any value at all. Thus, you need to wrap the whole pattern with a capturing group.

You may use

pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()@:%_+.~#?&/=]*)' 

Note the + is not necessary to escape inside a character class. Also, there is no need to use // inside a character class, one / is enough.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.