Extracting many URLs in a python dataframe

Question

I have a dataframe which contains text including one or more URL(s) :

user_id          text
  1              blabla... http://amazon.com ...blabla
  1              blabla... http://nasa.com ...blabla
  2              blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
  2              blabla... https://fnac.com ...blabla ...
  3              blabla....

I want to transform this dataframe with the count of URL(s) per user-id :

 user_id          count_URL
    1               2 
    2               3
    3               0

Is there a simple way to perform this task in Python ?

My code start :

URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])

for i in range(data.shape[0]) :
  for j in range(0,8):
     URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))

Thanks you

Lionel

DYZ · Accepted Answer · 2018-06-19 22:05:58Z

3

In general, the definition of a URL is much more complex than what you have in your example. Unless you are sure you have very simple URLs, you should look up a good pattern.

import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...

First, extract the URLs from each string and count them:

df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()

Next, group the counts by user id:

df.groupby('user_id').sum()['urlcount']
#user_id
#1    2
#2    3
#3    0

answered Jun 19, 2018 at 22:05

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Amine Sehaba · Accepted Answer · 2018-06-19 22:41:21Z

0

Below there is another way to do that:

#read data
import pandas as pd
data = pd.read_csv("data.csv")

#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)

#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
    counter = val[1][0].count(sub)
    count_URL.append(counter)

#list to DataFrame
count_URL = pd.DataFrame(count_URL)

#Concatenate the two data frames and apply the code of @DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

answered Jun 19, 2018 at 22:41

Amine Sehaba

1206 bronze badges

2 Comments

DYZ Over a year ago

What if one of the lines looks like this: 'there is an http and more httphttps in the line'?

Amine Sehaba Over a year ago

In this case the program don't work as expected, thanks for this remark

Collectives™ on Stack Overflow

Extracting many URLs in a python dataframe

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related