3

I have a dataframe which contains text including one or more URL(s) :

user_id          text
  1              blabla... http://amazon.com ...blabla
  1              blabla... http://nasa.com ...blabla
  2              blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
  2              blabla... https://fnac.com ...blabla ...
  3              blabla....

I want to transform this dataframe with the count of URL(s) per user-id :

 user_id          count_URL
    1               2 
    2               3
    3               0

Is there a simple way to perform this task in Python ?

My code start :

URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])

for i in range(data.shape[0]) :
  for j in range(0,8):
     URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))

Thanks you

Lionel

2 Answers 2

3

In general, the definition of a URL is much more complex than what you have in your example. Unless you are sure you have very simple URLs, you should look up a good pattern.

import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...

First, extract the URLs from each string and count them:

df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()

Next, group the counts by user id:

df.groupby('user_id').sum()['urlcount']
#user_id
#1    2
#2    3
#3    0
Sign up to request clarification or add additional context in comments.

Comments

0

Below there is another way to do that:

#read data
import pandas as pd
data = pd.read_csv("data.csv")

#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)

#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
    counter = val[1][0].count(sub)
    count_URL.append(counter)

#list to DataFrame
count_URL = pd.DataFrame(count_URL)

#Concatenate the two data frames and apply the code of @DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

2 Comments

What if one of the lines looks like this: 'there is an http and more httphttps in the line'?
In this case the program don't work as expected, thanks for this remark

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.