split multiple urls using urlparse in python

Question

I have a string with multiple urls extracted using BeautifulSoup and I want to split all of these urls to extract dates and year (the urls have dates and year in them).

print(dat)
http://www.foo.com/2016/01/0124
http://www.foo.com/2016/02/0122
http://www.foo.com/2016/02/0426
http://www.foo.com/2016/03/0129
.
.

I tried the following but it only retrieves the first:

import urlparse
parsed = urlparse(dat)
path = parsed[2] #defining after www.foo.com/
pathlist = path.split("/")

['', '2016', '01', '0124']

So I am only getting result for the first element of the string. How can I retrieve these parses for all of the urls, and store them so I can extract information? I would like know how many of the links there are for year and month.

Also strangely after doing this, when I do print(dat) I only get the first element http://www.foo.com/2016/01/0124, it seems that urlparse is not working for multiple urls.

Euh can't you use a regex, that looks like an appropriate tool here. Furthermore how are the urls joined in the first place? — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Jan 28, 2017 at 22:40
@WillemVanOnsem They were extracted using Beautifulsoup and then converted to string by using str() — Asteroid098
– Asteroid098, Commented Jan 28, 2017 at 22:42
but they are sepated by new lines? In that case you can use a for loop over your script. — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Jan 28, 2017 at 22:42
can you give the type of dat (what is the result of type(dat))? — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Jan 28, 2017 at 22:44
@WillemVanOnsem Yes they are separated by new lines, could you help me with using the for loop? I'm new to python. — Asteroid098
– Asteroid098, Commented Jan 28, 2017 at 22:44

willeM_ Van Onsem · Accepted Answer · 2017-01-28 22:47:03Z

2

Based on your question, it looks like you have a list of URLs separated by new lines. In that case you can use a for loop to iterate over them:

list_pathlist = []
for url in dat.split('\n'):
    parsed = urlparse(url)
    path = parsed[2] #defining after www.foo.com/
    pathlist = path.split("/")
    list_pathlist.append(pathlist)

In which case I suspect the result (list_pathlist) will be something like:

[['', '2016', '01', '0124'],['', '2016', '02', '1222'],...]

so a list of lists.

Or you can put it into a nice one-liner using list-comprehension:

list_pathlist = [urlparse(url)[2].split('/') for url in dat.split('\n')]

answered Jan 28, 2017 at 22:47

willeM_ Van Onsem

482k33 gold badges483 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

split multiple urls using urlparse in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related