0

I have a string with multiple urls extracted using BeautifulSoup and I want to split all of these urls to extract dates and year (the urls have dates and year in them).

print(dat)
http://www.foo.com/2016/01/0124
http://www.foo.com/2016/02/0122
http://www.foo.com/2016/02/0426
http://www.foo.com/2016/03/0129
.
.

I tried the following but it only retrieves the first:

import urlparse
parsed = urlparse(dat)
path = parsed[2] #defining after www.foo.com/
pathlist = path.split("/")

['', '2016', '01', '0124']

So I am only getting result for the first element of the string. How can I retrieve these parses for all of the urls, and store them so I can extract information? I would like know how many of the links there are for year and month.

Also strangely after doing this, when I do print(dat) I only get the first element http://www.foo.com/2016/01/0124, it seems that urlparse is not working for multiple urls.

8
  • Euh can't you use a regex, that looks like an appropriate tool here. Furthermore how are the urls joined in the first place? Commented Jan 28, 2017 at 22:40
  • @WillemVanOnsem They were extracted using Beautifulsoup and then converted to string by using str() Commented Jan 28, 2017 at 22:42
  • but they are sepated by new lines? In that case you can use a for loop over your script. Commented Jan 28, 2017 at 22:42
  • can you give the type of dat (what is the result of type(dat))? Commented Jan 28, 2017 at 22:44
  • @WillemVanOnsem Yes they are separated by new lines, could you help me with using the for loop? I'm new to python. Commented Jan 28, 2017 at 22:44

1 Answer 1

2

Based on your question, it looks like you have a list of URLs separated by new lines. In that case you can use a for loop to iterate over them:

list_pathlist = []
for url in dat.split('\n'):
    parsed = urlparse(url)
    path = parsed[2] #defining after www.foo.com/
    pathlist = path.split("/")
    list_pathlist.append(pathlist)

In which case I suspect the result (list_pathlist) will be something like:

[['', '2016', '01', '0124'],['', '2016', '02', '1222'],...]

so a list of lists.

Or you can put it into a nice one-liner using list-comprehension:

list_pathlist = [urlparse(url)[2].split('/') for url in dat.split('\n')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.