Extracting links with regex from source code; Python

Question

I have a dataset of links to newspaper articles that I want to do some research on. However, the links in the dataset end with .ece extension (which is a problem for me because of some api restrictions)

http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece

and

http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html

are links to the same page. Now I need to convert all the .ece links into .html links. I didn't find an easier way to do it, but to parse the page and find the original .html link. The problem is that the link is buried inside an html meta element, and I can't get to it using tree.xpath.

<meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"

Unfortunately, I am not well acquainted with regex, and don't know how to extract a link using it. Basically, every link I need will starts with:

<meta content="http://www.telegraaf.nl/

I need the full link (i.e., http://www.telegraaf.nl/THE_REST_OF_THE_LINK). Also, I'm using BeautifulSoup for parsing. Thanks.

Do you at least know how to use regex if the regex string was given to you? — Nick Humrich
– Nick Humrich, Commented Sep 29, 2014 at 17:28
well, I know that I'd have to use the re module. re.findall (r"expression", "string")? — Zlo
– Zlo, Commented Sep 29, 2014 at 17:42
@Padraic Cunningham, I have a file exclusively with .ece links. When I opened the page's source code, I've found that they store the .html link in the meta element. What I need is to get that link (.html) from the source code. — Zlo
– Zlo, Commented Sep 29, 2014 at 17:48
@Zlo I can think you can probably adapt my answer then... I've assumed they're both in the the same file... but you can tweak it to how you want... — Jon Clements
– Jon Clements, Commented Sep 29, 2014 at 17:52

Nick Humrich · Accepted Answer · 2014-09-29 20:52:16Z

1

Here is a really simple regex to get you started.

This one will extract all links

\<meta content="(http:\/\/www\.telegraaf\.nl.*)"

This one will match all html links

\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"

To use this with what you have, you can do the following:

import urllib2
import re

replacements = dict()
for url in ece_url_list:
    response = urllib2.urlopen(url)
    html = response.read()
    replacements[url] = re.findall('\<meta content="(http:\/\/www\.telegraaf\.nl.*\.html)"', html)[0]

Note: This assumes that each source code page always includes an html link in this meta tag. It expects one and only one.

edited Sep 29, 2014 at 20:52

answered Sep 29, 2014 at 17:22

Nick Humrich

15.8k10 gold badges67 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jon Clements · Accepted Answer · 2014-09-29 17:49:18Z

Use BeautifulSoup to find matching content attributes, then replace as such:

from bs4 import BeautifulSoup
import re

html = """
    <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/article22178882.ece" />
    <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html" />
"""

soup = BeautifulSoup(html)
# reference table of url prefixes to full html link
html_links = {
    el['content'].rpartition('/')[0]: el['content'] 
    for el in soup.find_all('meta', content=re.compile('.html$'))
}
# find all ece links, strip the end of to match links, then adjust
# meta content with looked up element
for el in soup.find_all('meta', content=re.compile('.ece$')):
    url = re.sub('(?:article(\d+).ece$)', r'\1', el['content'])
    el['content'] = html_links[url]

print soup
# <meta content="http://www.telegraaf.nl/telesport/voetbal/buitenlands/22178882/__Wenger_vreest_het_ergste__.html"/>

vks · Accepted Answer · 2014-09-29 17:19:06Z

0

(.*?)(http:\/\/.*\/.*?\.)(ece)

Try this.Replace by $2html.

See demo.

http://regex101.com/r/nA6hN9/24

answered Sep 29, 2014 at 17:19

vks

68.1k11 gold badges96 silver badges132 bronze badges

3 Comments

Nick Humrich Over a year ago

I dont think that will work. The OP shows that the html links are different then the ece links. They are not just .html. In other words, a simple find replace on .ece would not work. He needs to replace the whole link, with the whole second link.

vks Over a year ago

@Humdinger m lost now .will wait for clarification from OP

Zlo Over a year ago

@Humdinger, that's correct. The .html link is a bit different than the .ece one, so a simple replacement doesn't work.

Collectives™ on Stack Overflow

Extracting links with regex from source code; Python

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related