0

I am trying to extra web links from web content with Python regex. here's my python script

webUrlList = re.findall(r"(?<=<a href=\").+(.html|/)(?=\")", content)
print webUrlList

and the matched webUrlList is like:

['/', '.html', '/', '/', '/', '/',...] 

please help me find out the reason why this script yield the above output.

target weburl strings samples:

<a href="http://ab.test.com/flower/1111027378112/purple/119735281586093.html"

<a href="/abcabcdef/coffee/su1/" 
2
  • I'm having trouble reproducing the output your citing. When using the regex that you supplied, r"(?<=<a href=\").+(.html|/)(?=\")", I'm only getting ['.html'] and not any forward-slash characters. Commented Jul 3, 2016 at 17:53
  • Just make the capturing group a noncapturing one. And use lazy dot matching. Commented Jul 3, 2016 at 18:01

2 Answers 2

2

If you're only parsing for links, and you're familiar with the content you will be parsing, the following regex should help you accomplish what you're after and is pretty safe.

regex = re.compile(r'href="([^"]+)')
results = re.findall(regex, <CONTENT-HERE>)
  • href=" consumes but doesn't capture the literal characters href="
  • ([^"]+) consumes and captures any character that isn't a quotation mark

Run a few trials with the content you are scraping and assess whether you need more specificity in the regex or not.

Sign up to request clarification or add additional context in comments.

2 Comments

You are using re.findall. r'href="([^"]+)' is enough.
@WiktorStribiżew indeed it is. Good catch. I'll modify the answer above.
1

Use a html parser like BeautifulSoup:

soup = BeautifulSoup(content, "html.parser")

print([a["href"] for a in soup.find_all("a", href=True)])

Don't use a regex to parse html

10 Comments

This requires adding an additional module, BeautifulSoup, to the project. I understand that there may be better tools to parse HTML than regular expressions. But this question is asking for extracting web links using regular expressions. So while your answer works and is elegant, it seems to side-step the what's being asked.
@wcarroll,stackoverflow.com/questions/1732348/… you should not use regex to parse html, there is no side-stepping what is being asked, it is the correct approach to what is essentially being asked.
I almost included in my comment "yes I have seen the infamous SO post". I guess I should have been explicit. This doesn't change my comment above. If he is only parsing small strings that contain HTML, regular expressions are fit for the task and I think preferable to including a third-party module and learning its API.
@wcarroll, where does it say they are parsing small strings that contain HTML, I am trying to extra web links from web content seems pretty clear that they are parsing the full content returned. I am not going to encourage anyone to parse html with a regex and anyone that does is leading someone down a bad path
But he is not parsing the whole html to make a DOM either, isn't it ok to look for http URIs in it ? If no, why exactly ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.