0

I'm trying to figure out how to use regular expressions in Python to extract out certain URLs in strings. For example, I might have 'blahblahblah (a href="example.com")'. In this case I want to extract all "example.com" links. How can I do that instead of just splitting the string?

Thanks!

2
  • 5
    Are your strings HTML? If so, do not use regex. If not, there are certainly many URI-matching regular expressions around. You might try giving one of those a shot and coming back to ask a more specific question if you get stuck. Commented Jan 23, 2013 at 0:54
  • Please don't forget to leave a feedback to the responders and to upvote those answer you found useful! :) Commented Jan 23, 2013 at 8:00

3 Answers 3

1

There is a great module called BeautifulSoup (link: http://www.crummy.com/software/BeautifulSoup/) which is great for parsing HTML. You should use this instead of using regex to get info from HTML. Here's an example of BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> html = """<p> some <a href="http://link.com">HTML</a> and <a href="http://second.com">another link</a></p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
...    print link['href']
http://link.com
http://second.com

Here is a link to the documentation, which is really easy to follow: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Sign up to request clarification or add additional context in comments.

Comments

0

Regex are very powerful tools, but they might not be your tool in all circumstances (as other has suggested already). That said, here's a minimal example from the console that uses - as per request - regex:

>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']

Focus on r'a href="(.*?)"'. In Englis it translates in: "find a string beginning with a href=", then save as a result any character until you hit the next ". The syntax is:

  • the () means "save only stuff in here"
  • the . means "any character"
  • the * means "any number of times"
  • the ? means "non greedy" or in other terms: find the shortest string that satisfy the requirements (try without the question mark and you will see what happens).

HTH!

Comments

0

Do not use regexp:

Here is why you should not think at regex in the first place when dealing with HTML or XML (or URLs).

If you wish to use regex anyway,

You can find several pattern that do the job, and several way to fetch the strings you wish to find.

These patterns do the job:

r'\(a href="(.*?)"\)'

r'\(a href="(.*)"\)'

r'\(a href="(+*)"\)'

1. re.findall()

re.findall(pattern, string, flags=0) 

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

import re
st = 'blahblahblah (a href="example.com") another bla <a href="polymer.edu">'
re.findall(r'\(a href="(+*)"\)',s)

2. re.search()

re.search(pattern, string, flags=0)

Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance.

Then, go with re.group() through groups. For instance, using regex r'\(a href="(.+?(.).+?)"\)', that is also working here, you have several enclosed groups: group 0 is a match to the whole pattern, group 1 is a match to the first enclosed sub-pattern surrounded with parenthesis, (.+?(.).+?)

You would use search when looking for first occurence of pattern only. And with your example this would be

>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.