Python Regex Tokenize

Question

I'm trying to figure out how to use regular expressions in Python to extract out certain URLs in strings. For example, I might have 'blahblahblah (a href="example.com")'. In this case I want to extract all "example.com" links. How can I do that instead of just splitting the string?

Thanks!

Are your strings HTML? If so, do not use regex. If not, there are certainly many URI-matching regular expressions around. You might try giving one of those a shot and coming back to ask a more specific question if you get stuck. — mechanical_meat
– mechanical_meat, Commented Jan 23, 2013 at 0:54
Please don't forget to leave a feedback to the responders and to upvote those answer you found useful! :) — mac
– mac, Commented Jan 23, 2013 at 8:00

TerryA · Accepted Answer · 2013-01-23 01:09:25Z

1

There is a great module called BeautifulSoup (link: http://www.crummy.com/software/BeautifulSoup/) which is great for parsing HTML. You should use this instead of using regex to get info from HTML. Here's an example of BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> html = """<p> some <a href="http://link.com">HTML</a> and <a href="http://second.com">another link</a></p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
...    print link['href']
http://link.com
http://second.com

Here is a link to the documentation, which is really easy to follow: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

edited Jan 23, 2013 at 1:09

answered Jan 23, 2013 at 1:04

TerryA

60.2k11 gold badges122 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mac · Accepted Answer · 2013-01-23 01:33:14Z

Regex are very powerful tools, but they might not be your tool in all circumstances (as other has suggested already). That said, here's a minimal example from the console that uses - as per request - regex:

>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']

Focus on r'a href="(.*?)"'. In Englis it translates in: "find a string beginning with a href=", then save as a result any character until you hit the next ". The syntax is:

the () means "save only stuff in here"
the . means "any character"
the * means "any number of times"
the ? means "non greedy" or in other terms: find the shortest string that satisfy the requirements (try without the question mark and you will see what happens).

HTH!

Community · Accepted Answer · 2017-05-23 12:03:27Z

Do not use regexp:

Here is why you should not think at regex in the first place when dealing with HTML or XML (or URLs).

If you wish to use regex anyway,

You can find several pattern that do the job, and several way to fetch the strings you wish to find.

These patterns do the job:

r'\(a href="(.*?)"\)'

r'\(a href="(.*)"\)'

r'\(a href="(+*)"\)'

1. re.findall()

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

import re
st = 'blahblahblah (a href="example.com") another bla <a href="polymer.edu">'
re.findall(r'\(a href="(+*)"\)',s)

2. re.search()

re.search(pattern, string, flags=0)

Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance.

Then, go with re.group() through groups. For instance, using regex r'\(a href="(.+?(.).+?)"\)', that is also working here, you have several enclosed groups: group 0 is a match to the whole pattern, group 1 is a match to the first enclosed sub-pattern surrounded with parenthesis, (.+?(.).+?)

You would use search when looking for first occurence of pattern only. And with your example this would be

>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'

Collectives™ on Stack Overflow

Python Regex Tokenize

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related