Regex Matching Error

Question

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.

I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says

raise error, v # invalid expression
sre_constants.error: multiple repeat

I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.

Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!

Brock

import urllib2
from BeautifulSoup import BeautifulSoup
import re

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

pattern = r'<a href="http://forums.epicgames.com/archive/index.php?t-([0-9]+).html">(.?+)</a> <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'

for match in re.finditer(pattern, page, re.S):
    print match(0)

Unknown · Accepted Answer · 2009-08-12 21:19:26Z

1

That means your regular expression has an error.

(.?+)</a> <i>((.?+)

What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.

answered Aug 12, 2009 at 21:19

Unknown

47k29 gold badges142 silver badges184 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

retracile Over a year ago

They make sense in the other order. +? is non-greedy matching form of +.

retracile · Accepted Answer · 2009-08-12 21:27:10Z

1

You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.

Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.

4 Comments

Btibert3 Over a year ago

I changed my pattern and ran the script again, and yet no matches were found, at least I dont have anything printed in the window when I try to iterate over my matches and print them. Any ideas?

retracile Over a year ago

Look at the content of the file by hand. When I look at it, I don't see the string 'replies' in it anywhere. So the regex won't find any matches.

retracile Over a year ago

pattern = r'<a href="forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>( <i>\(([0-9]+?) replies\))?' might be closer?

Btibert3 Over a year ago

I tried your new patter,, and what I dont get is that it returned no matches. I even shortened the pattern and tried this code, and when I try to print match.group(0), nothing (I think) gets sent to the console. Any ideas? pattern = r'<a href="forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>' for match in re.finditer(pattern, page, re.S): print match(0)

Ned Deily · Accepted Answer · 2009-08-12 21:46:48Z

1

As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!

answered Aug 12, 2009 at 21:46

Ned Deily

85.4k17 gold badges134 silver badges156 bronze badges

1 Comment

Btibert3 Over a year ago

I have tried the documentation. As I new to Python, and even HTML for that matter, I am having a hard time 'easily' finding what I need it do, although I have no doubt it can do what I need.

hughdbrown · Accepted Answer · 2009-08-13 21:51:23Z

0

import urllib2
import re
from BeautifulSoup import BeautifulSoup

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

# Get all the links
links = [str(match) for match in soup('a')]

s = r'<a href="http://forums.epicgames.com/archive/index.php\?t-\d+.html">(.+?)</a>' 
r = re.compile(s)
for link in links:
    m = r.match(link)
    if m:
        print m.groups(1)[0]

edited Aug 13, 2009 at 21:51

answered Aug 12, 2009 at 22:01

hughdbrown

49.2k20 gold badges89 silver badges111 bronze badges

4 Comments

Btibert3 Over a year ago

Is it possible to filter the links I want...as you can see in my attempt to do a regex, I want a certain set of links. Additionally, and I know I am pushing my luck, I was hoping to get the link text along with it. In short, is it possible to filter the links returned and get the link text with it?

hughdbrown Over a year ago

A couple of things: what is the "link text"? The stuff between <a href...> and </a>? Or the href value? Or some stuff after the opening <a> and closing </a>? Or something else?¶ Here's what I don't get: the page you point to, forums.epicgames.com/archive/index.php?f-356-p-164.html, doesn't even have a single instance of 'replies' in the HTML source. Are you sure you are looking for that? And why have you accepted as an answer a regex that cannot match any links in the data?¶

Btibert3 Over a year ago

New to stack overflow, didnt realize that meant I was done, sorry. By link text, I simply want the text after the link in the source code (the text right before </a>. Since I am new to Python and web scraping, I am starting slow and trying to learn as much as I can. But all I am looking to do is grab the links from that archive (every page), follow each link (discussion), and grab all of the posts for that discussion. I will need to parse the data into a 'dataset', which can be a list, but simply, I want to scrape the archives and collect all of the message titles and posts for each.

hughdbrown Over a year ago

Marking a solution as "the one" usually means that you are satisfied with it and responders will not expect to get any credit for further efforts. Also, if you select one of the solutions and it doesn't actually work, what should responders make of that? The new version of the code goes to the web page you cited, scrapes all the links, and then prints all the text between the opening and closing anchor tags. I think that's what you want.

machineghost · Accepted Answer · 2009-08-12 21:24:03Z

0

To extend on what others wrote:

.? means "one or zero of any character"

.+ means "one ore more of any character"

As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

answered Aug 12, 2009 at 21:24

machineghost

36k33 gold badges174 silver badges271 bronze badges

1 Comment

retracile Over a year ago

Except that .+? is non-greedy matching of one or more characters. Which is what he's after.

Collectives™ on Stack Overflow

Regex Matching Error

5 Answers 5

1 Comment

4 Comments

1 Comment

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

4 Comments

1 Comment

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related