8

I am new to Python (I dont have any programming training either), so please keep that in mind as I ask my question.

I am trying to search a retrieved webpage and find all links using a specified pattern. I have done this successfully in other scripts, but I am getting an error that says

raise error, v # invalid expression

sre_constants.error: multiple repeat

I have to admit I do not know why, but again, I am new to Python and Regular Expressions. However, even when I don't use patterns and use a specific link (just to test the matching), I do not believe I return any matches (nothing is sent to the window when I print match.group(0). The link I tested is commented out below.

Any ideas? It usually is easier for me to learn by example, but any advice you can give is greatly appreciated!

Brock

import urllib2
from BeautifulSoup import BeautifulSoup
import re

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

pattern = r'<a href="http://forums.epicgames.com/archive/index.php?t-([0-9]+).html">(.?+)</a> <i>((.?+) replies)'
#pattern = r'href="http://forums.epicgames.com/archive/index.php?t-622233.html">Gears of War 2: Horde Gameplay</a> <i>(20 replies)'

for match in re.finditer(pattern, page, re.S):
    print match(0)

5 Answers 5

1

That means your regular expression has an error.

(.?+)</a> <i>((.?+)

What does ?+ mean? Both ? and + are meta characters that does not make sense right next to each other. Maybe you forgot to escape the '?' or something.

Sign up to request clarification or add additional context in comments.

1 Comment

They make sense in the other order. +? is non-greedy matching form of +.
1

You need to escape the literal '?' and the literal '(' and ')' that you are trying to match.

Also, instead of '?+', I think you're looking for the non-greedy matching provided by '+?'.

More documentation here.

For your case, try this:

pattern = r'<a href="http://forums.epicgames.com/archive/index.php\?t-([0-9]+).html"> (.+?)</a> <i>\((.+?) replies\)'

4 Comments

I changed my pattern and ran the script again, and yet no matches were found, at least I dont have anything printed in the window when I try to iterate over my matches and print them. Any ideas?
Look at the content of the file by hand. When I look at it, I don't see the string 'replies' in it anywhere. So the regex won't find any matches.
pattern = r'<a href="forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>( <i>\(([0-9]+?) replies\))?' might be closer?
I tried your new patter,, and what I dont get is that it returned no matches. I even shortened the pattern and tried this code, and when I try to print match.group(0), nothing (I think) gets sent to the console. Any ideas? pattern = r'<a href="forums.epicgames.com/archive/index.php\?t-([0-9]+).html">(.+?)</a>' for match in re.finditer(pattern, page, re.S): print match(0)
1

As you're discovering, parsing arbitrary HTML is not easy to do correctly. That's what packages like Beautiful Soup do. Note, you're calling it in your script but then not using the results. Refer to its documentation here for examples of how to make your task a lot easier!

1 Comment

I have tried the documentation. As I new to Python, and even HTML for that matter, I am having a hard time 'easily' finding what I need it do, although I have no doubt it can do what I need.
0
import urllib2
import re
from BeautifulSoup import BeautifulSoup

url = "http://forums.epicgames.com/archive/index.php?f-356-p-164.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

# Get all the links
links = [str(match) for match in soup('a')]

s = r'<a href="http://forums.epicgames.com/archive/index.php\?t-\d+.html">(.+?)</a>' 
r = re.compile(s)
for link in links:
    m = r.match(link)
    if m:
        print m.groups(1)[0]

4 Comments

Is it possible to filter the links I want...as you can see in my attempt to do a regex, I want a certain set of links. Additionally, and I know I am pushing my luck, I was hoping to get the link text along with it. In short, is it possible to filter the links returned and get the link text with it?
A couple of things: what is the "link text"? The stuff between <a href...> and </a>? Or the href value? Or some stuff after the opening <a> and closing </a>? Or something else?¶ Here's what I don't get: the page you point to, forums.epicgames.com/archive/index.php?f-356-p-164.html, doesn't even have a single instance of 'replies' in the HTML source. Are you sure you are looking for that? And why have you accepted as an answer a regex that cannot match any links in the data?¶
New to stack overflow, didnt realize that meant I was done, sorry. By link text, I simply want the text after the link in the source code (the text right before </a>. Since I am new to Python and web scraping, I am starting slow and trying to learn as much as I can. But all I am looking to do is grab the links from that archive (every page), follow each link (discussion), and grab all of the posts for that discussion. I will need to parse the data into a 'dataset', which can be a list, but simply, I want to scrape the archives and collect all of the message titles and posts for each.
Marking a solution as "the one" usually means that you are satisfied with it and responders will not expect to get any credit for further efforts. Also, if you select one of the solutions and it doesn't actually work, what should responders make of that? The new version of the code goes to the web page you cited, scrapes all the links, and then prints all the text between the opening and closing anchor tags. I think that's what you want.
0

To extend on what others wrote:

.? means "one or zero of any character"

.+ means "one ore more of any character"

As you can hopefully see, combining the two makes no sense; they are different and contradictory "repeat" characters. So, your error about "multiple repeats" is because you combined those two "repeat" characters in your regular expression. To fix it, just decide which one you actually meant to use, and delete the other.

1 Comment

Except that .+? is non-greedy matching of one or more characters. Which is what he's after.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.