0

I'm trying to do some scraping using Python 2.7.2. I've just started with Python and unfortunately it is not as intuitive as I thought it will be. I try to collect all specific -s from all pages. I don't know how to accumulate results from all pages in string array. So far I'm getting results from 1 page only. I know that this is a super easy question for people who write in python. So please help me. Here is the code:

import urllib
import re
j=1
while j<10:
    url="http://www.site.com/search?page=" + str(j) + "&query=keyword"
    print url
    htmlfile=urllib.urlopen(url)
    htmltext=htmlfile.read()
    regex='<span class="class33">(.+?)</span>'
    pattern=re.compile(regex)
    spans=re.findall(pattern,htmltext)
    #spans[j] insttead of spans doesn't work
    #spans.append(spans) doesn't work
    j+=1
i=0
while i<len(spans):
    print spans[i]
    i+=1
1
  • 3
    You are actually making things harder on yourself. I would use BeautifulSoup for this problem. Commented Jul 3, 2013 at 15:40

3 Answers 3

1
  1. put all invariant code outside the for loop
  2. outside the for loop init s to the empty list

    s = []
    
  3. inside the for loop

        s.extend(re.findall(pattern, htmltext))
    

If you prefer s += re.findall(pattern, htmltext) will do the same

Sign up to request clarification or add additional context in comments.

Comments

0

Change

spans=re.findall(pattern,htmltext)

to

spans.extend(re.findall(pattern,htmltext))

I'd also change your loop syntax a bit

import urllib
import re
spans = []
for j in range(1,11):
    url="http://www.site.com/search?page=" + str(j) + "&query=keyword"
    print url
    htmlfile=urllib.urlopen(url)
    htmltext=htmlfile.read()
    regex='<span class="class33">(.+?)</span>'
    pattern=re.compile(regex)
    spans.extend(re.findall(pattern,htmltext))
for span in spans:
    print span

2 Comments

Unfortunately it doesn't work, here is the message: spans.append(re.findall(pattern,htmltext)) NameError: name 'spans' is not defined
Sorry see the edit, forgot to initialize spans to an empty list
0

Before your loop, define spans:

spans = []

Then in your loop:

spans.extend(re.findall(pattern,htmltext))

The findall method will return a list. You want to extend the spans list with the new spans on each iteration.

2 Comments

still getting an error: File "C:\Python27\Debug\mml01.py", line 13, in <module> spans.extend(re.findall(pattern,htmltext)) NameError: name 'spans' is not defined
You need to define spans first before your loop. spans = []

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.