Using string array in Python 2.7

Question

I'm trying to do some scraping using Python 2.7.2. I've just started with Python and unfortunately it is not as intuitive as I thought it will be. I try to collect all specific -s from all pages. I don't know how to accumulate results from all pages in string array. So far I'm getting results from 1 page only. I know that this is a super easy question for people who write in python. So please help me. Here is the code:

import urllib
import re
j=1
while j<10:
    url="http://www.site.com/search?page=" + str(j) + "&query=keyword"
    print url
    htmlfile=urllib.urlopen(url)
    htmltext=htmlfile.read()
    regex='<span class="class33">(.+?)</span>'
    pattern=re.compile(regex)
    spans=re.findall(pattern,htmltext)
    #spans[j] insttead of spans doesn't work
    #spans.append(spans) doesn't work
    j+=1
i=0
while i<len(spans):
    print spans[i]
    i+=1

You are actually making things harder on yourself. I would use BeautifulSoup for this problem. — Jason Sperske
– Jason Sperske, Commented Jul 3, 2013 at 15:40

Stefano M · Accepted Answer · 2013-07-03 15:45:40Z

1

put all invariant code outside the for loop
outside the for loop init s to the empty list
```
s = []
```

inside the for loop

    s.extend(re.findall(pattern, htmltext))

If you prefer s += re.findall(pattern, htmltext) will do the same

answered Jul 3, 2013 at 15:45

Stefano M

4,8882 gold badges30 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Brian · Accepted Answer · 2013-07-03 15:45:33Z

0

Change

spans=re.findall(pattern,htmltext)

to

spans.extend(re.findall(pattern,htmltext))

I'd also change your loop syntax a bit

import urllib
import re
spans = []
for j in range(1,11):
    url="http://www.site.com/search?page=" + str(j) + "&query=keyword"
    print url
    htmlfile=urllib.urlopen(url)
    htmltext=htmlfile.read()
    regex='<span class="class33">(.+?)</span>'
    pattern=re.compile(regex)
    spans.extend(re.findall(pattern,htmltext))
for span in spans:
    print span

edited Jul 3, 2013 at 15:45

answered Jul 3, 2013 at 15:40

Brian

3,13119 silver badges29 bronze badges

2 Comments

Ash Over a year ago

Unfortunately it doesn't work, here is the message: spans.append(re.findall(pattern,htmltext)) NameError: name 'spans' is not defined

Brian Over a year ago

Sorry see the edit, forgot to initialize spans to an empty list

Craig · Accepted Answer · 2013-07-03 15:57:56Z

0

Before your loop, define spans:

spans = []

Then in your loop:

spans.extend(re.findall(pattern,htmltext))

The findall method will return a list. You want to extend the spans list with the new spans on each iteration.

edited Jul 3, 2013 at 15:57

answered Jul 3, 2013 at 15:42

Craig

4,4194 gold badges40 silver badges57 bronze badges

2 Comments

Ash Over a year ago

still getting an error: File "C:\Python27\Debug\mml01.py", line 13, in <module> spans.extend(re.findall(pattern,htmltext)) NameError: name 'spans' is not defined

Craig Over a year ago

You need to define spans first before your loop. spans = []

Collectives™ on Stack Overflow

Using string array in Python 2.7

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related