1

Hi iam trying to scrap some data off from this URL:

http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1

As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)

I tried to do it like this:

def main():
    try:
        cj = CookieJar()
        baseurl = "http://www.21cineplex.com"
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        opener.open(baseurl)

        urllib2.install_opener(opener)
        movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()

        splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)

        print splitSource

    except Exception, e:
        str(e)
        print "Error occured in main Block"

However, i ended up failing to scrap from that particular URL.

A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.

The question is how do i mitigate such example?

ps: i've tried to install request (via pip) how ever it gives me (404):

  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Getting page https://pypi.python.org/simple/
  URLs to search for versions for request:
  * https://pypi.python.org/simple/request/
  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Could not find any downloads that satisfy the requirement request

Cleaning up...

2 Answers 2

3

Thanks to @Chainik i got it to work now. I ended up modify my code like this:

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

Once, the html text is retrieved. It's all about parsing its content.

Cheers

Sign up to request clarification or add additional context in comments.

Comments

0

Try setting a referer URL, see below.

Without referer URL set (302 redirect):

$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily                       
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/

With referer URL set (HTTP/200):

$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/

To set referer URL using urllib, see this post

-- ab1

1 Comment

Hi thx, theoretically this should works. But i am not sure about the expiration thing. Anyway, i'll let you know.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.