1

I'm trying to scrape a page like this one

What they do is to load all information from their server and store it in a javascript function, so that depending on which button you click, it loads one part or another. I was trying to just request the page, and get all the data from the script function, being the structure of the page something like this

<!DOCTYPE html>
<html lang="en" xmlns:wb="http://open.weibo.com/wb">
<head>
    <meta charset="utf-8">
        <title>Historical Statistics of Kristiansund BK vs Molde on 2020/07/03 - ScoreBing</title>

#Several script tags over here....
</head>

<body class="vEn">

#Some stuff here...

#This is where the buttons that deploy the data are
<div class="panel-body">
     <div id="live-filter-bar">
         <div class="row MBTitle">
               <div class="small-6 columns PL0">
                    <a href="javascript:set_type(1);" id="tabtypeid1" class="button tiny radius MB0 MRMini VM font-bold">All</a>
                    <a href="javascript:set_type(2);" id="tabtypeid2" class="button tiny radius action MB0 MRMini VM">This League</a>
                    <a href="javascript:" onClick="select(1)" id="tabid1" class="button tiny radius MB0 MRMini VM">All</a>
                    <a href="javascript:" onClick="select(2)" id="tabid2" class="button tiny radius action MB0 MRMini VM">HA</a>
                    <a href="javascript:" onClick="select(3)" id="tabid3" class="button tiny radius action MB0 MRMini VM">AH</a>
                    <a href="javascript:" onClick="select(4)" id="tabid4" class="button tiny radius action MB0 MRMini VM">HH</a>
                    <a href="javascript:" onClick="select(5)" id="tabid5" class="button tiny radius action MB0 MRMini VM">AA</a>
                </div>
                <div class="small-6 columns text-right PR0">
                    <a href="javascript:set_num(10);" id=td10 class="button tiny radius action MB0 MRMini VM">Last 10</a>
                    <a href="javascript:set_num(8);" id=td8 class="button tiny radius action MB0 MRMini VM">Last 8</a>
                    <a href="javascript:set_num(6);" id=td6 class="button tiny radius MB0 MRMini VM">Last 6</a>
                    <a href="javascript:set_num(4);" id=td4 class="button tiny radius action MB0 MRMini VM">Last 4</a>
                </div>
         </div>
     </div>
     <div id="history_table">

     </div>
     <div id="history1">

     </div>
     <div id="history2">

     </div>
</div>

</body>
</html>
<script type="text/javascript">

var kind=1,num=6,typenum=1;
var race=[],league_bgcolor=[],league_i= 1;
var race_have_corner_handicap=1;
var home_id = [];
var guest_id = [];
home_id.push(1405);    guest_id.push(4503);
var sclass='',leaue_id=198;
var tongji_info=[];
var half_goal_av='-',goal_av='-',half_corner_av='-',corner_av='-';
var tmp_host_name,tmp_guest_name,tmp_league_name;
        tmp_host_name = "Mjondalen";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[0]=[746711,198,'20/06/29 12:01','903',1410,tmp_host_name,'957',1405,tmp_guest_name,'0.0','2.5','11 ',tmp_league_name,'2' ,'1','0','0','3',' 3','1','2','0.0','5.5','1.0',1];
    tmp_host_name = "Molde";
tmp_guest_name = "Stabaek";
tmp_league_name = "Norway Tippeligaen";
race[1]=[746712,198,'20/06/29 12:00','661',4503,tmp_host_name,'1162',1396,tmp_guest_name,'-1.0','3.0','10 ',tmp_league_name,'2' ,'1','1','0','5',' 3','4','1','-0.5','4.5','1.0,1.5',1];
    tmp_host_name = "Haugesund";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[2]=[746167,198,'20/06/25 12:00','673',1390,tmp_host_name,'957',1405,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'1','0','1','8',' 3','2','2','0.0','5','1.0',1];
    tmp_host_name = "IK Start";
tmp_guest_name = "Molde";
tmp_league_name = "Norway Tippeligaen";
race[3]=[746169,198,'20/06/25 12:00','667',1392,tmp_host_name,'661',4503,tmp_guest_name,'+1.0','3.0','10 ',tmp_league_name,'4' ,'3','1','2','6',' 8','2','3','+0.5','4.5','1.0,1.5',1];
    tmp_host_name = "Kristiansund BK";
tmp_guest_name = "Aalesund";
tmp_league_name = "Norway Tippeligaen";
race[4]=[744697,198,'20/06/22 12:01','957',1405,tmp_host_name,'677',1321,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'2','3','2','6',' 2','7','2','0.0','5','1.0',1];
    tmp_host_name = "Molde";
tmp_guest_name = "Rosenborg";
tmp_league_name = "Norway Tippeligaen";
race[5]=[744698,198,'20/06/21 02:30','661',4503,tmp_host_name,'1161',2482,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'0','0','0','9',' 4','1','0','0.0','5','1.0',1];
    tmp_host_name = "Aalesund";
tmp_guest_name = "Molde";
tmp_league_name = "Norway Tippeligaen";
race[6]=[743531,198,'20/06/17 12:00','677',1321,tmp_host_name,'661',4503,tmp_guest_name,'0.0,+0.5','2.5,3.0','9.5 ',tmp_league_name,'8' ,'1','1','2','8',' 4','1','4','0.0,+0.5','4.5','1.0,1.5',1];
    tmp_host_name = "Rosenborg";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[7]=[743533,198,'20/06/17 12:00','1161',2482,tmp_host_name,'957',1405,tmp_guest_name,'-1.0','2.5,3.0','11.5 ',tmp_league_name,'7' ,'4','0','0','8',' 9','0','0','0.0,-0.5','5.5','1.0',1];

And the script tag goes longer than the snippet. So, I have 2 problems. One, when I do response=requests.get(url=url) and I do response.content, I can see it only reaches until the end of html tag, so my script tag with all the data is not included. How do I include it with requests?

Second question, how do I scrape this, after I get it?

1
  • Are you sure it's the issue with the library? Just knowi made a sample page with script tag after html body, and was able to successfully retrieve it without any issues. Maybe it's the issue with webpage itself returning different responses depending on whether or not you use the webbrowser (stuff like user agent etc. used to deny access for scrappers) Commented Jul 3, 2020 at 9:57

2 Answers 2

4

Well, it appears that it is simply a parser setting that should be adjusted with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'www.scorebing.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
    
}

response = requests.get('https://www.scorebing.com/match_history/747514', headers=headers)


soup = BeautifulSoup(response.content, 'html.parser', encoding='UTF-8')
soup.find('script', text = re.compile('race_have_corner_handicap'))

Output

<script type="text/javascript">
    var is_en = 1;
    var kind=1,num=6,typenum=1;
    var race=[],league_bgcolor=[],league_i= 1;
    var race_have_corner_handicap=1;
    var home_id = [];
    var guest_id = [];
    home_id.push(1405);    guest_id.push(4503);
...
</script>
Sign up to request clarification or add additional context in comments.

1 Comment

How curious, if you parse it like BeautifulSoup(response.content,'lxml') you don't get anything after the html tag, but if you do, like you suggested, BeautifulSoup(response.content,'html.parser') then the script tag of afterwards appears. Thanks!
0

The page looks to be updated by a script after loading.

You can bypass this by using use Selenium instead of requests:

from selenium import webdriver
from bs4 import BeautifulSoup
import re

firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)

driver = webdriver.Firefox(firefox_profile=firefox_profile)

driver.get("https://www.scorebing.com/match_history/747514")
soup = BeautifulSoup(driver.page_source)
#Find the script tag that contains specific text:
data = soup.find('script', text = re.compile('race_have_corner_handicap'))
print(data)

Output

<script type="text/javascript">
    var is_en = 1;
    var kind=1,num=6,typenum=1;
    var race=[],league_bgcolor=[],league_i= 1;
    var race_have_corner_handicap=1;
    var home_id = [];
    var guest_id = [];
    home_id.push(1405);    guest_id.push(4503);
...
</script>

1 Comment

The problem is that I need to get data from more than 100 links and selenium is too time consuming. If aftr loading the page they call to the database to upload data in a script, isn't it possible to replicate that call using requests?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.