1

I am testing how to use Selenium in python, and successfully open a page via this below code in Ubuntu 16.04:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from selenium import webdriver 
from selenium.webdriver.firefox.options import Options 
firefox_options = Options()
firefox_options.binary_location = '/usr/bin/firefox'  
driver= webdriver.Firefox(executable_path='/home/myname/geckodriver',firefox_options=firefox_options)
driver.get('https://www.toutiao.com')

However, some data/contents are missing, comparing to open this page('https://www.toutiao.com') manually.

enter image description here

My Firefox version is '72.0.2' and geckodriver version is'0.26.0'. Could anybody help me on this issue please? Thanks in Advance!

9
  • question is not clear? do you want to verify this Text? Commented Feb 17, 2020 at 11:36
  • yes, there are texts here when I browser this page with Firefox but nothing here when I use selenium to open it Commented Feb 17, 2020 at 11:41
  • wait for page load maybe it takes time for page load Commented Feb 17, 2020 at 11:42
  • Are you giving proper time to load page ? I have checked page and it take little long to load. Commented Feb 17, 2020 at 11:43
  • Yes, of course, I wait for a long time and the loading is already completed, but it is still blank using selenium. I can soon see the texts if I open the page in normal way. Commented Feb 17, 2020 at 11:46

1 Answer 1

2

I took your code, simplified the script and while execution I have encountered the similar issue i.e. the data/contents are missing comparing to open this page as follows:

  • Code Block:

        from selenium import webdriver
    
        driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
        driver.get('https://www.toutiao.com')
        print(driver.page_source)
    
  • Console Output:

        <html><head><style class="vjs-styles-defaults">
              .video-js {
            width: 300px;
            height: 150px;
              }
    
              .vjs-fluid {
            padding-top: 56.25%
              }
            </style><meta charset="utf-8"><title>????</title><meta http-equiv="x-dns-prefetch-control" content="on"><meta name="renderer" content="webkit"><link rel="dns-prefetch" href="//s3.pstatp.com/"><link rel="dns-prefetch" href="//s3a.pstatp.com/"><link rel="dns-prefetch" href="//s3b.pstatp.com"><link rel="dns-prefetch" href="//p1.pstatp.com/"><link rel="dns-prefetch" href="//p3.pstatp.com/"><meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,minimum-scale=1,user-scalable=no,minimal-ui"><meta name="360-site-verification" content="b96e1758dfc9156a410a4fb9520c5956"><meta name="360_ssp_verify" content="2ae4ad39552c45425bddb738efda3dbb"><meta name="google-site-verification" content="3PYTTW0s7IAfkReV8wAECfjIdKY-bQeSkVTyJNZpBKE"><meta name="shenma-site-verification" content="34c05607e2a9430ad4249ed48faaf7cb_1432711730"><meta name="baidu_union_verify" content="b88dd3920f970845bad8ad9f90d687f7"><meta name="domain_verify" content="pmrgi33nmfuw4ir2ej2g65lunfqw6ltdn5wselbcm52wszbchirdqyztge3tenrsgq3dknjume2tayrvmqytemlfmiydimddgu4gcnzcfqrhi2lnmvjwc5tfei5dcnbwhazdcobuhe2dqobrpu"><meta name="keywords" content="????,??,???,????,??????"><meta name="description" content="«????»(www.toutiao.com)????????????????,?????????????????,?????????????,??????????????????????"><link rel="alternate" media="only screen and (max-width: 640px)" href="//m.toutiao.com/"><link rel="shortcut icon" href="//s3a.pstatp.com/toutiao/resource/ntoutiao_web/static/image/favicon_5995b44.ico" type="image/x-icon"><link rel="stylesheet" href="//s3.pstatp.com/toutiao/player/dist/pc_vue2.css" media="screen" title="no title"><!--[if lt IE 9]>
          <p>?????????,?<a href="http://browsehappy.com/">?????</a></p>
        .
        .
        .
        <script>var imgUrl = '/c/9ubkblw9out4h9t6ya05r7h0uu7q2u341jhsdh7l4r4yphpuxlqgdm/';</script><script>tac='i+2gv2ch1tigds!i$1dmgs"yZl!%s"l"u&kLs#l l#vr*charCodeAtx0[!cb^i$1em7b*0d#>>>s j\uffeel  s#0,<8~z|\x7f@QGNCJF[\\^D\\KFYSk~^WSZhg,(lfi~ah`{md"inb|1d<,%Dscafgd"in,8[xtm}nLzNEGQMKAdGG^NTY\x1ckgd"inb<b|1d<g,&TboLr{m,(\x02)!jx-2n&vr$testxg,%@tug{mn ,%vrfkbm[!cb|'</script><script type="text/javascript" crossorigin="anonymous" src="//s3b.pstatp.com/toutiao/static/js/vendor.63b66d4280309ac2fb48.js"></script><script type="text/javascript" crossorigin="anonymous" src="//s3a.pstatp.com/toutiao/static/js/page/index_node/index.e6afc60a3a3f653cfdba.js"></script><script type="text/javascript" crossorigin="anonymous" src="//s3b.pstatp.com/toutiao/static/js/ttstatistics.a083f6cd9b1a9a970725.js"></script><script src="//s3.pstatp.com/inapp/lib/raven.js" crossorigin="anonymous"></script><script>;(function(window) {
            // sentry
            window.Raven && Raven.config('//[email protected]/log/sentry/v2/96', {
              whitelistUrls: [/pstatp\.com/],
              shouldSendCallback: function(data) {
            var ua = navigator && navigator.userAgent;
            var isDeviceOK = !/Mobile|Linux/i.test(navigator.userAgent);
            return isDeviceOK;
              },
              tags: {
            bid: 'toutiao_pc',
            pid: 'index_new'
              },
              autoBreadcrumbs: {
            'xhr': false,
            'console': true,
            'dom': true,
            'location': true
              }
            }).install();
          })(window);</script><script>document.getElementsByTagName('body')[0].addEventListener('click', function(e) {
            var target = e.target,
            ga_event,
            ga_category,
            ga_label,
            ga_value;
            while(target && target.nodeName.toUpperCase() !== 'BODY') {
              ga_event = target.getAttribute('ga_event');
              ga_category = target.getAttribute('ga_category') || '/';
              ga_label = target.getAttribute('ga_label') || '';
              ga_value = target.getAttribute('ga_value') || 1;
    
              ga_event && window.ttAnalysis && ttAnalysis.send('event', { ev: ga_event });
              target = target.parentNode;
            }
          });</script><script src="https://xxbg.snssdk.com/websdk/v1/getInfo?q=YOsueEs6CjZquUQrQwttBa2p27c%2FmJBGcEmZKypwf%2Fh%2B%2FFzCVrIwzk9L3bo%2FZb2O8gVTNaA4L2Bk10qWfZ2s94e6qe8KRXlOEjnI%2FrONB4jQynV3bfJ9exD2E4QPsgydRGjRLlDXE9uYD7HU3IZ%2FOU2MJG2vMgfNU55%2FmsOAlVSrPQH2wo4Eor0lgghKHjRi28vVvBdKY7JT4gG7S7ThRFD2YBIc%2Fs4JYViQu1Ll1Bg5Xn5bKuD6jZRz3AzfFqzSOWguO6vUbzL0wBc4mpa22mdpmAXIvUNWtjg5MUfXh9rfWI0ti7saL%2B0r4%2BaBdN5y4lrmxAcQZq2oeAKl4WjOeJsN%2BePpYmisoxTzdBZ6TL8IGE0E7ZUUlFlPGyUWhU3E4IRbtbCCd0QdVaJajiSOIhg9cImqTZYI56kIao1yVnV%2Bxu4%2BhaC1kHu5xsk49%2BX%2FNdwGcel%2BlOUzagkE5s8X6jEswA7jzW%2ByD6%2FusfkNyyx8WOWCJmZlTGQ4SNQr%2FQHvmK2QscQ7KnTvKVqjedUd7IFcvyTyYz3iFFrmRkOMRN9042sLiQwerXsn0f%2Fc%2Bh46PNdeU1S6BsFKq%2BZhMDxw1vI2Y1C%2Fa0RBdZC%2BGZq%2BkbNaoVotfvslg05ahevHTainlZR9DHEiWawFBJbTwjMeYrmo4NZiL5eNBUvslFn%2BDPHk%2F6Oj0Nbb89Rx8Ihi2pRH04voRog9848H2o2LR9gx0N0i0o6%3D&amp;callback=_8712_1581940674310"></script></body></html>
    

Analysis

While inspecting the DOM Tree of the webpage you will find that some of the <script> tag refers to JavaScripts having keyword dist. As an example:

  • <link rel="stylesheet" href="//s3.pstatp.com/toutiao/player/dist/pc_vue2.css" media="screen" title="no title">
  • <script src="//unpkg.pstatp.com/byted/sec_sdk_build/1.1.12/dist/captcha.js"></script>
  • //s3a.pstatp.com/toutiao/picc_mig/dist/img.min.js

Which is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

Sign up to request clarification or add additional context in comments.

3 Comments

Done! it seems there is no easy way to work this around though :(
@X.C. There are a couple of ways to try out to work this around, but that would be beyond the scope of Selenium. See the reference discussions, this and this discussion.
using headless mode is probably even harder..I tried set up a user agent, but still no avail.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.