Headless Chrome returning empty HTML when using a Proxy

Question

I am looking to use a headless browser to scrape some websites and need to use a proxy server.

I'm a bit lost and am looking for help.

When I disable the proxy it works perfectly every time.

When I disable headless mode I get an empty browser window, but if I press enter on the URL bar that has "https://www.whatsmyip.org" the page loads (using the proxy server showing a different IP).

I have the same error for other websites as well, it's not just whatsmyip.org that is having this result.

I am running Centos7, Python 3.6 and Selenium 3.14.0.

I have also tried it on a Windows machine running Anaconda and have the same results.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.proxy import Proxy, ProxyType

my_proxy = "x.x.x.x:xxxx" #I have a real proxy address here
proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': my_proxy,
    'ftpProxy': my_proxy,
    'sslProxy': my_proxy,
    'noProxy': ''
})

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--allow-insecure-localhost')
chrome_options.add_argument('--allow-running-insecure-content')
chrome_options.add_argument("--ignore-ssl-errors");
chrome_options.add_argument("--ignore-certificate-errors");
chrome_options.add_argument("--ssl-protocol=any");        
chrome_options.add_argument('--window-size=800x600')
chrome_options.add_argument('--disable-application-cache')

capabilities = dict(DesiredCapabilities.CHROME)
proxy.add_to_capabilities(capabilities)
capabilities['acceptSslCerts'] = True
capabilities['acceptInsecureCerts'] = True

browser = webdriver.Chrome(executable_path=r'/home/glen/chromedriver', chrome_options=chrome_options, desired_capabilities=capabilities)

browser.get('https://www.whatsmyip.org/')

print(browser.page_source)     

browser.close()

When I run the code I get the following returned:

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

Not the website.

Gabriel Santos · Accepted Answer · 2022-02-21 20:53:43Z

3

There are two problems here:

You need to wait for the browser to load the web site.
browser.page_source doesn't return what you want.

The first problem is solved by waiting for an element to appear in the DOM. Usually, you will want to scrape something, so you know how to identify the element. Add code to wait until that element exists.

The second problem is that page_source doesn't return the current DOM but the initial HTML which the browser did load. If JavaScript modified the page since, you won't see it this way.

The solution is to locate the html element and ask for the outerHtml property:

from selenium.webdriver.common.by import By
htmlElement = driver.find_element(By.TAG_NAME, "html")
dom = htmlElement.get_attribute("outerHTML")
print(dom)

For details, see the examples at: https://www.seleniumhq.org/docs/03_webdriver.jsp#introducing-the-selenium-webdriver-api-by-example

edited Feb 21, 2022 at 20:53

Gabriel Santos

4,9742 gold badges45 silver badges74 bronze badges

answered Mar 27, 2019 at 14:07

Aaron Digulla

330k111 gold badges626 silver badges840 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Glen Over a year ago

Thanks for your response. How do you do this "The solution is to locate the html element and ask for the outerHtml property."?

miazo Over a year ago

For Python, shouldn't it rather be: dom = htmlElement.get_attribute("outerHTML")?

lenord · Accepted Answer · 2022-01-25 20:57:55Z

0

All of you who didn't solve the problem check this out (python):

options.add_arguments("disable-blink-features=AutomationControlled")

Some sites can detect the automation software and prevent from loading the content properly on purpose.

Source: ChromeDriver with Selenium displays a blank page

answered Jan 25, 2022 at 20:57

lenord

1,2812 gold badges11 silver badges16 bronze badges

Collectives™ on Stack Overflow

Headless Chrome returning empty HTML when using a Proxy

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related