2

I'm working on a scraper that works through a Chrome extension. It grabs all of the HTML on the page(s) and sends it to a python code that filters and saves the data. The reason that I'm doing the scraping this way is because the website has Distil Networks and a 'traditional' scraper gets blocked.

I have a successful connection between the 2 codes but whenever I try to send 'Test.' to the python server it just output the headers of the browser.

b'GET / HTTP/1.1 Host: localhost:18364 Connection: Upgrade Pragma: no-cache Cache-Control: no-cache User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 Upgrade: websocket Origin: chrome-extension://ocplnbpkkcpcomkjioockgnlohhkdeic Sec-WebSocket-Version: 13 Accept-Encoding: gzip, deflate, br Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7 Sec-WebSocket-Key: SDC7zPgHK/eV+QRSJy0DZQ== Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits'

The JavaScript Code (Client):

chrome.runtime.onMessage.addListener(function(request, sender) {
if (request.action == "getSource") {
  var pageAmount = parseInt(request.source, 10)

  var allHTML = ""
  var BaseURL = "https://www.funda.nl/huur/rotterdam/p"

  function encode_utf8(s) {
    return unescape(encodeURIComponent(s));
  }

  var websocket = new WebSocket('ws://localhost:18364');

  websocket.onopen = function () {
    data = encode_utf8('Test.')
    websocket.send('Test.'); 
  };
message.innerText = request.source;
}
});

function onWindowLoad() {

var message = document.querySelector('#message');

chrome.tabs.executeScript(null, {
file: "getPageContent.js"
}, function() {
// If you try and inject into an extensions page or the         webstore/NTP you'll get an error
if (chrome.runtime.lastError) {
  message.innerText = 'There was an error injecting script : \n' + chrome.runtime.lastError.message;
}
});
}

window.onload = onWindowLoad;

The Python code (Server):

import socket

LocalSocket = socket.socket()
allHTML = ''

try:  # Connecting the Socket
LocalSocket = socket.socket(socket.AF_INET,     socket.SOCK_STREAM)
LocalSocket.setsockopt(socket.SOL_SOCKET,   socket.SO_REUSEADDR, 1)
LocalSocket.bind(('localhost', 18364))
print("Connected.")
except socket.error as err:
print("ConnectionError: %s" % err)


def main():
LocalSocket.listen(1)

c, addr = LocalSocket.accept()
print('Got connection from', addr)
print(c.recv(1024))

c.close()

if __name__ == "__main__":
main()

1 Answer 1

1

web sockets are layered over HTTP, so this is expected behaviour. you need a web server (or something that speaks HTTP) to handle the Connection: Upgrade and Upgrade: websocket parts, then perform the rest of the handshake before getting a valid connection that supports bi-directional communication

you could look at using the websockets package which wraps this up nicely

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.