Python HTMLParser(encoding='utf-8') error

Question

When I print this I get: ['Ordinateur', 'Impression', 'Tablette & TÃ©lÃ©phonie ', 'MultimÃ©dia',...] What I want instead comes from the following ['Ordinateur', 'Impression', 'Tablette & Téléphonie ', 'Multimédia',...]

I m looking to scrape list of data from the header of a website correctly Here is my code:

from tkinter import *
import tkinter.ttk
from lxml import html
import requests
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from time import sleep
import csv
import os
import re


             
index="https://www.disway.com/"
p=requests.get(index)
pages_s=[]
script= html.fromstring(p.text,parser=html.HTMLParser(encoding='utf-16'))

pages_s.extend(script.xpath('//*[@id="7ea42b1d-f4c2-41af-9908-eaaec09f308c"]/li/a/text()'))
pages_s.extend(script.xpath('//*[@id="7ea42b1d-f4c2-41af-9908-eaaec09f308c"]/li/ul/li/a/text()'))
print(pages_s)

Where is the usage of Selenium here?

undetected Selenium
– undetected Selenium

2021-12-17 05:27:31 +00:00
Commented Dec 17, 2021 at 5:27 — undetected Selenium
– undetected Selenium, Commented Dec 17, 2021 at 5:27

Mark Tolonen · Accepted Answer · 2021-12-17 00:29:01Z

requests thinks the web page is encoded in ISO-8859-1 but it is really UTF-8. The web page doesn't declare the content encoding correctly. Use p.content to get the raw bytes of the request, and decode it as UTF-8 instead:

from lxml import html
import requests

index = "https://www.disway.com/"
p=requests.get(index)
pages_s = []
script = html.fromstring(p.content,parser=html.HTMLParser(encoding='utf8'))

pages_s.extend(script.xpath('//*[@id="7ea42b1d-f4c2-41af-9908-eaaec09f308c"]/li/a/text()'))
pages_s.extend(script.xpath('//*[@id="7ea42b1d-f4c2-41af-9908-eaaec09f308c"]/li/ul/li/a/text()'))
print(pages_s)

Console output:

['Ordinateur', 'Impression', 'Tablette & Téléphonie ', 'Multimédia', 'Accessoires', 'PC portable', 'PC bureau', 'Tout en un ', 'Options', 'Imprimante', 'Scanner', 'Terminal point de vente', 'Traceur', 'Copieur', 'Fax', 'Consommable', 'Options', 'Tablette', 'Smartphone ', 'Objet connecté', 'Casque & écouteurs', 'Options', 'Écran PC', 'Téléviseur', 'Vidéoprojecteur', 'Ecran projection', 'Visioconférence', 'Photo & vidéo', 'Options', 'Câble', 'Lecteur', 'Disque dur', 'Mémoire flash', 'Bagagerie', 'Clavier & souris', 'Barrette mémoire', 'Gaming', 'Audio', 'Webcam', 'Power bank', 'Multi-prise', 'Onduleur', 'Autres & divers']

Collectives™ on Stack Overflow

Python HTMLParser(encoding='utf-8') error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related