How to fix Cyrillic characters while web-scraping with Python

Question

I'm scraping a Cyrillic website with python using BeautifulSoup, but I'm having some trouble, every word is showing like this:

Ð¡Ð¸Ð»ÑÐ°Ð½Ð¾Ð²ÑÐºÐ° ÐÐ°Ð²ÐºÐ¾Ð²Ð° Ð²Ð¾ ÐÐ°Ð·Ð¸

I also tried some other Cyrillic websites, but they are working good.

My code is this:

from bs4 import BeautifulSoup
import requests

source = requests.get('https://').text

soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())

How should I fix it?

Patryk Bratkowski · Accepted Answer · 2019-04-22 21:58:49Z

4

requests fails to detect it as utf-8.

from bs4 import BeautifulSoup
import requests

source = requests.get('https://time.mk/')  # don't convert to text just yet

# print(source.encoding)
# prints out ISO-8859-1

source.encoding = 'utf-8'  # override encoding manually

soup = BeautifulSoup(source.text, 'lxml')  # this will now decode utf-8 correctly

edited Apr 22, 2019 at 21:58

answered Apr 22, 2019 at 21:23

Patryk Bratkowski

5653 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

snakecharmerb Over a year ago

The site doesn't serve a content-type header so requests falls back to ISO-8859-1/latin-1. However there is a meta tag in the html that defines the charset, so another approach might be to pass source.content to BeautifulSoup and let BeautifulSoup handle the decoding.

scpbook Over a year ago

when I add this line " source.encoding = 'utf-8' " I don't have any errors but the output is blank !? Did you get any result with this?

Patryk Bratkowski Over a year ago

@scpbook setting a variable doesn't print anything. Just like foo = 42 doesn't print anything unless you print(foo). You can add a print(source.encoding) on the following line to test it, or simply see if it fixed your problem. It has for me, at least.

scpbook Over a year ago

@PatrykBratkowski of course im printing it, my code:

from bs4 import BeautifulSoup import requests  source = requests.get('https://time.mk/')  source.encoding = 'utf-8'  soup = BeautifulSoup(source.text, 'lxml')  print(soup)

It shows that i have 2740 lines of text, but when I open it its empty.

Patryk Bratkowski Over a year ago

@scpbook I think you should make a new post, if you are having a different problem now, as SO isn't really suited to discuss it in comments. The code I posted definitely works.

|

Collectives™ on Stack Overflow

How to fix Cyrillic characters while web-scraping with Python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related