How to remove nonAscii characters in python

Question

This is my code:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
f.write(str(div))
f.close()

Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks

the script you wrote does not throw any errors, what is the problem with non ascii letters?, do you now want it in the file you are writing? — jcr
– jcr, Commented Oct 21, 2015 at 16:04
I know there are no errors, but there are some characters just like "Â" into the HTML that i need to remove. — Poggio
– Poggio, Commented Oct 21, 2015 at 16:06
@Poggio may be this will be of help stackoverflow.com/questions/17732695/… — LetzerWille
– LetzerWille, Commented Oct 21, 2015 at 16:24

jcr · Accepted Answer · 2015-10-21 18:47:44Z

4

characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128

chr convert a integer to a character, ord converts a character to an integer.

text = ''.join((c for c in str(div) if ord(c) < 128)

this should be your final code

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
text = ''.join((c for c in str(div) if ord(c) < 128)
f.write(text)
f.close()

edited Oct 21, 2015 at 18:47

answered Oct 21, 2015 at 15:51

jcr

1,0356 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Poggio Over a year ago

Traceback (most recent call last): File "pppp.py", line 38, in <module> div = ''.join((c for c in div if ord(c) < 128)) File "pppp.py", line 38, in <genexpr> div = ''.join((c for c in div if ord(c) < 128)) TypeError: ord() expected string of length 1, but Tag found This is the error

jcr Over a year ago

there should be a str(div), to convert the div tag to a text string, I forgot that

Poggio Over a year ago

There are some chars i need to handle in a better way, just like the stressed letters. For example: à - è - ì - ò - ù, that i need to print with the rest of the text. Do you know if there is a solution?

Dušan Maďar · Accepted Answer · 2015-10-21 16:28:28Z

4

Try to normalize the string and then ASCII encode it ignoring errors.

# -*- coding: utf-8 -*-
from unicodedata import normalize

string = 'úäô§'

if isinstance(string, str):
    string = string.decode('utf-8')

print normalize('NFKD', string).encode('ASCII', 'ignore')
>>> uao

edited Oct 21, 2015 at 16:28

answered Oct 21, 2015 at 16:21

Dušan Maďar

10k6 gold badges58 silver badges72 bronze badges

1 Comment

jcr Over a year ago

I think your solution is the best, because my solution does wierd things to 16 bit encoded letters, where yours behave slightly more sane

Dušan Maďar · Accepted Answer · 2015-10-21 16:47:26Z

-2

To remove non ASCII characters from text.

import string

text = [word for word in text if word not in string.ascii_letters]

edited Oct 21, 2015 at 16:47

Dušan Maďar

10k6 gold badges58 silver badges72 bronze badges

answered Oct 21, 2015 at 15:51

LetzerWille

5,6965 gold badges26 silver badges28 bronze badges

2 Comments

Poggio Over a year ago

This throws errors, cause i can't write nonAscii char into the Python.

LetzerWille Over a year ago

@Poggio you can't run this list comprehension? what are the errors that you a getting ?

Collectives™ on Stack Overflow

How to remove nonAscii characters in python

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related