1

This is my code:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
f.write(str(div))
f.close()

Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks

3
  • the script you wrote does not throw any errors, what is the problem with non ascii letters?, do you now want it in the file you are writing? Commented Oct 21, 2015 at 16:04
  • I know there are no errors, but there are some characters just like "Â" into the HTML that i need to remove. Commented Oct 21, 2015 at 16:06
  • @Poggio may be this will be of help stackoverflow.com/questions/17732695/… Commented Oct 21, 2015 at 16:24

3 Answers 3

4

characters are 8 byte (0-255), ascii chars are 7 byte (0-127), so you can simply drop all chars with a ord value below 128

chr convert a integer to a character, ord converts a character to an integer.

text = ''.join((c for c in str(div) if ord(c) < 128)

this should be your final code

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
text = ''.join((c for c in str(div) if ord(c) < 128)
f.write(text)
f.close()
Sign up to request clarification or add additional context in comments.

3 Comments

Traceback (most recent call last): File "pppp.py", line 38, in <module> div = ''.join((c for c in div if ord(c) < 128)) File "pppp.py", line 38, in <genexpr> div = ''.join((c for c in div if ord(c) < 128)) TypeError: ord() expected string of length 1, but Tag found This is the error
there should be a str(div), to convert the div tag to a text string, I forgot that
There are some chars i need to handle in a better way, just like the stressed letters. For example: à - è - ì - ò - ù, that i need to print with the rest of the text. Do you know if there is a solution?
4

Try to normalize the string and then ASCII encode it ignoring errors.

# -*- coding: utf-8 -*-
from unicodedata import normalize

string = 'úäô§'

if isinstance(string, str):
    string = string.decode('utf-8')

print normalize('NFKD', string).encode('ASCII', 'ignore')
>>> uao

1 Comment

I think your solution is the best, because my solution does wierd things to 16 bit encoded letters, where yours behave slightly more sane
-2

To remove non ASCII characters from text.

import string

text = [word for word in text if word not in string.ascii_letters]

2 Comments

This throws errors, cause i can't write nonAscii char into the Python.
@Poggio you can't run this list comprehension? what are the errors that you a getting ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.