1

I'm trying to extract plain text from a website using python. My code is something like this (a slightly modified version of what I found here):

import requests
import urllib
from bs4 import BeautifulSoup
url = "http://www.thelatinlibrary.com/vergil/aen1.shtml"
r = requests.get(url)
k = r.content
file = open('C:\\Users\\Anirudh\\Desktop\\NEW2.txt','w')
soup = BeautifulSoup(k)
for script in soup(["Script","Style"]):
    script.exctract()
text = soup.get_text
file.write(repr(text))

This doesn't seem to work. I'm guessing that beautifulsoup doesn't accept r.content. What can I do to fix this?

This is the error -

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 8 of the file C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))
Traceback (most recent call last):
  File "C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py", line 12, in <module>
    file.write(repr(text))
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 2130: character maps to <undefined>

Process finished with exit code 1
8
  • What is the error you get? Commented Aug 14, 2016 at 13:03
  • @OrDuan I have eddited the error into the question Commented Aug 14, 2016 at 13:10
  • try soup = BeautifulSoup(K, 'html.parser') and tell me if the error changes. Commented Aug 14, 2016 at 13:14
  • @Harrison , it is now - Traceback (most recent call last): File "C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py", line 12, in <module> file.write(repr(text)) File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 2130: character maps to <undefined> oh , by the way , what was that warning and what happened to it when you included html.parser ? Commented Aug 14, 2016 at 13:18
  • @AnirudhGanesh If you look at the error message it's telling you that it can't encode this character codetable.net/hex/97 Commented Aug 14, 2016 at 13:21

2 Answers 2

2

The "error" is a warning, and is of no consequence. Quieten it with soup = BeautifulSoup(k, 'html.parser')

There seems to be a typo script.exctract() The word extract is spelt incorrectly.

The actual error seems to be that the content is a bytestring, but you are writing in text mode. The source contains an em dash. Handling this character is the problem.

You can encode with soup.encode("utf-8"). This means hardcoding the encoding into your script (which is bad). Or try using binary mode for the file open(..., 'wb'), or converting the content to a string before passing it to Beautiful Soup, using the correct encoding for that file, with k = str(r.content,"utf-8").

Sign up to request clarification or add additional context in comments.

9 Comments

Still same error with typo corrected and use of repr(k)
maybe convert to a string before passing to beautiful soup.
I'm sorry , I think I misunderstood but str(k) doesn't help either . I did what you said in the answer , still the same result
can you print instead of file.write ?
I seem to have fixed it , the code ought to have been soup.get_text()
|
0

There was a — on the code which resulted in an error , '—' being non utf-8 . Changing the encoding before passing text on to BeautifulSoup fixed the issue .

Another error was due to soup.get_text . Missing out () implied I was referencing the method , not the output .

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.