0

Right now my output to a file is like:

<b>Nov 22–24</b>   <b>Nov 29–Dec 1</b>    <b>Dec 6–8</b> <b>Dec 13–15</b>   <b>Dec 20–22</b>   <b>Dec 27–29</b>   <b>Jan 3–5</b> <b>Jan 10–12</b>   <b>Jan 17–19</b>   <b><i>Jan 17–20</i></b>    <b>Jan 24–26</b>   <b>Jan 31–Feb 2</b>    <b>Feb 7–9</b> <b>Feb 14–16</b>   <b><i>Feb 14–17</i></b>    <b>Feb 21–23</b>   <b>Feb 28–Mar 2</b>    <b>Mar 7–9</b> <b>Mar 14–16</b>   <b>Mar 21–23</b>   <b>Mar 28–30</b>   

I want to remove all the "Â" and css tags (< b >, < / b >). I tried using the .remove and .replace functions but I get an error:

SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The output above is in a list, which I get from a webcrawling function:

def getWeekend(item_url):
    dates = []
    href = item_url[:37]+"page=weekend&" + item_url[37:]
    response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
    return date

I write it to a file like so:

for item in listOfDate:
    wr.writerow(item)

How can I remove all the tags so that only the date is left?

1
  • 1
    what is the page encoding? Commented Jun 27, 2015 at 22:23

4 Answers 4

2

I'm not sure, but I think aString.regex_replace('toFind', 'toReplace') should work. Either that or writeb it to a file, and then run sed on it like: sed -i 's/toFind/toReplace/g'

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I'll just use the excel find and replace function, just tried it and its super easy.
1

You already got a working solution, but for the future:

  1. Use get_text() to get rid of the tags

date = soup.select('table.chart-wide > tr > td > nobr > font > a > b').get_text()

  1. Use .replace(u'\xc2',u'') to get rid of the Â. the u makes u'\xc2' a unicode string. (This might take some futzing around with encoding, but for me get_Text() is already returning a unicode object.)

(Additionally, possibly consider .replace(u'\u2013',u'-') because right now, you have an en-dash :P.)

date = date.replace(u'\xc2',u'').replace(u'\u2013',u'-')

Comments

1

The problem is that you don't have an ASCII string from the website. You need to convert the non-ASCII text into something Python can understand before manipulating it.

Python will use Unicode when given a chance. There's plenty of information out there if you just have a look. For example, you can find more help from other questions on this website:

Python: Converting from ISO-8859-1/latin1 to UTF-8

python: unicode in Windows terminal, encoding used?

What is the difference between encode/decode?

Comments

0

If your Python 2 source code has literal non-ASCII characters such as  then you should declare the source code encoding as the error message says. Put at the top of your Python file:

# -*- coding: utf-8 -*-

Make sure the file is saved using the utf-8 encoding and use Unicode strings to work with the text.

2 Comments

If you are a VIm user more than an emacs one, you can instead put near the top: # vim:set fileencoding=utf8:.
@bufh: Python doesn't care as long as it matches "coding[:=]\s*([-\w.]+)" regular expression.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.