Replacing all instances of string in string Python

Question

Right now my output to a file is like:

<b>Nov 22Â–24</b>   <b>Nov 29Â–Dec 1</b>    <b>Dec 6Â–8</b> <b>Dec 13Â–15</b>   <b>Dec 20Â–22</b>   <b>Dec 27Â–29</b>   <b>Jan 3Â–5</b> <b>Jan 10Â–12</b>   <b>Jan 17Â–19</b>   <b><i>Jan 17Â–20</i></b>    <b>Jan 24Â–26</b>   <b>Jan 31Â–Feb 2</b>    <b>Feb 7Â–9</b> <b>Feb 14Â–16</b>   <b><i>Feb 14Â–17</i></b>    <b>Feb 21Â–23</b>   <b>Feb 28Â–Mar 2</b>    <b>Mar 7Â–9</b> <b>Mar 14Â–16</b>   <b>Mar 21Â–23</b>   <b>Mar 28Â–30</b>

I want to remove all the "Â" and css tags (< b >, < / b >). I tried using the .remove and .replace functions but I get an error:

SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

The output above is in a list, which I get from a webcrawling function:

def getWeekend(item_url):
    dates = []
    href = item_url[:37]+"page=weekend&" + item_url[37:]
    response = requests.get(href)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
    return date

I write it to a file like so:

for item in listOfDate:
    wr.writerow(item)

How can I remove all the tags so that only the date is left?

what is the page encoding?

Padraic Cunningham
– Padraic Cunningham

2015-06-27 22:23:19 +00:00
Commented Jun 27, 2015 at 22:23 — Padraic Cunningham
– Padraic Cunningham, Commented Jun 27, 2015 at 22:23

D Swartz · Accepted Answer · 2015-06-27 21:48:25Z

2

I'm not sure, but I think aString.regex_replace('toFind', 'toReplace') should work. Either that or writeb it to a file, and then run sed on it like: sed -i 's/toFind/toReplace/g'

answered Jun 27, 2015 at 21:48

D Swartz

1552 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

alphamonkey Over a year ago

Thanks, I'll just use the excel find and replace function, just tried it and its super easy.

NightShadeQueen · Accepted Answer · 2015-06-27 22:45:43Z

1

You already got a working solution, but for the future:

Use get_text() to get rid of the tags

date = soup.select('table.chart-wide > tr > td > nobr > font > a > b').get_text()

Use .replace(u'\xc2',u'') to get rid of the Â. the u makes u'\xc2' a unicode string. (This might take some futzing around with encoding, but for me get_Text() is already returning a unicode object.)

(Additionally, possibly consider .replace(u'\u2013',u'-') because right now, you have an en-dash :P.)

date = date.replace(u'\xc2',u'').replace(u'\u2013',u'-')

answered Jun 27, 2015 at 22:45

NightShadeQueen

3,3633 gold badges27 silver badges37 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:14:08Z

1

The problem is that you don't have an ASCII string from the website. You need to convert the non-ASCII text into something Python can understand before manipulating it.

Python will use Unicode when given a chance. There's plenty of information out there if you just have a look. For example, you can find more help from other questions on this website:

Python: Converting from ISO-8859-1/latin1 to UTF-8

python: unicode in Windows terminal, encoding used?

What is the difference between encode/decode?

edited May 23, 2017 at 12:14

CommunityBot

11 silver badge

answered Jun 27, 2015 at 22:15

Peter Brittain

13.7k3 gold badges45 silver badges59 bronze badges

Comments

jfs · Accepted Answer · 2015-06-27 23:04:53Z

0

If your Python 2 source code has literal non-ASCII characters such as Â then you should declare the source code encoding as the error message says. Put at the top of your Python file:

# -*- coding: utf-8 -*-

Make sure the file is saved using the utf-8 encoding and use Unicode strings to work with the text.

answered Jun 27, 2015 at 23:04

jfs

417k210 gold badges1k silver badges1.7k bronze badges

2 Comments

bufh Over a year ago

If you are a VIm user more than an emacs one, you can instead put near the top: # vim:set fileencoding=utf8:.

jfs Over a year ago

@bufh: Python doesn't care as long as it matches "coding[:=]\s*([-\w.]+)" regular expression.

Collectives™ on Stack Overflow

Replacing all instances of string in string Python

4 Answers 4

1 Comment

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related