Python Regex to remove all HTML data

Question

I am creating a python program that crawls and indexes a site, when i run my current code i get the error;

UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to <undefined>

I'm not sure why this error is occurring but i believe it to be due to my regex expressions. I decode the text then run it through multiple regex expressions to remove all links,brackets,hex values etc.

if (isinstance(page_contents, bytes)):     #bytes to string
        c = page_contents.decode('utf-8')
    else:
        c = page_contents
    if isinstance(c, bytes):
        print(' page not converted to string')

## the regex route
c = re.sub('\\\\n|\\\\r|\\\\t', ' ', c)  # get rid of newlines, tabs
c = re.sub('\\\\\'', '\'', c)  # replace \' with '
c = re.sub('<script.*?script>', ' ', c, flags=re.DOTALL)  # get rid of scripts
c = re.sub('<!\[CDATA\[.*?\]\]', ' ', c, flags=re.DOTALL)  # get rid of CDATA ?redundant
c = re.sub('<link.*?link>|<link.*?>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<style.*?style>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<.*?>', ' ', c, flags=re.DOTALL)  # get rid of HTML tags
c = re.sub('\\\\x..', ' ', c)  # get rid of hex values
c = re.sub('<--|-->', ' ', c, flags=re.DOTALL)  # get rid of comments
c = re.sub('<|>', ' ', c)  # get rid of stray angle brackets
c = re.sub('&.*?;|#.*?;', ' ', c)  # get rid of HTML entities
page_text = re.sub('\s+', ' ', c)  # replace multiple spaces with a single space

I then split the document it up into individual words which are then sorted and dealt with. But the problem occurs when i print it out. It loops round and prints out the data for the first url (document) extension but when it moves onto the second the error is outputted.

docids.append(url)
docid = str(docids.index(url))

##### stemming and other processing goes here #####
# page_text is the initial content, transformed to words
words = page_text
#   Send document to stemmer
stemmed_doc = stem_doc(words)

# add the vocab counts and postings
for word in stemmed_doc.split():
    if (word in vocab):
        vocab[word] += 1
    else:
        vocab[word] = 1
    if (not word in postings):
        postings[word] = [docid]
    elif (docid not in postings[word]):
        postings[word].append(docid)

    print('make_index3: docid=', docid, ' word=', word, ' count=', vocab[word], ' postings=', postings[word])

I would like to know if this error is due to incorrect regex or if there is something else occurring?

Solved

I added the expression

c = re.sub('[\W_]+', ' ', c)

which replaces all non alphanumerics with a space

Are you trying to parse HTML with regex? That's generally not a great idea - use an HTML parser. — jonrsharpe
– jonrsharpe, Commented Nov 4, 2015 at 14:42
Other comments here are correct that your approach of using regexes to "sanitize" the page contents is fundamentally flawed. But your problem here isn't with the regex, it's with how you convert the bytes to a string. Not all web pages will use UTF-8. Instead you need to parse the Content-Type header (which can be overridden in a <meta> tag) to determine the correct encoding. — Daniel Pryden
– Daniel Pryden, Commented Nov 4, 2015 at 15:19

nsm · Accepted Answer · 2015-11-04 14:57:00Z

1

The problem you get seems to be with enconding and no with regex. Have you tried changing

c = page_contents.decode('utf-8')

and using anothed encoding, for example:

c = page_contents.decode('latin-1')

?

answered Nov 4, 2015 at 14:57

nsm

3191 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Daniel Pryden Over a year ago

The correct encoding to use will be part of the HTTP response, either in the Content-Type header or in a <meta> tag. Simply guessing a different encoding isn't any better.

nsm Over a year ago

ok, I was just stating that the problem reported was not with regex but with enconding, and suggesting a way to check that, not giving a solution.

Daniel Pryden Over a year ago

Yep... you are correct about that (I just added a similar comment above).

user5525032 · Accepted Answer · 2015-11-15 16:47:42Z

0

this worked, replaced all non alphanumerics with a space

c = re.sub('[\W_]+', ' ', c)

answered Nov 15, 2015 at 16:47

user5525032

Collectives™ on Stack Overflow

Python Regex to remove all HTML data

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related