3

I am creating a python program that crawls and indexes a site, when i run my current code i get the error;

UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to <undefined>

I'm not sure why this error is occurring but i believe it to be due to my regex expressions. I decode the text then run it through multiple regex expressions to remove all links,brackets,hex values etc.

if (isinstance(page_contents, bytes)):     #bytes to string
        c = page_contents.decode('utf-8')
    else:
        c = page_contents
    if isinstance(c, bytes):
        print(' page not converted to string')

## the regex route
c = re.sub('\\\\n|\\\\r|\\\\t', ' ', c)  # get rid of newlines, tabs
c = re.sub('\\\\\'', '\'', c)  # replace \' with '
c = re.sub('<script.*?script>', ' ', c, flags=re.DOTALL)  # get rid of scripts
c = re.sub('<!\[CDATA\[.*?\]\]', ' ', c, flags=re.DOTALL)  # get rid of CDATA ?redundant
c = re.sub('<link.*?link>|<link.*?>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<style.*?style>', ' ', c, flags=re.DOTALL)  # get rid of links
c = re.sub('<.*?>', ' ', c, flags=re.DOTALL)  # get rid of HTML tags
c = re.sub('\\\\x..', ' ', c)  # get rid of hex values
c = re.sub('<--|-->', ' ', c, flags=re.DOTALL)  # get rid of comments
c = re.sub('<|>', ' ', c)  # get rid of stray angle brackets
c = re.sub('&.*?;|#.*?;', ' ', c)  # get rid of HTML entities
page_text = re.sub('\s+', ' ', c)  # replace multiple spaces with a single space

I then split the document it up into individual words which are then sorted and dealt with. But the problem occurs when i print it out. It loops round and prints out the data for the first url (document) extension but when it moves onto the second the error is outputted.

docids.append(url)
docid = str(docids.index(url))

##### stemming and other processing goes here #####
# page_text is the initial content, transformed to words
words = page_text
#   Send document to stemmer
stemmed_doc = stem_doc(words)

# add the vocab counts and postings
for word in stemmed_doc.split():
    if (word in vocab):
        vocab[word] += 1
    else:
        vocab[word] = 1
    if (not word in postings):
        postings[word] = [docid]
    elif (docid not in postings[word]):
        postings[word].append(docid)

    print('make_index3: docid=', docid, ' word=', word, ' count=', vocab[word], ' postings=', postings[word])

I would like to know if this error is due to incorrect regex or if there is something else occurring?

Solved

I added the expression

c = re.sub('[\W_]+', ' ', c)

which replaces all non alphanumerics with a space

5
  • stackoverflow.com/a/17551962/1172714 Commented Nov 4, 2015 at 14:26
  • 1
    Please check this answer of mine, I hope it will help. Commented Nov 4, 2015 at 14:28
  • 3
    Are you trying to parse HTML with regex? That's generally not a great idea - use an HTML parser. Commented Nov 4, 2015 at 14:42
  • 3
    Insert link to The Famous Answer here Commented Nov 4, 2015 at 14:47
  • 1
    Other comments here are correct that your approach of using regexes to "sanitize" the page contents is fundamentally flawed. But your problem here isn't with the regex, it's with how you convert the bytes to a string. Not all web pages will use UTF-8. Instead you need to parse the Content-Type header (which can be overridden in a <meta> tag) to determine the correct encoding. Commented Nov 4, 2015 at 15:19

2 Answers 2

1

The problem you get seems to be with enconding and no with regex. Have you tried changing

c = page_contents.decode('utf-8')

and using anothed encoding, for example:

c = page_contents.decode('latin-1')

?

Sign up to request clarification or add additional context in comments.

3 Comments

The correct encoding to use will be part of the HTTP response, either in the Content-Type header or in a <meta> tag. Simply guessing a different encoding isn't any better.
ok, I was just stating that the problem reported was not with regex but with enconding, and suggesting a way to check that, not giving a solution.
Yep... you are correct about that (I just added a similar comment above).
0

this worked, replaced all non alphanumerics with a space

c = re.sub('[\W_]+', ' ', c)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.