0

I made a scraping script with python and selenium. It scrapes data from a Spanish language website:

for i, line in enumerate(browser.find_elements_by_xpath(xpath)):
    tds = line.find_elements_by_tag_name('td')  # takes <td> tags from line
    print tds[0].text  # FIRST PRINT
    if len(tds)%2 == 0:  # takes data from lines with even quantity of cells only
        data.append([u"".join(tds[0].text), u"".join(tds[1].text), ])
    print data  # SECOND PRINT

The first print statement gives me a normal Spanish string. But the second print gives me a string like this: "Data de Distribui\u00e7\u00e3o". What's the reason for this?

1
  • could you show the original string, and the data in tds please? Commented Dec 2, 2015 at 13:26

2 Answers 2

3

You are mixing encodings:

u'' # unicode string
b'' # bytearray string

The text property of tds[0] is a bytearray string which is encoding agnostic, and you are operating in the second print with unicode string, thus mixing the encodings

Sign up to request clarification or add additional context in comments.

Comments

0

for using any type of accented character we have to first encode or decode it before using them

accent_char = "ôâ"
name = accent_char.decode('utf-8')
print(name)

The above code will work for decoding the characters

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.