Python 3 does not split strings when encoded

Question

Here is the snippet:

for eachLine in content.splitlines(True):
    entity = str(eachLine.encode("utf-8"))[1:]
    splitResa = entity.split('\t')
    print(entity)
    print(splitResa)

Basically I am getting this result:

'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'
['\'<!ENTITY DOCUMENT_STATUS\\t\\t\\t\\t\\t"draft">\\n\'']

however in IDLE it all works fine:

>>> '<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\n']

Couldn't figure out why. I've also tried answers here: splitting a string based on tab in the file But it still does the same behaviour. What is the issue here?

Why are you encoding in the first place. And then removing the b from the bytes representation (debugging output!) but leaving in the single or double quotes? What is the problem you are trying to solve here? — Martijn Pieters
– Martijn Pieters, Commented Mar 23, 2015 at 9:37
Moreover, you appear to be processing a XML DTD. Why not use a XML parser for the task? — Martijn Pieters
– Martijn Pieters, Commented Mar 23, 2015 at 9:38
@SarpKaya. I meant where is it coming from, I don't understand why you are encoding — Padraic Cunningham
– Padraic Cunningham, Commented Mar 23, 2015 at 9:39
@MartijnPieters if I don't encode then I get UnicodeEncodeError: 'charmap' codec can't encode characters in position 141-142 — Sarp Kaya
– Sarp Kaya, Commented Mar 23, 2015 at 9:41

thefourtheye · Accepted Answer · 2015-03-23 09:39:41Z

1

Looks like eachLine is a raw string.

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\t')
['<!ENTITY DOCUMENT_STATUS\\t\\t\\t\\t\\t"draft">\\n']

So, you should either split that with a raw \t (r'\t'), like this

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split(r'\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\\n']

or with properly escaped \t ('\\t'), like this

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\\n']

answered Mar 23, 2015 at 9:39

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Martijn Pieters Over a year ago

They shouldn't be using string representations of a bytes object in the first place. Any UTF-8 bytes are also going to be mangled.

Sarp Kaya Over a year ago

Thanks for the answer. How do I convert a raw string to a normal string so that I can avoid using r completely?

thefourtheye Over a year ago

@SarpKaya What do you mean by that? Raw strings are normal strings only. If you want to avoid r, follow the second method I mentioned in the answer \\t.

Martijn Pieters · Accepted Answer · 2015-03-23 09:51:21Z

You produced a bytes representation; you mangled the repr() debugging output here. Any non-printable or special character is replaced by their escape sequence. The output you produced has no tab characters in the string, it contains sequences of the two characters \ and t:

>>> '\t'
'\t'
>>> '\t'.encode('utf8')
b'\t'
>>> str('\t'.encode('utf8'))
"b'\\t'"
>>> str('\t'.encode('utf8'))[1:]
"'\\t'"
>>> str('\t'.encode('utf8'))[1:][1:-1]
'\\t'
>>> len(str('\t'.encode('utf8'))[1:][1:-1])
2

It is not clear to me why you are encoding the text into bytes then converting back to a string in the first place. You don't want to do that, generally speaking.

In IDLE, you did not produce such mangled output; you just have a regular string with actual tabs, so splitting on those then works. My only advice here is to not encode to bytes here.

Collectives™ on Stack Overflow

Python 3 does not split strings when encoded

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related