1

Here is the snippet:

for eachLine in content.splitlines(True):
    entity = str(eachLine.encode("utf-8"))[1:]
    splitResa = entity.split('\t')
    print(entity)
    print(splitResa)

Basically I am getting this result:

'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'
['\'<!ENTITY DOCUMENT_STATUS\\t\\t\\t\\t\\t"draft">\\n\'']

however in IDLE it all works fine:

>>> '<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\n']

Couldn't figure out why. I've also tried answers here: splitting a string based on tab in the file But it still does the same behaviour. What is the issue here?

12
  • @PadraicCunningham it's <class 'str'> Commented Mar 23, 2015 at 9:34
  • 1
    Why are you encoding in the first place. And then removing the b from the bytes representation (debugging output!) but leaving in the single or double quotes? What is the problem you are trying to solve here? Commented Mar 23, 2015 at 9:37
  • Moreover, you appear to be processing a XML DTD. Why not use a XML parser for the task? Commented Mar 23, 2015 at 9:38
  • @SarpKaya. I meant where is it coming from, I don't understand why you are encoding Commented Mar 23, 2015 at 9:39
  • @MartijnPieters if I don't encode then I get UnicodeEncodeError: 'charmap' codec can't encode characters in position 141-142 Commented Mar 23, 2015 at 9:41

2 Answers 2

1

Looks like eachLine is a raw string.

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\t')
['<!ENTITY DOCUMENT_STATUS\\t\\t\\t\\t\\t"draft">\\n']

So, you should either split that with a raw \t (r'\t'), like this

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split(r'\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\\n']

or with properly escaped \t ('\\t'), like this

>>> r'<!ENTITY DOCUMENT_STATUS\t\t\t\t\t"draft">\n'.split('\\t')
['<!ENTITY DOCUMENT_STATUS', '', '', '', '', '"draft">\\n']
Sign up to request clarification or add additional context in comments.

3 Comments

They shouldn't be using string representations of a bytes object in the first place. Any UTF-8 bytes are also going to be mangled.
Thanks for the answer. How do I convert a raw string to a normal string so that I can avoid using r completely?
@SarpKaya What do you mean by that? Raw strings are normal strings only. If you want to avoid r, follow the second method I mentioned in the answer \\t.
0

You produced a bytes representation; you mangled the repr() debugging output here. Any non-printable or special character is replaced by their escape sequence. The output you produced has no tab characters in the string, it contains sequences of the two characters \ and t:

>>> '\t'
'\t'
>>> '\t'.encode('utf8')
b'\t'
>>> str('\t'.encode('utf8'))
"b'\\t'"
>>> str('\t'.encode('utf8'))[1:]
"'\\t'"
>>> str('\t'.encode('utf8'))[1:][1:-1]
'\\t'
>>> len(str('\t'.encode('utf8'))[1:][1:-1])
2

It is not clear to me why you are encoding the text into bytes then converting back to a string in the first place. You don't want to do that, generally speaking.

In IDLE, you did not produce such mangled output; you just have a regular string with actual tabs, so splitting on those then works. My only advice here is to not encode to bytes here.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.