0

While migrating to Python 3, I noticed some files we generate using the built-in csv now have b' prefix around each strings...

Here's the code, that should generate a .csv for a list of dogs, according to some parameters defined by export_fields (thus always returns unicode data):

file_content = StringIO()
csv_writer = csv.writer(
    file_content, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL
)
csv_writer.writerow([
    header_name.encode('cp1252') for _v, header_name in export_fields
])
# Write content
for dog in dogs:
    csv_writer.writerow([
        get_value(dog).encode('cp1252') for get_value, _header in export_fields
    ])

The problem is once I returns file_content.getvalue(), I get:

b'Does he bark?'    b'Full     Name'    b'Gender'
b'Sometimes, yes'   b'Woofy the dog'    b'Male' 

Instead of (indentation has been modified to be readable on SO):

'Does he bark?'   'Full     Name'   'Gender'
'Sometimes, yes'  'Woofy the dog'   'Male' 

I did not find any encoding parameter in the csv module. I would like the whole file to be encoded in cp1252, so I don't really care either the encoding is done through the iteration of the lines or on the file construted itself.

So, does anyone know how to generate a proper string, containing only cp1252 encoded strings?

3
  • Why are you encoding in the first place? The open file object takes care of that. Commented Jul 29, 2016 at 10:52
  • @MartijnPieters Maybe my question is incomplete then: I want to return the string through Django: return HttpResponse(generate_csv_file()). Should I handle encoding at Django level instead? Commented Jul 29, 2016 at 10:55
  • See my answer; you are approaching this at the wrong level; tabs and quotechars need to be encoded too, but this is the job of the I/O level, not the csv module or the code producing rows. Commented Jul 29, 2016 at 10:57

1 Answer 1

2

The csv module deals with text, and converts anything that is not a string to a string using str().

Don't pass in bytes objects. Pass in str objects or types that cleanly convert to strings with str(). That means you should not encode strings.

If you need cp1252 output, encode the StringIO value:

file_content.getvalue().encode('cp1252')

as StringIO objects also deal in text only.

Better yet, use a BytesIO object with a TextIOWrapper() to do the encoding for you as the csv module writes to the file object:

from io import BytesIO, TextIOWrapper

file_content = BytesIO()
wrapper = TextIOWrapper(file_content, encoding='cp1252', line_buffering=True)
csv_writer = csv.writer(
    wrapper, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)

# write rows

result = file_content.getvalue()

I've enabled line-buffering on the wrapper so that it'll auto-flush to the BytesIO instance every time a row is written.

Now file_content.getvalue() produces a bytestring:

>>> from io import BytesIO, TextIOWrapper
>>> import csv
>>> file_content = BytesIO()
>>> wrapper = TextIOWrapper(file_content, encoding='cp1252', line_buffering=True)
>>> csv_writer = csv.writer(wrapper, delimiter='\t', quotechar='"', quoting=csv.QUOTE_MINIMAL)
>>> csv_writer.writerow(['Does he bark?', 'Full     Name', 'Gender'])
36
>>> csv_writer.writerow(['Sometimes, yes', 'Woofy the dog', 'Male'])
35
>>> file_content.getvalue()
b'Does he bark?\tFull     Name\tGender\r\nSometimes, yes\tWoofy the dog\tMale\r\n'
Sign up to request clarification or add additional context in comments.

3 Comments

Looks like it works with the wrapper indeed (once flushed, but you made the edit before I has the time to comment). Tests passed so 99% sure it is the right answer :)
@MaximeLorant: I've now switched it to using line-buffering; avoids having to manually flush. Sorry about that.
Seems cleaner indeed! Thanks for the tip.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.