2

I'm trying to set up a Python script that will be able to read in many fixed width data files and then convert them to csv. To do this I'm using pandas like this:

pandas.read_fwf('source.txt', colspecs=column_position_length).\
         to_csv('output.csv', header=column_name, index=False, encoding='utf-8')

Where column_position_length and column_name are lists containing the information needed to read and write the data.

Within these files I have long strings of numbers representing test answers. For instance: 333133322122222223133313222222221222111133313333 represents the correct answers on a multiple choice test. So this is more of a code than a numeric value. The problem that I am having is pandas interpreting these values as floats and then writing these values in scientific notation into the csv (3.331333221222221e+47).

I found a lot of questions regarding this issue, but they didn't quite resolve my issue.

  1. Solution 1 - I believe at this point the values have already been converted to floats so this wouldn't help.
  2. Solution 2 - according to the pandas documentation, dtype is not supported as an argument for read_fwf in Python.
  3. Solution 3 Use converters - the issue with using converters is that you need to specify the column name or index to convert to a data type, but I would like to read all of the columns as strings.

The second option seemes to be the go to answer for reading every column in as a string, but unfortunately it just isn't supported for read_fwf. Any suggestions?

2
  • 2
    dtype is supported, and yes, setting it to object would be the optimal solution. Commented May 5, 2017 at 18:11
  • dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’). Use str or object to preserve and not interpret dtype. pandas.pydata.org/pandas-docs/stable/generated/… Commented May 5, 2017 at 18:24

1 Answer 1

2

So I think I figured out a solution, but I don't know why it works. Pandas was interpreting these values as floats because there were NaN values (blank lines) in the columns. By adding keep_default_na=False to the read_fwf() parameters, it resolved this issue. According to the documentation:

keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

I guess I'm not quite understanding how this is fixing my issue. Could anyone add any clarity on this?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.