3

I am trying to read a space delimited file in Python using read_csv from panda. It works by specifying delimiter=" ". Problem arises when there are certain missing values in columns, because it ignores the missing value by considering it as a delimiter.

Is there a way to resolve this problem?

1600    1141.0000  020006        600    1141.0000    69.0000   OAUC     0.0000   
   1    1070.5000  020032          1    1070.5000   400.0000            0.0000

You can see there is a missing value in the column with value OAUC. There is uneven spacing between columns which is making it more difficult. Also the columns are fixed, so it's possible to find out that some value is missing but finding out which value is missing hasn't been possible yet.

4
  • You say there is uneven spacing between columns but is there not always more space between values when one is missing then when one is not. Commented Aug 1, 2013 at 15:06
  • 2
    I would suggest cleaning this file with command line tools or python first before trying to read it as structured data. (emacs org mode would do wonders!) And have you tried pandas.read_fwf for reading fixed width files? Commented Aug 1, 2013 at 15:07
  • @Justin Yes, I did try using pandas.read_fwf but then all the columns don't have a fixed width, specially the ones with numbers, as you can see in the first column value can be 1600 or 1 or 1600000. Commented Aug 1, 2013 at 15:15
  • But the ends of each column line up, so you can pick the max width as one less than the distance to the next non space character. However, using some command line tools or text editors on the file beforehand will be much cleaner imho Commented Aug 1, 2013 at 15:20

1 Answer 1

1

I agree with Justin that cleaning it up first is the best way to be sure to get it right. If you can skim your results to verify quality control, than this hack might get the job done in this case.

pd.read_csv(header=None, sep='\s{1, 7}')

I'll say again, this is not a great idea. If you just want to get a smallish data set loaded, it will do the job. But if you can't verify that it worked, better use read_fwf and carefully specify colspecs, or follow Justin's advice and clean up the file.

Sign up to request clarification or add additional context in comments.

3 Comments

I think you can also use '\s{1, 7}', but still... :s
Sorry to bother but I am not really that acquainted with regular expressions. What does '\s\s?\s?\s?\s?\s?\s?' really signify?
Hahaha... I should probably have been ashamed to show the world my meager regex-fu. Let me edit that...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.