26

How do I read the following (two columns) data (from a .dat file) with Pandas

TIME                      XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17

Column separator is (at least) 2 spaces.

I tried

df = pd.read_table("test.dat", sep="\s+", usecols=['TIME', 'XGSM'])
print df

But it prints

   TIME  XGSM
   2004     6
   2004     6
   2004     6
   2004     6
   2004     6
1

3 Answers 3

13

You can use parameter usecols with order of columns:

import pandas as pd
from pandas.compat import StringIO

temp=u"""TIME             XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), 
                 sep="\s+", 
                 skiprows=1, 
                 usecols=[0,7], 
                 names=['TIME','XGSM'])

print (df)
   TIME  XGSM
0  2004     1
1  2004     5
2  2004     8
3  2004    11
4  2004    17

Edit:

You can use separator regex - 2 and more spaces and then add engine='python' because warning:

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

import pandas as pd
from pandas.compat import StringIO

temp=u"""TIME              XGSM
2004 006 01 00 01 37 600   1
2004 006 01 00 02 32 800   5
2004 006 01 00 03 28 000   8
2004 006 01 00 04 23 200   11
2004 006 01 00 05 18 400   17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep=r'\s{2,}', engine='python')

print (df)
                       TIME  XGSM
0  2004 006 01 00 01 37 600     1
1  2004 006 01 00 02 32 800     5
2  2004 006 01 00 03 28 000     8
3  2004 006 01 00 04 23 200    11
4  2004 006 01 00 05 18 400    17
Sign up to request clarification or add additional context in comments.

1 Comment

Question edited to explicit say there are two columns there. The first column contains 2004 006 01 00 01 37 600, i.e.
11

Could also try pd.read_fwf() (Read a table of fixed-width formatted lines into DataFrame):

import pandas as pd
from io import StringIO

pd.read_fwf(StringIO("""TIME                      XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17"""), usecols = ["TIME", "XGSM"])

#   TIME    XGSM
#0  2004    1
#1  2004    5
#2  2004    8
#3  2004    11
#4  2004    17

2 Comments

So if you dont pass widths does it automatically figure it out based on headers?
@ayhan. From the docs, it is using the first 100 rows of the data to detect the column specifications by default.
2

I too experienced the problem while importing when there are lots of white space. I could solve by using

pd.read_fwf(file_name)

If you want to import files with fixed width text file, then read_fwf might be the solution without needing to use StringIO.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.