Read data (.dat file) with Pandas

Question

How do I read the following (two columns) data (from a .dat file) with Pandas

TIME                      XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17

Column separator is (at least) 2 spaces.

I tried

df = pd.read_table("test.dat", sep="\s+", usecols=['TIME', 'XGSM'])
print df

But it prints

Possible duplicate of Python pandas: Generate Document-Term matrix from whitespace delimited '.dat' file — Shihe Zhang
– Shihe Zhang, Commented Aug 21, 2017 at 1:22

jezrael · Accepted Answer · 2016-12-07 19:27:37Z

13

You can use parameter usecols with order of columns:

import pandas as pd
from pandas.compat import StringIO

temp=u"""TIME             XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), 
                 sep="\s+", 
                 skiprows=1, 
                 usecols=[0,7], 
                 names=['TIME','XGSM'])

print (df)
   TIME  XGSM
0  2004     1
1  2004     5
2  2004     8
3  2004    11
4  2004    17

Edit:

You can use separator regex - 2 and more spaces and then add engine='python' because warning:

ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

import pandas as pd
from pandas.compat import StringIO

temp=u"""TIME              XGSM
2004 006 01 00 01 37 600   1
2004 006 01 00 02 32 800   5
2004 006 01 00 03 28 000   8
2004 006 01 00 04 23 200   11
2004 006 01 00 05 18 400   17"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep=r'\s{2,}', engine='python')

print (df)
                       TIME  XGSM
0  2004 006 01 00 01 37 600     1
1  2004 006 01 00 02 32 800     5
2  2004 006 01 00 03 28 000     8
3  2004 006 01 00 04 23 200    11
4  2004 006 01 00 05 18 400    17

edited Dec 7, 2016 at 19:27

answered Dec 7, 2016 at 19:12

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

KcFnMi Over a year ago

Question edited to explicit say there are two columns there. The first column contains 2004 006 01 00 01 37 600, i.e.

akuiper · Accepted Answer · 2016-12-07 19:15:24Z

11

Could also try pd.read_fwf() (Read a table of fixed-width formatted lines into DataFrame):

import pandas as pd
from io import StringIO

pd.read_fwf(StringIO("""TIME                      XGSM
2004 006 01 00 01 37 600  1
2004 006 01 00 02 32 800  5
2004 006 01 00 03 28 000  8
2004 006 01 00 04 23 200  11
2004 006 01 00 05 18 400  17"""), usecols = ["TIME", "XGSM"])

#   TIME    XGSM
#0  2004    1
#1  2004    5
#2  2004    8
#3  2004    11
#4  2004    17

answered Dec 7, 2016 at 19:15

akuiper

216k33 gold badges362 silver badges379 bronze badges

2 Comments

user2285236 Over a year ago

So if you dont pass widths does it automatically figure it out based on headers?

akuiper Over a year ago

@ayhan. From the docs, it is using the first 100 rows of the data to detect the column specifications by default.

Suraj · Accepted Answer · 2022-01-29 23:28:40Z

2

I too experienced the problem while importing when there are lots of white space. I could solve by using

pd.read_fwf(file_name)

If you want to import files with fixed width text file, then read_fwf might be the solution without needing to use StringIO.

answered Jan 29, 2022 at 23:28

Suraj

213 bronze badges

Collectives™ on Stack Overflow

Read data (.dat file) with Pandas

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related