0

I need to extract some data from .dat file which I usually do with

import numpy as np
file = np.loadtxt('blablabla.dat')

Here my data are not separated by a specific delimiter but have predefined length (digits) and some lines don't have any values for some columns. Here an sample to be clear :

 3  0  36  0  0 0  0   0    0  0         99. 
-2  0   0  0  0 0  0   0    0  0         99. 
 2  0   0  0  0 0  0   0    0  0 .LA.0?.  3. 
 5  0   0  0  0 2  4   0    0  0 .SAS7?. 99. 
-5  0   0  0  0 0  0   0    0  0         99. 
99  0   0  0  0 0  0   0    0  0 .S..3*.  3.5

My little code above get the error :

# Convert each value according to its column and store
ValueError: Wrong number of columns at line 3

Does someone have an idea about how to collect this kind of data?

1
  • By the way I have the format of the file which is for the given example : I2 / I3 / I2 / I2 / I1 / I2 / I3 / I4 / I2 / A7 / F4.1 Commented Feb 29, 2016 at 12:09

2 Answers 2

1

numpy.genfromtxt seems to be what you want; it you can specify field widths for each column and treats missing data as NaNs.

For this case:

import numpy as np
data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5])

If you want to keep information in the string part of the file, you could read twice and specify the usecols parameter:

import numpy as np
number_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
                            usecols=(0,1,2,3,4,5,6,7,8,9,11))
string_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
                            usecols=(10),dtype=str)
Sign up to request clarification or add additional context in comments.

2 Comments

I think it should be usecols=(10,)
I finally get my data extracted thanks for the help. My file is to big to use exactly the same method (the delimiter array is 375 long) but the usecols option helps !
0

What you essentially need is to get list of empty "columns" position that serve as delimiters That will get you started

In [108]: table = ''' 3  0  36  0  0 0  0   0    0  0         99. 
   .....: -2  0   0  0  0 0  0   0    0  0         99. 
   .....:  2  0   0  0  0 0  0   0    0  0 .LA.0?.  3. 
   .....:  5  0   0  0  0 2  4   0    0  0 .SAS7?. 99. 
   .....: -5  0   0  0  0 0  0   0    0  0         99. 
   .....: 99  0   0  0  0 0  0   0    0  0 .S..3*.  3.5'''.split('\n')

In [110]: max_row_len = max(len(row) for row in table)

In [117]: spaces = reduce(lambda res, row: res.intersection(idx for idx, c in enumerate(row) if c == ' '), table, set(range(max_row_len)))

This code builds set of character positions in the longest row - and reduce leaves only set of positions that have spaces in all rows

1 Comment

Thanks Volcano i didn't use what you propose but your code does work if someone wants to understand more

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.