Importing data from a text file using python

Question

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.

The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).

What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.

What exactly do you mean by "last two columns"? The last two characters in the line, or the last two space-separated entries? — Tim Pietzcker
– Tim Pietzcker, Commented Jun 10, 2010 at 8:02
@Tim: the OP writes "...the last two columns if the second column contains the string 'OW' ..." so thinking it's possible he has switched meaning inside a sentence: "the last two characters if the second field contains the string 'OW'... ??? Consider (re)?reading his 2nd paragraph: "The columns ... are also not all the same number of characters ... the last one is 5". — John Machin
– John Machin, Commented Jun 10, 2010 at 9:16

tzaman · Accepted Answer · 2010-06-10 08:36:21Z

4

Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:

import struct

def parsefile(filename):
    with open(filename) as myfile:
        for line in myfile:
            line = line.rstrip('\n')
            fields = struct.unpack('11s11s8s8s5s', line)
            if 'OW' in fields[1]:
                yield (int(fields[3]), int(fields[4]))

Usage:

if __name__ == '__main__':
    for field in parsefile('file.txt'):
        print field

Test data:

1234567890a1234567890a123456781234567812345
something  maybe OW d 111111118888888855555
aaaaa      bbbbb      1234    1212121233333
other thinganother OW 121212  6666666644444

Output:

(88888888, 55555)
(66666666, 44444)

edited Jun 10, 2010 at 8:36

answered Jun 10, 2010 at 7:48

tzaman

48k11 gold badges93 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

John Machin Over a year ago

+1 for the concept, -1 for the attention to detail. Why strip instead of rstrip? Why include \r in the chars to strip? In any case he hasn't mentioned lines at all; maybe the rows aka records are fixed-length without separators. He has FIVE fields; the unpack format should be '11s11s8s8s5s' and the output indices should be 3 and 4, not 2 and 3.

tzaman Over a year ago

@John - Yeah, I noticed the fields myself and fixed already. Switched to rstrip too, nice tip. \r\n is just for robustness in the face of different line-endings... probably just \n works fine, but adding \r doesn't hurt imo. About the lines themselves - he has actually, mentions his data is in "rows and columns" - sounds like lines to me.

John Machin Over a year ago

(1) If the file is being read in 'r' or 'rU' mode, lines will end with \n (except possibly the last line which may not be terminated). If reading in 'r' mode, ending up with '\r' before your line terminator is a BUG in the data; stripping it silently is NOT "robust". Having '\r' in there will make people reading your code wonder why. (2) Fixed-length no-separator records often go hand-in-hand with fixed-length fields.

tzaman Over a year ago

@John - okay, \r removed. Leaving the separator as-is until the OP clarifies.

Dave Kirby · Accepted Answer · 2010-06-10 07:44:34Z

3

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.

So you can do something like this:

columns = [slice(11,22), slice(30,38), slice(38,44)]

myfile = open('some/file/path')
for line in myfile:
    fields = [line[column].strip() for column in columns]
    if "OW" in fields[0]:
        value1 = int(fields[1])
        value12 = int(fields[2]) 
        ....

Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.

edited Jun 10, 2010 at 7:44

answered Jun 10, 2010 at 7:33

Dave Kirby

26.7k5 gold badges72 silver badges84 bronze badges

1 Comment

Matthew Flaschen Over a year ago

+1. Nice use of slice objects. Nit: You have an off-by-one error on the last slice. It should be 38,44

Glyph · Accepted Answer · 2010-06-10 07:26:54Z

Here's a function which might help you:

def rows(f, columnSizes):
    while True:
        row = {}
        for (key, size) in columnSizes:
            value = f.read(size)
            if len(value) < size: # EOF
                return
            row[key] = value
        yield row

for an example of how it's used:

from StringIO import StringIO

sample = StringIO("""aaabbbccc
d  e  f  
g  h  i  
""")

for row in rows(sample, [('first', 3),
                         ('second', 3),
                         ('third', 4)]):
    print repr(row)

Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.

You can test if one string is a substring of another with the 'in' operator. For example,

>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True

So in this case, you might do

if 'OW' in row['third']:
    stuff()

but you can obviously test any field for any value as you see fit.

Matthew Flaschen · Accepted Answer · 2010-06-10 08:23:50Z

0

entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])

for num1, num2 in entries:
  # whatever

edited Jun 10, 2010 at 8:23

answered Jun 10, 2010 at 7:20

Matthew Flaschen

286k53 gold badges523 silver badges554 bronze badges

Comments

Justin L. · Accepted Answer · 2010-06-10 07:37:15Z

-2

entries = []
with open('my_file.txt', 'r') as f:
  for line in f.read().splitlines()
    line = line.split()
    if line[1].find('OW') >= 0
      entries.append( ( int(line[-2]) , int(line[-1]) ) )

entries is an array containing tuples of the last two entries

edit: oops

edited Jun 10, 2010 at 7:37

answered Jun 10, 2010 at 7:26

Justin L.

13.7k5 gold badges51 silver badges86 bronze badges

1 Comment

Matthew Flaschen Over a year ago

This is wrong. line[1] will be the second character of the line, etc.

Collectives™ on Stack Overflow

Importing data from a text file using python

5 Answers 5

4 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related