4

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.

The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).

What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.

3
  • What exactly do you mean by "last two columns"? The last two characters in the line, or the last two space-separated entries? Commented Jun 10, 2010 at 8:02
  • Are your "rows" separated by newlines? Commented Jun 10, 2010 at 8:38
  • @Tim: the OP writes "...the last two columns if the second column contains the string 'OW' ..." so thinking it's possible he has switched meaning inside a sentence: "the last two characters if the second field contains the string 'OW'... ??? Consider (re)?reading his 2nd paragraph: "The columns ... are also not all the same number of characters ... the last one is 5". Commented Jun 10, 2010 at 9:16

5 Answers 5

4

Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:

import struct

def parsefile(filename):
    with open(filename) as myfile:
        for line in myfile:
            line = line.rstrip('\n')
            fields = struct.unpack('11s11s8s8s5s', line)
            if 'OW' in fields[1]:
                yield (int(fields[3]), int(fields[4]))

Usage:

if __name__ == '__main__':
    for field in parsefile('file.txt'):
        print field

Test data:

1234567890a1234567890a123456781234567812345
something  maybe OW d 111111118888888855555
aaaaa      bbbbb      1234    1212121233333
other thinganother OW 121212  6666666644444

Output:

(88888888, 55555)
(66666666, 44444)
Sign up to request clarification or add additional context in comments.

4 Comments

+1 for the concept, -1 for the attention to detail. Why strip instead of rstrip? Why include \r in the chars to strip? In any case he hasn't mentioned lines at all; maybe the rows aka records are fixed-length without separators. He has FIVE fields; the unpack format should be '11s11s8s8s5s' and the output indices should be 3 and 4, not 2 and 3.
@John - Yeah, I noticed the fields myself and fixed already. Switched to rstrip too, nice tip. \r\n is just for robustness in the face of different line-endings... probably just \n works fine, but adding \r doesn't hurt imo. About the lines themselves - he has actually, mentions his data is in "rows and columns" - sounds like lines to me.
(1) If the file is being read in 'r' or 'rU' mode, lines will end with \n (except possibly the last line which may not be terminated). If reading in 'r' mode, ending up with '\r' before your line terminator is a BUG in the data; stripping it silently is NOT "robust". Having '\r' in there will make people reading your code wonder why. (2) Fixed-length no-separator records often go hand-in-hand with fixed-length fields.
@John - okay, \r removed. Leaving the separator as-is until the OP clarifies.
3

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.

So you can do something like this:

columns = [slice(11,22), slice(30,38), slice(38,44)]

myfile = open('some/file/path')
for line in myfile:
    fields = [line[column].strip() for column in columns]
    if "OW" in fields[0]:
        value1 = int(fields[1])
        value12 = int(fields[2]) 
        ....

Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.

1 Comment

+1. Nice use of slice objects. Nit: You have an off-by-one error on the last slice. It should be 38,44
0

Here's a function which might help you:

def rows(f, columnSizes):
    while True:
        row = {}
        for (key, size) in columnSizes:
            value = f.read(size)
            if len(value) < size: # EOF
                return
            row[key] = value
        yield row

for an example of how it's used:

from StringIO import StringIO

sample = StringIO("""aaabbbccc
d  e  f  
g  h  i  
""")

for row in rows(sample, [('first', 3),
                         ('second', 3),
                         ('third', 4)]):
    print repr(row)

Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.

You can test if one string is a substring of another with the 'in' operator. For example,

>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True

So in this case, you might do

if 'OW' in row['third']:
    stuff()

but you can obviously test any field for any value as you see fit.

Comments

0
entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])

for num1, num2 in entries:
  # whatever

Comments

-2
entries = []
with open('my_file.txt', 'r') as f:
  for line in f.read().splitlines()
    line = line.split()
    if line[1].find('OW') >= 0
      entries.append( ( int(line[-2]) , int(line[-1]) ) )

entries is an array containing tuples of the last two entries

edit: oops

1 Comment

This is wrong. line[1] will be the second character of the line, etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.