CSV file processing in Python

Question

I work with spatial data that is output to text files with the following format:

COMPANY NAME
P.O. BOX 999999
ZIP CODE , CITY 
+99 999 9999
23 April 2013 09:27:55

PROJECT: Link Ref
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Design DTM is 30MB 2.5X2.5
Stripping applied to design is 0.000

Point Number      Easting     Northing        R.L. Design R.L.  Difference  Tol  Name
     3224808   422092.700  6096059.380       2.520     -19.066     -21.586  --   
     3224809   422092.200  6096059.030       2.510     -19.065     -21.575  --   
<Remainder of lines>
 3273093   422698.920  6096372.550       1.240     -20.057     -21.297  --   

Average height difference is -21.390
RMS  is  21.596
0.00 % above tolerance
98.37 % below tolerance
End of Report

As shown, the files have a header and a footer. The data is delimited by spaces, but not an equal amount between the columns.

What I need, is comma delimited files with Easting, Northing and Difference.

I'd like to prevent having to modify several hundred large files by hand and am writing a small script to process the files. This is what I have so far:

#! /usr/bin/env python
import csv,glob,os
from itertools import islice
list_of_files = glob.glob('C:/test/*.txt')
for filename in list_of_files:
(short_filename, extension )= os.path.splitext(filename)
print short_filename
file_out_name = short_filename + '_ed' + extension
with open (filename, 'rb') as source:
    reader = csv.reader( source) 
    for row in islice(reader, 10, None):
        file_out= open (file_out_name, 'wb')
        writer= csv.writer(file_out)
        writer.writerows(reader)
        print 'Created file: '+ file_out_name
        file_out.close()
print 'All done!'

Questions:

How can I let the line starting with 'Point number' become the header in the output file? I'm trying to put DictReader in place of the reader/writer bit but can't get it to work.
Writing the output file with delimiter ',' does work but writes a comma in place of each space, giving way too much empty columns in my output file. How do I circumvent this?
How do I remove the footer?

Would you please reward the user that provided you the best idication, by clicking on the big tick outline on left of the answer? — Joël
– Joël, Commented Nov 28, 2013 at 8:27

fortran · Accepted Answer · 2013-04-29 14:19:29Z

6

I can see a problem with your code, you are creating a new writer for each row; so you will end up only with the last one.

Your code could be something like this, without the need of CSV readers or writers, as it's simple enough to be parsed as simple text (problem would arise if you had text columns, with escaped characters and so on).

def process_file(source, dest):
  found_header = False
  for line in source:
    line = line.strip()
    if not header_found:
      #ignore everything until we find this text
      header_found = line.starswith('Point Number')
    elif not line:
      return #we are done when we find an empty line, I guess
    else:
      #write the needed columns
      columns = line.split()
      dest.writeline(','.join(columns[i] for i in (1, 2, 5)))

for filename in list_of_files:
  short_filename, extension = os.path.splitext(filename)
  file_out_name = short_filename + '_ed' + extension
  with open(filename, 'r') as source:
    with open(file_out_name. 'w') as dest:
      process_file(source, dest)

edited Apr 29, 2013 at 14:19

answered Apr 29, 2013 at 10:11

fortran

76.5k27 gold badges143 silver badges180 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

thegrinner Over a year ago

Use elif instead of else if.

Joël Over a year ago

+1 for indication of escaped characters, that could be a problem hard to handle with built-in fuctions.

Termininja · Accepted Answer · 2016-12-17 10:40:35Z

This worked:

#! /usr/bin/env python

import glob,os

list_of_files = glob.glob('C:/test/*.txt')

def process_file(source, dest):
  header_found = False
  for line in source:
    line = line.strip()
    if not header_found:
      #ignore everything until we find this text
      header_found = line.startswith('Stripping applied') #otherwise, header is lost
    elif not line:
      return #we are done when we find an empty line
    else:
      #write the needed columns
      columns = line.split()
      dest.writelines(','.join(columns[i] for i in (1, 2, 5))+"\n") #newline character adding was necessary

for filename in list_of_files:
  short_filename, extension = os.path.splitext(filename)
  file_out_name = short_filename + '_ed' + ".csv"
  with open(filename, 'r') as source:
    with open(file_out_name, 'wb') as dest:
      process_file(source, dest)

Joël · Accepted Answer · 2013-04-29 11:32:27Z

1

To answer to your first and last question: it is simply about ignoring the corresponding lines, i.e. not to write them to output. This corresponds to if not header_found and else if not line: blocks of fortran proposal.

Second point is that there is no dedicated delimiter in your file: you have one or more spaces, which makes it hard to be parsed using csv module. Using split() will parse each line and return the list of non-blank characters, and will therefore only return useful values.

answered Apr 29, 2013 at 11:32

Joël

2,8491 gold badge22 silver badges38 bronze badges

Collectives™ on Stack Overflow

CSV file processing in Python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related