Python, CSV, skipping lines based on content

Question

I'm handling csv files with buffer rows before the header, the number of rows varies and some of the rows contain strings and some don't. The only thing that is consistent is that these buffer rows all contain a null value in one or more of the cells, so I'm trying to skip any row with a null cell.

I've got the following script but it is outputting a blank file. I'm guessing that I'm not getting to the 'else' but I'm guessing that if I put it in a loop I'll end up creating a file for every row...

with open(fileName, 'rb') as inf, open("out_"+fileName, 'wb') as outf:
    csvreader = csv.DictReader(inf)

    if '' in csvreader.fieldnames:
        next(csvreader)
    else:
        fieldnames = ['url_source','downloaded_at'] + csvreader.fieldnames  # add column names to beginning
        csvwriter = csv.DictWriter(outf, fieldnames)
        csvwriter.writeheader()
        for node, row in enumerate(csvreader, 1):
            csvwriter.writerow(dict(row, url_source=csvUrl, downloaded_at=today))
    return

Are you saying that the CSV contains empty values in the first rows and that therefor the auto-loading of fieldnames fails? — Martijn Pieters
– Martijn Pieters, Commented Jul 27, 2014 at 22:24

Martijn Pieters · Accepted Answer · 2014-07-27 22:40:03Z

5

Your code did one thing; either it would read and discard (skip) one row, then return, or it would read the whole file and copy over to a new CSV. It would not ever do both.

If you cannot count on the first row containing the header, then don't rely on auto-loading the DictReader() fieldnames from the file. Find the header manually, then pass those on to the DictReader() constructor.

Open the CSV as a regular csv.reader() first, find the first row that is the actual header, then re-load the file as a csv.DictReader() with that row as the fieldnames explicitly:

with open(fileName, 'rb') as inf, open("out_"+fileName, 'wb') as outf:
    reader = csv.reader(inf)
    # find header row
    for row in reader:
        if '' not in row:
            fieldnames = row
            break
    else:
        # oops, *only* rows with empty cells found
        raise ValueError('Unable to determine header row')

    # rewind, switch to DictReader, skip past header
    inf.seek(0)
    reader = csv.DictReader(inf, fieldnames)
    for row in reader:
        if row.keys() == row.values()
            break

    # copy all rows across with extra two columns
    writer = csv.DictWriter(outf, ['url_source','downloaded_at'] + fieldnames)
    writer.writeheader()
    writer.writerows(dict(r, url_source=csvUrl, downloaded_at=today)
                     for r in reader)

edited Jul 27, 2014 at 22:40

answered Jul 27, 2014 at 22:32

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

woodbine Over a year ago

Hi Martijn, thanks for this, I see exactly what I was doing wrong. Thanks very much for the excellent and fulsome answer.

Collectives™ on Stack Overflow

Python, CSV, skipping lines based on content

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related