2

I'm handling csv files with buffer rows before the header, the number of rows varies and some of the rows contain strings and some don't. The only thing that is consistent is that these buffer rows all contain a null value in one or more of the cells, so I'm trying to skip any row with a null cell.

I've got the following script but it is outputting a blank file. I'm guessing that I'm not getting to the 'else' but I'm guessing that if I put it in a loop I'll end up creating a file for every row...

with open(fileName, 'rb') as inf, open("out_"+fileName, 'wb') as outf:
    csvreader = csv.DictReader(inf)

    if '' in csvreader.fieldnames:
        next(csvreader)
    else:
        fieldnames = ['url_source','downloaded_at'] + csvreader.fieldnames  # add column names to beginning
        csvwriter = csv.DictWriter(outf, fieldnames)
        csvwriter.writeheader()
        for node, row in enumerate(csvreader, 1):
            csvwriter.writerow(dict(row, url_source=csvUrl, downloaded_at=today))
    return
2
  • Are you saying that the CSV contains empty values in the first rows and that therefor the auto-loading of fieldnames fails? Commented Jul 27, 2014 at 22:24
  • 2
    Perhaps you could give a few lines of sample CSV files? Commented Jul 27, 2014 at 22:28

1 Answer 1

5

Your code did one thing; either it would read and discard (skip) one row, then return, or it would read the whole file and copy over to a new CSV. It would not ever do both.

If you cannot count on the first row containing the header, then don't rely on auto-loading the DictReader() fieldnames from the file. Find the header manually, then pass those on to the DictReader() constructor.

Open the CSV as a regular csv.reader() first, find the first row that is the actual header, then re-load the file as a csv.DictReader() with that row as the fieldnames explicitly:

with open(fileName, 'rb') as inf, open("out_"+fileName, 'wb') as outf:
    reader = csv.reader(inf)
    # find header row
    for row in reader:
        if '' not in row:
            fieldnames = row
            break
    else:
        # oops, *only* rows with empty cells found
        raise ValueError('Unable to determine header row')

    # rewind, switch to DictReader, skip past header
    inf.seek(0)
    reader = csv.DictReader(inf, fieldnames)
    for row in reader:
        if row.keys() == row.values()
            break

    # copy all rows across with extra two columns
    writer = csv.DictWriter(outf, ['url_source','downloaded_at'] + fieldnames)
    writer.writeheader()
    writer.writerows(dict(r, url_source=csvUrl, downloaded_at=today)
                     for r in reader)
Sign up to request clarification or add additional context in comments.

1 Comment

Hi Martijn, thanks for this, I see exactly what I was doing wrong. Thanks very much for the excellent and fulsome answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.