1

I am trying read three csv files and wants to put output in single csv file by making first column as ID so it should not repeat as it's common in all input csv files. I have written some code but it's giving errors. I am not sure this is best way to perform my task.

code:

#! /usr/bin/python
import csv
from collections import defaultdict

result = defaultdict(dict)
fieldnames = ("ID")

for csvfile in ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv"):
    with open(csvfile, 'rb') as infile:
        reader = csv.DictReader(infile)
        for row in reader:
            id = row.pop("ID")
            for key in row:
                fieldnames.add(key) 
                result[id][key] = row[key]

    with open("out.csv", "w") as outfile:
    writer = csv.DictWriter(outfile, sorted(fieldnames))
    writer.writeheader()
    for item in result:
        result[item]["ID"] = item
        writer.writerow(result[item]

input csv files are listed below:

FR1.1.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED

FR2.0.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.0 , COMPILE_FAILED , EXECUTION_FAILED

FR2.5.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

out.csv (required)-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED

thanks to post best method to achieve above result.

6
  • 2
    And what errors are you seeing? Please do include the full traceback. Commented Oct 11, 2013 at 6:53
  • 1
    And the indentation for your code sample is incorrect; presumably the second with statement is not indented that far? Commented Oct 11, 2013 at 6:53
  • 1
    Sorting the fieldnames is probably not what you wanted to do; and you should only add fieldnames for the first row of each CSV file. Commented Oct 11, 2013 at 6:54
  • @MartijnPieters Kindly requesting you to check my updated code and requirement and suggest me the way to achive this requirement. Commented Oct 23, 2013 at 9:21
  • Instead of expanding your question to cover new problems, ask a new question instead. That way far more people get to see it too. I've reverted your edit; this specific question has already been answered. Commented Oct 23, 2013 at 9:24

1 Answer 1

2

If you want to just join each CSV row based on ID, then don't use a DictReader. Dictionary keys must be unique, but you are producing rows with multiple EXECUTION_STATUS and RELEASE, etc. columns.

Moreover, how will you handle ids where one or two of the input CSV files has no input?

Use regular readers and store each row keyed by filename. Make fieldnames a list as well:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in sorted(result):
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This code stores rows per filename, so that you can later build a whole row from each file but still fill in blanks if the row was missing in that file.

The alternative would be to make column names unique by appending a number or filename to each; that way your DictReader approach could work too.

The above gives:

TEST_ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

If you need to base your order on one of the input files, then omit that input file from the first reading loop; instead, read that file while writing the output loop and use its first column to look up the other file data:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = []

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("FR1.1.csv", "rb") as infile, open("out.csv", "wb") as outfile:
    reader = csv.reader(infile)
    headers = next(reader, [])  # read first line, headers

    writer = csv.writer(outfile)
    writer.writerow(headers + fieldnames)

    for row in sorted(reader):
        data = result[row[0]]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This does mean that any TEST_ID values extra in the other two files are ignored.

If you wanted to preserve all TEST_IDs then I'd use collections.OrderedDict(); new TEST_IDs found in the later files will be tacked onto the end:

import csv
from collections import OrderedDict

result = OrderedDict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            if row[0] not in result:
                result[row[0]] = {}
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in result:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

The OrderedDict maintains entries in insertion order; so FR1.1.csv sets the order for all keys, but any FR2.0.csv ids not found in the first file are appended to the dictionary at the end, and so on.

For Python versions < 2.7, either install a backport (see OrderedDict for older versions of python) or track the ID order manually with:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
ids, seen = [], set()

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            id_ = row[0]
            # track ordering
            if id_ not in seen:
                seen.add(id_)
                ids.append(id_)
            result[id_][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in ids:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)
Sign up to request clarification or add additional context in comments.

15 Comments

@ Pieters i have tried your code header row is printed very much the way i want only fisrt value is wrong. It should be TEST_ID. All row values are not printed from all files, only one row is printed from each file. TEST_ID is also not printed in out.csv file. here is output of above code: .ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS <built-in function id>, FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED
@Pieters your edited code gives output correctly. Only the issue here is its printing ID as first column header it should be TEST_ID. It is not possible to print TEST_ID??
Also one problem is out file ID sequence is not matching with the infile ID. infile sequence is B_019, B_020 and B_021 but in out file it is printing it as B_021, B_019 and B_020. last row is printed first first row is in middle and second row is printed at last. can we keep the sequenec as input files have.Thanks!!!
@RamMore: My code writes ID because your original code did. Simply replace fieldnames = ['ID'] with the desired column name.
@RamMore: What should the order be? Sorted on id, or based on the order of one of the CSV input files? Note that the latter is tricky if ids are missing from one of the files; those could end up at the end of the file in that case.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.