python Script to read three csv files and writing in one csv file

Question

I am trying read three csv files and wants to put output in single csv file by making first column as ID so it should not repeat as it's common in all input csv files. I have written some code but it's giving errors. I am not sure this is best way to perform my task.

code:

#! /usr/bin/python
import csv
from collections import defaultdict

result = defaultdict(dict)
fieldnames = ("ID")

for csvfile in ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv"):
    with open(csvfile, 'rb') as infile:
        reader = csv.DictReader(infile)
        for row in reader:
            id = row.pop("ID")
            for key in row:
                fieldnames.add(key) 
                result[id][key] = row[key]

    with open("out.csv", "w") as outfile:
    writer = csv.DictWriter(outfile, sorted(fieldnames))
    writer.writeheader()
    for item in result:
        result[item]["ID"] = item
        writer.writerow(result[item]

input csv files are listed below:

FR1.1.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED

FR2.0.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.0 , COMPILE_FAILED , EXECUTION_FAILED

FR2.5.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

out.csv (required)-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED

thanks to post best method to achieve above result.

And what errors are you seeing? Please do include the full traceback. — Martijn Pieters
– Martijn Pieters, Commented Oct 11, 2013 at 6:53
And the indentation for your code sample is incorrect; presumably the second with statement is not indented that far? — Martijn Pieters
– Martijn Pieters, Commented Oct 11, 2013 at 6:53
Sorting the fieldnames is probably not what you wanted to do; and you should only add fieldnames for the first row of each CSV file. — Martijn Pieters
– Martijn Pieters, Commented Oct 11, 2013 at 6:54
@MartijnPieters Kindly requesting you to check my updated code and requirement and suggest me the way to achive this requirement. — Ram More
– Ram More, Commented Oct 23, 2013 at 9:21
Instead of expanding your question to cover new problems, ask a new question instead. That way far more people get to see it too. I've reverted your edit; this specific question has already been answered. — Martijn Pieters
– Martijn Pieters, Commented Oct 23, 2013 at 9:24

Community · Accepted Answer · 2017-05-23 12:05:14Z

2

If you want to just join each CSV row based on ID, then don't use a DictReader. Dictionary keys must be unique, but you are producing rows with multiple EXECUTION_STATUS and RELEASE, etc. columns.

Moreover, how will you handle ids where one or two of the input CSV files has no input?

Use regular readers and store each row keyed by filename. Make fieldnames a list as well:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in sorted(result):
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This code stores rows per filename, so that you can later build a whole row from each file but still fill in blanks if the row was missing in that file.

The alternative would be to make column names unique by appending a number or filename to each; that way your DictReader approach could work too.

The above gives:

TEST_ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

If you need to base your order on one of the input files, then omit that input file from the first reading loop; instead, read that file while writing the output loop and use its first column to look up the other file data:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = []

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("FR1.1.csv", "rb") as infile, open("out.csv", "wb") as outfile:
    reader = csv.reader(infile)
    headers = next(reader, [])  # read first line, headers

    writer = csv.writer(outfile)
    writer.writerow(headers + fieldnames)

    for row in sorted(reader):
        data = result[row[0]]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This does mean that any TEST_ID values extra in the other two files are ignored.

If you wanted to preserve all TEST_IDs then I'd use collections.OrderedDict(); new TEST_IDs found in the later files will be tacked onto the end:

import csv
from collections import OrderedDict

result = OrderedDict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            if row[0] not in result:
                result[row[0]] = {}
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in result:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

The OrderedDict maintains entries in insertion order; so FR1.1.csv sets the order for all keys, but any FR2.0.csv ids not found in the first file are appended to the dictionary at the end, and so on.

For Python versions < 2.7, either install a backport (see OrderedDict for older versions of python) or track the ID order manually with:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
ids, seen = [], set()

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            id_ = row[0]
            # track ordering
            if id_ not in seen:
                seen.add(id_)
                ids.append(id_)
            result[id_][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in ids:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

edited May 23, 2017 at 12:05

CommunityBot

11 silver badge

answered Oct 11, 2013 at 7:07

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Ram More Over a year ago

@ Pieters i have tried your code header row is printed very much the way i want only fisrt value is wrong. It should be TEST_ID. All row values are not printed from all files, only one row is printed from each file. TEST_ID is also not printed in out.csv file. here is output of above code: .ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS <built-in function id>, FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

Ram More Over a year ago

@Pieters your edited code gives output correctly. Only the issue here is its printing ID as first column header it should be TEST_ID. It is not possible to print TEST_ID??

Ram More Over a year ago

Also one problem is out file ID sequence is not matching with the infile ID. infile sequence is B_019, B_020 and B_021 but in out file it is printing it as B_021, B_019 and B_020. last row is printed first first row is in middle and second row is printed at last. can we keep the sequenec as input files have.Thanks!!!

Martijn Pieters Over a year ago

@RamMore: My code writes ID because your original code did. Simply replace fieldnames = ['ID'] with the desired column name.

Martijn Pieters Over a year ago

@RamMore: What should the order be? Sorted on id, or based on the order of one of the CSV input files? Note that the latter is tricky if ids are missing from one of the files; those could end up at the end of the file in that case.

|

Collectives™ on Stack Overflow

python Script to read three csv files and writing in one csv file

1 Answer 1

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related