0

I have three CSV files with attributes Product_ID, Name, Cost, Description. Each file contains Product_ID. I want to combine Name (file1), Cost(file2), Description(File3) to new CSV file with Product_ID and all three above attributes. I need efficient code as files contains over 130000 rows.

After combining all data to new file, I have to load that data in a dictionary. Like: Product_Id as Key and Name,Cost,Description as Value.

6
  • And what have you tried so far? Show us your code, so we might be able to help you better. Commented Apr 8, 2016 at 21:46
  • All I have tried is to combine the data from three files to a dictionary and then write it, but I am getting error. In below code I am writing a file to dictionary with row[1] as key and row[2],row[3] as value. But I am not able to append another file to same dictionary. with open('train_1.csv', 'r',encoding="utf8") as file: text_file = csv.reader(file) next(text_file) for rows in text_file: maindict[rows[1]] = rows[2],rows[3] Commented Apr 8, 2016 at 21:56
  • @Sameer May want to edit your question with that code, comments aren't exactly easy on the eyes. Commented Apr 8, 2016 at 22:01
  • I am doing this approach for feature extraction, after that I have to apply Multinominal Naive Bayes. Although I have no idea about this method, I am learning it. Commented Apr 8, 2016 at 22:01
  • I dont know how do i add new line in comments Commented Apr 8, 2016 at 22:04

2 Answers 2

1

It might be more efficient to read each input .csv into a dictionary before creating your aggregated result.

Here's a solution for reading in each file and storing the columns in a dictionary with Product_IDs as the keys. I assume that each Product_ID value exists in each file and that headers are included. I also assume that there are no duplicate columns across the files aside from Product_ID.

import csv
from collections import defaultdict

entries = defaultdict(list)
files = ['names.csv', 'costs.csv', 'descriptions.csv']
headers = ['Product_ID']

for filename in files:
   with open(filename, 'rU') as f:      # Open each file in files.
      reader = csv.reader(f)            # Create a reader to iterate csv lines
      heads = next(reader)              # Grab first line (headers)

      pk = heads.index(headers[0])      # Get the position of 'Product_ID' in
                                        # the list of headers
      # Add the rest of the headers to the list of collected columns (skip 'Product_ID')
      headers.extend([x for i,x in enumerate(heads) if i != pk])

      for row in reader:
         # For each line, add new values (except 'Product_ID') to the
         # entries dict with the line's Product_ID value as the key
         entries[row[pk]].extend([x for i,x in enumerate(row) if i != pk])

writer = csv.writer(open('result.csv', 'wb'))    # Open file to write csv lines
writer.writerow(headers)                         # Write the headers first
for key, value in entries.items():
   writer.writerow([key] + value)      # Write the product IDs
   # concatenated with the other values
Sign up to request clarification or add additional context in comments.

7 Comments

if I want to append more than one row from a CSV then the above code wont work. Suppose names.csv contains Product_ID, Names, Tags. If i want to append both row 1, row2 ??
You didn't include much information about your csv columns. I assumed that there was no other data included with them. You can read in the headers from the first line, rather than skipping them, in order to find the correct row indices for the key and the value to append. To clarify, you want every column from each file added, with the product ID as their key?
I've edited my answer to include every column from each file.
Thanks for the help, I will look into the code you have provided. Will comment further if needed.
With the above code, I am getting some error. heads = reader.next() AttributeError: '_csv.reader' object has no attribute 'next'
|
0

A general solution that produces a record, maybe incomplete, for each id it encounters processing the 3 files needs the use of a specialized data structure that fortunately is just a list, with a preassigned number of slots

d = {id:[name,None,None] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
    id, cost = line.strip().split(',')
    if id in d:
        d[id][1] = cost
    else:
        d[id] = [None, cost, None]
for line in open(fn3):
    id, desc = line.strip().split(',')
    if id in d:
        d[id][2] = desc
    else:
        d[id] = [None, None, desc]

for id in d:
    if all(d[id]): 
       print ','.join([id]+d[id])
    else: # for this id you have not complete info,
          # so you have to decide on your own what you want, I have to
        pass

If you are sure that you don't want to further process incomplete records, the code above can be simplified

d = {id:[name] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
    id, cost = line.strip().split(',')
    if id in d: d[id].append(name)
for line in open(fn3):
    id, desc = line.strip().split(',')
    if id in d: d[id].append(desc)

for id in d:
    if len(d[id])==3: print ','.join([id]+d[id])

3 Comments

@ gboffi, I will look into the code today, Thanks for the help.
could you please, check this question out? stackoverflow.com/questions/54192260/…
@Barbie I've checked that question of yours but I have no working knowledge of pandas and I do not clearly understand the issue, so I'm afraid that I cannot help you, sorry...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.