-1

I have multiple .csv files of various departments, with same header (Name, Full Address, Department ), Full address is a long string around 20 - 50 words , it includes all details including city. Each of my .CSV file contain around 100K rows, and i have around 400 such files in same folder.

What i want to do is that i want to create a separate csv file for all the employees (after reading all csv files) who lives in New York, (the resultant file should contain Name, full address and department) manually it is not possible but i know it is possible in python.

Can anyone help me out with this?

4
  • have you already created a dataframe ? if you need to parse out the city please add a sample of your data please read minimal reproducible example Commented Jun 10, 2020 at 15:29
  • I have not created the dataframe or any code yet. m just a beginner. Commented Jun 10, 2020 at 15:34
  • One possible way: generate list of csv files in directory. Create empty list y. Iterate over list of files. Read files line by line using csv module (stackoverflow.com/a/13428432/503835). Check if string "new york" exists in line, if not a duplicate, store in y. After done reading convert to dataframe and save to csv. Commented Jun 10, 2020 at 15:34
  • @eNc I have to consider the every file is of around 400 MB, and there are 400 such files.. so reading all one by one , will it crash my system? Commented Jun 10, 2020 at 15:36

1 Answer 1

1

As suggested, due to the size you probably should just parse all the data a file at a time and write out any matching rows to a new CSV file. This approach does not require whole files to be loaded and so will work with any size.

This can be done as follows:

import csv
import glob

write_header = True
output_csv = 'new york.csv'     # assume this is not already used

with open(output_csv, 'w', newline='') as f_output:
    csv_output = csv.writer(f_output)

    for csv_filename in glob.glob('*.csv'):
        if csv_filename != output_csv:
            with open(csv_filename) as f_input:
                csv_input = csv.reader(f_input)
                header = next(csv_input)

                if write_header:
                    csv_output.writerow(header)
                    write_header = False

                for row in csv_input:
                    if "new york" in row[1].lower():
                        csv_output.writerow(row)

It should produce an output new york.csv file containing any rows where the address column contained the word new york. Obviously you might need to improve this test to cover badly formed addresses.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.