How to Filter specific rows from a huge CSV file using Python Script

Question

Is there an efficient way in python to load only specific rows from a huge csv file into the memory (for further processing) without burdening the memory?
E.g: Let's say I want to filter the rows from specific date out of a file in the following format, and let's say this file is tens or hundreds of gigs (dates are not ordered)

Date         event_type    country
2015/03/01   impression    US
2015/03/01   impression    US
2015/03/01   impression    CA
2015/03/01   click         CA
2015/03/02   impression    FR
2015/03/02   click         FR
2015/03/02   impression    US
2015/03/02   click         US

You don't have to read the file into memory, you can filter as you iterate but without some idea of what the data looks like and what you are filtering on it is impossible to supply a working example. — Padraic Cunningham
– Padraic Cunningham, Commented Mar 21, 2016 at 13:29

Evandro C. · Accepted Answer · 2016-03-21 14:10:04Z

0

import csv

filter_countries = {'US': 1}
with open('data.tsv', 'r') as f_name:
    for line in csv.DictReader(f_name, delimiter='\t'):
        if line['country'] not in filter_countries:
            print(line)

answered Mar 21, 2016 at 14:10

Evandro C.

716 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

max · Accepted Answer · 2016-03-21 14:17:15Z

0

You still need to process every row in the file in order to check your clause. However, it's unnecessary to load all file into memory so you can use streams as following:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in spamreader:
        if row[0] == '2015/03/01':
            continue

        # Process data here

If you need just to have a list of matched rows it's faster and even simpler to use list comprehension as follow:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    rows = [row for row in spamreader if row[0] == '2015/03/01']

edited Mar 21, 2016 at 14:17

answered Mar 21, 2016 at 14:00

max

2,81725 silver badges21 bronze badges

3 Comments

Sergei Lebedev Over a year ago

Using a filter instead of a for-loop with an if can be much faster for bigger inputs.

max Over a year ago

@SergeiLebedev, of course. Especially in case if there is a need to have the list of the matched rows, not to calculate any aggregated values.

Sergei Lebedev Over a year ago

Actually, I was referring to the Python3 filter, which does not produce an intermediate list. The idea is to delegate as much iteration is possible to the C-side.

Community · Accepted Answer · 2017-05-23 12:23:47Z

0

If the dates can appear anywhere you will have to parse the whole file:

import csv

def get_rows(k, fle):
    with open(fle) as f:
        next(f)
        for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
            if row[0] == k:
                yield row


for row in get_rows("2015/03/02", "in.txt"):
    print(row)

You could use the multiprocessing to speed up the parsing splitting the data into chunks. There are some ideas here

edited May 23, 2017 at 12:23

CommunityBot

11 silver badge

answered Mar 21, 2016 at 14:03

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

1 Comment

Sergei Lebedev Over a year ago

You can relax the ordering requirement by using filter instead of dropwhile.

David R · Accepted Answer · 2022-07-18 01:49:33Z

I prefer a pandas-only approach to this that allows you to use all the features of read.csv(). This approach envisions a situation where you may need to filter on different dates at different times, so it is worth a little overhead to create a date registry that can be saved to disk and re-used.

First, create a registry holding just the date data for your csv:

my_date_registry = pd.read_csv('data.csv', usecols=['Date'], engine='c')

(Note, in newer version of pandas, you can use engine = 'pyarrow', which will be faster.)

There are two ways of using this registry and the skiprows parameter to filter out the rows you don't want. You may wish to experiment as to which one is faster for your specific data.

Option 1: Build a list of integer indexes

filter_date = '2017-03-09'

my_rows = my_date_registry['Date'] == filter_date
skip_rows = ~my_rows
my_skip_indexes = my_data[skip_rows].index
my_skip_list = [x + 1 for x in my_skip_indexes]
my_selected_rows = pd.read_csv('data.csv', engine='c', skiprows=my_skip_list)

N.B. Since your data has header rows, you must add 1 to every index in my_skip_indexes to make up for the header row.

Option 2: Create a Callable function

filter_date = '2017-03-09'
my_rows = my_data[my_data['Date'] == my_date]
my_row_indexes = my_rows.index
my_row_indexes = set([0] + [x + 1 for x in my_row_indexes])
my_filter = lambda x: x not in my_row_indexes
my_selected_rows = pd.read_csv('data.csv',  engine='c', skiprows=my_filter)

Collectives™ on Stack Overflow

How to Filter specific rows from a huge CSV file using Python Script

4 Answers 4

Comments

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related