0

Is there an efficient way in python to load only specific rows from a huge csv file into the memory (for further processing) without burdening the memory?
E.g: Let's say I want to filter the rows from specific date out of a file in the following format, and let's say this file is tens or hundreds of gigs (dates are not ordered)

Date         event_type    country
2015/03/01   impression    US
2015/03/01   impression    US
2015/03/01   impression    CA
2015/03/01   click         CA
2015/03/02   impression    FR
2015/03/02   click         FR
2015/03/02   impression    US
2015/03/02   click         US
6
  • @Li-aungYip, can you answer? Commented Mar 21, 2016 at 13:26
  • How do you specify them? Commented Mar 21, 2016 at 13:29
  • 1
    You don't have to read the file into memory, you can filter as you iterate but without some idea of what the data looks like and what you are filtering on it is impossible to supply a working example. Commented Mar 21, 2016 at 13:29
  • Are the dates ordered? Commented Mar 21, 2016 at 14:00
  • No. The dates are not ordered. Commented Mar 21, 2016 at 14:03

4 Answers 4

0
import csv

filter_countries = {'US': 1}
with open('data.tsv', 'r') as f_name:
    for line in csv.DictReader(f_name, delimiter='\t'):
        if line['country'] not in filter_countries:
            print(line)
Sign up to request clarification or add additional context in comments.

Comments

0

You still need to process every row in the file in order to check your clause. However, it's unnecessary to load all file into memory so you can use streams as following:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    for row in spamreader:
        if row[0] == '2015/03/01':
            continue

        # Process data here

If you need just to have a list of matched rows it's faster and even simpler to use list comprehension as follow:

import csv
with open('huge.csv', 'rb') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"')
    rows = [row for row in spamreader if row[0] == '2015/03/01']

3 Comments

Using a filter instead of a for-loop with an if can be much faster for bigger inputs.
@SergeiLebedev, of course. Especially in case if there is a need to have the list of the matched rows, not to calculate any aggregated values.
Actually, I was referring to the Python3 filter, which does not produce an intermediate list. The idea is to delegate as much iteration is possible to the C-side.
0

If the dates can appear anywhere you will have to parse the whole file:

import csv

def get_rows(k, fle):
    with open(fle) as f:
        next(f)
        for row in csv.reader(f, delimiter=" ", skipinitialspace=1):
            if row[0] == k:
                yield row


for row in get_rows("2015/03/02", "in.txt"):
    print(row)

You could use the multiprocessing to speed up the parsing splitting the data into chunks. There are some ideas here

1 Comment

You can relax the ordering requirement by using filter instead of dropwhile.
0

I prefer a pandas-only approach to this that allows you to use all the features of read.csv(). This approach envisions a situation where you may need to filter on different dates at different times, so it is worth a little overhead to create a date registry that can be saved to disk and re-used.

First, create a registry holding just the date data for your csv:

my_date_registry = pd.read_csv('data.csv', usecols=['Date'], engine='c')

(Note, in newer version of pandas, you can use engine = 'pyarrow', which will be faster.)

There are two ways of using this registry and the skiprows parameter to filter out the rows you don't want. You may wish to experiment as to which one is faster for your specific data.

Option 1: Build a list of integer indexes

filter_date = '2017-03-09'

my_rows = my_date_registry['Date'] == filter_date
skip_rows = ~my_rows
my_skip_indexes = my_data[skip_rows].index
my_skip_list = [x + 1 for x in my_skip_indexes]
my_selected_rows = pd.read_csv('data.csv', engine='c', skiprows=my_skip_list)

N.B. Since your data has header rows, you must add 1 to every index in my_skip_indexes to make up for the header row.

Option 2: Create a Callable function

filter_date = '2017-03-09'
my_rows = my_data[my_data['Date'] == my_date]
my_row_indexes = my_rows.index
my_row_indexes = set([0] + [x + 1 for x in my_row_indexes])
my_filter = lambda x: x not in my_row_indexes
my_selected_rows = pd.read_csv('data.csv',  engine='c', skiprows=my_filter)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.