Sorting CSV in Python

Question

I assumed sorting a CSV file on multiple text/numeric fields using Python would be a problem that was already solved. But I can't find any example code anywhere, except for specific code focusing on sorting date fields.

How would one go about sorting a relatively large CSV file (tens of thousand lines) on multiple fields, in order?

Python code samples would be appreciated.

Alex Martelli · Accepted Answer · 2010-01-18 21:04:20Z

10

Python's sort works in-memory only; however, tens of thousands of lines should fit in memory easily on a modern machine. So:

import csv

def sortcsvbymanyfields(csvfilename, themanyfieldscolumnnumbers):
  with open(csvfilename, 'rb') as f:
    readit = csv.reader(f)
    thedata = list(readit)
  thedata.sort(key=operator.itemgetter(*themanyfieldscolumnnumbers))
  with open(csvfilename, 'wb') as f:
    writeit = csv.writer(f)
    writeit.writerows(thedata)

edited Jan 18, 2010 at 21:04

answered Jan 18, 2010 at 20:45

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jcdyer Over a year ago

This is why I need to spend a weekend (or week), reviewing the standard library reference. itemgetter looks sweet.

Pranab Over a year ago

I also loved the answer at stackoverflow.com/questions/1143671/…

John Machin Over a year ago

This doesn't address the OP's "multiple text/numeric fields" requirement; it treats all fields as text.

Alex Martelli Over a year ago

@John, if some fields need to be treated differently (e.g. subjected to transformations such as multiple disparate type-coercions) before the sorting is done, that's not hard to arrange, but there's just not enough specs detail in the Q about how such a potentially complicated issue would be specified by function arguments (no doubt worth of a separate Q as it can be important quite apart from sorting!) -- if you want that info, why not open a Q yourself?

Mr_and_Mrs_D Over a year ago

Btw usually the first line of the csv is a header - be careful to omit that from sorting

Robert Rossney · Accepted Answer · 2010-01-19 09:22:50Z

4

Here's Alex's answer, reworked to support column data types:

import csv
import operator

def sort_csv(csv_filename, types, sort_key_columns):
    """sort (and rewrite) a csv file.
    types:  data types (conversion functions) for each column in the file
    sort_key_columns: column numbers of columns to sort by"""
    data = []
    with open(csv_filename, 'rb') as f:
        for row in csv.reader(f):
            data.append(convert(types, row))
    data.sort(key=operator.itemgetter(*sort_key_columns))
    with open(csv_filename, 'wb') as f:
        csv.writer(f).writerows(data)

Edit:

I did a stupid. I was playing with various things in IDLE and wrote a convert function a couple of days ago. I forgot I'd written it, and I haven't closed IDLE in a good long while - so when I wrote the above, I thought convert was a built-in function. Sadly no.

Here's my implementation, though John Machin's is nicer:

def convert(types, values):
    return [t(v) for t, v in zip(types, values)]

Usage:

import datetime
def date(s):
    return datetime.strptime(s, '%m/%d/%y')

>>> convert((int, date, str), ('1', '2/15/09', 'z'))
[1, datetime.datetime(2009, 2, 15, 0, 0), 'z']

edited Jan 19, 2010 at 9:22

answered Jan 19, 2010 at 0:18

Robert Rossney

97.3k24 gold badges150 silver badges218 bronze badges

4 Comments

Pranab Over a year ago

What is the convert() function? Also, are the second and third argument lists?

Pranab Over a year ago

sort_csv('myfile.csv', [?, ?, ?, ?], ['Name', 'BirthDate', 'Age', 'Price']

John Machin Over a year ago

@Pranab: both the second and third arguments can be any iterable

Robert Rossney Over a year ago

The convert function is, ah, something I forgot to include. See the edit. You'd call this function using something like: sort_csv('myfile.csv', (str, int, float, int), (2, 3)) if you wanted to sort a four-column CSV file by its last two columns.

John Machin · Accepted Answer · 2010-01-19 06:46:48Z

2

Here's the convert() that's missing from Robert's fix of Alex's answer:

>>> def convert(convert_funcs, seq):
...    return [
...        item if func is None else func(item)
...        for func, item in zip(convert_funcs, seq)
...        ]
...
>>> convert(
...     (None, float, lambda x: x.strip().lower()),
...     [" text ", "123.45", " TEXT "]
...     )
[' text ', 123.45, 'text']
>>>

I've changed the name of the first arg to highlight that the per-columns function can do what you need, not merely type-coercion. None is used to indicate no conversion.

answered Jan 19, 2010 at 6:46

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Comments

telliott99 · Accepted Answer · 2010-01-18 21:17:39Z

-1

You bring up 3 issues:

file size
csv data
sorting on multiple fields

Here is a solution for the third part. You can handle csv data in a more sophisticated way.

>>> data = 'a,b,c\nb,b,a\nb,c,a\n'
>>> lines = [e.split(',') for e in data.strip().split('\n')]
>>> lines
[['a', 'b', 'c'], ['b', 'b', 'a'], ['b', 'c', 'a']]
>>> def f(e):
...     field_order = [2,1]
...     return [e[i] for i in field_order]
... 
>>> sorted(lines, key=f)
[['b', 'b', 'a'], ['b', 'c', 'a'], ['a', 'b', 'c']]

Edited to use a list comprehension, generator does not work as I had expected it to.

edited Jan 18, 2010 at 21:17

answered Jan 18, 2010 at 20:51

telliott99

7,9574 gold badges30 silver badges28 bronze badges

Collectives™ on Stack Overflow

Sorting CSV in Python

4 Answers 4

5 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related