7

I assumed sorting a CSV file on multiple text/numeric fields using Python would be a problem that was already solved. But I can't find any example code anywhere, except for specific code focusing on sorting date fields.

How would one go about sorting a relatively large CSV file (tens of thousand lines) on multiple fields, in order?

Python code samples would be appreciated.

4 Answers 4

10

Python's sort works in-memory only; however, tens of thousands of lines should fit in memory easily on a modern machine. So:

import csv

def sortcsvbymanyfields(csvfilename, themanyfieldscolumnnumbers):
  with open(csvfilename, 'rb') as f:
    readit = csv.reader(f)
    thedata = list(readit)
  thedata.sort(key=operator.itemgetter(*themanyfieldscolumnnumbers))
  with open(csvfilename, 'wb') as f:
    writeit = csv.writer(f)
    writeit.writerows(thedata)
Sign up to request clarification or add additional context in comments.

5 Comments

This is why I need to spend a weekend (or week), reviewing the standard library reference. itemgetter looks sweet.
I also loved the answer at stackoverflow.com/questions/1143671/…
This doesn't address the OP's "multiple text/numeric fields" requirement; it treats all fields as text.
@John, if some fields need to be treated differently (e.g. subjected to transformations such as multiple disparate type-coercions) before the sorting is done, that's not hard to arrange, but there's just not enough specs detail in the Q about how such a potentially complicated issue would be specified by function arguments (no doubt worth of a separate Q as it can be important quite apart from sorting!) -- if you want that info, why not open a Q yourself?
Btw usually the first line of the csv is a header - be careful to omit that from sorting
4

Here's Alex's answer, reworked to support column data types:

import csv
import operator

def sort_csv(csv_filename, types, sort_key_columns):
    """sort (and rewrite) a csv file.
    types:  data types (conversion functions) for each column in the file
    sort_key_columns: column numbers of columns to sort by"""
    data = []
    with open(csv_filename, 'rb') as f:
        for row in csv.reader(f):
            data.append(convert(types, row))
    data.sort(key=operator.itemgetter(*sort_key_columns))
    with open(csv_filename, 'wb') as f:
        csv.writer(f).writerows(data)

Edit:

I did a stupid. I was playing with various things in IDLE and wrote a convert function a couple of days ago. I forgot I'd written it, and I haven't closed IDLE in a good long while - so when I wrote the above, I thought convert was a built-in function. Sadly no.

Here's my implementation, though John Machin's is nicer:

def convert(types, values):
    return [t(v) for t, v in zip(types, values)]

Usage:

import datetime
def date(s):
    return datetime.strptime(s, '%m/%d/%y')

>>> convert((int, date, str), ('1', '2/15/09', 'z'))
[1, datetime.datetime(2009, 2, 15, 0, 0), 'z']

4 Comments

What is the convert() function? Also, are the second and third argument lists?
sort_csv('myfile.csv', [?, ?, ?, ?], ['Name', 'BirthDate', 'Age', 'Price']
@Pranab: both the second and third arguments can be any iterable
The convert function is, ah, something I forgot to include. See the edit. You'd call this function using something like: sort_csv('myfile.csv', (str, int, float, int), (2, 3)) if you wanted to sort a four-column CSV file by its last two columns.
2

Here's the convert() that's missing from Robert's fix of Alex's answer:

>>> def convert(convert_funcs, seq):
...    return [
...        item if func is None else func(item)
...        for func, item in zip(convert_funcs, seq)
...        ]
...
>>> convert(
...     (None, float, lambda x: x.strip().lower()),
...     [" text ", "123.45", " TEXT "]
...     )
[' text ', 123.45, 'text']
>>>

I've changed the name of the first arg to highlight that the per-columns function can do what you need, not merely type-coercion. None is used to indicate no conversion.

Comments

-1

You bring up 3 issues:

  • file size
  • csv data
  • sorting on multiple fields

Here is a solution for the third part. You can handle csv data in a more sophisticated way.

>>> data = 'a,b,c\nb,b,a\nb,c,a\n'
>>> lines = [e.split(',') for e in data.strip().split('\n')]
>>> lines
[['a', 'b', 'c'], ['b', 'b', 'a'], ['b', 'c', 'a']]
>>> def f(e):
...     field_order = [2,1]
...     return [e[i] for i in field_order]
... 
>>> sorted(lines, key=f)
[['b', 'b', 'a'], ['b', 'c', 'a'], ['a', 'b', 'c']]

Edited to use a list comprehension, generator does not work as I had expected it to.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.