Python csv Module Error: index out of range

Question

I have a CSV file and I want to extract columns from it, but only from some of the rows. It looks like this:

gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

Essentially I want the 2nd and 6th column, but only from the rows which have "gene_name" in the 5th column. So I want to extract:

ENSDARG00000104632, RERG

(It goes on from there with many thousands of rows)

This is what I wrote:

import csv


with open('filename.csv', 'rb') as infh:
        reader = csv.reader(infh)
        for row in reader:
                if row[4] == 'gene_name':
                        print row[1, 5]

However, it gives me this error:

File "./gene_name_grabber.sh", line 10, in if row[4] == 'gene_name': IndexError: list index out of range

I understand that this error means I've asked it to look at an index number greater than the number of indexes in the rows...but there are clearly more than 4 indexes in each row. Help please?

Thanks!

Are you sure that all your lines have the same number of columns? Can you add a print statement right before the if condition, so that we can see the line that gives this error? — Antimony
– Antimony, Commented Sep 21, 2017 at 22:49
I changed it to this: import csv with open('zebrafish_gtf_IDs_and_names.csv', 'rb') as infh: reader = csv.reader(infh) for row in reader: print row if row[4] == 'gene_name': print row[1, 5] but it still gives me this error: File "./gene_name_grabber.sh", line 11, in if row[4] == 'gene_name': IndexError: list index out of range — David Tatarakis
– David Tatarakis, Commented Sep 21, 2017 at 22:54

adder · Accepted Answer · 2017-09-21 23:10:51Z

1

Obviously, there are some rows that do not contain enough columns. Try this:

import csv

with open('input.csv', 'r') as f:

    reader = csv.reader(f)

    for row in reader:
        try:
            if 'gene_name' in row[4]:
                print('%s, %s' % (row[1].strip(), row[5].strip()))
        except IndexError:
            continue

...output:

ENSDARG00000104632, RERG

edited Sep 21, 2017 at 23:10

answered Sep 21, 2017 at 22:57

adder

3,7382 gold badges24 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

David Tatarakis Over a year ago

This did work, although it surrounded the elements in each column with (' So that they look like this: ('ENSDARG00000104632', 'RERG') Is there a way to make it so the output is simply: ENSDARG00000104632, RERG ?

David Tatarakis Over a year ago

Ahh I see. Thank you very much this helps a lot!

J_H · Accepted Answer · 2017-09-22 00:41:18Z

0

I want the 2nd and 6th column, but only from the rows which have "gene_name" in the 5th column.

I love python. But this is most naturally expressed as

awk '$5 ~ /gene_name/ {print $2, $6}'

Let's move back to python. This isn't what you wanted to write:

                    print row[1, 5]

Phrase it as print(row[1], row[5]) instead.

Some of your lines have only a small number of columns. So you'll want to wrap dereferences of e.g. row[4] or row[5] in an if statement that verifies it's a long enough line:

    if len(row) > 5:
        ...

edited Sep 22, 2017 at 0:41

answered Sep 21, 2017 at 22:55

J_H

21.2k5 gold badges29 silver badges50 bronze badges

2 Comments

David Tatarakis Over a year ago

I tried using the awk command, but it didn't seem to do anything at all. I wrote it like this: cat filename.csv | awk '$5 ~ /^gene_name$/ {print $2, $6}' > newfile.csv Is this not correct?

J_H Over a year ago

Sorry, I shouldn't have anchored it. I will revise the answer to delete ^ and $ as they're not needed in an awk context.

jrd1 · Accepted Answer · 2017-09-22 02:19:30Z

As Antimony noted, it sounds as though your data has occasional missing values in it, which csv cannot easily handle out-of-the-box. I'd suggest using a library like pandas which has a read_csv function, and can handle missing values. Using this data as an example:

gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id,
gene_id, ENSDARG00000104632, gene_version, , transcript_id,
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

it could be read as follows:

import pandas as pd

# Use the 2nd, 5th and 6th columns - i.e.column indices 1, 4 and 5 respectively
# And, we set the 'not available' data - i.e. `na_values` as 'N/A'.
data = pd.read_csv('test.dat', na_values='N/A', header=None, skipinitialspace=True, usecols=[1,4,5])

# now select only the rows without 'gene_version':
d = data.loc[data[4] != 'gene_name']

# and, now we only select columns with index 1 and 5:
selected_data = d[[1, 5]]

Yielding:

                    1                   5
0  ENSDARG00000104632                RERG
1  ENSDARG00000104632  ENSDART00000166186
2  ENSDARG00000104632  ENSDART00000166186
3  ENSDARG00000104632  ENSDART00000166186
4  ENSDARG00000104632  ENSDART00000166186
5  ENSDARG00000104632  ENSDART00000166186
6  ENSDARG00000104632  ENSDART00000166186
7  ENSDARG00000104632                 NaN
8  ENSDARG00000104632                 NaN
9  ENSDARG00000104632  ENSDART00000166186

As desired.

However, if there is missing data - like in this example - all you'll have to do is remove those rows like:

selected_data.dropna()

Which outputs:

                    1                   5
1  ENSDARG00000104632  ENSDART00000166186
2  ENSDARG00000104632  ENSDART00000166186
3  ENSDARG00000104632  ENSDART00000166186
4  ENSDARG00000104632  ENSDART00000166186
5  ENSDARG00000104632  ENSDART00000166186
6  ENSDARG00000104632  ENSDART00000166186
9  ENSDARG00000104632  ENSDART00000166186

(However, this may not be what you want.)

REFERENCE

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You said usecols=[1, 5] but I think you need usecols=[1, 4, 5]. That way you preserve enough information to mask out any rows where element 4 is not "gene_name".

Collectives™ on Stack Overflow

Python csv Module Error: index out of range

3 Answers 3

2 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related