0

I have a CSV file and I want to extract columns from it, but only from some of the rows. It looks like this:

gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

Essentially I want the 2nd and 6th column, but only from the rows which have "gene_name" in the 5th column. So I want to extract:

ENSDARG00000104632, RERG

(It goes on from there with many thousands of rows)

This is what I wrote:

import csv


with open('filename.csv', 'rb') as infh:
        reader = csv.reader(infh)
        for row in reader:
                if row[4] == 'gene_name':
                        print row[1, 5]

However, it gives me this error:

File "./gene_name_grabber.sh", line 10, in if row[4] == 'gene_name': IndexError: list index out of range

I understand that this error means I've asked it to look at an index number greater than the number of indexes in the rows...but there are clearly more than 4 indexes in each row. Help please?

Thanks!

7
  • 1
    Are you sure that all your lines have the same number of columns? Can you add a print statement right before the if condition, so that we can see the line that gives this error? Commented Sep 21, 2017 at 22:49
  • I changed it to this: import csv with open('zebrafish_gtf_IDs_and_names.csv', 'rb') as infh: reader = csv.reader(infh) for row in reader: print row if row[4] == 'gene_name': print row[1, 5] but it still gives me this error: File "./gene_name_grabber.sh", line 11, in if row[4] == 'gene_name': IndexError: list index out of range Commented Sep 21, 2017 at 22:54
  • Which line does it print last? Commented Sep 21, 2017 at 22:56
  • It prints no lines Commented Sep 21, 2017 at 22:57
  • Just the error and nothing else Commented Sep 21, 2017 at 22:57

3 Answers 3

1

Obviously, there are some rows that do not contain enough columns. Try this:

import csv

with open('input.csv', 'r') as f:

    reader = csv.reader(f)

    for row in reader:
        try:
            if 'gene_name' in row[4]:
                print('%s, %s' % (row[1].strip(), row[5].strip()))
        except IndexError:
            continue

...output:

ENSDARG00000104632, RERG

Sign up to request clarification or add additional context in comments.

2 Comments

This did work, although it surrounded the elements in each column with (' So that they look like this: ('ENSDARG00000104632', 'RERG') Is there a way to make it so the output is simply: ENSDARG00000104632, RERG ?
Ahh I see. Thank you very much this helps a lot!
0

I want the 2nd and 6th column, but only from the rows which have "gene_name" in the 5th column.

I love python. But this is most naturally expressed as

awk '$5 ~ /gene_name/ {print $2, $6}'

Let's move back to python. This isn't what you wanted to write:

                    print row[1, 5]

Phrase it as print(row[1], row[5]) instead.

Some of your lines have only a small number of columns. So you'll want to wrap dereferences of e.g. row[4] or row[5] in an if statement that verifies it's a long enough line:

    if len(row) > 5:
        ...

2 Comments

I tried using the awk command, but it didn't seem to do anything at all. I wrote it like this: cat filename.csv | awk '$5 ~ /^gene_name$/ {print $2, $6}' > newfile.csv Is this not correct?
Sorry, I shouldn't have anchored it. I will revise the answer to delete ^ and $ as they're not needed in an awk context.
0

As Antimony noted, it sounds as though your data has occasional missing values in it, which csv cannot easily handle out-of-the-box. I'd suggest using a library like pandas which has a read_csv function, and can handle missing values. Using this data as an example:

gene_id, ENSDARG00000104632, gene_version, 2, gene_name, RERG
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id,
gene_id, ENSDARG00000104632, gene_version, , transcript_id,
gene_id, ENSDARG00000104632, gene_version, 2, transcript_id, ENSDART00000166186

it could be read as follows:

import pandas as pd

# Use the 2nd, 5th and 6th columns - i.e.column indices 1, 4 and 5 respectively
# And, we set the 'not available' data - i.e. `na_values` as 'N/A'.
data = pd.read_csv('test.dat', na_values='N/A', header=None, skipinitialspace=True, usecols=[1,4,5])

# now select only the rows without 'gene_version':
d = data.loc[data[4] != 'gene_name']

# and, now we only select columns with index 1 and 5:
selected_data = d[[1, 5]]

Yielding:

                    1                   5
0  ENSDARG00000104632                RERG
1  ENSDARG00000104632  ENSDART00000166186
2  ENSDARG00000104632  ENSDART00000166186
3  ENSDARG00000104632  ENSDART00000166186
4  ENSDARG00000104632  ENSDART00000166186
5  ENSDARG00000104632  ENSDART00000166186
6  ENSDARG00000104632  ENSDART00000166186
7  ENSDARG00000104632                 NaN
8  ENSDARG00000104632                 NaN
9  ENSDARG00000104632  ENSDART00000166186

As desired.

However, if there is missing data - like in this example - all you'll have to do is remove those rows like:

selected_data.dropna()

Which outputs:

                    1                   5
1  ENSDARG00000104632  ENSDART00000166186
2  ENSDARG00000104632  ENSDART00000166186
3  ENSDARG00000104632  ENSDART00000166186
4  ENSDARG00000104632  ENSDART00000166186
5  ENSDARG00000104632  ENSDART00000166186
6  ENSDARG00000104632  ENSDART00000166186
9  ENSDARG00000104632  ENSDART00000166186

(However, this may not be what you want.)

REFERENCE

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

2 Comments

You said usecols=[1, 5] but I think you need usecols=[1, 4, 5]. That way you preserve enough information to mask out any rows where element 4 is not "gene_name".
@JH Ah! Good catch!! Thanks! Edited.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.