Python: Ignoring specific rows in a csv file

Question

I am trying to create a simple line graph to compare columns from two files. I have written some code and would like to know how to ignore lines in the two .csv files that I have. The code is as follows:

import numpy as np
import csv
from matplotlib import pyplot as plt

def read_cell(x, y):
        with open('Illumina_Heart_Gencode_Paired_End_Novel_Junctions.csv', 'r') as f:
                reader = csv.reader(f)
                y_count = 0
                for n in reader:
                        if y_count == y:
                                cell = n[x]
                                return cell
                        y_count += 1
print(read_cell(6, 932)

def read_cell(x, y):
        with open('Illumina_Heart_RefSeq_Paired_End_Novel_Junctions.csv', 'r') as f:
                reader = csv.reader(f)
                y_count = 0
                for n in reader:
                        if y_count == y:
                                cell = n[x]
                                return cell
                        y_count += 1
print(read_cell(6, 932))


d1 = []
for i in set1:
    try:
        d1.append(float(i[5]))
    except ValueError:
        continue

d2 = []
for i in set2:
    try:
        d2.append(float(i[5]))
    except ValueError:
        continue

min_len = len(d1)
if len(d2) < min_len:
    min_len = len(d2)
d1 = d1[0:min_len]
d2 = d2[0:min_len]

plt.plot(d1, d2, 'r*')
plt.plot(d1, d2, 'b-')
plt.xlabel('Data Set 1: PE_NJ')
plt.ylabel('Data Set 2: PE_SJ')
plt.show()

The first csv file has 932 rows and the second one has 99,154 rows. I am only interested in taking the first 932 rows from both files and then want to compare the 7th column in both files.

How do I go about doing that?

The first file looks like this:

chr1    1718493 1718764 2   2   0   12  0   24
chr1    8928117 8930883 2   2   0   56  0   24
chr1    8930943 8931949 2   2   0   48  0   25
chr1    9616316 9627341 1   1   0   12  0   24
chr1    10166642    10167279    1   1   0   31  1   24

The second file looks like so:

chr1    880181  880421  2   2   0   15  0   21
chr1    1718493 1718764 2   2   0   12  0   24
chr1    8568735 8585817 2   2   0   12  0   21
chr1    8617583 8684368 2   2   0   14  0   23
chr1    8928117 8930883 2   2   0   56  0   24

@ChrisArena I am new to this sort of stuff. How would a CSV file look different from .txt file? I got this output by doing head -5 "filename". — Ruchik Yajnik
– Ruchik Yajnik, Commented Jun 24, 2014 at 6:47
CSV file has entries which are separated by commas. Your file has entries which are separated by tabs. — Ashalynd
– Ashalynd, Commented Jun 24, 2014 at 6:48
There are also tab-delimited CSV's but they are rarely supported — Arusekk
– Arusekk, Commented Aug 12, 2014 at 8:37

Ashalynd · Accepted Answer · 2014-06-24 06:55:59Z

One possible approach would be read all lines from the first (shorter) file, find out its length (N), read N lines from the second file, take the kth column you are interested with from both files.

Something like (adjusting delimiter for your case):

def read_tsv_file(fname): # reads the full contents of tab-separated file (like you have)
    return list(csv.reader(open(fname, 'rb'), delimiter='\t'))

def take_nth_column(first_array, second_array, n): # returns a tuple containing nth columns from both arrays, with length corresponding to the length of the smaller array
    len1 = len(first_array)
    len2 = len(second_array)
    min_len = len1 if len1<=len2 else len2
    col1 = [row[n] for row in first_array[:min_len]]
    col2 = [row[n] for row in second_array[:min_len]]
    return (col1, col2)


first_array = read_tsv_file('your-first-file')
second_array = read_tsv_file('your-second-file')
(col1, col2) = take_nth_column(first_array, second_array, 7)

Chris Arena · Accepted Answer · 2014-06-24 06:49:23Z

So, your file isn't comma separated, which actually makes this a bit easier. We go through the first file and take the 7th item in each row after splitting the row on whitespace (tabs/spaces that separate the items in your data). Then we do the same thing for the next file, but if we get past the 932nd line we break out of the loop and finish.

I'd do it something like this:

file1_values = []
file2_values = []

with open('file1') as f1:
    for line in f1:
         seventh_column = line.split()[6]
         file1_values.append(seventh_column)

with open('file2') as f2:
    for i, line in enumerate(f2):
         if i > 932:
             break
         seventh_column = line.split()[6]
         file2_values.append(seventh_column)

Then, you have the values that you're interested in placed into two lists of hopefully equal length, and can go from there doing whatever comparisons or graphing you'd like to do.

Serge Ballesta · Accepted Answer · 2014-06-24 07:54:53Z

EDIT : add delimiter option and precision on function definition

If you just want keep one column and to stop reading after a count of line, simply append values to a list in your loop and break when it is exhausted. But if your file use anything else than a comma (,) as delimiter, you have to specify it. And do not repeat function definition : one def is enough. So you reader function could be like :

def read_column(file_name, x, y):
        cells = []
        with open(file_name, 'r') as f:
                reader = csv.reader(f, delimiter="\t")
                y_count = 0
                for n in reader:
                        y_count += 1
                        if y_count > y:
                                break
                        cells.append(n[x])
       return cells

That way function returns a list with the x column on they first lines

Collectives™ on Stack Overflow

Python: Ignoring specific rows in a csv file

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related