1

I am trying to create a simple line graph to compare columns from two files. I have written some code and would like to know how to ignore lines in the two .csv files that I have. The code is as follows:

import numpy as np
import csv
from matplotlib import pyplot as plt

def read_cell(x, y):
        with open('Illumina_Heart_Gencode_Paired_End_Novel_Junctions.csv', 'r') as f:
                reader = csv.reader(f)
                y_count = 0
                for n in reader:
                        if y_count == y:
                                cell = n[x]
                                return cell
                        y_count += 1
print(read_cell(6, 932)

def read_cell(x, y):
        with open('Illumina_Heart_RefSeq_Paired_End_Novel_Junctions.csv', 'r') as f:
                reader = csv.reader(f)
                y_count = 0
                for n in reader:
                        if y_count == y:
                                cell = n[x]
                                return cell
                        y_count += 1
print(read_cell(6, 932))


d1 = []
for i in set1:
    try:
        d1.append(float(i[5]))
    except ValueError:
        continue

d2 = []
for i in set2:
    try:
        d2.append(float(i[5]))
    except ValueError:
        continue

min_len = len(d1)
if len(d2) < min_len:
    min_len = len(d2)
d1 = d1[0:min_len]
d2 = d2[0:min_len]

plt.plot(d1, d2, 'r*')
plt.plot(d1, d2, 'b-')
plt.xlabel('Data Set 1: PE_NJ')
plt.ylabel('Data Set 2: PE_SJ')
plt.show()

The first csv file has 932 rows and the second one has 99,154 rows. I am only interested in taking the first 932 rows from both files and then want to compare the 7th column in both files.

How do I go about doing that?

The first file looks like this:

chr1    1718493 1718764 2   2   0   12  0   24
chr1    8928117 8930883 2   2   0   56  0   24
chr1    8930943 8931949 2   2   0   48  0   25
chr1    9616316 9627341 1   1   0   12  0   24
chr1    10166642    10167279    1   1   0   31  1   24

The second file looks like so:

chr1    880181  880421  2   2   0   15  0   21
chr1    1718493 1718764 2   2   0   12  0   24
chr1    8568735 8585817 2   2   0   12  0   21
chr1    8617583 8684368 2   2   0   14  0   23
chr1    8928117 8930883 2   2   0   56  0   24
4
  • CSVs are comma separated. Comma Separated Values.. Commented Jun 24, 2014 at 6:42
  • @ChrisArena I am new to this sort of stuff. How would a CSV file look different from .txt file? I got this output by doing head -5 "filename". Commented Jun 24, 2014 at 6:47
  • CSV file has entries which are separated by commas. Your file has entries which are separated by tabs. Commented Jun 24, 2014 at 6:48
  • There are also tab-delimited CSV's but they are rarely supported Commented Aug 12, 2014 at 8:37

3 Answers 3

1

One possible approach would be read all lines from the first (shorter) file, find out its length (N), read N lines from the second file, take the kth column you are interested with from both files.

Something like (adjusting delimiter for your case):

def read_tsv_file(fname): # reads the full contents of tab-separated file (like you have)
    return list(csv.reader(open(fname, 'rb'), delimiter='\t'))

def take_nth_column(first_array, second_array, n): # returns a tuple containing nth columns from both arrays, with length corresponding to the length of the smaller array
    len1 = len(first_array)
    len2 = len(second_array)
    min_len = len1 if len1<=len2 else len2
    col1 = [row[n] for row in first_array[:min_len]]
    col2 = [row[n] for row in second_array[:min_len]]
    return (col1, col2)


first_array = read_tsv_file('your-first-file')
second_array = read_tsv_file('your-second-file')
(col1, col2) = take_nth_column(first_array, second_array, 7)
Sign up to request clarification or add additional context in comments.

Comments

0

So, your file isn't comma separated, which actually makes this a bit easier. We go through the first file and take the 7th item in each row after splitting the row on whitespace (tabs/spaces that separate the items in your data). Then we do the same thing for the next file, but if we get past the 932nd line we break out of the loop and finish.

I'd do it something like this:

file1_values = []
file2_values = []

with open('file1') as f1:
    for line in f1:
         seventh_column = line.split()[6]
         file1_values.append(seventh_column)

with open('file2') as f2:
    for i, line in enumerate(f2):
         if i > 932:
             break
         seventh_column = line.split()[6]
         file2_values.append(seventh_column)

Then, you have the values that you're interested in placed into two lists of hopefully equal length, and can go from there doing whatever comparisons or graphing you'd like to do.

Comments

0

EDIT : add delimiter option and precision on function definition

If you just want keep one column and to stop reading after a count of line, simply append values to a list in your loop and break when it is exhausted. But if your file use anything else than a comma (,) as delimiter, you have to specify it. And do not repeat function definition : one def is enough. So you reader function could be like :

def read_column(file_name, x, y):
        cells = []
        with open(file_name, 'r') as f:
                reader = csv.reader(f, delimiter="\t")
                y_count = 0
                for n in reader:
                        y_count += 1
                        if y_count > y:
                                break
                        cells.append(n[x])
       return cells

That way function returns a list with the x column on they first lines

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.