Iterating through CSV reader to slice data frame

Question

I have a data frame that contains 508383 rows. I am only showing the first 10 row.

    0        1        2

0 chr3R 4174822 4174922
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144

I want to iterate through each row and check the value of column #2 of the first row to the value of the next row. I want to check if the difference between these values is less than 5000. If the difference is greater than 5000 then I want to slice the data frame from the first row to the previous row and have this be a subset data frame.

I then want to repeat this process and create a second subset data frame. I've only manage to get this done by using CSV reader in combination with Pandas.

Here is my code:

#!/usr/bin/env python

import pandas as pd

data = pd.read_csv('sort_cov_emb_sg.bed', sep='\t', header=None, index_col=None)

import csv

file = open('sort_cov_emb_sg.bed')

readCSV = csv.reader(file, delimiter="\t")

first_row = readCSV.next()
print first_row

count_1 = 0
while count_1 < 100000:
    next_row = readCSV.next()
    value_1 = int(next_row[1]) - int(first_row[1])
    count_1 = count_1 + 1
    if value_1 < 5000:
        continue
    else:
        break

print next_row
print count_1
print value_1

window_1 = data[0:63]
print window_1

first_row = readCSV.next()
print first_row

count_2 = 0
while count_2 < 100000:
    next_row = readCSV.next()
    value_2 = int(next_row[1]) - int(first_row[1])
    count_2 = count_2 + 1
    if value_2 < 5000:
        continue
    else:
        break

print next_row
print count_2
print value_2

window_2 = data[0:74]
print window_2

I wanted to know if there is a better way to do this process )without repeating the code every time) and get all the subset data frames I need.

Thanks.

Rodrigo

I remember seeing a similar question and someone came up with a great answer to it, but I can't find it. As a start, this will give you every row where the difference is greater than 5000. data[abs(data["2"] - data["2"].shift()) > 5000]. I guess you could iterate through that and slice accordingly — Bob Haffner
– Bob Haffner, Commented Apr 15, 2015 at 1:36

DSM · Accepted Answer · 2015-04-15 02:46:25Z

This is yet another example of the compare-cumsum-groupby pattern. Using only rows you showed (and so changing the diff to 100 instead of 5000):

jumps = df[2] > df[2].shift() + 100
grouped = df.groupby(jumps.cumsum())
for k, group in grouped:
    print(k)
    print(group)

produces

0
       0        1        2
0  chr3R  4174822  4174922
1
       0        1        2
1  chr3R  4175400  4175500
2  chr3R  4175466  4175566
3  chr3R  4175521  4175621
4  chr3R  4175603  4175703
5  chr3R  4175619  4175719
6  chr3R  4175692  4175792
2
       0        1        2
7  chr3R  4175889  4175989
8  chr3R  4175966  4176066
9  chr3R  4176044  4176144

This works because the comparison gives us a new True every time a new group starts, and when we take the cumulative sum of that, we get what is effectively a group id, which we can group on:

>>> jumps
0    False
1     True
2    False
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: 2, dtype: bool
>>> jumps.cumsum()
0    0
1    1
2    1
3    1
4    1
5    1
6    1
7    2
8    2
9    2
Name: 2, dtype: int32

Collectives™ on Stack Overflow

Iterating through CSV reader to slice data frame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related