0

I have a data frame that contains 508383 rows. I am only showing the first 10 row.

    0        1        2 

0 chr3R 4174822 4174922
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144

I want to iterate through each row and check the value of column #2 of the first row to the value of the next row. I want to check if the difference between these values is less than 5000. If the difference is greater than 5000 then I want to slice the data frame from the first row to the previous row and have this be a subset data frame.

I then want to repeat this process and create a second subset data frame. I've only manage to get this done by using CSV reader in combination with Pandas.

Here is my code:

#!/usr/bin/env python

import pandas as pd

data = pd.read_csv('sort_cov_emb_sg.bed', sep='\t', header=None, index_col=None)

import csv

file = open('sort_cov_emb_sg.bed')

readCSV = csv.reader(file, delimiter="\t")

first_row = readCSV.next()
print first_row

count_1 = 0
while count_1 < 100000:
    next_row = readCSV.next()
    value_1 = int(next_row[1]) - int(first_row[1])
    count_1 = count_1 + 1
    if value_1 < 5000:
        continue
    else:
        break

print next_row
print count_1
print value_1

window_1 = data[0:63]
print window_1

first_row = readCSV.next()
print first_row

count_2 = 0
while count_2 < 100000:
    next_row = readCSV.next()
    value_2 = int(next_row[1]) - int(first_row[1])
    count_2 = count_2 + 1
    if value_2 < 5000:
        continue
    else:
        break

print next_row
print count_2
print value_2

window_2 = data[0:74]
print window_2

I wanted to know if there is a better way to do this process )without repeating the code every time) and get all the subset data frames I need.

Thanks.

Rodrigo

1
  • I remember seeing a similar question and someone came up with a great answer to it, but I can't find it. As a start, this will give you every row where the difference is greater than 5000. data[abs(data["2"] - data["2"].shift()) > 5000]. I guess you could iterate through that and slice accordingly Commented Apr 15, 2015 at 1:36

1 Answer 1

3

This is yet another example of the compare-cumsum-groupby pattern. Using only rows you showed (and so changing the diff to 100 instead of 5000):

jumps = df[2] > df[2].shift() + 100
grouped = df.groupby(jumps.cumsum())
for k, group in grouped:
    print(k)
    print(group)

produces

0
       0        1        2
0  chr3R  4174822  4174922
1
       0        1        2
1  chr3R  4175400  4175500
2  chr3R  4175466  4175566
3  chr3R  4175521  4175621
4  chr3R  4175603  4175703
5  chr3R  4175619  4175719
6  chr3R  4175692  4175792
2
       0        1        2
7  chr3R  4175889  4175989
8  chr3R  4175966  4176066
9  chr3R  4176044  4176144

This works because the comparison gives us a new True every time a new group starts, and when we take the cumulative sum of that, we get what is effectively a group id, which we can group on:

>>> jumps
0    False
1     True
2    False
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: 2, dtype: bool
>>> jumps.cumsum()
0    0
1    1
2    1
3    1
4    1
5    1
6    1
7    2
8    2
9    2
Name: 2, dtype: int32
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.