Parse bed file with Pandas

Question

For the sections between count=1s and the start and end; combine overlapping positions and output the median of the counts.

Input

chr         start   stop   strand   count
chr1        0       13320   -       1
chr1        13320   13321   -       2
chr1        13321   13328   -       1
chr1        13328   13342   -       2
chr1        13342   13343   -       18
chr1        13343   13344   -       36
chr1        13344   13345   -       18
chr1        13345   13346   -       6
chr1        13346   16923   -       1
chr1        16923   16942   -       3
chr1        16942   16943   -       2

Output

chr1        13320   13321    2
chr1        13328   13346   18
chr1        16923   16943   2.5

For the second value:

Start 13328 - this is because the 4th value in the table has the start 13328.
This is the row after the second count=1.
Stop 13346 - this is because the 8th value in the table has the stop 13346.
This is the row before the third count=1.
Count 18 - this is the median of counts between the 4th and 8th inclusive.

Here is my code.

from pathlib import Path
import pandas as pd
file = Path("bed_file.bed")
# load with pandas
df = pd.read_csv(file, sep='\t', header=None)

# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# create new dataframe
newDF = pd.DataFrame(columns=['chr','start', 'stop', 'count'])
# last position
end = df.index[-1]

# parse dataframe
for idx, elem in enumerate(li):
    if elem != li[-1]: 
        next_elem = li[(idx + 1) % len(li)] # next element where count=1
        start = df.iloc[elem]['stop'] # start position 
        stop = df.iloc[next_elem-1]['stop'] # stop position
        if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
            count = df.iloc[elem+1]['count']
            #print(f"start={start}\tstop={stop}\tcount={count}")
        else:
            count = df.iloc[elem+1:next_elem]['count'].median()
            #print(f"start={start}\tstop={stop}\tcount={count}")
        newDF = newDF.append({
            'chr' : df.loc[0,'chr'],
            'start' : start,
            'stop' : stop,
            'count' : count
            
        },ignore_index=True)
    else: # last element in the list
        start = df.iloc[elem]['stop']
        stop = df.iloc[end]['stop']
        count = df.iloc[elem+1:end+1]['count'].median()
        #print(f"start={start}\tstop={stop}\tcount={count}")
        newDF = newDF.append({
            'chr' : df.loc[0,'chr'],
            'start' : start,
            'stop' : stop,
            'count' : count
        },ignore_index=True)

Is there a better way to do this? Is my code Pythonic?

spyr03 · Accepted Answer · 2020-08-11 16:56:02Z

I'll first offer some critique of your code, and then I'll show you how I would approach the problem.

Commented out code should be removed before asking for a code review #print(f"start={start}\tstop={stop}\tcount={count}")
Many of the comments don't add value. # last position doesn't mean much on its own. Why do you want the last position? Why doesn't the code do a good enough job explaining that?
Generally an if/else in a loop where one of branches is only taken once, either at the start or end, can be removed. You can iterate less, and deal with the case explicitly. You can add a sentinel value so you don't have to check if you are at the end of the iterator. You can used the available libraries or built-in functions, which will deal with the case for you.

# load with pandas
df = pd.read_csv(file, sep='\t', header=None)

# set colnames
header = ['chr','start','stop','strand','count']
df.columns = header[:len(df.columns)]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

If the header is cut short len(df.columns) < len(header), the first thing to be cut off is the column df['count']. You then assume it exists straight away after by using it. Which is it? Will it always exist, or sometimes will there not be enough columns? Erring on the side of it always exists, the code becomes

# load with pandas
df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

# index where count=1
col_count = df['count'].tolist()
li = [i for i, n in enumerate(col_count) if n == 1]

...

for idx, elem in enumerate(li):

If you are using pandas (or numpy) it is generally not the best to move the data back and forth between the library and Python. You lose most of the efficiency of the library, and the code generally becomes far less readable.

Don't use names like li. It doesn't give any information to the reader. If you have a list of indices, what will you use the list for? That would make a much better name.

Using pandas more, and renaming gives something like

splitting_indices = df.index[df['count'] == 1].tolist()

for idx, elem in enumerate(splitting_indices):

if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
    count = df.iloc[elem+1]['count']
    #print(f"start={start}\tstop={stop}\tcount={count}")
else:
    count = df.iloc[elem+1:next_elem]['count'].median()

Finding this logic in amongst getting the data out from the dataframe is not easy. This is the core logic, and should be treated as such. Put this in a function at the very least.

def extract_median(df, elem, next_elem):
    if next_elem - (elem+1) == 1: # cases where only one position and we cannot compute median
        count = df.iloc[elem+1]['count']
    else:
        count = df.iloc[elem+1:next_elem]['count'].median()
    return count

Now it should be much more apparent that the comment is bogus. You CAN compute the median of a single element list. So why are we special casing this? df.iloc[elem+1:next_elem] works even if next_elem is only one bigger than elem+1.

def extract_median(df, elem, next_elem):
    return df.iloc[elem+1:next_elem]['count'].median()

And now we can see that a function is probably not necessary.

The approach I would take to implementing this is to try and stay using pandas as long as possible. No loops. No tolist. Since I won't want loops, indices are probably not needed too, so I can limit usage of iloc and df.index.

First, read in the data

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)

     chr  start   stop strand  count
0   chr1      0  13320      -      1
1   chr1  13320  13321      -      2
2   chr1  13321  13328      -      1
3   chr1  13328  13342      -      2
4   chr1  13342  13343      -     18
5   chr1  13343  13344      -     36
6   chr1  13344  13345      -     18
7   chr1  13345  13346      -      6
8   chr1  13346  16923      -      1
9   chr1  16923  16942      -      3
10  chr1  16942  16943      -      2

Then, find every row of interest. That would be everywhere count is not 1.

df['count'] != 1

0     False
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8     False
9      True
10     True

I want to group all the consecutive rows that are True together. The usual method to group consecutive rows by a column value is

Keep a running tally.
Compare each value in the column with the next one.
If they are the same, don't do anything.
If they are different, add 1 to a running tally.
Associate the tally to that value.
Groupby the tally.

In code

mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()

    count   mask  tally
0       1  False      1
1       2   True      2
2       1  False      3
3       2   True      4
4      18   True      4
5      36   True      4
6      18   True      4
7       6   True      4
8       1  False      5
9       3   True      6
10      2   True      6

Grouping then gives

df.groupby(tally).groups

{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1], dtype='int64'),
 3: Int64Index([2], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 5: Int64Index([8], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

Since you only want the rows where count is not 1, we can reuse the mask to filter them out.

df[mask].groupby(tally).groups

{2: Int64Index([1], dtype='int64'),
 4: Int64Index([3, 4, 5, 6, 7], dtype='int64'),
 6: Int64Index([9, 10], dtype='int64')}

And finally the median is quick to get from a grouper

df[mask].groupby(tally).median()

         start     stop  count
count                         
2      13320.0  13321.0    2.0
4      13343.0  13344.0   18.0
6      16932.5  16942.5    2.5

In the end, the code is a lot shorter

df = pd.read_csv(file, sep='\t', names=('chr', 'start', 'stop', 'strand', 'count'), header=None)
mask = df['count'] != 1
tally = (mask != mask.shift()).cumsum()
df[mask].groupby(tally).median()

I found the answer using agg() function ! df[mask].groupby(tally).agg({'start': np.min, 'stop': np.max, 'count' : np.median}) — PIFASTE
– PIFASTE, Commented Aug 12, 2020 at 10:04

Stack Exchange Network

Parse bed file with Pandas

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Parse bed file with Pandas

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions