0

I want to create a data frame using Python's Panda by reading a text file. The values are tab-separated but when I use this code:

import sys
import pandas as pd

query = sys.argv[1]

df = pd.DataFrame()

with open(query) as file_open:

    for line in iter(file_open.readline, ''):

        if line.startswith("#CHROM"):
            columns = line.split("\t")

        if line.startswith("chr7"):
            df = df.append(line.split("\t"))

print df
print len(df)

My output is:

...
0                                                chr7
1                                           158937585
2                                           rs3763427
3                                                   T
4                                                   C
5                                              931.21
6                                                   .
7   AC=2;AF=1.00;AN=2;DP=24;Dels=0.00;FS=0.000;HRu...
8                              GT:DP:GQ:PL:A:C:G:T:IR
9         1/1:24:72.24:964,72,0:0,0:11,12:0,0:0,0:0\n
0                                                chr7
1                                           158937597
2                                                   .
3                                                   C
4                                                  CG
5                                              702.73
6                                                   .
7   AC=2;AF=1.00;AN=2;BaseQRankSum=-1.735;DP=19;FS...
8                              GT:DP:GQ:PL:A:C:G:T:IR
9         1/1:19:41.93:745,42,0:0,0:10,8:0,0:0,0:17\n

[510350 rows x 1 columns]
510350

The text file contains this format:

#CHROM \t POS \t ID \t REF \t ALT \t QUAL \t FILTER \t INFO \n
chr7 \t 149601 \t tMERGED_DEL_2_39754 \t T \t .\t 141.35 \t . \t AC=0;AF=0.00;AN=2;DP=37;MQ=37.00;MQ0=0;1000gALT=<DEL>;AF1000g=0.09.. \n
chr7 \t 149616 \t rs190051229 \t C \t . \t 108.65 \t . \t AC=0;AF=0.00;AN=2;DP=35;MQ=37.00;MQ0=0;1000gALT=T;AF1000g=0.00.. \n
...

I want the data frame to look like:

 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO   
 chr7  149601  MERGED..   T      .       141.35    .    AC=0;AF=0.00;A..
 chr7  149616  rs1900..   C      .       108.65    .    AC=0;AF=0.00;A..
 ...

Reading each line with the code above creates a list of the values in that line:

['chr7','149601','MERGED..','T','.','141.35','.','AC=0;AF=0;A..'\n]

What is wrong about my code?

Thank you.

Rodrigo

1 Answer 1

2

Don’t read the file by hand. Use pandas’ powerful read_csv:

df = pd.read_csv(query, sep='\t')

Full program:

import sys
import pandas as pd

query = sys.argv[1]
df = pd.read_csv(query, sep='\t')
print df
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.