1

I want to extract the whole content of a column from a multi-column data frame using pandas but I am getting only a part of the column.

The code I am using is:

import pandas
import csv
data = pandas.read_csv('data1.csv', usecols = ['dbSNP RS ID'])

import sys  
sys.stdout = open("data2.csv", "w") 
print data

What I get is something like this:

       dbSNP RS ID
0        rs4147951
1        rs2022235
2        rs6425720
3       rs12997193
4        rs9933410
5        rs7142489
...            ...
934963  rs10262938
934964   rs6140985
934965   rs2704067
934966   rs2239441
934967  rs10041689

[934968 rows x 1 columns]

The first 2 lines of the csv file are:

"Probe Set ID","dbSNP RS ID","Chromosome","Physical Position","Strand","ChrX    pseudo-autosomal region 1","Cytoband","Flank","Allele A","Allele B","Associated Gene","Genetic Map","Microsatellite","Fragment Enzyme Type Length Start Stop","Allele Frequencies","Heterozygous Allele Frequencies","Number of individuals","In Hapmap","Strand Versus dbSNP","Copy Number Variation","Probe Count","ChrX pseudo-autosomal region 2","In Final List","Minor Allele","Minor Allele Frequency","% GC","OMIM"

"AFFX-   SNP_10000979","rs4147951","17","66943738","+","0","q24.2","GGATAAGGATGGGCTA[A/G]ATTATCATTGCTGTTA","A","G","ENST00000269080 // intron // 0 // Hs.58351 // ABCA8 // 10351 // ATP-binding cassette, sub-family A (ABC1), member 8 /// ENST00000428549 // intron // 0 // Hs.58351 // ABCA8 // 10351 // ATP-binding cassette, sub-family A (ABC1), member 8 /// ENST00000541225 // intron // 0 // Hs.58351 // ABCA8 // 10351 // ATP-binding cassette, sub-family A (ABC1), member 8 /// ENST00000542396 // intron // 0 // Hs.58351 // ABCA8 // 10351 // ATP-binding cassette, sub-family A (ABC1), member 8 /// NM_007168 // intron // 0 // Hs.58351 // ABCA8 // 10351 // ATP-binding cassette, sub-family A (ABC1), member 8","99.8510 // D17S795 // D17S2182 // --- // --- // deCODE /// 90.7912 // D17S1870 // D17S840 // AFM323TB1 // AFM207VF4 // Marshfield /// 82.3131 // --- // D17S1786 // 147671 // --- // SLM1","D17S795 // downstream // 265562 /// D17S1474E // upstream // 113179","NspI // ACATGT_ACATGT // 536 // 66943408 // 66943943 /// StyI // CCTTGG_CCATGG // 2334 // 66941614 // 66943947","0.3917 // 0.6083 // CEU /// 0.6444 // 0.3556 // CHB /// 0.6000 // 0.4000 // JPT /// 0.5667 // 0.4333 // YRI","0.3833 // CEU /// 0.4889 // CHB /// 0.4444 // JPT /// 0.5667 // YRI","60 // CEU /// 45 // CHB /// 45 // JPT /// 60 // YRI","YES","reverse","---","6","0","YES","A // CEU /// G // CHB /// G // JPT /// G // YRI","0.3917 // CEU /// 0.3556 // CHB /// 0.4000 // JPT /// 0.4333 // YRI","---","---"

Any idea about how to extract the 'dbSNP RS ID' from the 934968 rows??. Thank you very much !

12
  • what do the first couple of lines of your CSV look like? Commented Dec 10, 2015 at 14:07
  • IIUC do you want data.values ? Commented Dec 10, 2015 at 14:09
  • Hi @mirosval, I will edit the question and include the first two lines of the csv file Commented Dec 10, 2015 at 14:10
  • Hi @Fabio, some column values are strings, others are integers Commented Dec 10, 2015 at 14:14
  • Are you refering to the dots in the middle of your second code listing? If so it is just how pandas shows data, you don't want to print all of the 900k columns out... if you actually try to save them, they should all be there... Commented Dec 10, 2015 at 14:16

1 Answer 1

1

IIUC you should read and write again a .csv file with:

data = pandas.read_csv('data1.csv', usecols = ['dbSNP RS ID'])

data.to_csv('data2.csv')

The problem with your code is that the print function actually writes on file only the part of the file that pandas shows in the terminal prompt. When there are too much rows it splits the output adding ... in the middle.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.