I have a data.frame that looks like:
SNP CLST A1 A2 FRQ IMP POS CHR BVAL
1 rs2803291 Brahui C T 0.660000 0 1882185 1 878
2 rs2803291 Balochi C T 0.750000 0 1882185 1 878
3 rs2803291 Hazara C T 0.772727 0 1882185 1 878
4 rs2803291 Makrani C T 0.620000 0 1882185 1 878
5 rs2803291 Sindhi C T 0.770833 0 1882185 1 878
6 rs2803291 Pathan C T 0.681818 0 1882185 1 878
53 rs12060022 Brahui T C 0.0600000 1 3108186 1 982
54 rs12060022 Balochi T C 0.0416667 1 3108186 1 982
55 rs12060022 Hazara T C 0.0000000 1 3108186 1 982
56 rs12060022 Makrani T C 0.0200000 1 3108186 1 982
57 rs12060022 Sindhi T C 0.0625000 1 3108186 1 982
58 rs12060022 Pathan T C 0.0681818 1 3108186 1 982
105 rs870171 Brahui T G 0.2200000 0 3332664 1 976
106 rs870171 Balochi T G 0.3333330 0 3332664 1 976
107 rs870171 Hazara T G 0.3636360 0 3332664 1 976
108 rs870171 Makrani T G 0.1800000 0 3332664 1 976
109 rs870171 Sindhi T G 0.2083330 0 3332664 1 976
110 rs870171 Pathan T G 0.1590910 0 3332664 1 976
157 rs4282783 Brahui G T 0.8400000 1 4090545 1 992
158 rs4282783 Balochi G T 0.9583333 1 4090545 1 992
159 rs4282783 Hazara G T 0.8409090 1 4090545 1 992
160 rs4282783 Makrani G T 0.9000000 1 4090545 1 992
161 rs4282783 Sindhi G T 0.8958330 1 4090545 1 992
162 rs4282783 Pathan G T 0.9772727 1 4090545 1 992
Each SNP locus has certain populations associated with it and a certain frequency (FRQ) for each population. There are "L" amount of unique SNPs in the total data.frame. I would like to randomly sample 3 SNPs from the data.frame and then I would like to take the sum of (FRQ_balochi_SNP1 - FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1 - FRQ_Pathan_SNP1) across + (FRQ_balochi_SNP2 - FRQ_Pathan_SNP2) * (FRQ_Y_SNP2 - FRQ_Pathan_SNP2) + (FRQ_balochi_SNP3 - FRQ_Pathan_SNP3) * (FRQ_Y_SNP3 - FRQ_Pathan_SNP3) using the "3" randomly generated SNPs. The notation looks something like Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i). Y is a given population. For example: "Hazara".
I would like my output to be a list of Values from this calculation along with their Y populations.
For example, let's walk through Hazara as our Y population. We randomly sample and get SNP1, SNP2, and SNP4. The first SNP (rs2803291) gives us (0.75 - 0.681818) * (0.772727 - 0.681818) for a value of 0.006198. The second SNP (rs12060022) gives us (0.041666 - 0.0681818) * (0.0000 - 0.061818) for a value of 0.001639. The fourth SNP (rs4282783) gives us (0.958333 - 0.9772727) * (0.8409090 - 0.9772727) for a value of 0.002582. Summing our values together we would get 0.006198+0.001639+0.002582 for a total sum of 0.01402. Thus the first line of the output file would be
Population Value
Hazara 0.01402
Makrani ???
I would like this done for every population, including Balochi and Pathan if possible.
Pathanwill always be zero because the function subtracts Y - Pathan. Just an fyi.