1

I have a data.frame that looks like:

              SNP              CLST A1 A2       FRQ IMP     POS CHR BVAL
    1   rs2803291            Brahui  C  T  0.660000   0 1882185   1  878
    2   rs2803291           Balochi  C  T  0.750000   0 1882185   1  878
    3   rs2803291            Hazara  C  T  0.772727   0 1882185   1  878
    4   rs2803291           Makrani  C  T  0.620000   0 1882185   1  878
    5   rs2803291            Sindhi  C  T  0.770833   0 1882185   1  878
    6   rs2803291            Pathan  C  T  0.681818   0 1882185   1  878
    53  rs12060022           Brahui  T  C 0.0600000   1 3108186   1  982
    54  rs12060022          Balochi  T  C 0.0416667   1 3108186   1  982
    55  rs12060022           Hazara  T  C 0.0000000   1 3108186   1  982
    56  rs12060022          Makrani  T  C 0.0200000   1 3108186   1  982
    57  rs12060022           Sindhi  T  C 0.0625000   1 3108186   1  982
    58  rs12060022           Pathan  T  C 0.0681818   1 3108186   1  982
    105   rs870171           Brahui  T  G 0.2200000   0 3332664   1  976
    106   rs870171          Balochi  T  G 0.3333330   0 3332664   1  976
    107   rs870171           Hazara  T  G 0.3636360   0 3332664   1  976
    108   rs870171          Makrani  T  G 0.1800000   0 3332664   1  976
    109   rs870171           Sindhi  T  G 0.2083330   0 3332664   1  976
    110   rs870171           Pathan  T  G 0.1590910   0 3332664   1  976
    157  rs4282783           Brahui  G  T 0.8400000   1 4090545   1  992
    158  rs4282783          Balochi  G  T 0.9583333   1 4090545   1  992
    159  rs4282783           Hazara  G  T 0.8409090   1 4090545   1  992
    160  rs4282783          Makrani  G  T 0.9000000   1 4090545   1  992
    161  rs4282783           Sindhi  G  T 0.8958330   1 4090545   1  992
    162  rs4282783           Pathan  G  T 0.9772727   1 4090545   1  992

Each SNP locus has certain populations associated with it and a certain frequency (FRQ) for each population. There are "L" amount of unique SNPs in the total data.frame. I would like to randomly sample 3 SNPs from the data.frame and then I would like to take the sum of (FRQ_balochi_SNP1 - FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1 - FRQ_Pathan_SNP1) across + (FRQ_balochi_SNP2 - FRQ_Pathan_SNP2) * (FRQ_Y_SNP2 - FRQ_Pathan_SNP2) + (FRQ_balochi_SNP3 - FRQ_Pathan_SNP3) * (FRQ_Y_SNP3 - FRQ_Pathan_SNP3) using the "3" randomly generated SNPs. The notation looks something like Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i). Y is a given population. For example: "Hazara".

I would like my output to be a list of Values from this calculation along with their Y populations.

For example, let's walk through Hazara as our Y population. We randomly sample and get SNP1, SNP2, and SNP4. The first SNP (rs2803291) gives us (0.75 - 0.681818) * (0.772727 - 0.681818) for a value of 0.006198. The second SNP (rs12060022) gives us (0.041666 - 0.0681818) * (0.0000 - 0.061818) for a value of 0.001639. The fourth SNP (rs4282783) gives us (0.958333 - 0.9772727) * (0.8409090 - 0.9772727) for a value of 0.002582. Summing our values together we would get 0.006198+0.001639+0.002582 for a total sum of 0.01402. Thus the first line of the output file would be

Population   Value
Hazara       0.01402
Makrani      ???

I would like this done for every population, including Balochi and Pathan if possible.

3
  • Why did you rollback the edit cleaning up the alignment? Commented Jul 22, 2016 at 16:28
  • I didn't mean to, I was mid-edit changing some of the math to make it cleaner Commented Jul 22, 2016 at 16:30
  • Pathan will always be zero because the function subtracts Y - Pathan. Just an fyi. Commented Jul 22, 2016 at 17:06

1 Answer 1

2

I would create a helper function then place it into a looping mechanism that will try out each label:

library(dplyr)

snp_sum <- function(SNP, FRQ, CLST) {
  (FRQ[CLST == "Balochi"] - FRQ[CLST == "Pathan"]) * (FRQ[CLST == SNP] - FRQ[CLST == "Pathan"])
}

sum_df <- function(mydf, clst_list) {
  lst <- lapply(clst_list, function(x) {
           mydf %>% group_by(SNP) %>%
           summarise(FRQ_SUM=snp_sum(x, FRQ, CLST)) %>%
           summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
         })
  cbind.data.frame(Population=clst_list, do.call("rbind", lst))
}

sum_df(df1, unique(df1$CLST))
#   Population        Value
# 1     Brahui 0.0134297098
# 2    Balochi 0.0353677606
# 3     Hazara 0.0400308238
# 4    Makrani 0.0008918497
# 5     Sindhi 0.0161916643
# 6     Pathan 0.0000000000

Edit

Possible speed up with a built-in R package called parallel:

library(parallel)
no_cores <- detectCores() - 1L
cl <- makeCluster(no_cores)
clusterExport(cl, c("df1", "snp_sum"))
clusterEvalQ(cl, library(dplyr))

sum_parallel <- parLapply(cl, unique(df1$CLST), function(x) {

  df1 %>% group_by(SNP) %>%
    summarise(FRQ_SUM = snp_sum(x, FRQ, CLST)) %>%
    summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
})

cbind.data.frame(Population=unique(df1$CLST), do.call("rbind", sum_parallel))

stopCluster(cl)
Sign up to request clarification or add additional context in comments.

8 Comments

do you have any estimates on how long this would take if my actual file was 3,000,000 unique SNPs with 52 populations so like 150,000,000 lines long, and if I wanted to do a random sample of 5,000 SNPs? Just like a rough ballpark estimate. It takes like 20 minutes to read in the file itself. But are we talking an hour, few hours, or a few days?
Gradually build up to see how time increases. Try 10k rows, then 100k,and monitor the timing.
The bigger issue will be memory usage. Do you have enough working memory to read in that many rows?
You can probably run quicker if you use a parallel process. Would you like to try it out.
I added a way with library(parallel). It does not have to be installed. You may see speed increases.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.