R - Removing rows from dataframe based on multiple conditions for a single column

Question

I have the following example dataframe in R:

SampleID <- c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "E", "E", "E", "E", "F", "F")
Analyte <- c("A1", "A1", "A2", "A2", "B1", "B2", "C1", "C1", "C1", "C2", "C2", "C2", "D1", "D2", "E1", "E1", "E2", "E2", "F1", "F2")
Fraction <- c("Dissolved", "Total", "Dissolved", "Total", "Total", "Total", "Dissolved", "Suspended", "Total", "Dissolved", "Suspended", "Total", "Unknown", "Unknown", "Dissolved", "Suspended", "Dissolved", "Suspended", "Dissolved", "Dissolved")
Concentration <- c(4.2, 5.6, 8.6, 11.2, 2.1, 9.6, 15.6, 28.7, 42.3, 18.3, 23.2, 48.6, 6.4, 28.8, 9.1, 32.5, 36.4, 24.5, 10.7, 3.4)
MyData <- data.frame(SampleID, Analyte, Fraction, Concentration)

MyData

   SampleID Analyte  Fraction Concentration
1         A      A1 Dissolved           4.2
2         A      A1     Total           5.6
3         A      A2 Dissolved           8.6
4         A      A2     Total          11.2
5         B      B1     Total           2.1
6         B      B2     Total           9.6
7         C      C1 Dissolved          15.6
8         C      C1 Suspended          28.7
9         C      C1     Total          42.3
10        C      C2 Dissolved          18.3
11        C      C2 Suspended          23.2
12        C      C2     Total          48.6
13        D      D1   Unknown           6.4
14        D      D2   Unknown          28.8
15        E      E1 Dissolved           9.1
16        E      E1 Suspended          32.5
17        E      E2 Dissolved          36.4
18        E      E2 Suspended          24.5
19        F      F1 Dissolved          10.7
20        F      F2 Dissolved           3.4

I would like to do the following:

For each SampleID, if an Analyte has a "Total" Fraction reported, retain only that row for the Analyte and remove rows with any other Fraction value (i.e., Dissolved, Suspended) for that Analyte.
If an Analyte for a SampleID includes both Dissolved and Suspended in the Fraction column (and no other values for Fraction), sum the concentrations for Dissolved and Suspended and add a row for that Analyte with the Fraction column labeled Total and the Concentration column listing the sum. Remove the original rows for Dissolved and Suspended for that Analyte.

So for the dataframe above, the two Analytes of SampleID "A" have Dissolved and Total, so I would want to remove the rows with the Dissolved Fraction. For SampleID "C", I would want to remove the Dissolved and Suspended Fractions of both Analytes and just keep the rows with Total. And lastly, for SampleID "E", the Dissolved and Suspended Fractions for each of the two Analytes would be summed together and the result would be a new row for each Analyte that represents the sum (relabeled as Total), and the rows associated with the Dissolved and Suspended Fractions would be removed.

The output of the above dataframe MyData would be the following:

   SampleID Analyte  Fraction Concentration
2         A      A1     Total           5.6
4         A      A2     Total          11.2
5         B      B1     Total           2.1
6         B      B2     Total           9.6
9         C      C1     Total          42.3
12        C      C2     Total          48.6
13        D      D1   Unknown           6.4
14        D      D2   Unknown          28.8
15        E      E1     Total          41.6 
17        E      E2     Total          60.9
19        F      F1 Dissolved          10.7
20        F      F2 Dissolved           3.4

Note that the example I have provided is just a small subset of a much larger dataset that includes hundreds of SampleIDs, but the Fraction column can only equal the values listed in the original dataframe above (i.e., Dissolved, Suspended, Total, or Unknown).

Thank you!

Anoushiravan R · Accepted Answer · 2021-05-21 10:20:40Z

1

You can also use the following solution. It may sound a bit verbose but will also get the job done:

library(dplyr)
library(purrr)


MyData %>%
  group_split(SampleID, Analyte) %>%
  map(~ if("Total" %in% .x$Fraction) {
    .x %>% filter(Fraction == "Total")} else {
      .x
    }) %>%
  map(~ if(all(c("Dissolved", "Suspended") %in% .x$Fraction)) {
    add_row(.x, SampleID = .x$SampleID[1], Analyte = .x$Analyte[1], 
            Fraction = "Total", Concentration = sum(.x$Concentration))
  } else {
    .x
  }) %>%
  map_dfr(~ if("Total" %in% .x$Fraction) {
    .x %>% filter(Fraction == "Total")} else {
      .x
    })


# A tibble: 12 x 4
   SampleID Analyte Fraction  Concentration
   <chr>    <chr>   <chr>             <dbl>
 1 A        A1      Total               5.6
 2 A        A2      Total              11.2
 3 B        B1      Total               2.1
 4 B        B2      Total               9.6
 5 C        C1      Total              42.3
 6 C        C2      Total              48.6
 7 D        D1      Unknown             6.4
 8 D        D2      Unknown            28.8
 9 E        E1      Total              41.6
10 E        E2      Total              60.9
11 F        F1      Dissolved          10.7
12 F        F2      Dissolved           3.4

edited May 21, 2021 at 10:20

answered May 20, 2021 at 19:55

Anoushiravan R

22k3 gold badges22 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

akrun Over a year ago

Have you tried with if/else if/else in a single map

akrun Over a year ago

What I find in your post is that you are looping 3 times with map. You have a single if/else in each map.

akrun Over a year ago

Or are you saying that it should be sequential after each filter

akrun Over a year ago

In some cases, it is needed sequential i.e. all(c("Dissolved", "Suspended") may not be TRUE if it is not filtered. I haven't run the whole code

Anoushiravan R Over a year ago

Oh thank god! I don't have to be worried about OP running into a problem with my solution. Yes it's weird I noticed sometimes it doesn't run. Thank you for your time I really appreciate your help :)

|

Onyambu · Accepted Answer · 2021-05-20 16:57:35Z

This could be done as:

library(tidyverse)
MyData %>%
  pivot_wider(c(SampleID, Analyte),Fraction, values_from = Concentration) %>%
  mutate(Total = coalesce(Total, Dissolved + Suspended), 
         Dissolved = ifelse(is.na(Total)&is.na(Suspended), Dissolved, NA),
         Suspended = ifelse(is.na(Total)&is.na(Dissolved), Suspended, NA)) %>%
  pivot_longer(-c(SampleID, Analyte), values_drop_na = TRUE)

# A tibble: 12 x 4
   SampleID Analyte name      value
   <chr>    <chr>   <chr>     <dbl>
 1 A        A1      Total       5.6
 2 A        A2      Total      11.2
 3 B        B1      Total       2.1
 4 B        B2      Total       9.6
 5 C        C1      Total      42.3
 6 C        C2      Total      48.6
 7 D        D1      Unknown     6.4
 8 D        D2      Unknown    28.8
 9 E        E1      Total      41.6
10 E        E2      Total      60.9
11 F        F1      Dissolved  10.7
12 F        F2      Dissolved   3.4

Collectives™ on Stack Overflow

R - Removing rows from dataframe based on multiple conditions for a single column

2 Answers 2

11 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related