4

I have the following query in which I want to group by treatment_selections.treatment_id and select the treatments.name column to be called:

@search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('DISTINCT ON (treatment_selections.treatment_id) treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")

I get the dreaded error:

PG::GroupingError: ERROR:  column "treatment_selections.treatment_id" must appear in the GROUP BY clause or be used in an aggregate function

So I switched to the following query:

@search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatments.name, treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN #{params[:start_year]} AND #{params[:end_year]}")  
.group('treatment_selections.treatment_id')

Which I know won't work because of not referencing treatments.name in the group clause. But I figured the top method should of worked as I'm not grouping by anything. I understand that using such methods as AVG and SUM are not needed to be referenced in the group clause, but what about columns that don't reference any aggregate functions?

I have seen that nesting queries is a possible way of doing what I'm after, but I'm unsure of how best to implement this using the above query. Hoping someone could help me out here.

Log

SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield FROM "trials" INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id" INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id" INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id" INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id" WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018) GROUP BY treatment_selections.treatment_id)
7
  • 1
    one minor comment - don't use #{params[:start_year]} (read about SQL injections) Commented Mar 15, 2019 at 7:16
  • Thanks for the feedback @IgorKasyanchuk Commented Mar 15, 2019 at 7:23
  • You could try to add .order('treatment_selections.treatment_id') to the first query. Commented Mar 15, 2019 at 9:05
  • There are a lot of SO posts could be found by "Postgresql must appear in the GROUP BY clause or be used in an aggregate function" query - you can choose appropriate one. Commented Mar 15, 2019 at 9:08
  • Thanks @PavelMikhailyuk the order method still gives me that same error. I know there are a lot of other posts around this error, but each one that is similar in some way to my problem just doesn't seem to work for me. Commented Mar 15, 2019 at 22:25

2 Answers 2

2
+25

Selecting multiple columns (without aggregation) and using aggregate functions together won't be possible, unless you group by the selected columns - otherwise there is no way to determine how the average should be calculated (entire data set vs grouped by something). You could do this -

@search = Trial.joins(:quality_datum, treatment_selections: :treatment)
.select('treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield')
.where("EXTRACT(year from season_year) BETWEEN ? AND ?", params[:start_year], params[:end_year])  
.group('treatment_selections.treatment_id, treatments.name')

Although this might not work well for your use case if one treatments.id can be associated with mutiple treatment.name

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks Jeechu, I have tried this method and the treatments.name throws the average calculations off and gives an undesired result.
What's the result you're looking for? Average across all records, or grouped by id?
I'm looking to average by quality_data.yield and group by treatment_selections.treatment_id whilst allowing to call on other associated columns.
0

I am not expert on Rails but lets analyze the logged query:

SELECT treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id

Maybe you are relying in the clause DISTINCT ON to make this work without specifying both columns. But as you see in the log, this is not being translated into SQL.

SELECT [missing DISTINCT ON(treatment_selections.treatment_id)] treatment_selections.treatment_id, treatment.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id

But even if you managed to force Rails to implement DISTINCT ON, you might not get your intended result because DISTINCT ON should return only one row per treatment_id.

The standard SQL way is to specify both columns as grouping in the aggregation:

If it is the case that treatment_id has a 1:1 relationship to treatment_name, then if you run the query without the AVG function (and without DISTINCT ON), the data would look similar to:

|   treatment_id    |       name          |  yield    |  
------------------------------------------------------
|        1          |   treatment 1       |    0.50   |
|        1          |   treatment 1       |    0.45   |
|        2          |   treatment 2       |    0.65   |
|        2          |   treatment 2       |    0.66   |
|        3          |   treatment 3       |    0.85   |

Now to use the average function you must aggregate by (both) treatment_id and treatment_name.

The reason you must specify both is because the database manager assumes that all the columns in the resulting data set are not related among each other. So, aggregating by both columns

SELECT treatment_selections.treatment_id, treatments.name, AVG(quality_data.yield) as yield
FROM "trials"
INNER JOIN "treatment_selections" ON "treatment_selections"."trial_id" = "trials"."id"
INNER JOIN "quality_data" ON "quality_data"."treatment_selection_id" = "treatment_selections"."id"
INNER JOIN "treatment_selections" "treatment_selections_trials" ON "treatment_selections_trials"."trial_id" = "trials"."id"
INNER JOIN "treatments" ON "treatments"."id" = "treatment_selections_trials"."treatment_id"
WHERE (EXTRACT(year from season_year) BETWEEN 2018 AND 2018)
GROUP BY treatment_selections.treatment_id, treatments.name

will give you the following result:

|   treatment_id    |       name          |   AVG(yield)   |  
------------------------------------------------------------
|        1          |   treatment 1       |      0.475     |
|        2          |   treatment 2       |      0.655     |
|        3          |   treatment 3       |      0.85      |

To understand this better, if the resulting data in the first two columns was not related; for example:

|   year    |       name          |   yield   |  
-----------------------------------------------
|    2000   |   treatment 1       |    0.1    |
|    2000   |   treatment 1       |    0.2    |
|    2000   |   treatment 2       |    0.3    |
|    2000   |   treatment 3       |    0.4    |
|    2001   |   treatment 2       |    0.5    |
|    2001   |   treatment 3       |    0.6    |
|    2002   |   treatment 3       |    0.7    |

you must still group by year and name and, in this case, the average function would only be used when year and name are the same (note that it is not possible to do otherwise) resulting:

|   year    |       name          |   AVG(yield)   |  
---------------------------------------------------
|    2000   |   treatment 1       |     0.15       |
|    2000   |   treatment 2       |     0.3        |
|    2000   |   treatment 3       |     0.4        |
|    2001   |   treatment 2       |     0.5        |
|    2001   |   treatment 3       |     0.6        |
|    2002   |   treatment 3       |     0.7        |

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.