How do I create an agent that does aggregations on the data from the connected datastore?

The Aim:-

I want to create an agent connected to a dataset. The user should be able to ask questions about the data, and the agent should return responses by running SQL-like analytical queries.

Specifically, I expect the agent to handle:

  • Selection : Filtering Rows based on conditions across multiple columns
  • Projection : Returning values from another column.
  • Perform Aggregations : Like average, sum, min, max, count on filtered subsets
  • Simple Computations : Like sum/ difference/ percentage if needed (i doubt i will need this a lot)

What I did:-

  1. Created a Default Generative Playbook and connected it with a datastore tool.
  2. For now, I only want it to handle one test question reliably. The question is “What are all the pitch results when the event is a strikeout on 26th April, 2024“

To answer the above question,I have defined the steps to the agent as follows :

  1. Filter rows by “game_date" column name. (Do I need to tell it in which format the date data is in the datset? or will it be able to undersand the YYYY-MM-DD format on it’s own?)
  2. Filter again by “events" column for a specific event.
  3. Return all corresponding values from the description column.

Constraints I Set :

  • The agent should never guess answers; all outputs must come directly from the data.

  • The agent is allowed to use aggregation if the question requires it.

Datastore Tool Setup

This playbook is connected to structured dataset datastore that I have fetched from bigquery.

I configured the datastore so only the relevant columns are enabled with the right properties:

  • description (key property): Searchable, Indexable, Retrievable

  • events: Searchable, Indexable, Retrievable, Dynamic Facetable

  • game_date: Indexable, Retrievable, Dynamic Facetable

  • All other columns have Searchable, Indexable, Retrievable, Dynamic Facetable turned off.

Problems

  1. When I ask the one test question to the agent, it says that I can’t do aggregations on the dataset. It is however able to access the differnt columns. in the dataset.

  2. While uploading the dataset to bigquery, I set it to detect the schema of the dataset automatically. It automatically defines one of the columns called “description“ as the key property. Now, to test if the agent is able to see the dataset, I asked it a simple question. “Give me all the unique entries “xyz“ column name. “. The agent is able to fetch some (or sometimes all) the unique entries under a specific column name but it can’t give me any values from the description column.

So my question is why does auto schema detection set description as the key value and then second, is that the reason why I am unable to fetch values from the description column? If so, How do I fix it?

  1. The agent is connected to structured dataset from bigquery. why is it that, when I ask it the question “Give me all the unique entries “xyz“ column name. “, for the events column it would give all of the unique entries under that column name and then in a different test of the same agent, it is not able to fetch all the entries from the events column, it only returns two or three.

Hi Gaurangi_Garg,

Let’s go through your problems one by one:

Problem 1: The agent may be misinterpreting the query or encountering a platform-specific limitation of the datastore tool related to aggregation within a Default Generative Playbook. If the agent is attempting an aggregation like COUNT by default when it shouldn’t, try rewording your test question to be more specific about listing or returning the values. For example, you could say, “List the values in the description column for rows where events are ‘strikeout’ and game_date is ‘2024-04-26’.” On the other hand, if the agent states it can’t perform any aggregation, despite having the necessary constraints set, this likely points to a configuration mismatch or a limitation of the particular datastore connector or playbook type in use.

Problem 2: A possible workaround is to ensure that the column you need to project (i.e. return values from) is not set as the Key Property. To do this, start by reconfiguring the datastore setup and manually editing the schema or properties. If you have a true unique identifier, such as a pitch ID or row number, consider assigning that as the Key Property instead. Make sure that the description column is treated as a standard column, configured to be Searchable, Indexable, Retrievable, and Dynamic Facetable (as it currently is), but not as the Key Property. By removing the Key Property designation from the description column, the agent’s tool will be able to treat it as a regular column and properly return its values during the Projection step of your test query.

Problem 3: It’s generally best to accept this limitation, as retrieving a complete, exhaustive list of unique entries for a high-cardinality column is often not the intended use case for a generative agent. The primary function of the agent is to answer specific questions about the data. You can consider using Facets to efficiently get the unique values (and their counts) in a column for filtering and summarization. To verify that the agent is working correctly for unique values, test it with a column that has a small and stable number of unique entries, such as a “starter_batter” column with only two values. If the agent still produces inconsistent results, the issue is likely due to a strict platform limitation on result set size.

Additionally, you may refer to this documentation, which might help you with your use case.