DynamoDB Scan Query and BatchGet

Question

We have a Dynamo DB table structure which consists Hash and Range as primary key.

Hash = date.random_number
Range = timestamp

How to get items within X and Y timestamp? Since hash key is attached with random_number, those many times query has to be fired. Is it possible to give multiple hash values and single RangeKeyCondition.

What would be most efficient in terms of cost and time?

Random number range is from 1 to 10.

Your question is not quite clear. What do you mean by ", and, any why i can get items from X time stamp to Y time stamp." — Daniel Langdon
– Daniel Langdon, Commented Apr 16, 2015 at 12:47
Guessed was OP trying to say. Looks like does not know how to get current time - x minutes or how to put two time instants in a between query. — tgkprog
– tgkprog, Commented Apr 16, 2015 at 13:16
1. Why do you need the random number after the date? You have a time stamp that has date and time right? What queries have you tried? Do you know how to get current time - 10 minutes? — tgkprog
– tgkprog, Commented Apr 16, 2015 at 13:17

Community · Accepted Answer · 2017-05-23 10:27:00Z

If I understood correctly, you have a table with the following definition of Primary Keys:

Hash Key  : date.random_number 
Range Key : timestamp

One thing that you have to keep in mind is that , whether you are using GetItem or Query, you have to be able to calculate the Hash Key in your application in order to successfully retrieve one or more items from your table.

It makes sense to use the random numbers as part of your Hash Key so your records can be evenly distributed across the DynamoDB partitions, however, you have to do it in a way that your application can still calculate those numbers when you need to retrieve the records.

With that in mind, let's create the query needed for the specified requirements. The native AWS DynamoDB operations that you have available to obtain several items from your table are:

Query, BatchGetItem and Scan

In order to use BatchGetItem you would need to know beforehand the entire primary key (Hash Key and Range Key), which is not the case.
The Scan operation will literally go through every record of your table, something that in my opinion is unnecessary for your requirements.
Lastly, the Query operation allows you to retrieve one or more items from a table applying the EQ (equality) operator to the Hash Key and a number of other operators that you can use when you don't have the entire Range Key or would like to match more than one.

The operator options for the Range Key condition are: EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN

It seems to me that the most suitable for your requirements is the BETWEEN operator, that being said, let's see how you could build the query with the chosen SDK:

Table table = dynamoDB.getTable(tableName);

String hashKey = "<YOUR_COMPUTED_HASH_KEY>";
String timestampX = "<YOUR_TIMESTAMP_X_VALUE>";
String timestampY = "<YOUR_TIMESTAMP_Y_VALUE>";

RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("RangeKeyAttributeName").between(timestampX, timestampY);

        ItemCollection<QueryOutcome> items = table.query("HashKeyAttributeName", hashKey,
            rangeKeyCondition,
            null, //FilterExpression - not used in this example
            null,  //ProjectionExpression - not used in this example
            null, //ExpressionAttributeNames - not used in this example
            null); //ExpressionAttributeValues - not used in this example

You might want to look at the following post to get more information about DynamoDB Primary Keys: DynamoDB: When to use what PK type?

QUESTION: My concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?

Your concern is completely understandable, however, the only way to fetch all the records via BatchGetItem is by knowing the entire primary key (HASH + RANGE) of all records you intend to get. Although minimizing the HTTP roundtrips to the server might seem to be the best solution at first sight, the documentation actually suggests to do exactly what you are doing to avoid hot partitions and uneven use of your provisioned throughput:

Design For Uniform Data Access Across Items In Your Tables

"Because you are randomizing the hash key, the writes to the table on each day are spread evenly across all of the hash key values; this will yield better parallelism and higher overall throughput. [...] To read all of the items for a given day, you would still need to Query each of the 2014-07-09.N keys (where N is 1 to 200), and your application would need to merge all of the results. However, you will avoid having a single "hot" hash key taking all of the workload."

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Here there is another interesting point suggesting the moderate use of reads in a single partition... if you remove the random number from the hash key to be able to get all records in one shot, you are likely to fall on this issue, regardless if you are using Scan, Query or BatchGetItem:

Guidelines for Query and Scan - Avoid Sudden Bursts of Read Activity

"Note that it is not just the burst of capacity units the Scan uses that is a problem. It is also because the scan is likely to consume all of its capacity units from the same partition because the scan requests read items that are next to each other on the partition. This means that the request is hitting the same partition, causing all of its capacity units to be consumed, and throttling other requests to that partition. If the request to read data had been spread across multiple partitions, then the operation would not have throttled a specific partition."

And lastly, because you are working with time series data, it might be helpful to look into some best practices suggested by the documentation as well:

Understand Access Patterns for Time Series Data

For each table that you create, you specify the throughput requirements. DynamoDB allocates and reserves resources to handle your throughput requirements with sustained low latency. When you design your application and tables, you should consider your application's access pattern to make the most efficient use of your table's resources.

Suppose you design a table to track customer behavior on your site, such as URLs that they click. You might design the table with hash and range type primary key with Customer ID as the hash attribute and date/time as the range attribute. In this application, customer data grows indefinitely over time; however, the applications might show uneven access pattern across all the items in the table where the latest customer data is more relevant and your application might access the latest items more frequently and as time passes these items are less accessed, eventually the older items are rarely accessed. If this is a known access pattern, you could take it into consideration when designing your table schema. Instead of storing all items in a single table, you could use multiple tables to store these items. For example, you could create tables to store monthly or weekly data. For the table storing data from the latest month or week, where data access rate is high, request higher throughput and for tables storing older data, you could dial down the throughput and save on resources.

You can save on resources by storing "hot" items in one table with higher throughput settings, and "cold" items in another table with lower throughput settings. You can remove old items by simply deleting the tables. You can optionally backup these tables to other storage options such as Amazon Simple Storage Service (Amazon S3). Deleting an entire table is significantly more efficient than removing items one-by-one, which essentially doubles the write throughput as you do as many delete operations as put operations.

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

spin, yes you got it right. Since there is a random_number appended to the hash key, we have the query 10 times if random_number is between 1 to 10.
String timestampX = "2015-04-20_07:37:30.366"; String timestampY = "2015-04-20_07:37:33.366"; RangeKeyCondition rangeKeyCondition = new RangeKeyCondition("timestamp").between(timestampX, timestampY); int i = 0; while (i < 10) { String hashKey = "2015-04-20." + i; ItemCollection<QueryOutcome> items = table.query("day", hashKey, rangeKeyCondition); Iterator<Item> iterator = items.iterator(); while (iterator.hasNext()) { System.out.println(iterator.next().toJSONPretty()); } i++; }
Sorry for not using StackOverflow correctly... my concern is querying multiple times because of random_number attached to it. Is there a way to combine these queries and hit dynamoDB once ?
Hello Mr. Javalkar, I've updated my answer to include your last question, in case you believe this is the right answer please click on the check mark next to it so other users can be correctly guided. I hope that helps.

Collectives™ on Stack Overflow

DynamoDB Scan Query and BatchGet

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related