Background
I have a bucket on s3 called sample-level-test. It contains folders for each day such as 2020-10-08, 2020-10-09 and 2020-10-10.
Each date folder contains many folders that are id of a player like 2020-10-08/31001457373383, 2020-10-08/31001457373383 etc.
The folders 31001457373383 and 31001457373383 are player level folders and each such player level folder contains 3 files.
My Code
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(name="sample-level-test")
for my_bucket_object in my_bucket.objects.all():
print(my_bucket_object)
My code sample output
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-08/31001457373383/player-DNA.json')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-08/31001457373383/player-DNA.csv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-08/31001457373383/player-DNA_report.tsv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-09/31001461776686/player-DNA.json')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-09/31001461776686/player-DNA.csv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-09/31001461776686/player-DNA_report.tsv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-10/310014685532736/player-DNA.json')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-10/310014685532736/player-DNA.csv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-10/310014685532736/player-DNA_report.tsv')
My Problem
I am trying to create a tiny service that given some number X. It will return the X total player keys in the bucket by oldest date.
For example if X = 1 then my output should be ['2020-10-08/31001457373383'].
For example if X = 2 then my output should be ['2020-10-08/31001457373383', '2020-10-09/31001461776686'].
My Current Approach
Currently i loop through the entire output which is essentially the list of all objects in the bucket and i parse out individual date folders. Then i make check each date folder and get keys until i hit X.
I think my approach is flawed and very slow. I am wondering if there is a better way to approach this. I know in Java there is tree data structures where i can store this kind of directory output in a tree format and it would be fast to retrieve info if needed. Is there something similar i can use in python?