0

Background

I have a bucket on s3 called sample-level-test. It contains folders for each day such as 2020-10-08, 2020-10-09 and 2020-10-10.

Each date folder contains many folders that are id of a player like 2020-10-08/31001457373383, 2020-10-08/31001457373383 etc.

The folders 31001457373383 and 31001457373383 are player level folders and each such player level folder contains 3 files.

My Code

import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(name="sample-level-test")

 for my_bucket_object in my_bucket.objects.all():
     print(my_bucket_object)

My code sample output

s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-08/31001457373383/player-DNA.json')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-08/31001457373383/player-DNA.csv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-08/31001457373383/player-DNA_report.tsv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-09/31001461776686/player-DNA.json')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-09/31001461776686/player-DNA.csv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-09/31001461776686/player-DNA_report.tsv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-10/310014685532736/player-DNA.json')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-10/310014685532736/player-DNA.csv')
s3.ObjectSummary(bucket_name='sample-level-test', key='2020-10-10/310014685532736/player-DNA_report.tsv')

My Problem

I am trying to create a tiny service that given some number X. It will return the X total player keys in the bucket by oldest date.

For example if X = 1 then my output should be ['2020-10-08/31001457373383'].

For example if X = 2 then my output should be ['2020-10-08/31001457373383', '2020-10-09/31001461776686'].

My Current Approach

Currently i loop through the entire output which is essentially the list of all objects in the bucket and i parse out individual date folders. Then i make check each date folder and get keys until i hit X.

I think my approach is flawed and very slow. I am wondering if there is a better way to approach this. I know in Java there is tree data structures where i can store this kind of directory output in a tree format and it would be fast to retrieve info if needed. Is there something similar i can use in python?

1
  • is it essential to tree like structure? It seems like your output is already sorted? Commented Nov 25, 2020 at 2:58

1 Answer 1

2

assuming your structure is all the same, you can split the keys, remove duplicates and get your answer by slicing the resulting list

keys_list = ['2020-10-08/31001457373383/player-DNA.json',
'2020-10-08/31001457373383/player-DNA.csv',
'2020-10-08/31001457373383/player-DNA_report.tsv',
'2020-10-09/31001461776686/player-DNA.json',
'2020-10-09/31001461776686/player-DNA.csv',
'2020-10-09/31001461776686/player-DNA_report.tsv',
'2020-10-10/310014685532736/player-DNA.json',
'2020-10-10/310014685532736/player-DNA.csv',
'2020-10-10/310014685532736/player-DNA_report.tsv']

x=2
new_list = list(set([s.split('/player')[0] for s in keys_list]))
new_list.sort()
answer_list = new_list[0:x]

Output for x=2

['2020-10-08/31001457373383', '2020-10-09/31001461776686']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.