3

Assume a user connects via a Websocket connection to a server, which serves a personalized typescript function based on a personalized JSON file

So when a user connects,

  • the personalized JSON file is loaded from an S3-lile bucket (around 60-100 MB per user)
  • and when he types a Typescript/JavaScript/Python code is executed which returns some string a reply and the JSON-like data structure gets updates
  • when the user disconnects the JSON gets persisted back to the S3-like bucket.

In total, you can think about 10,000 users, so 600 GB in total.

It should

  • spin up fast for a user,
  • should be very scalable given the number of users (such that we do not waste money) and
  • have a global latency of a few tens of ms.

Is that possible? If so, what architecture seems to be the most fitting?

1 Answer 1

1

Is that possible?

Let's make a sketch of a single-user single-transaction end-2-end latency budget composition :

  1. User may spend from about first 1 [ms] if colocated, yet up to 150+ [ms] for sending packet over the live, RTO connection ( Here we ignore all socket initiation & setup negotiations for simplicity )

  2. Server may spend anything above 25+ [ms] for "reading" an auth'd-user specific JSON-formatted string from RAM upon a 1st seeking/indexing of SER/DES-ed string of still string representation of the key:value pairs ( Here we ignore all add-on costs of non-exclusive use of NUMA ecosystem, spent on actual finding, physical reading and cross-NUMA transport of those 60 ~ 100 MB of auth'd-user specific data from a remote, about a TB-sized off-RAM storage into the final destination inside a local CPU-core RAM area for simplicity )

  3. JSON-decoder may spend any amounts of additional time on repetitive key:value tests over the 60 ~ 100 MB data dictionary

  4. ML-model may spend any amounts of additional time on .predict()-method's internal evaluation

  5. Server will spend some additional time for assembling a reply to the user

  6. Network will again add transport latency, principally similar to the one experienced under item 1 above

  7. Server will next spend some additional time for a per-user & per-incident specific modification of the in-RAM, per-user maintained, JSON-encoded 60 ~ 100 MB data dictionary ( This part ought always happen after items above, if UX latency was a design priority )

  8. Server will next spend some additional time on an opposite direction of cross-NUMA exosystem data transport & storage. While mirroring the item 2, this time the data-flow may enjoy non-critical/async/cached/latency masked deferred usage of physical resources' patterns, which was not the case under item 2, where no pre-caching will happen unless some TB-sized, exclusive-use, never-evicted cache footprints are present and reserved end-to-end, alongside the whole data transport trajectory from the local CPU-core in-RAM representation, re-SER-ialisation into string, over all the cross-NUMA exosystem interconnects, to the very last cold-storage physical storage device (which is almost sure will not happen here)

( subtotal ... [ms] for a single-user single-transaction single-prediction )

Let's make a sketch of what else gets wrong once many-users many-transactions reality gets into the ZOO :

a.
All so far optimistic ( having been assumed as exclusive ) resources will start to degrade in processing performance / transport throughputs, which will add and/or increase actually achieved latencies, because concurrent requests will now result in entering blocking states ( both on micro-level like CPU-core LRU cache resupply delays, cross-QPI extended access times to CPU-core non-local RAM areas and macro-level like enqueuing all calls to wait before a local ML-model .predict()-method is free to run, none of which were present in the non-shared single-user single-transaction with unlimited exclusive resources usage above, so never expect a fair split of resources )

b.
Everything what was "permissive" for a deferred ( ALAP ) write in the items 7 & 8 above, will now become a part of the end-to-end latency critical-path, as also the JSON-encoded 60 ~ 100 MB data write-back has to be completed ASAP, not ALAP, as one never knows, how soon another request from the same user will arrive and any next shot has to re-fetch an already updated JSON-data for any next request ( perhaps even some user-specific serialisation of sequence of requests will have to get implemented, so as to avoid loosing the mandatory order of self-evolution of this very same user-specific JSON-data sequential self-updates )

( subtotal for about 10k+ many-users many-transactions many-predictions
will IMHO hardly remain inside a few tens of [ms] )


Architecture?

Well, given the O/P sketched computation strategy, there seems to be no architecture to "save" all there requested principal inefficiencies.

For industry-segments where ultra-low latency designs are a must, the core design principle is to avoid any unnecessary sources of increasing the end-to-end latencies.

  • binary-compact BLOBs rule ( JSON-strings are hell expensive in all stages, from storage, for all network transports' flows, till the repetitive ser-/DES-erialisation re-processing )

  • poor in-RAM computing scaling makes big designs to move ML-models closer to the exosystem periphery, not the singular CPU/RAM-blocker/CACHE-depleter inside the core of the NUMA ecosystem

( Does it seem complex? Yeah, it is complex & heterogeneous, distributed computing for (ultra)low-latency is a technically hard domain, not a free choice of some "golden bullet" architecture )

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.