3

Context:

  • I'm working with an end to end deep learning TTS framework (you give it text input it gives you a wav object back)
  • I've created a FastAPI endpoint in a docker container that uses the TTS framework to do inference
  • My frontend client will hit this FastAPI endpoint to do inference on a GPU server
  • I'm going to have multiple docker containers behind a load balancer (haproxy) all running the same FastAPI endpoint image

My questions:

  • Storage Choice: What is the recommended approach for hosting model files when deploying multiple Docker containers? Should I use Docker volumes, or is it advisable to utilize cloud storage solutions like S3 or Digital Ocean Spaces for centralized model storage?
  • Latency Concerns: How can I minimize latency when fetching models from cloud storage? Are there specific techniques or optimizations (caching, partial downloads, etc.) that can be implemented to reduce the impact of latency, especially when switching between different models for inference?

I'm still learning about mlops so I appreciate any help.

5
  • when you say "switching between different models for inference", do you mean swapping weights within a container, or between containers (ie sequential models in different containers)? Commented Oct 31, 2023 at 5:52
  • @Karl Apologies for the confusion, I mean swapping weights within a container. Each container will need to be able to swap between using any of the 7 voice models I have currently trained (and more than 7 in the future). So for example someone might make a TTS request with voice Brian and then want to switch to voice Mike. I could make single containers dedicated to only one voice but don't think that is feasible with server costs Commented Oct 31, 2023 at 6:22
  • I would strongly recommend a "one model per container" approach, unless you have a situation where you have a large "main" model that is augmented by small embeddings/fine-tuned weights. If cost is an issue, you can look into serverless inference or try to juggle containers within a server. For loading, you want to download a container's models once on build. Commented Oct 31, 2023 at 19:16
  • So I have one large model and then finetuned weights. The finetuned weights are still large 1GB each. I could download each finetuned model on build time and store them locally. One downside with this is consuming more local storage. The other downside is if I wanted to add new future models to the system I'd have to rebuild each container and repull new models. I could write some python logic to check for new ones in s3 bucket every 24 hours or something. I was planning on running multiple containers on one GPU machine to help save on server costs (rather than one container per GPU) Commented Oct 31, 2023 at 23:11
  • Long term my plan is to have easily 20-30 finetuned models. If I have I had 20-30 models I don't think I could afford have 30 individual containers running each model. But 3-5 containers that have access to all 30 models would be feasible. I was just trying to figure out if it made sense to store them on s3 and constantly read them from s3. But maybe it makes more sense to store them on s3 but then download them locally from s3 and check for new ones every 24 hours or something like that Commented Oct 31, 2023 at 23:14

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.