- Added support for NVIDIA A100s with 80GB of memory enabling faster inference and higher memory bandwidth. Select the "Fastest" hardware configuration setting (see graphic below) when creating an endpoint to utilize an A100. OctoML's compute service allows users to start with a single A100 and scale up as traffic increases without paying for idle hardware.
- Added the ability for users to specify the health check server path. This path returns a "200 HTTP Response" when an endpoint is ready to receive requests. For example, the Flan-T5 container tutorial defines a '
/healthcheck' path that returns "200" after the Flan-T5 model has been loaded and initialized.
- Coming Soon: New templates (e.g. Vicuna, Llama, etc.), demo apps (LLM chatbot), improvements to service and event logging, and cold start up time reduction.