Core Concepts

OctoML is on a mission to offer easy access to efficient compute and enable users to integrate their choice of AI models into applications. The OctoAI compute service helps you turn any container or Python code into a production-grade endpoint within minutes. This page outlines core concepts within OctoAI.


A template is a foundational AI model that has been pre-containerized to support inference. Templates can be cloned to create an endpoint in your account. A subset of templates are called Quickstart templates, which mean they are immediately available for inference without cold start/cloning (up to a certain rate limit). Cloning a template will allow you to overcome rate limits, customize your own autoscaling settings, and set your own privacy settings for the endpoint.


An endpoint provides a dedicated URL to serve inference. You can create an endpoint by cloning a template or creating a new endpoint from your own container. Endpoints are private by default, and you can optionally allow public access. Each endpoint has autoscaling configuration, including minimum replicas, maximum replicas, and a timeout duration.

Endpoints are described in detail in Introduction and their API reference can be found in Endpoint.


1 replica is equivalent to 1 GPU or hardware instance. You can specify minimum and maximum replicas for each endpoint. Using 0 minimum replicas will stop all hardware instances when there’s no inference requests within the specified timeout value. You can set the minimum replicas to 1 to reduce cold start occurrences. More details on cold start are available here.

The maximum replicas value is the maximum number of simultaneous hardware instances that will be used. In general, 1 GPU can support a single request at a time. Concurrent requests exceeding the maximum replicas will be placed in a queue until a replica is available.


Timeout is the wait duration, if there are no inference requests, before scaling down to the minimum configured replicas. This value is configured in seconds. A higher value will reduce cold start occurrences.

Registry Credentials

When creating an endpoint from a privately stored custom container, you’ll need to provide the registry credentials. You can provide these credentials when creating your endpoint. More details are available here.

Endpoint Secrets

When creating an endpoint from a custom container, you may wish to mount database secrets or any other environment variables onto your container. You can provide these secrets when creating your endpoint. More details are available here.

Web UI

Quickstart templates and cloned templates have a web interface where you can try out inference. As an example with Stable Diffusion, you can try different input prompts and parameters to generate images. You’ll see the output directly in the web interface.


The REST API can be used in your application with any programming language. Example inference curl commands are also provided in each template. A guide to the REST API is available here.

Python SDK

The Python SDK is a library built to ease using OctoAI endpoints in Python applications. It allows you to run inferences against an endpoint by providing a dictionary with the necessary inputs. A guide to the Python SDK is available here.