September 20, 2023

  • We have improved OctoAI's SDXL. Our latency-optimized SDXL endpoints now take about 2.8s to generate a 30-step image and is available here. We also have a cost-optimized SDXL endpoint that takes about 8s for 30 steps. Contact us at [email protected] or in Discord to request access to the cost-optimized version.
  • Added support for multi-user accounts, which allows your team to manage endpoints, view logs & metrics, and securely share access to an account. Contact us directly to setup your multi-user account.

September 14, 2023

  • Check out our github repo for template applications to help you get started on building your own app with OctoAI. Right now, we have an example using Python and deployable on Streamlit as well as one using Typescript and deployable on Vercel.
  • We also have a new audio generation endpoint available under private preview (e.g. Bark, Tortoise TTS). Please reach out to us at [email protected] if you want to request early access to this feature!

August 30, 2023

  • Added Llama2 70B quickstart template endpoint at: We can also host custom Llama2 LoRAs/ checkpoints for you-- please reach out on Discord if you're interested.
  • Enabled users to upload data via URL in the authoring experience (CLI + Python SDK)
  • Added real-time streaming capabilities to our Whisper audio flow, with a React hook called useWhisper for ease of integration into web/mobile apps. You can learn about how to use this feature here:
  • Changed the domain for all newly created endpoints from to Existing endpoints on octoai.cloudwill still work, but we suggest that you start changing your code to call endpoints from instead of, since we'll also update existing endpoints in about a month.

August 16, 2023

  • Added a new section in our Docs on Image Generation, including how to fine-tune and use Stable Diffusion:
  • Reduced cold start substantially on endpoints created with our authoring experience (can be multiple minutes of improvement depending on the model). Upgrade to the latest version of the CLI and SDK and author new endpoints to get faster cold start for your custom models.
  • Improved error boundaries in the UI. Users would be less likely to run into the "Whoops Beta Mode engaged" message in the UI.
  • Enabled concurrency handling improvements to all new endpoints created from now on. We will also be gradually rolling out this change on previously created endpoints in upcoming weeks.
  • A faster version of SD XL with dimension 1024x1024 is now available under private preview. We'd be gradually rolling out this new version over the next week or so. Contact us on Discord for early access.
  • Reminder: OctoAI's quickstart template endpoints are for demo/testing purposes only. On these endpoints, we rate-limit to 15 inferences per hour. If you would like to exceed this limit for production use, please clone the endpoint to your own account.

August 10, 2023

  • Stable Diffusion 1.5 Template Feature Additions: OctoAI's Stable Diffusion endpoint, running on A10Gs, has been upgraded to include the following features to help users customize styling and achieve higher-quality images:
    • Popular Checkpoints like DreamShaper and Realistic Vision, Low Rank Adaptations (LoRAs), and Textual Inversions (navigate to the SD 1.5 template and review the drop downs for a full list of fine tuned options). Note: LoRA weights must sum up to 1.
    • Additional image dimensions
    • Updated user interface
  • Whisper Template Feature Additions: Multi-hour long audio files are now supported. Furthermore, you can specify a URL to the audio input file (e.g. MP3, WAV, or MP4 formats), instead of uploading a file from your local environment (navigate to the Whisper template).
  • Private Registry: OctoAI's container authoring experience has been upgraded. Users are no longer required to provide registry credentials to get started. Images can be uploaded directly to a private OctoAI Registry. User uploaded images to OctoAI's Registry are accessible only to you and OctoAI services i.e. no other user can view or access your images.

July 26, 2023

  • Added more graceful concurrency handling: when users send more than N concurrent request to an endpoint with N replicas actively running, we will queue all extra requests instead of failing them. This queuing behavior has been activated for selected customers, and will be gradually rolled out over this week and next week. You will temporarily see a new replica spin up while the rollout is occurring on your endpoint.

  • Updated our Python SDK from 0.1.2 to 0.2.0--it now support both streaming and async inference requests.

  • Added diarization to our Whisper template endpoint and rectified the list of languages supported. Diarization enables use cases where you'd like to identify the speaker of each segment in a speech recording. You can view the full API spec at Here's an example of how to use the template with diarization:

    • import requests
      import base64
      def download_file(url, filename):
          response = requests.get(url)
          if response.status_code == 200:
              with open(filename, "wb") as f:
              print(f"File downloaded successfully as {filename}.")
              print(f"Failed to download the file. Status code: {response.status_code}")
      def make_post_request(filename):
          with open(filename, "rb") as f:
              encoded_audio = base64.b64encode("utf-8")
          headers = {
              "Content-Type": "application/json"
          data = {
              "audio": encoded_audio,
              "task": "transcribe",
              "diarize": True
          response ="", json=data, headers=headers)
          if response.status_code == 200:
              # Handle the successful response here
              json_response = response.json()
              for seg in json_response["response"]["segments"]:
              print(f"Request failed with status code: {response.status_code}")
      if __name__ == "__main__":
          url = "<YOUR_FILE_HERE>.wav"
          filename = "sample.wav"
          download_file(url, filename)

July 20, 2023

  • Added an OctoAI template for Llama2-7B Chat, which is an instruction-tuned model for chatbots. Users can now work with this brand-new to the market LLM directly in the web UI with limited token response or programmatically with additional optionality. A similar template for Llama2-70B is coming soon!

July 18, 2023

  • Changed the HTTP status code to 201 for the REST API calls for create secret and create registry credentials.  Previously, we returned 200 for these calls.  The behavior of the SDK and web frontend is not affected.

June 27, 2023

  • Released a Doc tutorial to show users how to use OctoAI's server class GPUs with Automatic1111 Stable Diffusion web user interface.
  • Released a video tutorial to show users how to apply custom model checkpoints using Automatic1111's Stable Diffusion web user interface on OctoAI.
  • Updated our Falcon template to use a different server implementation behind the scenes. The inference API is now available at /generate, but inferences at /predict will continue to work.

June 12, 2023

  • Join us for the OctoAI compute service public beta launch this Wednesday, June 14th! Register here.
  • With the launch of our service, changes will be made to our billing. You can find pricing plans and hardware options here. Changes and new user incentives taken into immediate effect are noted below:
    • Tomorrow, June 13th, any existing endpoints will be set to min replicas=0 so that you are not billed for an instance unintentionally left active and running. Be prepared for a cold start before your first inference and reset to min replicas=1 if you prefer to keep the instance warm.
    • Every user who logs in during public beta will receive credits for 2 free compute hrs on A100 (or 10+ hrs on A10!) to use in their first two weeks.
    • The first 500 users to create a new endpoint will receive credits for 12 free compute hrs on A100 (or 50+ hrs on A10!) to use within their first month.
  • You now have two options to integrate OctoAI endpoints into your application:
    • Our new Python client (supports synchronous inference). Read more about it here.
    • Our HTTP REST API now supports both synchronous and asynchronous calls allowing users to request inference without persisting a connection, poll for status, and retrieve the completed prediction data. This is most effective when managing longer running requests.Read more about it here.
  • We’ve updated our Whisper model to be much faster - don't worry, the input / output schema is the same!
  • We've also added MPT 7B and Vicuña 7B as new quickstart templates as better alternatives to Dolly, which will be removed soon.