Fine-tuning Stable Diffusion

OctoAI lets you fine-tune Stable Diffusion to customize generated images. Fine-tuning is a process of training a model with additional data for your task. You’ll provide training images of your subject for fine-tuning, and when complete, you can use the fine-tuned model to create new images. There's a few simple steps:

  1. Upload your training images
  2. Run the fine-tuning job
  3. Use the fine-tuned LoRA in your image generation inference requests

We're using the LoRA fine-tuning method, which is an acronym for Low-Rank Adaptation. It's a fast way to fine-tune Stable Diffusion, and usually takes about 5 to 8 minutes. Fine-tuning is supported for Stable Diffusion v1.5.

Fine-tuning is currently in Private Preview and limited to web UI access while we make several improvements to the fine-tuning API. During the Private Preview, fine-tuning is free and won't charge your account. If you're interested in early access, send us a request using our sign-up form. All uploaded images used for fine-tuning must comply with our terms of service.


In the web UI, navigate to the Fine-tune page to get started - any previously tuned models will also be listed here. Click on “Create tune” to continue.

Create fine-tune


Specify the name of your fine-tune, and the trigger word of the subject you're fine-tuning. The trigger word can be used in your inference requests to customize the images with your subject. We recommend using a unique trigger word, such as "sks1", that's unlikely to be associated with a different subject in Stable Diffusion. In this example, we'll fine-tune using images of people wearing a virtual reality headset.

Next, specify the number of steps to train. A good guideline is about 75 to 150 steps per training image. The model can underfit if the numer of training steps is too low, resulting in poor quality. If it's too high, the model can overfit and struggle to represent details that aren't represented in the training images.


Select the subject you want to tune: person, style, animal, or landscape. This will help the tuning process understand which parts of the image are most important. We'll use the Person subject for this example.

Interested in different subjects? We'd love to hear your feedback in our Discord community!

Upload images & finalize

Next, upload your images. We recommend using about 12-15 varied images, including different backgrounds, lightings, and distances. Finding a balance between variation and consistency can help improve image generation quality.

Optionally, you can provide captions for each image that describe the custom subject. This can also help improve fine-tuning and the quality of generated images. Make sure to include your trigger word in the caption.

When you're ready, click "Finalize", and the fine-tune job will progress from pending to running before completing.

Generating images

Specify the image prompt and additional fields, then click "Generate" to create your image. Again, be sure to include your trigger word within the prompt:

Fine-tuning tips

We recommend using some amount of variation in your images - including different backgrounds, lightings, and distances. If every image is close-up, the fine-tuned model may be limited to representing that distance. It's also helpful to have some level of consistency among the images to ensure the model learns the intended subject.

Finding the right balance between consistency and variation can require a few iterations, and we encourage you to experiment!