Running Large Language Models (LLMs) on Oracle Cloud Infrastructure with Ollama and Open-WebUI

Ilkin Berker Topuz

Sep 20, 2024 — 5 min read

In today’s tech world, Large Language Models (LLMs) are a hot topic, with new models boasting billions of parameters being released regularly. Companies everywhere are rushing to integrate AI into their systems. While I’m not a huge fan of the mass AI adoption, running your own LLM is a fun and practical way to explore the technology. With Ollama and Open-WebUI, you can run models either in the cloud or on your own machine. This approach is particularly useful for tasks that don’t need to be public.

In this article, I’ll walk you through how to run your own LLM on Oracle Cloud Infrastructure (OCI) using free resources, the bundled Docker version of Open-WebUI with Ollama, and Tailscale for secure access.

Leveraging Oracle’s Free Tier for LLMs

In my previous articles, I discussed how Oracle provides free cloud services, including ARM-based virtual machines. While the performance of OCI’s free tier is not cutting-edge, VM.Standard.A1.Flex is still capable of running smaller LLMs for personal tasks. Although I couldn’t run the largest models, probably due to the ARM CPUs' limited power, this setup works well for moderately sized models.

4 ARM CPUs
24 GB of RAM
200 GB SSD storage

You’ll also want to ensure you allocate 200 GB of SSD storage, but there’s a catch by default the full SSD size isn’t allocated to the root directory. So, I had to resize the root partition to take advantage of all that space.

Using Open-WebUI and Ollama

Open-WebUI (GitHub is provided here) is an interface similar to ChatGPT, but it offers more flexibility by allowing you to use open-source LLMs available via Ollama. You can also run Open-WebUI with a ChatGPT API key, but the real magic happens when you tap into Ollama's open-source models for private or customized use.

I used the Ollama-bundled Docker version of Open-WebUI for simplicity. This version allows you to quickly spin up a local instance of Open-WebUI with Ollama's open-source LLMs. There’s also a GPU version of Open-WebUI available for those looking to optimize performance with CUDA which is available in NVIDIA GPUs, but since Oracle’s free tier doesn’t include GPU access, I didn’t test this feature. However, if you’re running this setup on a machine with an NVIDIA card, you can take advantage of the CUDA acceleration.

Docker Installation

Once the VM is up, we’ll need to install Docker. Here’s how I did it, to reach the installation page, you can click here:

# Update the system
sudo yum update -y

# Install yum-utils
sudo yum install -y yum-utils

# Add Docker repository
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

# Install Docker
sudo yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# Starts Docker after reboot
sudo systemctl enable docker

# You no longer need to "sudo" for Docker
sudo usermod -aG docker $USER

# Start Docker
sudo systemctl start docker

With Docker installed, you’re all set to move on to the next step: setting up Open-WebUI with Ollama.

Running Open-WebUI with Ollama

Next up is getting Open-WebUI running. This is a tool similar to ChatGPT, but the coolest thing is that you can search for and run open-source LLMs directly through it. I used the Ollama-bundled Docker version of Open-WebUI, which simplifies the whole process. Since I was working with CPUs (and not GPUs), here’s the command I used:

docker run -d -p 3000:8080 -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

This runs Open-WebUI on port 3000, and you can access it via your browser. Unfortunately, I couldn’t test the GPU version because the OCI free tier doesn’t provide access to GPUs. However, if you have a local machine with an Nvidia GPU, you can run the GPU version using a CUDA-enabled Docker image. If you like, you can find more information about them in their GitHub page.

Secure Access with Tailscale

Now, you probably don’t want to expose Open-WebUI directly to the internet. To avoid that, I used Tailscale to connect to the web UI securely through a private network. Tailscale lets you create a private network called a "tailnet" that links your devices without needing to expose anything to the outside world.

To get Tailscale up and running, I used the following Docker command, in order to reach the installation page, you can use this link:

docker run -d --name=tailscaled --restart=always -v /var/lib:/var/lib -v /dev/net/tun:/dev/net/tun --network=host --cap-add=NET_ADMIN --cap-add=NET_RAW --env TS_AUTHKEY=tskey-auth-ab1CDE2CNTRL-0123456789abcdef tailscale/tailscale

Make sure to replace TS_AUTHKEY with your own Tailscale authentication key. Once this is set up, you can access Open-WebUI on your OCI instance using Tailscale’s Magic IP address, something like http://100.x.x.x:3000, all without opening any ports to the public internet. Of course, you need to install Tailscale to the machine that you are using.

Resizing the Root Disk

One small problem that I ran into was that the full 200 GB of SSD wasn’t available on the root partition. By default, Oracle allocates less space to the root directory, so I had to resize it to use all the available storage. If you plan to store multiple models or work with larger ones, this step is crucial.

Running Models with Open-WebUI and Ollama

Once everything’s up and running, you can use Open-WebUI to search for and download models. For example, to run the Llama 3.1 model with 8 billion parameters, all you need to do is type the model name with the parameters directly into the Open-WebUI interface, like this: llama3.1:8b You can find the model lists here.

Open-WebUI will download the model and get it running for you. Keep in mind that the size of the model you can run depends on how much RAM you have. According to Ollama’s GitHub guide:

You need at least 8 GB of RAM for 7B parameter models
16 GB for 13B models
32 GB for 33B models

Since Oracle's free tier gives you 24 GB of RAM, you’ll be able to run models with up to 13 billion parameters comfortably.

Fine-Tuning Your LLM: Key Parameters

One of the best things about running your own LLM is that you can tweak the model settings to suit your needs. Here are some of the key parameters you can play with:

Mirostat: Controls perplexity, making the model’s responses more coherent.
Temperature: Adjusts the creativity of the output. Higher values make responses more creative, while lower values make them more predictable.
Top-K: Limits the number of tokens considered. A higher value leads to more varied answers.
Top-P: Balances diversity and coherence in the responses.
Min-P: Filters out less likely tokens to improve response quality.

These parameters give you a lot of control over how the model behaves, letting you fine-tune it for your specific tasks. If you need more parameters, those can be found here.

Why Not Just Use ChatGPT?

Sure, ChatGPT is fast, free, and great for general use. But there’s something empowering about running your own LLM. You get more control, more customization, and full ownership of the data being processed. This is especially important for privacy-conscious users or those who want to tweak models for specific tasks.

Wrapping Up

Running your own LLM on Oracle Cloud Infrastructure using Ollama and Open-WebUI is a great way to dip your toes into the world of AI. While the free tier has its limitations, it’s more than enough to experiment with smaller models and explore what LLMs can do. The setup is fairly straightforward with Docker, and using Tailscale ensures that your models stay private.

If you’re curious about AI but don’t want to rely on public services, this is a perfect way to get started. Give it a try, you might be surprised at how powerful and flexible running your own LLM can be.