Ollama has become the standard for running Large Language Models (LLMs) locally. In this tutorial, I want to show you the most important things you should know about Ollama.
https://youtu.be/AGAETsxjg0o?embedable=true
Watch on YouTube:
What is Ollama?
Ollama is an open-source platform for running and managing large-language-model (LLM) packages entirely on your local machine. It bundles model weights, configuration, and data into a single Modelfile package. Ollama offers a command-line interface (CLI), a REST API, and a Python/JavaScript SDK, allowing users to download models, run them offline, and even call user-defined functions. Running models locally gives users privacy, removes network latency, and keeps data on the user’s device.
Install Ollama
Visit the official website to download Ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
macOS:
brew install ollama
Windows: download the .exe installer and run it.
How to Run Ollama
Before running models, it is essential to understand Quantization. Ollama typically runs models quantized to 4 bits (q4_0), which significantly reduces memory usage with minimal loss in quality.
Recommended Hardware:
- 7B Models (e.g., Llama 3, Mistral): Requires ~8GB RAM (runs on most modern laptops).
- 13B — 30B Models: Requires 16GB — 32GB RAM.
- 70B+ Models: Requires 64GB+ RAM or dual GPUs.
- GPU: An NVIDIA GPU or Apple Silicon (M1/M2/M3) is highly recommended for speed.
Go to
After that, click on the model name and copy the terminal command:
Then, open the terminal window and paste the command:
It will allow you to download and chat with a model immediately.
Ollama CLI — Core Commands
Ollama’s CLI is central to model management. Common commands include:
- ollama pull <model> — Download a model
- ollama run <model> — Run a model interactively
- ollama list or ollama ls — List downloaded models
- ollama rm <model> — Remove a model
- ollama create -f <Modelfile> — Create a custom model
- ollama serve — Start the Ollama API server
- ollama ps — Show running models
- ollama stop <model> — Stop a running model
- ollama help — Show help
Advanced Customization: Custom model with Modelfiles
You can “fine-tune” a model’s personality and constraints using a Modelfile. This is similar to a Dockerfile.
- Create a file named Modelfile
- Add the following configuration:
# 1. Base the model on an existing one
FROM llama3
# 2. Set the creative temperature (0.0 = precise, 1.0 = creative)
PARAMETER temperature 0.7
# 3. Set the context window size (default is 4096 tokens)
PARAMETER num_ctx 4096
# 4. Define the System Prompt (The AI’s “brain”)
SYSTEM """
You are a Senior Python Backend Engineer.
Only answer with code snippets and brief technical explanations.
Do not be conversational.
"""
FROM defines the base model
SYSTEM sets a system prompt
PARAMETER controls inference behavior
After that, you need to build the model by using this command:
ollama create [change-to-your-custom-name] -f Modelfile
This wraps the model + prompt template together into a reusable package.
Then run in:
ollama run [change-to-your-custom-name]
Press enter or click to view image in full size
Ollama Server (Local API)
Ollama can run as a local server that apps can call. To start the server use the command:
ollama serve
It listens on http://localhost:11434 by default.
Raw HTTP
import requests
r = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3",
"messages": [{"role":"user","content":"Hello Ollama"}]
}
)
print(r.json()["message"]["content"])
This lets you embed Ollama into apps or services.
Python Integration
Use Ollama inside Python applications with the official library. Run these commands:
Create and activate virtual environments:
python3 -m venv .venv
source .venv/bin/activate
Install the official library:
pip install ollama
Use this simple Python code:
import ollama
# This sends a message to the model 'gemma:2b'
response = ollama.chat(model='gemma:2b', messages=[
{
'role': 'user',
'content': 'Write a short poem about coding.'
},
])
# Print the AI's reply
print(response['message']['content'])
This works over the local API automatically when Ollama is running.
You can also call a local server:
import requests
r = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3",
"messages": [{"role":"user","content":"Hello Ollama"}]
}
)
print(r.json()["message"]["content"])
Using Ollama Cloud
Ollama also supports cloud models — useful when your machine can’t run very large models.
First, create an account on
In the models list, you will see the model with the -cloud prefix**,** which means it is available in the Ollama cloud.
Click on it and copy the CLI command. Then, inside the terminal, use:
ollama signin
To sign in to your Ollama account. Once you sign in with ollama signin, then run cloud models:
ollama run nemotron-3-nano:30b-cloud
Your Own Model in the Cloud
While Ollama is local-first, Ollama Cloud allows you to push your custom models (the ones you built with Modelfiles) to the web to share with your team or use across devices.
- Create an account at ollama.com.
- Add your public key (found in ~/.ollama/id_ed25519.pub).
- Push your custom model:
ollama push your-username/change-to-your-custom-model-name
Conclusion
That is the complete overview of Ollama! It is a powerful tool that gives you total control over AI. If you like this tutorial, please like it and share your feedback in the section below.
Cheers! ;)