Ollama Provider Reference¶

The Ollama provider enables running LLMs locally with complete privacy and no API costs.

Configuration¶

Prerequisites¶

Install Ollama:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from https://ollama.ai/download

Start Ollama Service¶

# Start Ollama (required before using AgentiCraft)
ollama serve

# Pull models you want to use
ollama pull llama2        # 7B model (3.8GB)
ollama pull llama2:13b    # 13B model (7.3GB)
ollama pull mistral       # Fast alternative
ollama pull codellama     # For code generation

Environment Variables¶

export OLLAMA_HOST="http://localhost:11434"  # Default

Initialization¶

from agenticraft import Agent

# IMPORTANT: Always set appropriate timeout for Ollama
agent = Agent(
    name="LocalBot",
    provider="ollama",
    model="llama2",      # or "llama2:latest"
    timeout=120          # 2 minutes - essential for CPU inference!
)

# Custom host
agent = Agent(
    name="RemoteBot",
    provider="ollama",
    model="mistral",
    base_url="http://192.168.1.100:11434",
    timeout=180
)

⚠️ Critical: Timeout Configuration¶

Ollama requires longer timeouts than cloud providers, especially on CPU:

# ❌ This will likely timeout on CPU
agent = Agent(provider="ollama", model="llama2")  # Default timeout too short

# ✅ Always set explicit timeout
agent = Agent(
    provider="ollama",
    model="llama2",
    timeout=120,  # Minimum 2 minutes recommended
    max_tokens=100  # Limit response length for faster generation
)

Timeout Guidelines¶

Scenario	Recommended Timeout	Notes
First run (model loading)	300s (5 min)	Model loads into memory
Simple queries	60-120s	Short prompts, limited tokens
Complex queries	180-300s	Longer responses
GPU available	30-60s	Much faster than CPU

Supported Models¶

Model	Size	Command	Use Case
`llama2`	3.8GB	`ollama pull llama2`	General purpose
`llama2:13b`	7.3GB	`ollama pull llama2:13b`	Better quality
`llama2:70b`	40GB	`ollama pull llama2:70b`	Best quality
`mistral`	4.1GB	`ollama pull mistral`	Fast, efficient
`codellama`	3.8GB	`ollama pull codellama`	Code generation
`phi`	1.6GB	`ollama pull phi`	Tiny, very fast

Performance Characteristics¶

Expected Generation Times (CPU)¶

# First request (model loading)
# Llama2 7B: 15-30 seconds to load
# Then: 1-5 tokens/second generation

# Subsequent requests (model in memory)
# Simple prompt (10-50 tokens): 5-15 seconds
# Medium prompt (100-200 tokens): 20-60 seconds
# Long prompt (500+ tokens): 2-5 minutes

# With GPU acceleration
# 5-10x faster than CPU

⚠️ Common Issues and Solutions¶

Issue: Timeouts during normal operation¶

Problem: Default timeout too short for local inference

# This often fails with timeout
agent = Agent(provider="ollama", model="llama2")
response = await agent.arun("Explain quantum computing")  # Timeout!

Solution: Set appropriate timeout and limit response length

agent = Agent(
    provider="ollama",
    model="llama2",
    timeout=180,      # 3 minutes
    max_tokens=100    # Limit response length
)

Issue: First request very slow¶

Problem: Model needs to load into memory (15-30 seconds)

Solution: Warm up the model

async def warm_up_model():
    """Load model into memory with simple query"""
    agent = Agent(provider="ollama", model="llama2", timeout=300)
    await agent.arun("Hi")  # Simple query to load model
    print("Model loaded and ready!")

# Run warmup before main tasks
await warm_up_model()

Issue: Inconsistent performance¶

Problem: System resources, model state affect performance

Solution: Add delays between requests

import asyncio

# Process multiple queries with delays
queries = ["Question 1", "Question 2", "Question 3"]
for query in queries:
    response = await agent.arun(query)
    print(response.content)
    await asyncio.sleep(2)  # Give Ollama time to stabilize

Configuration Options¶

# Optimized configuration for local inference
agent = Agent(
    name="OptimizedOllama",
    provider="ollama",
    model="llama2",

    # Essential settings
    timeout=180,           # 3 minutes - adjust based on your hardware
    max_tokens=150,        # Limit response length for speed

    # Ollama-specific options
    temperature=0.7,       # 0.0-1.0
    top_p=0.9,            # Nucleus sampling
    top_k=40,             # Top-k sampling  
    repeat_penalty=1.1,   # Penalize repetition
    seed=42,              # Reproducible outputs

    # Advanced options (if needed)
    num_ctx=2048,         # Context window (default: 2048)
    num_gpu=1,            # GPU layers (if available)
    num_thread=8,         # CPU threads
)

Performance Optimization¶

Quick Responses Configuration¶

# Optimized for speed
fast_agent = Agent(
    provider="ollama",
    model="llama2",
    timeout=60,
    temperature=0.1,    # Lower = faster
    max_tokens=50,      # Short responses
    top_k=10           # Restrict vocabulary
)

# Use for simple queries
response = await fast_agent.arun("What is 2+2?")

Quality Responses Configuration¶

# Optimized for quality (slower)
quality_agent = Agent(
    provider="ollama",
    model="llama2:13b",  # Larger model
    timeout=300,         # 5 minutes
    temperature=0.7,
    max_tokens=500,
    num_ctx=4096        # Larger context
)

Batch Processing¶

async def batch_process(queries: list, delay: float = 2.0):
    """Process multiple queries with delays"""
    agent = Agent(
        provider="ollama",
        model="llama2",
        timeout=120,
        max_tokens=100
    )

    results = []
    for i, query in enumerate(queries):
        print(f"Processing {i+1}/{len(queries)}...")
        try:
            response = await agent.arun(query)
            results.append(response.content)
        except Exception as e:
            results.append(f"Error: {e}")

        # Delay between requests
        if i < len(queries) - 1:
            await asyncio.sleep(delay)

    return results

Best Practices¶

Always set explicit timeout: Minimum 120 seconds for CPU
Limit response length: Use max_tokens to control generation time
Warm up models: First request loads model into memory
Add delays: Space out requests to prevent overwhelming Ollama
Monitor resources: Check CPU/RAM usage during inference
Use appropriate models: Smaller models for speed, larger for quality

Complete Working Example¶

import asyncio
import time
from agenticraft import Agent

class LocalAssistant:
    def __init__(self):
        # Check if Ollama is running
        self._check_ollama()

        # Create agents for different purposes
        self.fast_agent = Agent(
            name="FastLocal",
            provider="ollama",
            model="llama2",
            timeout=90,
            temperature=0.1,
            max_tokens=50
        )

        self.balanced_agent = Agent(
            name="BalancedLocal",
            provider="ollama",
            model="llama2",
            timeout=180,
            temperature=0.7,
            max_tokens=200
        )

        # Warm up models
        print("Warming up models...")
        asyncio.run(self._warmup())

    def _check_ollama(self):
        """Verify Ollama is accessible"""
        import httpx
        try:
            response = httpx.get("http://localhost:11434/api/tags")
            print("✅ Ollama is running")
        except:
            raise Exception(
                "❌ Ollama not running. Start with: ollama serve"
            )

    async def _warmup(self):
        """Load models into memory"""
        try:
            await self.fast_agent.arun("Hi")
            print("✅ Models loaded")
        except Exception as e:
            print(f"⚠️  Warmup failed: {e}")

    async def quick_answer(self, question: str) -> str:
        """Fast responses for simple questions"""
        start = time.time()
        try:
            response = await self.fast_agent.arun(question)
            elapsed = time.time() - start
            print(f"⏱️  Response time: {elapsed:.1f}s")
            return response.content
        except Exception as e:
            return f"Error: {e}"

    async def detailed_response(self, prompt: str) -> str:
        """Detailed responses (slower)"""
        start = time.time()
        try:
            response = await self.balanced_agent.arun(prompt)
            elapsed = time.time() - start
            print(f"⏱️  Response time: {elapsed:.1f}s")
            return response.content
        except Exception as e:
            return f"Error: {e}"

    async def batch_queries(self, queries: list) -> list:
        """Process multiple queries efficiently"""
        results = []

        for i, query in enumerate(queries):
            print(f"\nProcessing {i+1}/{len(queries)}: {query[:50]}...")

            # Use fast agent for simple queries
            if len(query) < 50 and "?" in query:
                result = await self.quick_answer(query)
            else:
                result = await self.detailed_response(query)

            results.append(result)

            # Delay between requests
            if i < len(queries) - 1:
                await asyncio.sleep(2)

        return results

# Usage example
async def main():
    print("🦙 Local LLM Assistant")
    print("=" * 50)

    # Initialize assistant
    assistant = LocalAssistant()

    # Quick questions
    print("\n📌 Quick Answers:")
    quick_q = [
        "What is 2+2?",
        "Capital of France?",
        "Define CPU"
    ]

    for q in quick_q:
        answer = await assistant.quick_answer(q)
        print(f"Q: {q}")
        print(f"A: {answer}\n")
        await asyncio.sleep(1)

    # Detailed response
    print("\n📌 Detailed Response:")
    detailed = await assistant.detailed_response(
        "Explain the benefits of running AI models locally"
    )
    print(f"Response: {detailed[:200]}...")

    # Batch processing
    print("\n📌 Batch Processing:")
    batch = [
        "What is RAM?",
        "Explain how neural networks work",
        "List 3 programming languages"
    ]
    results = await assistant.batch_queries(batch)

    for q, r in zip(batch, results):
        print(f"\nQ: {q}")
        print(f"A: {r[:100]}...")

if __name__ == "__main__":
    asyncio.run(main())

Troubleshooting Guide¶

Debugging Timeout Issues¶

async def debug_ollama():
    """Diagnose Ollama performance issues"""
    import httpx

    print("🔍 Ollama Diagnostics")
    print("=" * 40)

    # Check connection
    try:
        async with httpx.AsyncClient() as client:
            response = await client.get("http://localhost:11434/api/tags")
            models = response.json().get("models", [])
            print(f"✅ Connected. Models: {len(models)}")
            for model in models:
                print(f"   - {model['name']} ({model['size'] / 1e9:.1f}GB)")
    except:
        print("❌ Cannot connect to Ollama")
        return

    # Test performance
    timeouts = [30, 60, 120, 180]
    for timeout in timeouts:
        print(f"\n⏱️  Testing {timeout}s timeout...")
        agent = Agent(
            provider="ollama",
            model="llama2",
            timeout=timeout,
            max_tokens=10
        )

        try:
            start = time.time()
            await agent.arun("Say hello")
            elapsed = time.time() - start
            print(f"   ✅ Success in {elapsed:.1f}s")
        except Exception as e:
            print(f"   ❌ Failed: {type(e).__name__}")

# Run diagnostics if having issues
await debug_ollama()

Hardware Recommendations¶

Model Size	Minimum RAM	Recommended RAM	GPU Recommended
2-3B (phi)	4GB	8GB	No
7B (llama2)	8GB	16GB	Yes
13B	16GB	32GB	Yes
70B	64GB	128GB	Required