Telemetry Metrics Reference¶
Complete reference for all metrics collected by AgentiCraft's telemetry system.
Metric Naming Convention¶
All AgentiCraft metrics follow a consistent naming pattern:
Examples:
- agenticraft.agent.requests.total
- agenticraft.tokens.used.count
- agenticraft.latency.provider.milliseconds
Automatic Metrics¶
These metrics are automatically collected when telemetry is enabled.
Agent Metrics¶
agenticraft.agent.requests.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total number of agent requests
- Attributes:
agent.name
: Name of the agentagent.type
: Type of agent (base, reasoning, workflow, etc.)status
: Success or failure
agenticraft.agent.latency.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Agent request processing time
- Attributes:
agent.name
: Name of the agentagent.type
: Type of agentoperation
: Operation performed (execute, plan, reason)- Buckets: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
agenticraft.agent.errors.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total number of agent errors
- Attributes:
agent.name
: Name of the agenterror.type
: Exception class nameoperation
: Operation that failed
agenticraft.agent.active.count¶
- Type: Gauge
- Unit: 1 (count)
- Description: Number of currently active agents
- Attributes:
agent.type
: Type of agent
Token Usage Metrics¶
agenticraft.tokens.prompt.total¶
- Type: Counter
- Unit: 1 (tokens)
- Description: Total prompt tokens consumed
- Attributes:
provider
: LLM provider (openai, anthropic, ollama)model
: Model name (gpt-4, claude-3, etc.)agent.name
: Agent that made the request
agenticraft.tokens.completion.total¶
- Type: Counter
- Unit: 1 (tokens)
- Description: Total completion tokens generated
- Attributes:
provider
: LLM providermodel
: Model nameagent.name
: Agent that made the request
agenticraft.tokens.total.total¶
- Type: Counter
- Unit: 1 (tokens)
- Description: Total tokens (prompt + completion)
- Attributes:
provider
: LLM providermodel
: Model nameagent.name
: Agent that made the request
agenticraft.tokens.cost.dollars¶
- Type: Counter
- Unit: dollars
- Description: Estimated cost of token usage
- Attributes:
provider
: LLM providermodel
: Model nameagent.name
: Agent that made the request
Provider Metrics¶
agenticraft.provider.requests.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total provider API requests
- Attributes:
provider
: Provider namemodel
: Model namestatus
: Success or failure
agenticraft.provider.latency.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Provider API response time
- Attributes:
provider
: Provider namemodel
: Model nameoperation
: Type of operation (complete, stream, embed)- Buckets: [50, 100, 250, 500, 1000, 2500, 5000, 10000, 25000]
agenticraft.provider.errors.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Provider API errors
- Attributes:
provider
: Provider nameerror.type
: Error type (rate_limit, timeout, api_error)status_code
: HTTP status code (if applicable)
agenticraft.provider.rate_limit.remaining¶
- Type: Gauge
- Unit: 1 (requests)
- Description: Remaining rate limit
- Attributes:
provider
: Provider namelimit_type
: requests_per_minute, tokens_per_minute
Tool Metrics¶
agenticraft.tool.executions.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total tool executions
- Attributes:
tool.name
: Name of the tooltool.category
: Tool categorystatus
: Success or failure
agenticraft.tool.latency.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Tool execution time
- Attributes:
tool.name
: Name of the tooltool.category
: Tool category- Buckets: [10, 50, 100, 500, 1000, 5000, 10000, 30000]
agenticraft.tool.errors.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Tool execution errors
- Attributes:
tool.name
: Name of the toolerror.type
: Exception class name
agenticraft.tool.input.size.bytes¶
- Type: Histogram
- Unit: bytes
- Description: Size of tool input data
- Attributes:
tool.name
: Name of the tool- Buckets: [100, 1000, 10000, 100000, 1000000]
agenticraft.tool.output.size.bytes¶
- Type: Histogram
- Unit: bytes
- Description: Size of tool output data
- Attributes:
tool.name
: Name of the tool- Buckets: [100, 1000, 10000, 100000, 1000000]
Memory Metrics¶
agenticraft.memory.operations.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total memory operations
- Attributes:
operation
: store, retrieve, search, deletememory.type
: simple, vector, graphstatus
: Success or failure
agenticraft.memory.latency.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Memory operation latency
- Attributes:
operation
: store, retrieve, search, deletememory.type
: simple, vector, graph- Buckets: [1, 5, 10, 25, 50, 100, 250, 500]
agenticraft.memory.size.items¶
- Type: Gauge
- Unit: 1 (items)
- Description: Number of items in memory
- Attributes:
memory.type
: simple, vector, graph
agenticraft.memory.size.bytes¶
- Type: Gauge
- Unit: bytes
- Description: Memory storage size
- Attributes:
memory.type
: simple, vector, graph
agenticraft.memory.hits.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Memory cache hits
- Attributes:
memory.type
: simple, vector, graph
agenticraft.memory.misses.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Memory cache misses
- Attributes:
memory.type
: simple, vector, graph
Workflow Metrics¶
agenticraft.workflow.executions.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total workflow executions
- Attributes:
workflow.name
: Name of the workflowstatus
: Success, failure, cancelled
agenticraft.workflow.latency.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Workflow execution time
- Attributes:
workflow.name
: Name of the workflow- Buckets: [100, 500, 1000, 5000, 10000, 30000, 60000, 300000]
agenticraft.workflow.steps.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total workflow steps executed
- Attributes:
workflow.name
: Name of the workflowstep.name
: Name of the stepstatus
: Success or failure
agenticraft.workflow.active.count¶
- Type: Gauge
- Unit: 1 (count)
- Description: Currently active workflows
- Attributes:
workflow.name
: Name of the workflow
Reasoning Metrics¶
agenticraft.reasoning.operations.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total reasoning operations
- Attributes:
pattern
: chain_of_thought, tree_of_thoughts, reactstatus
: Success or failure
agenticraft.reasoning.latency.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Reasoning operation time
- Attributes:
pattern
: Reasoning pattern used- Buckets: [100, 500, 1000, 2500, 5000, 10000, 25000]
agenticraft.reasoning.steps.count¶
- Type: Histogram
- Unit: 1 (steps)
- Description: Number of reasoning steps
- Attributes:
pattern
: Reasoning pattern used- Buckets: [1, 2, 3, 5, 10, 20, 50, 100]
agenticraft.reasoning.confidence.ratio¶
- Type: Histogram
- Unit: ratio (0-1)
- Description: Reasoning confidence score
- Attributes:
pattern
: Reasoning pattern used- Buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99, 1.0]
Streaming Metrics¶
agenticraft.streaming.chunks.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Total stream chunks sent
- Attributes:
provider
: LLM providermodel
: Model name
agenticraft.streaming.latency.first_chunk.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Time to first stream chunk
- Attributes:
provider
: LLM providermodel
: Model name- Buckets: [10, 25, 50, 100, 250, 500, 1000, 2500]
agenticraft.streaming.duration.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Total stream duration
- Attributes:
provider
: LLM providermodel
: Model name- Buckets: [100, 500, 1000, 5000, 10000, 30000, 60000]
agenticraft.streaming.interruptions.total¶
- Type: Counter
- Unit: 1 (count)
- Description: Stream interruptions
- Attributes:
provider
: LLM providerreason
: timeout, user_cancelled, error
System Metrics¶
agenticraft.system.cpu.percent¶
- Type: Gauge
- Unit: percent
- Description: CPU usage percentage
- Attributes:
process
: agenticraft
agenticraft.system.memory.bytes¶
- Type: Gauge
- Unit: bytes
- Description: Memory usage
- Attributes:
process
: agenticrafttype
: rss, vms, heap
agenticraft.system.threads.count¶
- Type: Gauge
- Unit: 1 (count)
- Description: Number of active threads
- Attributes:
process
: agenticraft
agenticraft.system.gc.count¶
- Type: Counter
- Unit: 1 (count)
- Description: Garbage collection runs
- Attributes:
generation
: 0, 1, 2
agenticraft.system.gc.duration.milliseconds¶
- Type: Histogram
- Unit: milliseconds
- Description: Garbage collection duration
- Attributes:
generation
: 0, 1, 2- Buckets: [1, 5, 10, 25, 50, 100, 250, 500]
Custom Metrics¶
Creating Custom Counters¶
from agenticraft.telemetry import create_counter
# Create a counter for processed documents
doc_counter = create_counter(
name="custom.documents.processed",
description="Number of documents processed",
unit="1"
)
# Use in your code
doc_counter.add(1, {
"document.type": "pdf",
"document.size": "large",
"processor.version": "2.0"
})
Creating Custom Histograms¶
from agenticraft.telemetry import create_histogram
# Create a histogram for processing time
processing_time = create_histogram(
name="custom.processing.duration",
description="Document processing duration",
unit="milliseconds",
boundaries=[10, 50, 100, 500, 1000, 5000]
)
# Record values
processing_time.record(
value=234.5,
attributes={
"processor": "nlp",
"complexity": "high"
}
)
Creating Custom Gauges¶
from agenticraft.telemetry import create_gauge
# Create a gauge for queue size
def get_queue_size():
return len(processing_queue)
queue_gauge = create_gauge(
name="custom.queue.size",
description="Processing queue size",
unit="1"
)
# Register callback
queue_gauge.add_callback(
callback=get_queue_size,
attributes={"queue.name": "documents"}
)
Metric Aggregations¶
Prometheus Queries¶
# Request rate per minute
rate(agenticraft_agent_requests_total[1m])
# Average latency by agent
avg by (agent_name) (
rate(agenticraft_agent_latency_milliseconds_sum[5m]) /
rate(agenticraft_agent_latency_milliseconds_count[5m])
)
# 95th percentile latency
histogram_quantile(0.95,
rate(agenticraft_provider_latency_milliseconds_bucket[5m])
)
# Error rate percentage
100 * (
rate(agenticraft_agent_errors_total[5m]) /
rate(agenticraft_agent_requests_total[5m])
)
# Token usage per hour by model
sum by (model) (
increase(agenticraft_tokens_total_total[1h])
)
# Memory hit rate
rate(agenticraft_memory_hits_total[5m]) /
(rate(agenticraft_memory_hits_total[5m]) + rate(agenticraft_memory_misses_total[5m]))
# Cost per agent last 24h
sum by (agent_name) (
increase(agenticraft_tokens_cost_dollars[24h])
)
Grafana Dashboard Panels¶
Request Overview¶
{
"title": "Request Rate",
"targets": [{
"expr": "sum(rate(agenticraft_agent_requests_total[5m]))",
"legendFormat": "Total RPS"
}]
}
Latency Distribution¶
{
"title": "Latency Percentiles",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(agenticraft_agent_latency_milliseconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.90, rate(agenticraft_agent_latency_milliseconds_bucket[5m]))",
"legendFormat": "p90"
},
{
"expr": "histogram_quantile(0.99, rate(agenticraft_agent_latency_milliseconds_bucket[5m]))",
"legendFormat": "p99"
}
]
}
Metric Best Practices¶
1. Attribute Cardinality¶
Keep attribute cardinality low to prevent metric explosion:
# Bad - High cardinality
metric.add(1, {"user_id": user_id}) # Millions of values
# Good - Low cardinality
metric.add(1, {"user_tier": get_user_tier(user_id)}) # Few values
2. Consistent Naming¶
Follow the naming convention:
# Good names
"agenticraft.cache.hits.total"
"agenticraft.api.latency.milliseconds"
"agenticraft.queue.size.items"
# Bad names
"hits" # Too generic
"agenticraft_cache_hits" # Wrong separator
"latency" # Missing unit
3. Meaningful Attributes¶
Include attributes that aid in debugging and analysis:
# Good attributes
processing_time.record(duration, {
"stage": "preprocessing",
"document_type": "pdf",
"size_category": "large", # Not exact size
"version": "2.0"
})
# Avoid sensitive data
# Never include: user_email, api_keys, passwords, PII
4. Histogram Buckets¶
Choose buckets that match your SLOs:
# For API latency (SLO: 99% < 1s)
latency_histogram = create_histogram(
name="api.latency",
boundaries=[10, 50, 100, 250, 500, 750, 1000, 2000, 5000]
)
# For batch processing (SLO: 99% < 5m)
batch_histogram = create_histogram(
name="batch.duration",
boundaries=[1000, 10000, 30000, 60000, 120000, 180000, 300000]
)
5. Resource Metrics¶
Monitor resource usage to prevent issues:
# Alert on high memory usage
if memory_gauge.get() > 0.9 * max_memory:
alert("High memory usage detected")
Alerting Examples¶
Prometheus Alert Rules¶
groups:
- name: agenticraft
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(agenticraft_agent_errors_total[5m]) /
rate(agenticraft_agent_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate above 5%"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(agenticraft_agent_latency_milliseconds_bucket[5m])
) > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 5s"
# Token usage spike
- alert: TokenUsageSpike
expr: |
rate(agenticraft_tokens_total_total[5m]) >
2 * avg_over_time(rate(agenticraft_tokens_total_total[5m])[1h:5m])
for: 10m
labels:
severity: info
annotations:
summary: "Token usage 2x above average"
# Memory pressure
- alert: HighMemoryUsage
expr: |
agenticraft_system_memory_bytes{type="rss"} /
agenticraft_system_memory_bytes{type="limit"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Memory usage above 80%"
Performance Impact¶
Metric collection overhead:
Operation | Overhead |
---|---|
Counter increment | ~10ns |
Histogram record | ~50ns |
Gauge callback | ~100ns |
Attribute addition | ~20ns per attribute |
Total overhead with standard instrumentation: <1% of request time
Troubleshooting Metrics¶
Missing Metrics¶
-
Verify telemetry is enabled:
-
Check metric registration:
Incorrect Values¶
-
Verify units:
-
Check attribute values:
High Cardinality¶
Monitor cardinality:
from agenticraft.telemetry import get_metric_cardinality
for metric_name, cardinality in get_metric_cardinality().items():
if cardinality > 1000:
logger.warning(f"High cardinality metric: {metric_name} = {cardinality}")
Next Steps¶
- Integration Guide - Connect to monitoring platforms
- Configuration Guide - Detailed configuration options
- Examples - Real-world usage examples