TenSpire runs the infrastructure. You get the API. OpenAI-compatible endpoints on open-weight models. Stretch your platform credits, gain negotiating power, and build failover-ready infrastructure.
Most teams building on AI APIs end up locked into one provider. Costs are unpredictable, rate limits get hit at the worst times, and when model behavior changes, your product breaks. Meanwhile, open-weight models have gotten genuinely good, but running them yourself means building and operating inference infrastructure.
Offload routine or high-volume workloads to TenSpire while preserving your OpenAI, Anthropic, or Google credits for workloads that genuinely require frontier models. Stretch your existing agreements further.
Having a working alternative changes every vendor conversation. When renewal time comes, you're not locked in. You have production-tested capacity running real workloads on open models.
Outages happen. Rate limits get hit. Having inference capacity across multiple providers means you can failover, load-balance, or route by workload type. Don't put all your tokens in one basket.
Public API providers change model behavior and content policies without notice. Open-weight models don't have external policy teams. No surprise capability regressions, no unexplained refusals breaking your product overnight.
Your prompts and outputs are never used for model training. Open-weight models don't phone home.
TenSpire is seeking pilot customers to help shape our API-accessible AI network. We're particularly interested in working with:
Agent platforms and orchestration companies needing reliable, cost-effective inference backends.
Software vendors embedding AI capabilities into products who need predictable costs and control.
Providers with embedded AI capabilities in their SaaS platform looking for margin-friendly inference.
Teams looking to diversify inference providers and reduce single-vendor dependency.
As a pilot customer, you'll get hands-on support, direct input into our roadmap, and preferred pricing as we scale.
Same API format you're already using. Drop-in replacement running on open-weight models.
Standard HTTPS endpoints with TLS encryption
Direct connectivity to AWS, Azure, or GCP
Site-to-site VPN for additional network isolation
Drop-in replacement for OpenAI API (/v1/chat/completions, /v1/embeddings, /v1/audio/transcriptions) with standard authentication.
A curated lineup covering general chat, coding, reasoning, vision, moderation, and embeddings.
| Model | Parameters | Origin | Best For |
|---|---|---|---|
| Qwen3 235B MoE | 235B MoE | 🇨🇳 Alibaba | Flagship reasoning, complex tasks, best quality output |
| Llama 4 Scout | 109B MoE | 🇺🇸 Meta | General chat, long context (10M tokens), fast 70B-class quality |
| Qwen3 32B | 32B Dense | 🇨🇳 Alibaba | Translations (119 languages), agent/tool calling workflows, medium reasoning |
| Devstral Small 2 | 24B Dense | 🇫🇷 Mistral | General coding |
| Magistral 24B | 24B Dense | 🇫🇷 Mistral | Deep reasoning |
| Llama Guard 3 | 8B Dense | 🇺🇸 Meta | Content moderation |
| Gemma 3 4B | 4B Dense | Fast general chat, high volume workloads | |
| Llama 3.2 1B | 1B Dense | 🇺🇸 Meta | Ultra-fast classification, simple extraction |
| mxbai Embed Large | 335M Dense | 🇩🇪 Mixedbread | Text embeddings for semantic search and RAG |
Models can be swapped or added based on pilot requirements.
Use the approach that fits your architecture. Specify the model in your request, or use a model-specific endpoint.
Use the default hostname and specify the model in your request body. This is the simplest approach and works exactly like OpenAI's API.
# Hostname determines the tier, model field determines the model POST https://{tier-endpoint}/v1/chat/completions { "model": "qwen3:32b", "messages": [{"role": "user", "content": "Hello!"}] }
Use a hostname that encodes both tier and model. The model field in your request is ignored. Routing is determined entirely by hostname.
# Hostname determines BOTH the tier and the model POST https://{tier}-{model}-{size}/v1/chat/completions { "model": "ignored", "messages": [{"role": "user", "content": "Hello!"}] }
Dynamic selection
Use Option 1 when your application needs to switch models at runtime based on task complexity, user tier, or A/B testing.
Infrastructure routing
Use Option 2 when routing decisions should live in DNS or load balancer config. No code changes needed to switch models.
Different tiers for different workloads: cost-optimized, balanced, or lowest latency.
Cost-optimized for batch jobs, background processing, development. Higher latency acceptable.
Balanced performance and availability for general use. The default tier.
Fastest inference, dedicated resources for production-critical workloads.
Whether you're building an agent platform, embedding AI into a product, or orchestrating multi-step workflows.
Route high-volume, low-complexity queries (classification, extraction, simple Q&A) to fast/cheap endpoints. Route complex reasoning to high-end models and endpoints.
Test new models by shifting DNS or changing a hostname. No deployments, no feature flags in code, no SDK updates.
Point your primary at one endpoint, your fallback at another. Your application keeps making the same API calls.
Interactive user-facing requests go to priority endpoints. Background batch jobs go to economy. All without touching application logic.
Inference optimization becomes an infrastructure concern, not an application concern. Your code calls "the API." What model, what tier, what tradeoffs are hostname or config, not code.
Per-endpoint usage tracking makes cost attribution straightforward. Know exactly what each workload, team, or customer costs you.
We're seeking pilot customers to help shape the network. Hands-on support, direct roadmap input, and preferred pricing as we scale.