Blueprint · AI Infrastructure

AI Model Routing Platform

A routing layer that chooses the right AI model based on task type, cost, latency, user plan, and required output.

In ProgressHardOpenRouterModel RoutingUsage MeteringAI SaaSPrompt Precheck
back to blueprints

Overview

A model gateway that turns a user's prompt into a policy decision: which assistant family should handle it, which model should be called, what budget applies, and how usage should be shown back to the user.

Problem

AI products often start as one chat box backed by one model. That breaks down when tasks vary by cost, latency, quality, modality, safety category, and user plan.

Core users

  • AI app builders who need predictable cost control
  • Premium users expecting high-quality assistant behavior
  • Admins configuring model access and plan limits
  • Support engineers investigating usage and provider failures

MVP scope

  • Prompt intake and prompt precheck
  • Assistant family mapping for task/product structure
  • Model policy engine with plan-aware routing
  • Provider adapter for OpenRouter or direct LLM APIs
  • Usage and cost tracker
  • Fallback when a provider fails
  • Admin model activation and plan request flow

Non-goals

  • Do not build custom model training in MVP
  • Do not expose raw provider cost directly as the product price
  • Do not route every request through an expensive classifier if simple rules work
  • Do not promise perfect task classification

Core system components

  • Prompt precheck and category filter
  • Assistant family mapping
  • Model policy engine
  • Provider adapters
  • Usage metering
  • Cost display multiplier
  • Plan limits and entitlements
  • Admin overrides and activation workflow
  • Safety/category filters

Suggested architecture

  • Flow: prompt intake -> prompt classifier/precheck -> assistant recommendation -> model policy engine -> provider adapter -> usage/cost tracker -> response stream.
  • Frontend: assistant picker, plan state, precheck feedback, streaming response UI, and usage display.
  • Backend: model gateway API that owns routing policies, provider calls, retries, usage records, and cost calculations.
  • Database: assistants, models, providers, plans, entitlements, prompt checks, model calls, usage records, costs, and conversations.
  • Workers: optional async jobs for analytics, usage aggregation, and failed-call reconciliation.
  • External APIs: OpenRouter or direct model providers.

Data model

  • Assistant: id, name, family, defaultTaskTypes, enabled
  • Model: id, providerId, name, contextWindow, supportsImages, inputCost, outputCost, latencyTier
  • ModelProvider: id, name, apiBaseUrl, status
  • Plan: id, name, monthlyCredits, allowedModels, maxContext
  • UserEntitlement: id, userId, planId, overrides
  • PromptCheck: id, conversationId, category, risk, recommendedAssistant, createdAt
  • ModelCall: id, userId, conversationId, provider, model, status, latencyMs, tokens
  • UsageRecord: id, userId, type, quantity, displayUnits, createdAt
  • CostRecord: id, modelCallId, rawCost, displayCost, multiplier
  • Conversation: id, userId, assistantId, title, createdAt

API design

  • POST /api/ai/precheck - classify prompt and recommend assistant
  • POST /api/ai/chat - route and stream a model response
  • GET /api/ai/models - list available models for the user's plan
  • POST /api/admin/models/:id/activate - enable or disable a model
  • GET /api/usage - show user-facing usage
  • GET /api/admin/model-calls - inspect raw model calls and costs
  • POST /api/plan-requests - request access to a higher plan or model family

Key technical challenges

  • Cost control without making the product feel cheap
  • Matching task type to model capability
  • Fallbacks when providers fail or stream errors occur
  • Streaming output while usage is still being counted
  • Separating raw backend cost from user-facing usage
  • Abuse prevention and prompt category filtering
  • Keeping routing decisions explainable enough for debugging

Tradeoffs

  • Start with rule-based routing and assistant families before adding learned routing.
  • Keep raw cost internal and expose user-facing usage units to avoid confusing users.
  • Use a precheck only when it improves the UX or policy decision.
  • Centralize provider calls so usage and failures are observable in one place.

Security considerations

  • Rate-limit chat, precheck, and image requests.
  • Validate user entitlement before selecting premium models.
  • Store provider keys server-side only.
  • Log enough for debugging while avoiding sensitive prompt exposure in admin views.
  • Apply safety/category filters before model calls.
  • Add abuse detection for repeated high-cost or policy-breaking prompts.

Scaling path

  • Start with one provider adapter and rule-based policy.
  • Add multiple providers and fallback chains.
  • Add per-plan concurrency, budget caps, and model health checks.
  • Add usage aggregation and admin analytics.
  • Add learned routing from historical quality/cost outcomes.

Observability

  • Metrics for model latency, error rate, fallback rate, token volume, and raw cost.
  • Trace each chat request through precheck, route decision, provider call, and usage record.
  • Admin dashboards for provider health and expensive users.
  • Audit logs for admin model activation and plan overrides.

Future features

  • Learned routing
  • Provider health scoring
  • Per-assistant memory and retrieval
  • Model quality feedback loop
  • Budget simulation before model calls
  • Admin model marketplace