Skip to main content

LLM Integration

This is the default adapter guide for AI products.

Use it when the billable work happens at an LLM request boundary rather than at a generic HTTP handler or worker entrypoint.

Why this is different

LLM integration is mostly about choosing the right boundary:

  • authorize must happen before expensive model work starts
  • commit must use actual post-call usage
  • failed or interrupted work must trigger best-effort cancel
  • retries must not create duplicate accounting

This is why the best integration point is usually the model boundary itself, not the outer request handler.

Anchor order

Prefer these anchors in order:

  1. direct model SDK request boundary
  2. framework callback or span hook around model execution
  3. runner or plugin boundary that owns model and tool lifecycle
  4. API handler, worker, or queue consumer only as fallback

Examples:

  • direct SDK wrapper: wrap the provider SDK methods directly
  • framework injection: inject a gated client or model provider into the framework
  • callback or plugin: use before_model / after_model style hooks
  • shared helper: route all model calls through one helper if the framework offers no central callback

Core lifecycle

For one LLM operation:

  1. build request context
  2. derive feature_code and optional feature_family_code
  3. authorize
  4. execute the model call
  5. extract actual usage
  6. commit
  7. if execution fails before commit, best-effort cancel

Request context

Each LLM call should carry:

  • request_id
  • principal_id or billing_account_id
  • enough metadata to derive feature_code
  • optional feature_family_code
  • optional budget_id

Do not rely on broad globals if the framework already gives you per-turn or per-run state.

quantity_minor and meters[]

For LLM workloads:

  • top-level quantity_minor is the feature-level usage quantity and should still be sent even when you also send meters[]
  • if meters[] is omitted, top-level quantity_minor also applies to the same-named primary meter
  • if meters[] is present, meter-level pricing and settlement follow meters[]

Practical default:

  • use total tokens as top-level quantity_minor
  • send meters[] for input, output, and cached tokens when the model pricing differentiates them

In modern LLM pricing, separate meters[] is usually the right choice.

Streaming

Streaming changes the commit point:

  • authorize before opening the stream
  • do not commit on partial chunks
  • commit only once a final usage-bearing response or terminal event is available
  • if the stream ends before final usage is available, cancel best-effort instead of guessing

Tool calls

Model calls and tool calls are often different billable resources.

When tools are billable:

  • gate tool execution separately from model execution
  • use separate feature_code values
  • use separate feature_family_code values only if that distinction matters for entitlement, packaging, reporting, or limits

Multi-turn sessions

Sessions are not the billing unit. Individual model and tool operations are.

Good default:

  • one authorize -> commit/cancel cycle per model call
  • one authorize -> commit/cancel cycle per billable tool call
  • fresh gate context per turn or per operation

Retries

LLM frameworks often retry under the hood.

Rules:

  • each logical operation needs one stable idempotency base
  • derive authorize, commit, and cancel keys from that base
  • reuse the same keys on retry of the same logical operation

Next

Choose the framework page that matches your runtime: