Back to blog

4 May 2026

How to Handle Too Many Requests Errors in AI APIs

Learn why AI model APIs return HTTP 429 errors, how to retry safely, reduce agent overload, and use routing to improve reliability.

A familiar failure shows up when an agent is halfway through real work. It has already planned, called tools, fetched context, and started the next model step. Then the run stops with HTTP 429 Too Many Requests.

For AI systems, that error rarely means one simple thing. It can come from the API gateway in front of the model, the model provider behind the gateway, the account’s own quota or balance rules, or upstream capacity pressure during busy periods. In agent workflows, one blocked call can also stall a whole chain of tool use, retries, and follow-up prompts.

The practical goal isn't to “fix 429” once. The goal is to build clients and workflows that treat rate limits as a normal operating condition. Teams building with AI agent frameworks need retry discipline, concurrency control, queueing, and sensible routing if they want coding agents and long-context jobs to stay reliable under load.

Table of Contents

What HTTP 429 Means for Your AI Agents

HTTP 429 means the server is telling the client to slow down. In plain English, the application has sent more traffic than that service is willing or able to process right now.

For AI APIs, that matters more than it does in a simple CRUD app. An agent often makes multiple dependent calls. One model call drafts a plan, another selects a tool, another summarises output, and another turns the result into code or an action. If one of those steps gets a 429, the whole workflow may pause or fail.

A 429 also isn't a precise diagnosis by itself. It doesn't automatically mean “bad client”. It can signal:

  • Bursting too fast because the client sent too many requests in a short window
  • Too much parallel work because several agents or workers shared the same API key
  • Heavy requests because long prompts or long outputs consumed more capacity than expected
  • Provider pressure because the chosen model was temporarily saturated
  • Quota trouble because the account had balance, quota, or policy limits

Practical rule: Treat 429 as a flow-control signal, not as a surprise exception.

That framing changes the design. Instead of retrying immediately and hoping for the best, the client needs to respond with pacing, backoff, and workload shaping. For agentic systems, that usually means adding logic above the raw API call. The retry policy needs context about job priority, token budgets, and whether the request can switch to another model or wait in a queue.

The most useful mindset is simple. A resilient AI client assumes that too many requests errors will happen, especially during batch runs, coding sessions with tool loops, or long-context workloads. The system should slow itself down before the provider has to do it repeatedly.

Common Causes of 429 Errors in AI Workflows

An error page displaying the 429 Too Many Requests status code with a diagnostic message about rate limiting.

A 429 during AI inference usually comes from one of a few repeat offenders. The challenge is that these look similar in logs unless the client records enough detail around each request.

General API guidance often frames HTTP 429 as a simple rate-limit response, but AI inference makes the diagnosis more complicated. A single agent can combine long prompts, tool calls, retries, and parallel workers behind one visible user action. That means a small number of user-facing tasks can still create a large amount of upstream pressure.

One status code, several bottlenecks

The first cause is request rate. A coding tool or agent runner may fire off bursts of small calls while planning, reading files, generating patches, and checking outputs. Even if each request is cheap, the burst can still trip a per-minute or per-second limit.

The second is parallelism. A single agent might behave well, but ten workers sharing one account can overwhelm the same endpoint. This is common when a team adds background jobs, eval runs, and production traffic on the same key.

Long-context traffic adds another problem. AI providers don't only care about request count. They also care about resource consumption. Large prompts, retrieval-heavy context windows, and high max_tokens settings can consume enough compute to trigger throttling sooner than expected.

Another common source is provider-side capacity limits. One model may be crowded. The gateway returns too many requests even when the application isn't misbehaving by its own standards. This is especially visible on popular reasoning or coding models during peak periods.

Then there are account-level issues. If the account has exhausted quota, hit spend controls, or run into a platform-specific allowance, the client may still receive a 429-style response. Engineers often misclassify these as transient overload and keep retrying, which wastes time and money.

A practical diagnosis checklist

When too many requests errors start appearing, these checks usually separate the causes quickly:

  • Check request shape: Log model name, prompt size, expected output size, and whether tools were enabled. Big requests often correlate with throttling even when raw request count looks modest.
  • Check concurrency: Record how many jobs were in flight when the error occurred. Shared workers, cron tasks, and eval pipelines often create hidden contention.
  • Check account state: Inspect balance, quota, and any per-key restrictions before adding more retries.
  • Check whether retries piled up: A naive loop can turn a short spike into a prolonged outage.
  • Check whether one model is the hotspot: If failures cluster around one endpoint, capacity is likely model-specific.

When a 429 appears, the fastest fix isn't guessing. It's identifying whether the bottleneck is local behaviour, account policy, or upstream model capacity.

Teams building agents should also remember that the visible failing call may not be the original problem. A planner call can succeed, then tool invocations stack up, retries begin, and the final model call surfaces the first obvious 429. Without per-step logging, the wrong component gets blamed.

Safe Client-Side Retries

A hand pressing a pause button on a digital blue orb, representing the concept of graceful retries.

The worst response to too many requests is an immediate retry loop. That just sends fresh pressure into an already throttled path.

A better client does three things. It checks whether the server told it how long to wait, it backs off progressively when no exact wait is given, and it adds a bit of randomness so multiple workers don't all retry at the same instant.

The Gravitee guide to rate limiting strategies notes that the HTTP Retry-After header allows a server to specify the exact number of seconds a client should wait before retrying, which gives clients a standard way to throttle themselves without guesswork. That same guidance is why retry code should prefer server instructions over client guesses.

For teams wiring OpenAI-compatible clients, this logic belongs in a small shared wrapper. That keeps every model call consistent and avoids one part of the codebase using safe retries while another still hammers the API. Reference implementations and endpoint details should live close to platform docs, not scattered through application code. A central wrapper works best alongside an internal integration guide or a provider reference such as the Select.ax API docs.

What good retry logic actually does

Safe retry logic should follow a few rules:

  1. Retry only when the failure is retryable. A 429 often is. Invalid auth or malformed input isn't.
  2. Respect Retry-After first. If the server says wait, wait.
  3. Use exponential backoff with jitter. Delay should grow on each attempt.
  4. Cap retries. Infinite retry loops are operational debt.
  5. Protect non-idempotent operations. If a tool call writes state, the workflow needs idempotency protection before automatic retry.

A compact decision table helps:

Condition Response
429 with Retry-After Wait for that duration, then retry
429 without header Use exponential backoff with jitter
Repeated 429 on same model Reduce concurrency or route elsewhere
Account quota or balance issue Stop retrying and surface action to operator

Operational note: Retry logic should reduce traffic pressure over time. If it doesn't, it isn't retry logic. It's a request amplifier.

TypeScript example with exponential backoff

type ChatPayload = {
  model: string;
  messages: Array<{ role: string; content: string }>;
  max_tokens?: number;
  stream?: boolean;
};

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function parseRetryAfter(headerValue: string | null): number | null {
  if (!headerValue) return null;
  const seconds = Number(headerValue);
  if (!Number.isNaN(seconds)) return seconds * 1000;
  return null;
}

function backoffDelay(attempt: number, baseMs = 1000, maxMs = 30000): number {
  const exp = Math.min(baseMs * 2 ** attempt, maxMs);
  const jitter = Math.floor(Math.random() * 250);
  return exp + jitter;
}

export async function callWithRetry(
  url: string,
  apiKey: string,
  payload: ChatPayload,
  maxRetries = 5
) {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const res = await fetch(url, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Authorization": `Bearer ${apiKey}`
      },
      body: JSON.stringify(payload)
    });

    if (res.ok) {
      return res.json();
    }

    if (res.status !== 429) {
      const text = await res.text();
      throw new Error(`Request failed with ${res.status}: ${text}`);
    }

    if (attempt === maxRetries) {
      throw new Error("Max retries reached after HTTP 429 responses");
    }

    const retryAfterMs = parseRetryAfter(res.headers.get("Retry-After"));
    const delay = retryAfterMs ?? backoffDelay(attempt);

    await sleep(delay);
  }
}

This pattern is deliberately boring. That's a good thing. Most retry bugs come from trying to be clever.

Python example with safe 429 handling

import random
import time
import requests

def parse_retry_after(headers):
    value = headers.get("Retry-After")
    if not value:
        return None
    try:
        return int(value)
    except ValueError:
        return None

def backoff_delay(attempt, base=1, cap=30):
    delay = min(base * (2 ** attempt), cap)
    jitter = random.uniform(0, 0.25)
    return delay + jitter

def call_with_retry(url, api_key, payload, max_retries=5):
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
    }

    for attempt in range(max_retries + 1):
        response = requests.post(url, headers=headers, json=payload, timeout=60)

        if response.ok:
            return response.json()

        if response.status_code != 429:
            raise RuntimeError(f"Request failed with {response.status_code}: {response.text}")

        if attempt == max_retries:
            raise RuntimeError("Max retries reached after repeated 429 responses")

        retry_after = parse_retry_after(response.headers)
        wait_seconds = retry_after if retry_after is not None else backoff_delay(attempt)
        time.sleep(wait_seconds)

In agent systems, this wrapper should sit below planner logic and tool orchestration. The planner shouldn't decide sleep durations itself. It should receive either a result or a clear failure after controlled retries.

Reducing Pressure with Request and Concurrency Management

Retries are necessary, but prevention is cheaper. Most avoidable 429s come from sending more work than the endpoint can comfortably absorb.

For AI workloads, “more work” doesn't just mean more requests. It also means larger prompts, larger outputs, too many simultaneous tool calls, and workflows that fan out before they have to.

Reduce the cost of each request

The simplest win is to cap output size. If an agent asks for huge completions by default, it burns capacity and keeps connections open longer. Set max_tokens to a realistic ceiling for that step instead of using one global high limit.

Streaming also helps when supported. Streaming doesn't remove rate limits, but it can improve behaviour for long outputs because the client starts consuming the response as it arrives rather than waiting on one long blocking completion. For coding agents, that's often enough to keep the workflow responsive and to cancel early when the output is clearly off track.

Large jobs should also be split. A document pipeline that pushes one giant prompt for classification, extraction, and summarisation is harder to recover than a staged flow with smaller units of work.

Examples that usually work better:

  • Chunk documents: Summarise sections first, then combine summaries.
  • Separate planning from execution: Let one small call decide what to do before sending a heavier reasoning request.
  • Trim tool context: Don't send every prior tool result back into every next call.

Control concurrency before the provider does

Parallel agents are often the primary problem. A local benchmark may look fine with one worker, then fail in production when background sync jobs, user traffic, and eval runs all hit the same key.

Throttling can cascade when one bottleneck service slows a larger workflow. In agent systems, that bottleneck may be a model endpoint, a retrieval step, a tool API, or a shared account limit. Splitting large workflows into smaller stages often gives operators more control than simply raising a global limit.

A good control pattern is:

  • Set per-model concurrency caps: Reasoning-heavy models and long-context endpoints usually need stricter limits than lightweight classification steps.
  • Use separate pools for batch and interactive work: A nightly backlog shouldn't block live user requests.
  • Fail fast on queue overflow: If a queue is already too deep, reject low-priority work instead of letting latency spiral.
  • Apply backpressure at the worker level: Don't let every worker independently decide to send more requests.

A single saturated step can slow the whole workflow more than a slower but stable design with smaller, controlled stages.

Many teams over-focus on provider limits and under-focus on application shape. If the client creates sudden fan-out and each branch retries on its own, too many requests is only the symptom. The core issue is missing admission control inside the application.

Architectural Solutions: Model Routing and Job Queuing

A five-step diagram showing an architectural workflow for managing AI service requests, queues, and load balancing.

Sometimes the client behaves correctly and still gets throttled. The endpoint is busy, a model is under pressure, or an upstream provider has tightened capacity. At that point, reliability becomes an architectural problem, not a retry problem.

High-volume agentic inference needs routers, fallbacks, and throttling-aware design. A single busy model, shared account, or provider path can become the limiting step even when the rest of the application is healthy.

When one model is the bottleneck

If an application depends on one exact model for every task, saturation of that model becomes a single point of failure. That might be acceptable for tightly controlled outputs, but it's fragile for broad agent workloads.

There are two healthy patterns:

Pattern Best use
Direct model selection When the team needs stable behaviour, reproducibility, or a known output profile
Smart routing When availability and throughput matter more than pinning every request to one model

Direct selection matters. Some coding flows, evals, and regulated workflows need predictable behaviour and don't want automatic switching.

Smart routing matters too. If a provider or model is saturated, the router can send the request to another suitable option instead of turning a transient bottleneck into a user-facing failure. A concrete implementation example helps make this pattern easier to reason about, and this integration example shows the kind of drop-in routing shape developers usually want from an OpenAI-compatible stack.

Queueing turns spikes into manageable flow

Queueing is the companion to routing. Routing helps when another model can take the work. Queueing helps when the work should wait.

This matters most for:

  • Batch enrichment jobs
  • Large document pipelines
  • Background code review tasks
  • Non-urgent eval runs
  • Agent tasks triggered by webhooks that don't need instant completion

Instead of letting every caller hit the model immediately, the system admits work into a queue and processes it at a controlled rate. Priority lanes can separate live user requests from offline jobs.

A practical queue design often includes:

  1. Priority classes for interactive versus batch work.
  2. Per-model worker pools so one saturated endpoint doesn't freeze all traffic.
  3. Retry metadata stored with the job, not hidden inside one stateless function call.
  4. Circuit breakers that temporarily stop sending work to an unhealthy route.

Queueing doesn't make capacity infinite. It makes load predictable.

This also improves debugging. When requests move through an explicit queue and router, operators can see whether failures come from the client, the scheduler, or a particular model route. Without that layer, too many requests errors often look random even when the pattern is consistent.

Gaining Control Through Visibility and Smart Routing

Rate limiting gets easier to manage once the system exposes the right signals. Teams should monitor requests, tokens, cost, errors, and last-used timestamps per model, per agent, and where possible per workflow step. Without that view, 429 handling stays reactive.

That visibility matters for spend as much as reliability. Too many requests errors often reveal waste: duplicate requests, oversized prompts, retry loops, and badly paced agent swarms. Instrumenting those failures can show where the system needs better pacing, smaller prompts, or a different route.

Useful dashboards usually answer five questions fast:

  • Which model is being throttled most often
  • Which agents create the largest bursts
  • Which routes consume the most tokens
  • Whether retries are helping or multiplying load
  • When a previously quiet workflow starts behaving differently

For teams that want those controls behind one endpoint, Select.ax provides an OpenAI-compatible API with direct model selection, Smart Select routing, visible usage, and pay-as-you-go pricing. It will not eliminate 429s, because no provider or router can promise that. It does make them easier to understand and easier to route around when suitable alternatives are available.