Rate Limit Roulette: Building a Resilient AI Proxy Inside Remix Actions

We’ve all been there. You’re building the next great AI-powered feature in Remix. Your local environment is humming along beautifully, the prompts are crisp, and the UI feels like magic. Then you deploy to staging, three people use it at once, and suddenly your logs are a sea of red.

429: Too Many Requests.

It's the developer's version of a "Keep Out" sign. Integrating LLMs like GPT-4 or Claude directly into your Remix action functions seems straightforward until you realize that these APIs are significantly more temperamental than your standard REST endpoint. They’re slow, they’re expensive, and their rate limits are often shockingly low for new accounts.

Today, I want to share how I stopped playing "Rate Limit Roulette" and built a resilient proxy layer directly inside my Remix actions. We're going to talk about exponential backoff, fallback models, and a "poor man's queue" that keeps your app from falling over when OpenAI decides it’s had enough of your traffic.

Why Remix Actions?

In the Remix world, actions are our bread and butter for mutations. When a user submits a form to generate a summary or chat with a PDF, that request hits an action.

The temptation is to just await openai.chat.completions.create(...) and call it a day. But if that call takes 25 seconds or hits a rate limit, your user is staring at a hung UI, or worse, an error page that provides zero context.

By building a proxy logic _inside_ our server-side code, we can intercept these failures, retry them intelligently, or switch to a cheaper, faster model before the user even knows something went wrong.

Strategy 1: The "Graceful Failover" Loop

Let's start with the reality of AI providers: sometimes they just go down. Or maybe you've burned through your Tier 1 rate limit on GPT-4o.

The first thing I implement in any serious AI project is a prioritized list of models. If the "Premium" model fails, we drop down to the "Fast" model. It’s better to give the user a slightly less "intelligent" response than no response at all.

// app/utils/ai-proxy.server.ts

type AIConfig = {
  model: string
  provider: 'openai' | 'anthropic'
}

const MODEL_PRIORITY: AIConfig[] = [
  { provider: 'openai', model: 'gpt-4o' },
  { provider: 'anthropic', model: 'claude-3-5-sonnet-20240620' },
  { provider: 'openai', model: 'gpt-4o-mini' }, // The reliable fallback
]

export async function generateResilientCompletion(prompt: string) {
  let lastError

  for (const config of MODEL_PRIORITY) {
    try {
      // Logic to call OpenAI or Anthropic based on config.provider
      const result = await callProvider(config, prompt)
      return result
    } catch (error: any) {
      lastError = error

      // If it's a 429 or 500, we try the next model
      if (error.status === 429 || error.status >= 500) {
        console.warn(`Fallback: ${config.model} failed, trying next...`)
        continue
      }

      // If it's a validation error (400), don't bother retrying
      throw error
    }
  }

  throw new Error('All AI providers exhausted: ' + lastError?.message)
}

Strategy 2: Smart Retries with Exponential Backoff

A 429 doesn't always mean "Stop forever." Usually, it means "Stop for 500ms."

If you just wrap your call in a while loop, you'll likely hit the rate limit again immediately, digging yourself a deeper hole. I like to use a simple exponential backoff. It’s a fancy way of saying "wait a little bit, then wait a little bit more, then give up."

<Callout> Pro Tip: Don't write your own backoff logic if you can avoid it. Packages like p-retry are fantastic, but if you're in a restricted environment (like some Edge functions), a simple helper function does the trick. </Callout>

Here is how I usually structure the retry logic to keep it readable:

async function fetchWithRetry(
  fn: () => Promise<any>,
  retries = 3,
  delay = 1000
) {
  try {
    return await fn()
  } catch (error: any) {
    if (retries <= 0 || error.status !== 429) throw error

    // Wait for the specified delay, then try again with double the delay
    await new Promise((resolve) => setTimeout(resolve, delay))
    return fetchWithRetry(fn, retries - 1, delay * 2)
  }
}

Strategy 3: The Action-Level Queue

Here’s where things get interesting. Remix runs on the server. If you have 50 users hitting an action at once, you have 50 concurrent requests hitting your AI provider.

If your limit is 5 requests per minute (common on new accounts), 45 of those users are getting errors.

While a global Redis-backed queue is the "correct" enterprise solution, you can actually do quite a bit with a simple in-memory semaphore if you're running on a single server instance (like a Fly.io or DigitalOcean droplet).

// app/utils/limiter.server.ts

import pLimit from 'p-limit'

// Limit to 3 concurrent AI requests across the whole server instance
const limit = pLimit(3)

export async function queuedAICall(prompt: string) {
  return limit(() => generateResilientCompletion(prompt))
}

Now, in your Remix action, you use queuedAICall. If the "slots" are full, the request will wait (in a pending state) until a slot opens up.

// app/routes/generate.tsx

export const action = async ({ request }: ActionFunctionArgs) => {
  const formData = await request.formData()
  const prompt = formData.get('prompt') as string

  try {
    const result = await queuedAICall(prompt)
    return json({ success: true, data: result })
  } catch (error) {
    return json(
      { success: false, error: 'The robots are tired. Try again later.' },
      { status: 503 }
    )
  }
}

The UX Problem: "Is it working?"

The downside of queuing and retries is latency. If you're retrying a request three times with backoff, a request might take 45 seconds to resolve.

In Remix, this is where useNavigation becomes your best friend. Don't just show a generic spinner. If the request is taking longer than usual, tell the user why.

function MyComponent() {
  const navigation = useNavigation()
  const isSubmitting = navigation.state === 'submitting'

  return (
    <Form method="post">
      <button type="submit" disabled={isSubmitting}>
        {isSubmitting ? 'Generating (AI is busy, hang tight...)' : 'Generate'}
      </button>
    </Form>
  )
}

Handling Timeouts (The Edge Constraint)

If you are deploying to Vercel or Cloudflare Pages, you have a hard timeout limit (usually 10-30 seconds). If your retry logic takes longer than that, the platform will kill your function.

So, here's the thing: You can't always wait.

If you're on a serverless platform, you might need to shift to an optimistic UI pattern combined with a background job. But for many "internal" or low-traffic apps, simply being smart about your timeouts is enough. I always set a signal with an AbortController to make sure I'm not waiting on a dead request.

const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), 15000) // 15s timeout

try {
  const response = await fetch(url, { signal: controller.signal })
} finally {
  clearTimeout(timeoutId)
}

Wrapping Up

Building with AI is inherently messy. We're moving from a world of deterministic APIs to a world where "The server is having a bad day" is a legitimate technical hurdle.

By implementing these three things—model fallbacks, exponential backoff, and concurrency limiting—you move your Remix app from "toy project" to "production grade." You stop playing roulette and start building something that feels solid to your users, even when the underlying APIs are flaky as hell.

Now, go forth and build something cool. Just maybe... don't use GPT-4 to generate your console.log messages. Your wallet will thank you.

Happy coding!