Defending Expensive APIs: Rate Limiting Strategies for LLMs
The Cost of a Request
In the old days, a request cost the server microseconds. In the age of LLMs (Large Language Models), a single API call can tie up a GPU for 10 seconds. This makes Application Layer (Layer 7) DDoS attacks devastatingly effective and cheap to launch.
If you use a simple "10 requests per minute" limit, attackers will just rotate through 10,000 IP addresses. We need smarter defense.
Strategy 1: The Token Bucket with Redis & Lua
To handle high throughput with atomic precision, we shouldn't rely on application logic (which is slow). We push the logic into Redis using Lua scripts. This ensures the "Check" and "Decrement" operations happen atomically.
-- redis_limiter.lua local key = KEYS[1] local limit = tonumber(ARGV[1]) local window = tonumber(ARGV[2]) local current = redis.call("INCR", key) if current == 1 then redis.call("EXPIRE", key, window) end if current > limit then return 0 -- Rejected else return 1 -- Accepted end
Strategy 2: JA3 Fingerprinting
IP addresses are cheap. Browser fingerprints are expensive. JA3 is a method to fingerprint the SSL/TLS client hello packet. Even if a bot rotates its IP, its SSL handshake parameters (Cipher Suites, TLS version, Extensions order) usually stay the same.
By implementing JA3 filtering at your ingress (e.g., using Cloudflare Workers or Nginx Plus), you can block a specific type of bot client globally, regardless of which IP it comes from.
Summary
Protecting GenAI APIs requires a multi-layered approach:
- Network: Cloudflare/AWS Shield.
- Identity: Mandatory Auth0/Cognito tokens.
- Compute: Smart Rate Limiting based on token count, not just request count.