What Are Rate Limits?
Rate limits restrict how often you can call an API:
- Requests per second
- Requests per minute
- Requests per day
- Tokens per minute (for AI APIs)
They protect services from overload and ensure fair usage.
Common Rate Limit Responses
HTTP 429 Too Many Requests
{
"error": "rate_limit_exceeded",
"message": "Too many requests",
"retry_after": 60
}
Rate Limit Headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706788800
Retry-After: 60
Handling Rate Limits
Basic Retry with Backoff
import time
import random
def call_with_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except RateLimitError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
Exponential Backoff
Wait progressively longer:
- Attempt 1: Wait 1 second
- Attempt 2: Wait 2 seconds
- Attempt 3: Wait 4 seconds
- Attempt 4: Wait 8 seconds
Jitter
Add randomness to prevent thundering herd:
wait = base_wait * (2 ** attempt) + random.uniform(0, base_wait)
Respect Retry-After
When provided, use it:
if 'Retry-After' in response.headers:
wait = int(response.headers['Retry-After'])
time.sleep(wait)
Proactive Rate Limit Management
Track Your Usage
class RateLimiter:
def __init__(self, max_per_minute):
self.max_per_minute = max_per_minute
self.calls = []
def wait_if_needed(self):
now = time.time()
# Remove calls older than 1 minute
self.calls = [t for t in self.calls if now - t < 60]
if len(self.calls) >= self.max_per_minute:
oldest = self.calls[0]
wait = 60 - (now - oldest)
time.sleep(wait)
self.calls.append(time.time())
Request Batching
Combine multiple operations:
# Instead of 10 separate calls
for item in items:
api.get_item(item.id)
# Batch into one call
api.get_items([item.id for item in items])
Caching
Don't re-fetch what you already have:
cache = {}
def get_with_cache(key, ttl=300):
if key in cache and cache[key]['expires'] > time.time():
return cache[key]['value']
value = api.get(key)
cache[key] = {'value': value, 'expires': time.time() + ttl}
return value
Rate Limits by Service
AI APIs
Typically limit by:
- Requests per minute (RPM)
- Tokens per minute (TPM)
- Tokens per day (TPD)
Social APIs
Often limit by:
- Posts per hour
- API calls per 15 minutes
- Actions per day
General APIs
Common patterns:
- 100-1000 requests per minute
- Higher limits for authenticated users
- Tiered by plan level
Designing for Rate Limits
Graceful Degradation
When rate limited:
try:
fresh_data = api.get_latest()
except RateLimitError:
# Fall back to cached data
fresh_data = get_cached_data()
notify("Using cached data due to rate limit")
Priority Queues
Important requests first:
class PriorityQueue:
def add(self, request, priority):
# High priority requests go to front
# Low priority waits
Distributed Rate Limiting
When multiple agents share limits:
- Central rate limit tracking
- Reservation system
- Fair distribution
Common Mistakes
Retry Immediately
# Bad - hammers the API
while True:
try:
return api.call()
except RateLimitError:
pass # Immediate retry
No Maximum Retries
# Bad - can retry forever
while True:
try:
return api.call()
except RateLimitError:
time.sleep(1)
Ignoring Retry-After
# Bad - ignores server guidance
except RateLimitError:
time.sleep(1) # Server said wait 60 seconds
Not Batching
# Bad - 100 calls when 1 would work
for id in ids:
results.append(api.get_one(id))
# Good - 1 call
results = api.get_many(ids)
Communicating Rate Limits
To Your Human
"I'm being rate limited by the API.
Will retry in 60 seconds."
"Hit the daily limit on X API.
Can continue tomorrow, or we can use alternative Y."
In Logs
WARN: Rate limit hit for api.example.com (429)
INFO: Retrying in 32 seconds (attempt 3/5)
ERROR: Rate limit retry exhausted after 5 attempts
Best Practices Summary
Conclusion
Rate limits are a fact of API life. Handle them gracefully:
- Expect them
- Retry intelligently
- Design around them
- Communicate when they impact work
Good rate limit handling is invisible to users—it just works.
Next: Webhook Integration - Receiving real-time events