How to Scrape Without Getting Blocked: Complete Anti-Detection Guide
Getting blocked is the single biggest challenge in web scraping. You've built your scraper, tested it locally, and deployed it—only to watch it get banned after 50 requests.
This guide covers everything you need to know to scrape successfully without triggering anti-bot systems.
Why Do Websites Block Scrapers?
Understanding how you get blocked is the first step to avoiding it.
1. Volume-Based Detection
How it works: Normal users make 5-20 requests per minute. Your scraper makes 500. That's an instant red flag.
What triggers it:
- Too many requests from one IP
- Requests happening too uniformly (exactly 1 per second)
- Unusual traffic patterns (all POST requests, no CSS/JS loads)
2. Behavioral Detection
How it works: Real users browse pages, hover over elements, scroll, and pause. Scrapers don't.
What triggers it:
- Zero mouse movement or scrolling
- Instant form submissions
- No time spent "reading" content
- Accessing pages in non-human order
3. Technical Fingerprinting
How it works: Your browser leaks hundreds of unique identifiers. When those don't match real users, you're flagged.
What triggers it:
- Missing or incorrect headers (User-Agent, Accept-Language)
- Headless browser detection (navigator.webdriver = true)
- Inconsistent TLS fingerprints
- Missing JavaScript execution artifacts
The Anti-Detection Stack
Successful scraping at scale requires multiple layers of protection.
Layer 1: Rate Limiting and Timing
Never scrape at a constant rate. Real users are unpredictable.
import time
import random
def human_delay(min_seconds=2, max_seconds=8):
"""
Add randomized delays that mimic human behavior
"""
delay = random.uniform(min_seconds, max_seconds)
time.sleep(delay)
for url in urls_to_scrape:
response = requests.get(url, proxies=proxy_config)
process_data(response)
human_delay(3, 10)
Best practices:
- Vary delay between 2-10 seconds for normal browsing
- Increase delays during peak hours (sites monitor more closely)
- Implement exponential backoff on errors
- Respect robots.txt crawl-delay directives
Layer 2: Residential Proxies
Datacenter IPs are easily detected. Residential proxies appear as real users.
import requests
proxies = {
'http': 'http://username:[email protected]:8080',
'https': 'http://username:[email protected]:8080'
}
session = requests.Session()
for url in urls:
response = session.get(
url,
proxies=proxies,
timeout=30
)
Critical details:
- Use rotating residential proxies, not datacenter
- Enable sticky sessions for multi-page flows (checkouts, logins)
- Target IPs from the same country as your target audience
- Private pools are cleaner than shared pools
ProxyLabs tip: Set session duration to 10-20 minutes for natural browsing patterns.
Layer 3: Header Management
Headers reveal your identity. Get them wrong and you're instantly flagged.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def create_session():
session = requests.Session()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
session.headers.update(headers)
retry = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
return session
Header rotation strategies:
- Rotate User-Agent per session (not per request)
- Keep headers consistent within a session
- Match User-Agent to Accept headers (mobile UA = mobile Accept)
- Include all modern browser headers (Sec-Fetch-*, Cache-Control)
Layer 4: JavaScript Rendering
Modern sites require JavaScript. Requests won't cut it.
from playwright.sync_api import sync_playwright
import random
def scrape_with_playwright(url, proxy_config):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.proxylabs.net:8080",
"username": "your-username",
"password": "your-password"
}
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
locale='en-US',
timezone_id='America/New_York'
)
page = context.new_page()
page.goto(url, wait_until='networkidle')
page.mouse.move(
random.randint(100, 500),
random.randint(100, 500)
)
page.wait_for_timeout(random.randint(2000, 5000))
content = page.content()
browser.close()
return content
Playwright advantages over Selenium:
- Better anti-detection (fewer fingerprints)
- Faster execution
- Auto-waiting for elements
- Network interception capabilities
Stealth tactics:
- Always set viewport to common resolutions (1920x1080, 1366x768)
- Set locale and timezone to match proxy location
- Simulate mouse movements and scrolling
- Use real browser user agents only
Layer 5: Cookie and Session Management
Sessions reveal your behavior patterns. Manage them like a real user.
import requests
import pickle
def save_cookies(session, filename):
with open(filename, 'wb') as f:
pickle.dump(session.cookies, f)
def load_cookies(session, filename):
try:
with open(filename, 'rb') as f:
session.cookies.update(pickle.load(f))
except FileNotFoundError:
pass
session = requests.Session()
load_cookies(session, 'cookies.pkl')
response = session.get(target_url, proxies=proxy_config)
save_cookies(session, 'cookies.pkl')
Session best practices:
- Maintain sessions for multiple requests (like real users)
- Store and reuse cookies across scraping runs
- Don't share cookies across different proxy IPs
- Clear sessions after 10-15 minutes (natural session length)
Advanced Anti-Detection Techniques
Bypassing CAPTCHAs
CAPTCHAs are the final boss of web scraping.
Strategies:
- Avoid triggering them: Better anti-detection means fewer CAPTCHAs
- CAPTCHA solving services: 2captcha, Anti-Captcha (costs $1-3 per 1000)
- Machine learning: Train models on CAPTCHA patterns (high effort)
- Smart routing: If CAPTCHA appears, switch IP and retry
def handle_captcha(page):
if page.locator('iframe[src*="recaptcha"]').count() > 0:
print("CAPTCHA detected, rotating IP...")
return False
return True
Handling Rate Limits
If you get rate limited, don't panic.
import time
def exponential_backoff(attempt, base_delay=1, max_delay=60):
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
return delay + jitter
attempt = 0
max_attempts = 5
while attempt < max_attempts:
try:
response = session.get(url, proxies=proxy_config, timeout=30)
if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
print(f"Rate limited, waiting {retry_after}s")
time.sleep(retry_after)
attempt += 1
continue
if response.status_code == 200:
break
except Exception as e:
print(f"Error: {e}")
time.sleep(exponential_backoff(attempt))
attempt += 1
Browser Fingerprint Randomization
Every browser has a unique fingerprint. Randomize it.
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
viewports = [
{'width': 1920, 'height': 1080},
{'width': 1366, 'height': 768},
{'width': 1536, 'height': 864}
]
def create_random_context(browser):
return browser.new_context(
viewport=random.choice(viewports),
user_agent=random.choice(user_agents),
locale=random.choice(['en-US', 'en-GB']),
timezone_id='America/New_York',
has_touch=False,
is_mobile=False,
device_scale_factor=1
)
Target-Specific Strategies
E-Commerce Sites (Amazon, eBay, Walmart)
- Use residential proxies exclusively
- Mimic real shopping behavior (view product → add to cart → view cart)
- Maintain sessions across multiple page views
- Respect peak/off-peak hours
Social Media (Twitter, LinkedIn, Instagram)
- One account per proxy IP
- Use sticky sessions (10-30 min)
- Space actions 2-5 minutes apart
- Mimic mobile app traffic patterns
Search Engines (Google, Bing)
- Rotate IPs per search
- Include referrer headers
- Wait 3-10 seconds between searches
- Use location-matched proxies
Ticketing Sites (Ticketmaster, AXS)
- Private residential proxies only
- Sticky sessions for queue systems
- Sub-second response times critical
- Pre-warm sessions before sales
Monitoring Your Scraper Health
Track these metrics to catch blocks early:
class ScraperMetrics:
def __init__(self):
self.requests = 0
self.successes = 0
self.blocks = 0
self.captchas = 0
self.response_times = []
def record_request(self, success, blocked, captcha, response_time):
self.requests += 1
if success:
self.successes += 1
if blocked:
self.blocks += 1
if captcha:
self.captchas += 1
self.response_times.append(response_time)
def get_stats(self):
success_rate = (self.successes / self.requests * 100) if self.requests > 0 else 0
avg_response = sum(self.response_times) / len(self.response_times) if self.response_times else 0
return {
'total_requests': self.requests,
'success_rate': f'{success_rate:.1f}%',
'block_rate': f'{(self.blocks / self.requests * 100):.1f}%',
'captcha_rate': f'{(self.captchas / self.requests * 100):.1f}%',
'avg_response_time': f'{avg_response:.2f}s'
}
Warning signs:
- Success rate drops below 95%
- Response times increase by 50%+
- CAPTCHA rate above 5%
- Consistent 403/429 errors
Complete Scraping Example
Putting it all together:
from playwright.sync_api import sync_playwright
import random
import time
class StealthScraper:
def __init__(self, proxy_config):
self.proxy = proxy_config
self.metrics = ScraperMetrics()
def scrape_page(self, url):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy=self.proxy
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
locale='en-US'
)
page = context.new_page()
try:
start_time = time.time()
page.goto(url, wait_until='networkidle', timeout=30000)
page.mouse.move(
random.randint(200, 800),
random.randint(200, 600)
)
page.wait_for_timeout(random.randint(2000, 5000))
content = page.content()
response_time = time.time() - start_time
self.metrics.record_request(
success=True,
blocked=False,
captcha=False,
response_time=response_time
)
return content
except Exception as e:
self.metrics.record_request(
success=False,
blocked=True,
captcha=False,
response_time=0
)
raise e
finally:
browser.close()
def scrape_multiple(self, urls):
for url in urls:
try:
content = self.scrape_page(url)
print(f"✓ Scraped: {url}")
except Exception as e:
print(f"✗ Failed: {url} - {e}")
time.sleep(random.uniform(3, 8))
print("\nScraping Stats:")
print(self.metrics.get_stats())
proxy = {
"server": "http://gate.proxylabs.net:8080",
"username": "your-username",
"password": "your-password"
}
scraper = StealthScraper(proxy)
scraper.scrape_multiple([
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
])
Troubleshooting Common Issues
"I'm still getting blocked after implementing everything"
Check:
- Are you using residential proxies or datacenter?
- Is your request rate still too high?
- Are headers matching your User-Agent?
- Is JavaScript executing properly?
"My scraper is too slow now"
Solutions:
- Implement concurrent workers (but maintain rate limits per IP)
- Use faster proxies (ProxyLabs averages ~200ms)
- Cache responses where possible
- Optimize your parsing logic
"I keep getting CAPTCHAs"
This means:
- Your fingerprint is still detectable
- Your proxy IPs might be flagged
- Your behavior is too bot-like
- The site has very aggressive protection
Try:
- Switch to private proxy pools
- Increase delays between requests
- Add more human-like behavior
- Consider CAPTCHA solving services
Legal and Ethical Considerations
You should:
- Respect robots.txt
- Honor rate limits
- Don't overload small sites
- Comply with terms of service
- Use data responsibly
You shouldn't:
- Scrape private/authenticated data without permission
- Ignore cease-and-desist notices
- Scrape personal information for resale
- Overload sites to the point of service degradation
Final Checklist
Before deploying your scraper:
- [ ] Using residential proxies (not datacenter)
- [ ] Rate limiting with randomized delays
- [ ] Rotating headers per session
- [ ] JavaScript rendering for modern sites
- [ ] Cookie/session management implemented
- [ ] Error handling and retry logic
- [ ] Monitoring and alerting setup
- [ ] Tested at scale (not just locally)
- [ ] Legal review completed
Next Steps
- Test your scraper: Run for 24 hours at low volume first
- Monitor metrics: Track success rate, blocks, and response times
- Scale gradually: Double volume every 2-3 days if metrics stay healthy
- Iterate: Adjust delays and behavior based on real data
Successful scraping is about stealth, not speed. Go slow, stay undetected, and scale sustainably.
Ready to try the fastest residential proxies?
Join developers and businesses who trust ProxyLabs for mission-critical proxy infrastructure.
Building proxy infrastructure since 2019. Previously failed at many things, now failing slightly less.
Related Articles
Best Residential Proxies 2026: Top Providers Compared
Compare the best residential proxy providers in 2026. In-depth analysis of pricing, features, pool sizes, and performance to help you choose the right proxy service for web scraping, automation, and data collection.
10 min readSetting Up Proxies with Playwright: Complete Tutorial 2026
Learn how to configure residential proxies with Playwright for web scraping and browser automation. Includes authentication, rotation strategies, error handling, and anti-detection techniques with code examples.
9 min read