All articles
web scrapinganti-detectionresidential proxies

How to Scrape Without Getting Blocked: Complete Anti-Detection Guide

JL
James Liu
Lead Engineer @ ProxyLabs
January 28, 2026
9 min read
Share

How to Scrape Without Getting Blocked: Complete Anti-Detection Guide

Getting blocked is the single biggest challenge in web scraping. You've built your scraper, tested it locally, and deployed it—only to watch it get banned after 50 requests.

This guide covers everything you need to know to scrape successfully without triggering anti-bot systems.

Why Do Websites Block Scrapers?

Understanding how you get blocked is the first step to avoiding it.

1. Volume-Based Detection

How it works: Normal users make 5-20 requests per minute. Your scraper makes 500. That's an instant red flag.

What triggers it:

  • Too many requests from one IP
  • Requests happening too uniformly (exactly 1 per second)
  • Unusual traffic patterns (all POST requests, no CSS/JS loads)

2. Behavioral Detection

How it works: Real users browse pages, hover over elements, scroll, and pause. Scrapers don't.

What triggers it:

  • Zero mouse movement or scrolling
  • Instant form submissions
  • No time spent "reading" content
  • Accessing pages in non-human order

3. Technical Fingerprinting

How it works: Your browser leaks hundreds of unique identifiers. When those don't match real users, you're flagged.

What triggers it:

  • Missing or incorrect headers (User-Agent, Accept-Language)
  • Headless browser detection (navigator.webdriver = true)
  • Inconsistent TLS fingerprints
  • Missing JavaScript execution artifacts

The Anti-Detection Stack

Successful scraping at scale requires multiple layers of protection.

Layer 1: Rate Limiting and Timing

Never scrape at a constant rate. Real users are unpredictable.

import time
import random

def human_delay(min_seconds=2, max_seconds=8):
    """
    Add randomized delays that mimic human behavior
    """
    delay = random.uniform(min_seconds, max_seconds)
    time.sleep(delay)

for url in urls_to_scrape:
    response = requests.get(url, proxies=proxy_config)
    process_data(response)
    
    human_delay(3, 10)

Best practices:

  • Vary delay between 2-10 seconds for normal browsing
  • Increase delays during peak hours (sites monitor more closely)
  • Implement exponential backoff on errors
  • Respect robots.txt crawl-delay directives

Layer 2: Residential Proxies

Datacenter IPs are easily detected. Residential proxies appear as real users.

import requests

proxies = {
    'http': 'http://username:[email protected]:8080',
    'https': 'http://username:[email protected]:8080'
}

session = requests.Session()

for url in urls:
    response = session.get(
        url,
        proxies=proxies,
        timeout=30
    )

Critical details:

  • Use rotating residential proxies, not datacenter
  • Enable sticky sessions for multi-page flows (checkouts, logins)
  • Target IPs from the same country as your target audience
  • Private pools are cleaner than shared pools

ProxyLabs tip: Set session duration to 10-20 minutes for natural browsing patterns.

Layer 3: Header Management

Headers reveal your identity. Get them wrong and you're instantly flagged.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session():
    session = requests.Session()
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Sec-Fetch-User': '?1',
        'Cache-Control': 'max-age=0',
    }
    
    session.headers.update(headers)
    
    retry = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    return session

Header rotation strategies:

  • Rotate User-Agent per session (not per request)
  • Keep headers consistent within a session
  • Match User-Agent to Accept headers (mobile UA = mobile Accept)
  • Include all modern browser headers (Sec-Fetch-*, Cache-Control)

Layer 4: JavaScript Rendering

Modern sites require JavaScript. Requests won't cut it.

from playwright.sync_api import sync_playwright
import random

def scrape_with_playwright(url, proxy_config):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gate.proxylabs.net:8080",
                "username": "your-username",
                "password": "your-password"
            }
        )
        
        context = browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            locale='en-US',
            timezone_id='America/New_York'
        )
        
        page = context.new_page()
        
        page.goto(url, wait_until='networkidle')
        
        page.mouse.move(
            random.randint(100, 500),
            random.randint(100, 500)
        )
        page.wait_for_timeout(random.randint(2000, 5000))
        
        content = page.content()
        
        browser.close()
        
        return content

Playwright advantages over Selenium:

  • Better anti-detection (fewer fingerprints)
  • Faster execution
  • Auto-waiting for elements
  • Network interception capabilities

Stealth tactics:

  • Always set viewport to common resolutions (1920x1080, 1366x768)
  • Set locale and timezone to match proxy location
  • Simulate mouse movements and scrolling
  • Use real browser user agents only

Layer 5: Cookie and Session Management

Sessions reveal your behavior patterns. Manage them like a real user.

import requests
import pickle

def save_cookies(session, filename):
    with open(filename, 'wb') as f:
        pickle.dump(session.cookies, f)

def load_cookies(session, filename):
    try:
        with open(filename, 'rb') as f:
            session.cookies.update(pickle.load(f))
    except FileNotFoundError:
        pass

session = requests.Session()
load_cookies(session, 'cookies.pkl')

response = session.get(target_url, proxies=proxy_config)

save_cookies(session, 'cookies.pkl')

Session best practices:

  • Maintain sessions for multiple requests (like real users)
  • Store and reuse cookies across scraping runs
  • Don't share cookies across different proxy IPs
  • Clear sessions after 10-15 minutes (natural session length)

Advanced Anti-Detection Techniques

Bypassing CAPTCHAs

CAPTCHAs are the final boss of web scraping.

Strategies:

  1. Avoid triggering them: Better anti-detection means fewer CAPTCHAs
  2. CAPTCHA solving services: 2captcha, Anti-Captcha (costs $1-3 per 1000)
  3. Machine learning: Train models on CAPTCHA patterns (high effort)
  4. Smart routing: If CAPTCHA appears, switch IP and retry
def handle_captcha(page):
    if page.locator('iframe[src*="recaptcha"]').count() > 0:
        print("CAPTCHA detected, rotating IP...")
        return False
    return True

Handling Rate Limits

If you get rate limited, don't panic.

import time

def exponential_backoff(attempt, base_delay=1, max_delay=60):
    delay = min(base_delay * (2 ** attempt), max_delay)
    jitter = random.uniform(0, delay * 0.1)
    return delay + jitter

attempt = 0
max_attempts = 5

while attempt < max_attempts:
    try:
        response = session.get(url, proxies=proxy_config, timeout=30)
        
        if response.status_code == 429:
            retry_after = int(response.headers.get('Retry-After', 60))
            print(f"Rate limited, waiting {retry_after}s")
            time.sleep(retry_after)
            attempt += 1
            continue
        
        if response.status_code == 200:
            break
            
    except Exception as e:
        print(f"Error: {e}")
        time.sleep(exponential_backoff(attempt))
        attempt += 1

Browser Fingerprint Randomization

Every browser has a unique fingerprint. Randomize it.

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

viewports = [
    {'width': 1920, 'height': 1080},
    {'width': 1366, 'height': 768},
    {'width': 1536, 'height': 864}
]

def create_random_context(browser):
    return browser.new_context(
        viewport=random.choice(viewports),
        user_agent=random.choice(user_agents),
        locale=random.choice(['en-US', 'en-GB']),
        timezone_id='America/New_York',
        has_touch=False,
        is_mobile=False,
        device_scale_factor=1
    )

Target-Specific Strategies

E-Commerce Sites (Amazon, eBay, Walmart)

  • Use residential proxies exclusively
  • Mimic real shopping behavior (view product → add to cart → view cart)
  • Maintain sessions across multiple page views
  • Respect peak/off-peak hours

Social Media (Twitter, LinkedIn, Instagram)

  • One account per proxy IP
  • Use sticky sessions (10-30 min)
  • Space actions 2-5 minutes apart
  • Mimic mobile app traffic patterns

Search Engines (Google, Bing)

  • Rotate IPs per search
  • Include referrer headers
  • Wait 3-10 seconds between searches
  • Use location-matched proxies

Ticketing Sites (Ticketmaster, AXS)

  • Private residential proxies only
  • Sticky sessions for queue systems
  • Sub-second response times critical
  • Pre-warm sessions before sales

Monitoring Your Scraper Health

Track these metrics to catch blocks early:

class ScraperMetrics:
    def __init__(self):
        self.requests = 0
        self.successes = 0
        self.blocks = 0
        self.captchas = 0
        self.response_times = []
    
    def record_request(self, success, blocked, captcha, response_time):
        self.requests += 1
        if success:
            self.successes += 1
        if blocked:
            self.blocks += 1
        if captcha:
            self.captchas += 1
        self.response_times.append(response_time)
    
    def get_stats(self):
        success_rate = (self.successes / self.requests * 100) if self.requests > 0 else 0
        avg_response = sum(self.response_times) / len(self.response_times) if self.response_times else 0
        
        return {
            'total_requests': self.requests,
            'success_rate': f'{success_rate:.1f}%',
            'block_rate': f'{(self.blocks / self.requests * 100):.1f}%',
            'captcha_rate': f'{(self.captchas / self.requests * 100):.1f}%',
            'avg_response_time': f'{avg_response:.2f}s'
        }

Warning signs:

  • Success rate drops below 95%
  • Response times increase by 50%+
  • CAPTCHA rate above 5%
  • Consistent 403/429 errors

Complete Scraping Example

Putting it all together:

from playwright.sync_api import sync_playwright
import random
import time

class StealthScraper:
    def __init__(self, proxy_config):
        self.proxy = proxy_config
        self.metrics = ScraperMetrics()
    
    def scrape_page(self, url):
        with sync_playwright() as p:
            browser = p.chromium.launch(
                headless=True,
                proxy=self.proxy
            )
            
            context = browser.new_context(
                viewport={'width': 1920, 'height': 1080},
                user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                locale='en-US'
            )
            
            page = context.new_page()
            
            try:
                start_time = time.time()
                page.goto(url, wait_until='networkidle', timeout=30000)
                
                page.mouse.move(
                    random.randint(200, 800),
                    random.randint(200, 600)
                )
                
                page.wait_for_timeout(random.randint(2000, 5000))
                
                content = page.content()
                response_time = time.time() - start_time
                
                self.metrics.record_request(
                    success=True,
                    blocked=False,
                    captcha=False,
                    response_time=response_time
                )
                
                return content
                
            except Exception as e:
                self.metrics.record_request(
                    success=False,
                    blocked=True,
                    captcha=False,
                    response_time=0
                )
                raise e
            
            finally:
                browser.close()
    
    def scrape_multiple(self, urls):
        for url in urls:
            try:
                content = self.scrape_page(url)
                print(f"✓ Scraped: {url}")
            except Exception as e:
                print(f"✗ Failed: {url} - {e}")
            
            time.sleep(random.uniform(3, 8))
        
        print("\nScraping Stats:")
        print(self.metrics.get_stats())

proxy = {
    "server": "http://gate.proxylabs.net:8080",
    "username": "your-username",
    "password": "your-password"
}

scraper = StealthScraper(proxy)
scraper.scrape_multiple([
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
])

Troubleshooting Common Issues

"I'm still getting blocked after implementing everything"

Check:

  1. Are you using residential proxies or datacenter?
  2. Is your request rate still too high?
  3. Are headers matching your User-Agent?
  4. Is JavaScript executing properly?

"My scraper is too slow now"

Solutions:

  • Implement concurrent workers (but maintain rate limits per IP)
  • Use faster proxies (ProxyLabs averages ~200ms)
  • Cache responses where possible
  • Optimize your parsing logic

"I keep getting CAPTCHAs"

This means:

  • Your fingerprint is still detectable
  • Your proxy IPs might be flagged
  • Your behavior is too bot-like
  • The site has very aggressive protection

Try:

  • Switch to private proxy pools
  • Increase delays between requests
  • Add more human-like behavior
  • Consider CAPTCHA solving services

Legal and Ethical Considerations

You should:

  • Respect robots.txt
  • Honor rate limits
  • Don't overload small sites
  • Comply with terms of service
  • Use data responsibly

You shouldn't:

  • Scrape private/authenticated data without permission
  • Ignore cease-and-desist notices
  • Scrape personal information for resale
  • Overload sites to the point of service degradation

Final Checklist

Before deploying your scraper:

  • [ ] Using residential proxies (not datacenter)
  • [ ] Rate limiting with randomized delays
  • [ ] Rotating headers per session
  • [ ] JavaScript rendering for modern sites
  • [ ] Cookie/session management implemented
  • [ ] Error handling and retry logic
  • [ ] Monitoring and alerting setup
  • [ ] Tested at scale (not just locally)
  • [ ] Legal review completed

Next Steps

  1. Test your scraper: Run for 24 hours at low volume first
  2. Monitor metrics: Track success rate, blocks, and response times
  3. Scale gradually: Double volume every 2-3 days if metrics stay healthy
  4. Iterate: Adjust delays and behavior based on real data

Successful scraping is about stealth, not speed. Go slow, stay undetected, and scale sustainably.

Ready to try the fastest residential proxies?

Join developers and businesses who trust ProxyLabs for mission-critical proxy infrastructure.

~200ms responseBest anti-bot bypass£2.50/GB
Start Building NowNo subscription required
web scrapinganti-detectionresidential proxiesscraping tutorialavoid blocks
JL
James Liu
Lead Engineer @ ProxyLabs

Building proxy infrastructure since 2019. Previously failed at many things, now failing slightly less.

Found this helpful? Share it with others.

Share