When a scraper gets blocked, most developers start by checking their proxy health or rotating user agents. If that doesn't work, they might stare at the Chrome DevTools Network tab, trying to spot a missing header. This manual inspection is slow and prone to error. You'll miss a subtle cookie change or a hidden tracking pixel that's flagging your automation.
The professional way to debug these failures is using HAR (HTTP Archive) files. A HAR file is a JSON-formatted log of every single interaction between the browser and the server. It captures the exact headers, cookies, timing, and response bodies of a successful session. By comparing a HAR from a real browser with a HAR from your failing script, you can find the exact difference that's triggering the anti-bot wall.
What a HAR file captures
A HAR file is more than just a list of URLs. It's a complete snapshot of the network state. While a screenshot of DevTools shows you what happened, a HAR file gives you the data needed to recreate it.
It includes:
- Request and Response Headers: Every
x-requested-with,referer, and custom security header. - Cookies: Both the cookies sent by the browser and the
set-cookieinstructions from the server. - POST Bodies: The exact payload sent in JSON or form-encoded requests.
- Timing Data: How long each request took, which helps identify rate-limiting thresholds or artificial delays.
- Binary Content: Images, scripts, and fonts that might contain hidden bot-detection challenges.
If you're asking for help with a blocked scraper, sending a HAR file is the fastest way to get an answer. It eliminates the "it works on my machine" problem by providing the raw data of the failure.
Capturing HAR in Chrome DevTools
The easiest way to generate a HAR file is through the browser you already use.
- Open Chrome and press F12 (or Right-Click > Inspect) to open DevTools.
- Navigate to the Network tab.
- Check the Preserve log and Disable cache boxes. "Preserve log" ensures that when the page redirects (common in anti-bot flows), the previous requests aren't wiped. "Disable cache" ensures you see the full handshake, not just cached assets.
- Perform the action you're trying to scrape. Log in, search, or navigate to the product page.
- Once the flow is complete, look for the Download (down arrow) icon or right-click any request and select Save all as HAR with content.
For anti-bot analysis, you must capture the full page load, not just the final API call. Many security tokens are generated during the initial HTML load or by background scripts before your target data is ever requested.
Understanding the HAR structure
Since HAR is just JSON, you can parse it with any language. The root object is log, which contains entries. Each entry represents one HTTP transaction.
{
"log": {
"version": "1.2",
"entries": [
{
"startedDateTime": "2026-03-16T10:00:00.000Z",
"time": 245.3,
"request": {
"method": "GET",
"url": "https://target.com/api/products",
"headers": [
{ "name": "accept", "value": "application/json" },
{ "name": "user-agent", "value": "Mozilla/5.0..." }
],
"cookies": [],
"postData": { "mimeType": "application/json", "text": "{...}" }
},
"response": {
"status": 200,
"headers": [
{ "name": "content-type", "value": "application/json" }
],
"content": {
"size": 1024,
"mimeType": "application/json",
"text": "{\"products\": []}"
}
}
}
]
}
}
This structured format allows you to write scripts that automatically find the differences between a "clean" browser session and your "dirty" scraper session.
Scrubbing PII before sharing
HAR files are dangerously descriptive. If you record a session where you're logged into a site, that HAR file contains your session cookies, authentication tokens, and potentially your password if you recorded the login step. It also contains any personal data returned in the response bodies.
Before you share a HAR with a teammate, a consultant, or a proxy provider, you must scrub it. Here's a Python script to redact sensitive fields:
import json
import re
def scrub_har(input_path, output_path):
with open(input_path) as f:
har = json.load(f)
# Sensitive headers to look for
sensitive_headers = ('cookie', 'authorization', 'x-auth-token', 'proxy-authorization')
for entry in har['log']['entries']:
# Scrub request cookies and auth headers
for header in entry['request']['headers']:
if header['name'].lower() in sensitive_headers:
header['value'] = '[REDACTED]'
# Scrub request cookies list
if 'cookies' in entry['request']:
for cookie in entry['request']['cookies']:
cookie['value'] = '[REDACTED]'
# Scrub response Set-Cookie
for header in entry['response']['headers']:
if header['name'].lower() == 'set-cookie':
header['value'] = '[REDACTED]'
# Scrub response bodies which might contain PII or session data
if 'content' in entry['response'] and 'text' in entry['response']['content']:
# You can use regex here to find emails/names, or just redact everything
entry['response']['content']['text'] = '[REDACTED]'
with open(output_path, 'w') as f:
json.dump(har, f, indent=2)
# Usage: scrub_har("raw_capture.har", "clean_capture.har")
If you only need the headers for debugging, redacting the text field in the response content is the safest default.
Automated HAR capture with Playwright
Manual capture is fine for one-off debugging, but if you're dealing with a site that changes its security headers every hour, you need automation. Playwright has built-in support for recording HAR files.
This is useful for two things:
- Creating a "Golden HAR" from a clean, residential IP to use as a reference.
- Capturing what your scraper sees when it's running in a headless environment.
from playwright.async_api import async_playwright
import asyncio
async def capture_har(url: str, output_path: str):
async with async_playwright() as p:
# Launching with a real user agent to get a standard response
browser = await p.chromium.launch()
context = await browser.new_context(
record_har_path=output_path,
record_har_url_filter="**/*" # Capture everything
)
page = await context.new_page()
# Navigate and wait for the network to settle
await page.goto(url, wait_until="networkidle")
# Critical: Close the context to flush the HAR to disk
await context.close()
await browser.close()
if __name__ == "__main__":
asyncio.run(capture_har("https://target.com", "automated_capture.har"))
By running this on a schedule, you can catch exactly when a site rolls out a new version of their anti-bot script.
Analyzing a HAR for anti-bot signals
When you're blocked, the answer is usually hidden in a request you didn't think was important. Look for these specific patterns in your HAR:
1. Challenge Endpoints
Search for URLs containing _challenge, cf-chl, sensor_data, ak_bmsc, or __cf_bm. These are signs of Cloudflare, Akamai, or DataDome. If these requests return a 403 or stay in a loop, your scraper isn't solving the browser challenge correctly.
2. Unexpected Redirects
If your request to /api/data suddenly returns a 302 redirecting you to a login page or a captcha page, look at the cookies in the request that preceded it. Often, a "security" cookie is set on a seemingly unrelated image or CSS request.
3. Header Diffing
Compare the headers your scraper sends against the HAR. Common misses include:
sec-ch-ua(Client Hints): Modern browsers send these, but manyrequestsoraxiossetups don't.accept-encoding: If you're missingbr(Brotli), it's a massive red flag for many CDNs.referer: Some APIs will fail if the referer doesn't match the exact expected path.
You can use a script to find these blocks quickly:
import json
def find_blocking_requests(har_path):
with open(har_path) as f:
har = json.load(f)
for entry in har['log']['entries']:
req = entry['request']
resp = entry['response']
# Check for bot detection markers
bot_markers = ['_challenge', 'cf-chl', 'sensor_data', 'ak_bmsc', '__cf_bm']
if any(marker in req['url'] for marker in bot_markers):
print(f"BOT CHECK DETECTED: {req['url']} → Status: {resp['status']}")
# Check for redirects to security pages
if resp['status'] in (301, 302, 307, 308):
location = next((h['value'] for h in resp['headers'] if h['name'].lower() == 'location'), '')
if 'captcha' in location or 'block' in location:
print(f"SECURITY REDIRECT: {req['url']} → {location}")
# Check for typical block codes
if resp['status'] in (403, 429, 503):
print(f"BLOCKED: {resp['status']} on {req['url']}")
find_blocking_requests("automated_capture.har")
Extracting headers to replicate
Once you find the "magic" request that works in the browser but fails in your script, you need to copy those headers exactly. HAR files make this simple.
def get_exact_headers(har_path, url_pattern):
with open(har_path) as f:
har = json.load(f)
for entry in har['log']['entries']:
if url_pattern in entry['request']['url']:
# Create a dictionary of headers
headers = {h['name']: h['value'] for h in entry['request']['headers']}
# Remove pseudo-headers like :path, :method if you're using HTTP/1.1
headers = {k: v for k, v in headers.items() if not k.startswith(':')}
return headers
return None
# Get the headers for the API call
target_headers = get_exact_headers("automated_capture.har", "/api/products")
if target_headers:
print(json.dumps(target_headers, indent=2))
Copying these directly into your headers={...} block in Python or Node.js is often enough to bypass simple header-based checks.
Comparing Browser vs Scraper
The "Diff" is where the truth lies. If you can't see why you're being blocked, follow this process:
- Capture a HAR from your browser (The Reference).
- Capture a HAR from your scraper using mitmproxy or Playwright (The Scraper).
- Compare them side-by-side using a tool like HAR Analyzer or a simple text diff.
Look for:
- Header Order: Some advanced firewalls (like Akamai) check if headers are in the same order as a real Chrome browser. HAR files preserve this order.
- TLS Fingerprint: While HAR doesn't show the TLS handshake, it shows the resulting HTTP version. If your browser uses HTTP/2 and your scraper uses HTTP/1.1, that's an easy signal for a bot manager to block.
- Cookie Progression: Does the browser receive a cookie on request #2 and send it back on request #5? If your scraper misses that middle step, the server knows you're not a real user.
Essential HAR tools
You don't always have to write custom scripts to use HAR files. These tools are part of any serious scraper's toolkit:
- mitmproxy: An interactive HTTPS proxy. You can run your scraper through it and export the entire flow as a HAR file using the
view_haraddon. This is the best way to see what your code is actually sending. - Fiddler Everywhere: A powerful alternative to Charles Proxy. It has excellent HAR export capabilities and is particularly good for debugging mobile app traffic.
- har-validator: A CLI tool to ensure your HAR files are well-formed. This is useful if you're generating HAR files programmatically.
- Google HAR Analyzer: A web-based tool that lets you search and filter large HAR files quickly. It's often faster than the Chrome DevTools UI for files with thousands of entries.
When to use HAR files
Capturing a HAR shouldn't be a last resort. It should be part of your standard development lifecycle:
- Before a new project: Record a clean HAR of the target site to understand the request flow.
- During development: Periodically compare your script's output to the browser's reference.
- After a block: Immediately capture a HAR of the failure to see if it's a 403, a redirect, or a silent drop.
- When switching proxies: Different proxies can sometimes trigger different server responses (like injecting headers). A HAR will reveal this.
By treating HAR files as the "source of truth" for network activity, you stop guessing why you're blocked and start fixing the specific data points that give you away.
Ready to try the fastest residential proxies?
Join developers and businesses who trust ProxyLabs for mission-critical proxy infrastructure.
Building proxy infrastructure since 2019. Previously failed at many things, now failing slightly less.
Related Articles
Charles Proxy: The Scraper's Debugging Masterclass
How to use Charles Proxy to find exactly why your scraper is getting blocked. SSL proxying, request diffing, Map Remote, Breakpoints, and mobile app interception.
10 min readHTTP Header Order: Why It Gets You Blocked and How to Fix It
Anti-bot systems check more than just your User-Agent. HTTP/2 pseudo-header order, Sec-Ch-Ua Client Hints, and header consistency with your TLS fingerprint all matter in 2026.
9 min readContinue exploring
Implementation guides for requests, Scrapy, Axios, Puppeteer, and more.
See how residential proxies fit large-scale scraping workflows.
Evaluate ProxyLabs against Bright Data, Oxylabs, Smartproxy, and others.
Browse location coverage and targeting options across 195+ countries.