All articles
har filesdebugginganti-bot bypass

HAR Files for Web Scraping Debugging: Capture, Analyze, and Replicate

AC
Alex Carter
Senior Engineer @ ProxyLabs
March 16, 2026
10 min read
Share

When a scraper gets blocked, most developers start by checking their proxy health or rotating user agents. If that doesn't work, they might stare at the Chrome DevTools Network tab, trying to spot a missing header. This manual inspection is slow and prone to error. You'll miss a subtle cookie change or a hidden tracking pixel that's flagging your automation.

The professional way to debug these failures is using HAR (HTTP Archive) files. A HAR file is a JSON-formatted log of every single interaction between the browser and the server. It captures the exact headers, cookies, timing, and response bodies of a successful session. By comparing a HAR from a real browser with a HAR from your failing script, you can find the exact difference that's triggering the anti-bot wall.

What a HAR file captures

A HAR file is more than just a list of URLs. It's a complete snapshot of the network state. While a screenshot of DevTools shows you what happened, a HAR file gives you the data needed to recreate it.

It includes:

  • Request and Response Headers: Every x-requested-with, referer, and custom security header.
  • Cookies: Both the cookies sent by the browser and the set-cookie instructions from the server.
  • POST Bodies: The exact payload sent in JSON or form-encoded requests.
  • Timing Data: How long each request took, which helps identify rate-limiting thresholds or artificial delays.
  • Binary Content: Images, scripts, and fonts that might contain hidden bot-detection challenges.

If you're asking for help with a blocked scraper, sending a HAR file is the fastest way to get an answer. It eliminates the "it works on my machine" problem by providing the raw data of the failure.

Capturing HAR in Chrome DevTools

The easiest way to generate a HAR file is through the browser you already use.

  1. Open Chrome and press F12 (or Right-Click > Inspect) to open DevTools.
  2. Navigate to the Network tab.
  3. Check the Preserve log and Disable cache boxes. "Preserve log" ensures that when the page redirects (common in anti-bot flows), the previous requests aren't wiped. "Disable cache" ensures you see the full handshake, not just cached assets.
  4. Perform the action you're trying to scrape. Log in, search, or navigate to the product page.
  5. Once the flow is complete, look for the Download (down arrow) icon or right-click any request and select Save all as HAR with content.

For anti-bot analysis, you must capture the full page load, not just the final API call. Many security tokens are generated during the initial HTML load or by background scripts before your target data is ever requested.

Understanding the HAR structure

Since HAR is just JSON, you can parse it with any language. The root object is log, which contains entries. Each entry represents one HTTP transaction.

{
  "log": {
    "version": "1.2",
    "entries": [
      {
        "startedDateTime": "2026-03-16T10:00:00.000Z",
        "time": 245.3,
        "request": {
          "method": "GET",
          "url": "https://target.com/api/products",
          "headers": [
            { "name": "accept", "value": "application/json" },
            { "name": "user-agent", "value": "Mozilla/5.0..." }
          ],
          "cookies": [],
          "postData": { "mimeType": "application/json", "text": "{...}" }
        },
        "response": {
          "status": 200,
          "headers": [
            { "name": "content-type", "value": "application/json" }
          ],
          "content": {
            "size": 1024,
            "mimeType": "application/json",
            "text": "{\"products\": []}"
          }
        }
      }
    ]
  }
}

This structured format allows you to write scripts that automatically find the differences between a "clean" browser session and your "dirty" scraper session.

Scrubbing PII before sharing

HAR files are dangerously descriptive. If you record a session where you're logged into a site, that HAR file contains your session cookies, authentication tokens, and potentially your password if you recorded the login step. It also contains any personal data returned in the response bodies.

Before you share a HAR with a teammate, a consultant, or a proxy provider, you must scrub it. Here's a Python script to redact sensitive fields:

import json
import re

def scrub_har(input_path, output_path):
    with open(input_path) as f:
        har = json.load(f)
    
    # Sensitive headers to look for
    sensitive_headers = ('cookie', 'authorization', 'x-auth-token', 'proxy-authorization')
    
    for entry in har['log']['entries']:
        # Scrub request cookies and auth headers
        for header in entry['request']['headers']:
            if header['name'].lower() in sensitive_headers:
                header['value'] = '[REDACTED]'
        
        # Scrub request cookies list
        if 'cookies' in entry['request']:
            for cookie in entry['request']['cookies']:
                cookie['value'] = '[REDACTED]'
                
        # Scrub response Set-Cookie
        for header in entry['response']['headers']:
            if header['name'].lower() == 'set-cookie':
                header['value'] = '[REDACTED]'
                
        # Scrub response bodies which might contain PII or session data
        if 'content' in entry['response'] and 'text' in entry['response']['content']:
            # You can use regex here to find emails/names, or just redact everything
            entry['response']['content']['text'] = '[REDACTED]'
    
    with open(output_path, 'w') as f:
        json.dump(har, f, indent=2)

# Usage: scrub_har("raw_capture.har", "clean_capture.har")

If you only need the headers for debugging, redacting the text field in the response content is the safest default.

Automated HAR capture with Playwright

Manual capture is fine for one-off debugging, but if you're dealing with a site that changes its security headers every hour, you need automation. Playwright has built-in support for recording HAR files.

This is useful for two things:

  1. Creating a "Golden HAR" from a clean, residential IP to use as a reference.
  2. Capturing what your scraper sees when it's running in a headless environment.
from playwright.async_api import async_playwright
import asyncio

async def capture_har(url: str, output_path: str):
    async with async_playwright() as p:
        # Launching with a real user agent to get a standard response
        browser = await p.chromium.launch()
        context = await browser.new_context(
            record_har_path=output_path,
            record_har_url_filter="**/*" # Capture everything
        )
        page = await context.new_page()
        
        # Navigate and wait for the network to settle
        await page.goto(url, wait_until="networkidle")
        
        # Critical: Close the context to flush the HAR to disk
        await context.close()
        await browser.close()

if __name__ == "__main__":
    asyncio.run(capture_har("https://target.com", "automated_capture.har"))

By running this on a schedule, you can catch exactly when a site rolls out a new version of their anti-bot script.

Analyzing a HAR for anti-bot signals

When you're blocked, the answer is usually hidden in a request you didn't think was important. Look for these specific patterns in your HAR:

1. Challenge Endpoints

Search for URLs containing _challenge, cf-chl, sensor_data, ak_bmsc, or __cf_bm. These are signs of Cloudflare, Akamai, or DataDome. If these requests return a 403 or stay in a loop, your scraper isn't solving the browser challenge correctly.

2. Unexpected Redirects

If your request to /api/data suddenly returns a 302 redirecting you to a login page or a captcha page, look at the cookies in the request that preceded it. Often, a "security" cookie is set on a seemingly unrelated image or CSS request.

3. Header Diffing

Compare the headers your scraper sends against the HAR. Common misses include:

  • sec-ch-ua (Client Hints): Modern browsers send these, but many requests or axios setups don't.
  • accept-encoding: If you're missing br (Brotli), it's a massive red flag for many CDNs.
  • referer: Some APIs will fail if the referer doesn't match the exact expected path.

You can use a script to find these blocks quickly:

import json

def find_blocking_requests(har_path):
    with open(har_path) as f:
        har = json.load(f)
    
    for entry in har['log']['entries']:
        req = entry['request']
        resp = entry['response']
        
        # Check for bot detection markers
        bot_markers = ['_challenge', 'cf-chl', 'sensor_data', 'ak_bmsc', '__cf_bm']
        if any(marker in req['url'] for marker in bot_markers):
            print(f"BOT CHECK DETECTED: {req['url']} → Status: {resp['status']}")
        
        # Check for redirects to security pages
        if resp['status'] in (301, 302, 307, 308):
            location = next((h['value'] for h in resp['headers'] if h['name'].lower() == 'location'), '')
            if 'captcha' in location or 'block' in location:
                print(f"SECURITY REDIRECT: {req['url']} → {location}")
        
        # Check for typical block codes
        if resp['status'] in (403, 429, 503):
            print(f"BLOCKED: {resp['status']} on {req['url']}")

find_blocking_requests("automated_capture.har")

Extracting headers to replicate

Once you find the "magic" request that works in the browser but fails in your script, you need to copy those headers exactly. HAR files make this simple.

def get_exact_headers(har_path, url_pattern):
    with open(har_path) as f:
        har = json.load(f)
    
    for entry in har['log']['entries']:
        if url_pattern in entry['request']['url']:
            # Create a dictionary of headers
            headers = {h['name']: h['value'] for h in entry['request']['headers']}
            # Remove pseudo-headers like :path, :method if you're using HTTP/1.1
            headers = {k: v for k, v in headers.items() if not k.startswith(':')}
            return headers
    return None

# Get the headers for the API call
target_headers = get_exact_headers("automated_capture.har", "/api/products")
if target_headers:
    print(json.dumps(target_headers, indent=2))

Copying these directly into your headers={...} block in Python or Node.js is often enough to bypass simple header-based checks.

Comparing Browser vs Scraper

The "Diff" is where the truth lies. If you can't see why you're being blocked, follow this process:

  1. Capture a HAR from your browser (The Reference).
  2. Capture a HAR from your scraper using mitmproxy or Playwright (The Scraper).
  3. Compare them side-by-side using a tool like HAR Analyzer or a simple text diff.

Look for:

  • Header Order: Some advanced firewalls (like Akamai) check if headers are in the same order as a real Chrome browser. HAR files preserve this order.
  • TLS Fingerprint: While HAR doesn't show the TLS handshake, it shows the resulting HTTP version. If your browser uses HTTP/2 and your scraper uses HTTP/1.1, that's an easy signal for a bot manager to block.
  • Cookie Progression: Does the browser receive a cookie on request #2 and send it back on request #5? If your scraper misses that middle step, the server knows you're not a real user.

Essential HAR tools

You don't always have to write custom scripts to use HAR files. These tools are part of any serious scraper's toolkit:

  • mitmproxy: An interactive HTTPS proxy. You can run your scraper through it and export the entire flow as a HAR file using the view_har addon. This is the best way to see what your code is actually sending.
  • Fiddler Everywhere: A powerful alternative to Charles Proxy. It has excellent HAR export capabilities and is particularly good for debugging mobile app traffic.
  • har-validator: A CLI tool to ensure your HAR files are well-formed. This is useful if you're generating HAR files programmatically.
  • Google HAR Analyzer: A web-based tool that lets you search and filter large HAR files quickly. It's often faster than the Chrome DevTools UI for files with thousands of entries.

When to use HAR files

Capturing a HAR shouldn't be a last resort. It should be part of your standard development lifecycle:

  • Before a new project: Record a clean HAR of the target site to understand the request flow.
  • During development: Periodically compare your script's output to the browser's reference.
  • After a block: Immediately capture a HAR of the failure to see if it's a 403, a redirect, or a silent drop.
  • When switching proxies: Different proxies can sometimes trigger different server responses (like injecting headers). A HAR will reveal this.

By treating HAR files as the "source of truth" for network activity, you stop guessing why you're blocked and start fixing the specific data points that give you away.

Ready to try the fastest residential proxies?

Join developers and businesses who trust ProxyLabs for mission-critical proxy infrastructure.

~200ms responseBest anti-bot bypass£2.50/GB
Start Building NowNo subscription required
har filesdebugginganti-bot bypassweb scrapingchrome devtoolsplaywrightnetwork analysis
AC
Alex Carter
Senior Engineer @ ProxyLabs

Building proxy infrastructure since 2019. Previously failed at many things, now failing slightly less.

Found this helpful? Share it with others.

Share