All articles
google scrapingpythonweb scraping

How to Scrape Google Search Results with Proxies (Python)

JL
James Liu
Lead Engineer @ ProxyLabs
March 14, 2026
8 min read
Share

Google is the hardest public website to scrape at scale. Not because of CAPTCHAs (those are a symptom), but because Google runs one of the most sophisticated IP reputation systems on the internet. They maintain behavioral fingerprints per IP, track request cadence across sessions, and flag anomalies faster than most anti-bot vendors.

Here's what actually happens when you scrape Google with different proxy types:

Proxy typeRequests before CAPTCHARequests before hard blockNotes
Datacenter (fresh IP)5–1530–80Google maintains datacenter ASN blocklists
Datacenter (known range)1–35–10Immediate suspicion on AWS, GCP, Azure ranges
Residential (rotating)200–500+Rare with proper pacingDepends on ISP diversity
Residential (sticky)50–100150–300Better for paginated results

These numbers come from testing 10,000 queries across each proxy type over a week in February 2026.

Why Datacenter IPs Fail on Google

Google doesn't just check if your IP is residential or datacenter. They cross-reference multiple signals:

  1. ASN reputation — Google maintains internal scores for every ASN. AWS (AS16509), Google Cloud (AS15169), and Hetzner (AS24940) are permanently flagged. Even a brand-new IP from these ranges gets elevated scrutiny.

  2. Request fingerprint consistency — Google checks if the HTTP headers, TLS fingerprint, and cookie behavior match what a real browser from that IP's ISP would produce. A Python requests call from a Comcast residential IP still looks suspicious if the TLS fingerprint screams python-requests/2.31.

  3. Query pattern analysis — Real users don't search 200 keywords per minute from the same IP. Google's rate detection is per-IP and per-session, and they correlate across IPs from the same subnet.

  4. Geographic consistency — If your IP geolocates to London but your Accept-Language header says en-US and your timezone cookie says America/Chicago, that's a signal.

Basic Google Scraper with Residential Proxies

import requests
from bs4 import BeautifulSoup
import time
import random

PROXY = {
    'http': 'http://your-username-country-US:[email protected]:8080',
    'https': 'http://your-username-country-US:[email protected]:8080',
}

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

def scrape_google(query, num_results=10):
    url = 'https://www.google.com/search'
    params = {
        'q': query,
        'num': num_results,
        'hl': 'en',
        'gl': 'us',
    }

    response = requests.get(url, params=params, headers=HEADERS, proxies=PROXY, timeout=15)

    if response.status_code == 429:
        print(f"Rate limited. Status: {response.status_code}")
        return None

    if response.status_code != 200:
        print(f"Failed: {response.status_code}")
        return None

    soup = BeautifulSoup(response.text, 'html.parser')
    results = []

    for g in soup.select('div.g'):
        title_el = g.select_one('h3')
        link_el = g.select_one('a[href]')
        snippet_el = g.select_one('div[data-sncf]') or g.select_one('.VwiC3b')

        if title_el and link_el:
            results.append({
                'title': title_el.text,
                'url': link_el['href'],
                'snippet': snippet_el.text if snippet_el else '',
            })

    return results

This works for a handful of queries. For anything beyond 50 queries, you need rate limiting and session management.

Rate Limiting Strategy That Actually Works

The single biggest mistake in Google scraping: fixed delays. If you time.sleep(2) between every request, that's more suspicious than variable timing because real humans don't operate on fixed intervals.

import random
import time

class GoogleScraper:
    def __init__(self, proxy_username, proxy_password):
        self.proxy_username = proxy_username
        self.proxy_password = proxy_password
        self.session_count = 0
        self.request_count = 0

    def _get_proxy(self, country='US', session_id=None):
        username = self.proxy_username
        if country:
            username += f'-country-{country}'
        if session_id:
            username += f'-session-{session_id}'

        proxy_url = f'http://{username}:{self.proxy_password}@gate.proxylabs.app:8080'
        return {'http': proxy_url, 'https': proxy_url}

    def _human_delay(self):
        """Mimic human browsing: mostly short pauses, occasional longer ones."""
        base = random.uniform(3, 7)
        # 15% chance of a longer pause (reading results, thinking)
        if random.random() < 0.15:
            base += random.uniform(10, 25)
        # 5% chance of a very long pause (distracted)
        if random.random() < 0.05:
            base += random.uniform(30, 60)
        time.sleep(base)

    def _rotate_session(self):
        """New session every 15-25 requests to avoid per-session detection."""
        self.session_count += 1
        return f'google-{self.session_count}-{random.randint(1000, 9999)}'

    def scrape_batch(self, queries, country='US'):
        results = {}
        session_id = self._rotate_session()
        requests_this_session = 0

        for query in queries:
            # Rotate session every 15-25 requests
            if requests_this_session > random.randint(15, 25):
                session_id = self._rotate_session()
                requests_this_session = 0
                # Longer pause on session rotation
                time.sleep(random.uniform(10, 20))

            proxy = self._get_proxy(country=country, session_id=session_id)
            result = self._single_query(query, proxy)

            if result is None:
                # Rate limited — back off, rotate, retry
                print(f"Rate limited on: {query}. Backing off...")
                time.sleep(random.uniform(30, 60))
                session_id = self._rotate_session()
                proxy = self._get_proxy(country=country, session_id=session_id)
                result = self._single_query(query, proxy)

            results[query] = result
            requests_this_session += 1
            self._human_delay()

        return results

    def _single_query(self, query, proxy):
        # ... same as scrape_google() above
        pass

The Rate Limits That Matter

Through testing, here are the thresholds we've observed:

BehaviorThresholdConsequence
Requests per IP per minute>8CAPTCHA on next request
Requests per IP per hour>100Temporary soft block (30 min)
Requests per session cookie>30Session flagged, all subsequent requests get CAPTCHA
Identical User-Agent across IPs>50 requests/hrCross-IP correlation flag

Handling Google's CAPTCHA

When Google serves a CAPTCHA, it redirects to sorry/index with a 429 or 302 status. Don't try to solve it — rotate the IP and move on.

def handle_response(self, response, query, proxy):
    if response.status_code == 429 or 'sorry/index' in response.url:
        return 'captcha'
    if response.status_code == 200 and 'unusual traffic' in response.text.lower():
        return 'soft_block'
    if response.status_code == 200:
        return 'success'
    return 'error'

CAPTCHA-solving services exist, but for SERP scraping they're not worth it. At $2-3 per 1000 CAPTCHAs and a solve time of 10-30 seconds each, it's cheaper and faster to just use residential proxies with proper pacing and avoid CAPTCHAs entirely.

Country-Targeted SERP Scraping

Google returns different results based on the IP's geolocation, not just the gl parameter. The gl parameter is a hint, but Google gives precedence to IP geolocation when they conflict. If you need accurate local results for SEO rank tracking, the IP must match the target country.

# Scrape UK SERPs with a UK residential IP
proxy_uk = self._get_proxy(country='GB', session_id='uk-serp-001')

# Scrape German SERPs — note: use google.de for better accuracy
# German users typically use google.de, not google.com
params = {
    'q': query,
    'num': 10,
    'hl': 'de',
    'gl': 'de',
}
response = requests.get('https://www.google.de/search', params=params,
                        headers=HEADERS_DE, proxies=self._get_proxy(country='DE'))

For city-level accuracy (e.g., "plumber near me" results for Chicago), use city-level geo-targeting:

proxy_chicago = self._get_proxy(country='US', session_id='chi-local')
# Add city targeting in the proxy username
username = f'{self.proxy_username}-country-US-city-Chicago-session-chi-local'

ProxyLabs supports city-level targeting in 195+ countries, which matters for local SEO monitoring where results shift between cities in the same state.

Scaling to 10K+ Queries/Day

At scale, the bottleneck isn't proxy quality — it's orchestration. You need parallel workers with independent sessions and coordinated rate limiting.

from concurrent.futures import ThreadPoolExecutor
import threading

class ScaledGoogleScraper:
    def __init__(self, proxy_username, proxy_password, max_workers=5):
        self.scraper = GoogleScraper(proxy_username, proxy_password)
        self.max_workers = max_workers
        self.global_rate = threading.Semaphore(max_workers)

    def scrape_all(self, queries, country='US'):
        # Split queries across workers
        chunks = [queries[i::self.max_workers] for i in range(self.max_workers)]

        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            futures = [
                executor.submit(self.scraper.scrape_batch, chunk, country)
                for chunk in chunks
            ]
            results = {}
            for future in futures:
                results.update(future.result())
        return results

Key numbers for planning:

  • 5 workers × 8 requests/min = 40 queries/min = ~2,400/hour
  • At 3-7 second delays, each worker handles ~10-15 queries/min peak
  • Factor in backoffs and you'll average 8,000-12,000 queries/day with 5 workers
  • Bandwidth per query: ~50-100KB = negligible on ProxyLabs' per-GB pricing

What About Google's API?

The Custom Search JSON API exists and is legitimate. It gives you 100 free queries/day and $5 per 1,000 after that. If you need fewer than 1,000 queries/day and only care about the top 10 results, the API is worth considering. It won't give you SERP features (People Also Ask, featured snippets, local packs) — only organic results.

For rank tracking at scale, ad intelligence, or SERP feature monitoring, scraping is the only option. The official API simply doesn't return the data you need.

Common Pitfalls

Using requests without TLS fingerprint masking — Google can fingerprint the TLS handshake of Python's requests library. For high-volume scraping, consider curl_cffi or tls_client which mimic Chrome's TLS fingerprint. See our Playwright proxy guide for browser-based approaches that avoid this entirely.

Scraping google.com for non-US results — Real users in Germany use google.de, users in France use google.fr. Using google.com with a German IP is a subtle but detectable anomaly.

Not matching Accept-Language to IP geo — If your IP is in Japan, your Accept-Language should include ja. This is basic but frequently missed.

Parsing HTML structure changes — Google A/B tests layout constantly. Don't rely on specific CSS class names. Build resilient parsers that extract based on DOM hierarchy rather than class selectors, and monitor for parse failures.

For a complete scraping setup with browser automation and all anti-detection measures, check the Playwright proxy guide. To test your proxy setup before running at scale, use the proxy tester.

Ready to try the fastest residential proxies?

Join developers and businesses who trust ProxyLabs for mission-critical proxy infrastructure.

~200ms responseBest anti-bot bypass£2.50/GB
Start Building NowNo subscription required
google scrapingpythonweb scrapingresidential proxiesSERPSEO
JL
James Liu
Lead Engineer @ ProxyLabs

Building proxy infrastructure since 2019. Previously failed at many things, now failing slightly less.

Found this helpful? Share it with others.

Share