Google is the hardest public website to scrape at scale. Not because of CAPTCHAs (those are a symptom), but because Google runs one of the most sophisticated IP reputation systems on the internet. They maintain behavioral fingerprints per IP, track request cadence across sessions, and flag anomalies faster than most anti-bot vendors.
Here's what actually happens when you scrape Google with different proxy types:
| Proxy type | Requests before CAPTCHA | Requests before hard block | Notes |
|---|---|---|---|
| Datacenter (fresh IP) | 5–15 | 30–80 | Google maintains datacenter ASN blocklists |
| Datacenter (known range) | 1–3 | 5–10 | Immediate suspicion on AWS, GCP, Azure ranges |
| Residential (rotating) | 200–500+ | Rare with proper pacing | Depends on ISP diversity |
| Residential (sticky) | 50–100 | 150–300 | Better for paginated results |
These numbers come from testing 10,000 queries across each proxy type over a week in February 2026.
Why Datacenter IPs Fail on Google
Google doesn't just check if your IP is residential or datacenter. They cross-reference multiple signals:
-
ASN reputation — Google maintains internal scores for every ASN. AWS (AS16509), Google Cloud (AS15169), and Hetzner (AS24940) are permanently flagged. Even a brand-new IP from these ranges gets elevated scrutiny.
-
Request fingerprint consistency — Google checks if the HTTP headers, TLS fingerprint, and cookie behavior match what a real browser from that IP's ISP would produce. A Python
requestscall from a Comcast residential IP still looks suspicious if the TLS fingerprint screamspython-requests/2.31. -
Query pattern analysis — Real users don't search 200 keywords per minute from the same IP. Google's rate detection is per-IP and per-session, and they correlate across IPs from the same subnet.
-
Geographic consistency — If your IP geolocates to London but your Accept-Language header says
en-USand your timezone cookie saysAmerica/Chicago, that's a signal.
Basic Google Scraper with Residential Proxies
import requests
from bs4 import BeautifulSoup
import time
import random
PROXY = {
'http': 'http://your-username-country-US:[email protected]:8080',
'https': 'http://your-username-country-US:[email protected]:8080',
}
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
def scrape_google(query, num_results=10):
url = 'https://www.google.com/search'
params = {
'q': query,
'num': num_results,
'hl': 'en',
'gl': 'us',
}
response = requests.get(url, params=params, headers=HEADERS, proxies=PROXY, timeout=15)
if response.status_code == 429:
print(f"Rate limited. Status: {response.status_code}")
return None
if response.status_code != 200:
print(f"Failed: {response.status_code}")
return None
soup = BeautifulSoup(response.text, 'html.parser')
results = []
for g in soup.select('div.g'):
title_el = g.select_one('h3')
link_el = g.select_one('a[href]')
snippet_el = g.select_one('div[data-sncf]') or g.select_one('.VwiC3b')
if title_el and link_el:
results.append({
'title': title_el.text,
'url': link_el['href'],
'snippet': snippet_el.text if snippet_el else '',
})
return results
This works for a handful of queries. For anything beyond 50 queries, you need rate limiting and session management.
Rate Limiting Strategy That Actually Works
The single biggest mistake in Google scraping: fixed delays. If you time.sleep(2) between every request, that's more suspicious than variable timing because real humans don't operate on fixed intervals.
import random
import time
class GoogleScraper:
def __init__(self, proxy_username, proxy_password):
self.proxy_username = proxy_username
self.proxy_password = proxy_password
self.session_count = 0
self.request_count = 0
def _get_proxy(self, country='US', session_id=None):
username = self.proxy_username
if country:
username += f'-country-{country}'
if session_id:
username += f'-session-{session_id}'
proxy_url = f'http://{username}:{self.proxy_password}@gate.proxylabs.app:8080'
return {'http': proxy_url, 'https': proxy_url}
def _human_delay(self):
"""Mimic human browsing: mostly short pauses, occasional longer ones."""
base = random.uniform(3, 7)
# 15% chance of a longer pause (reading results, thinking)
if random.random() < 0.15:
base += random.uniform(10, 25)
# 5% chance of a very long pause (distracted)
if random.random() < 0.05:
base += random.uniform(30, 60)
time.sleep(base)
def _rotate_session(self):
"""New session every 15-25 requests to avoid per-session detection."""
self.session_count += 1
return f'google-{self.session_count}-{random.randint(1000, 9999)}'
def scrape_batch(self, queries, country='US'):
results = {}
session_id = self._rotate_session()
requests_this_session = 0
for query in queries:
# Rotate session every 15-25 requests
if requests_this_session > random.randint(15, 25):
session_id = self._rotate_session()
requests_this_session = 0
# Longer pause on session rotation
time.sleep(random.uniform(10, 20))
proxy = self._get_proxy(country=country, session_id=session_id)
result = self._single_query(query, proxy)
if result is None:
# Rate limited — back off, rotate, retry
print(f"Rate limited on: {query}. Backing off...")
time.sleep(random.uniform(30, 60))
session_id = self._rotate_session()
proxy = self._get_proxy(country=country, session_id=session_id)
result = self._single_query(query, proxy)
results[query] = result
requests_this_session += 1
self._human_delay()
return results
def _single_query(self, query, proxy):
# ... same as scrape_google() above
pass
The Rate Limits That Matter
Through testing, here are the thresholds we've observed:
| Behavior | Threshold | Consequence |
|---|---|---|
| Requests per IP per minute | >8 | CAPTCHA on next request |
| Requests per IP per hour | >100 | Temporary soft block (30 min) |
| Requests per session cookie | >30 | Session flagged, all subsequent requests get CAPTCHA |
| Identical User-Agent across IPs | >50 requests/hr | Cross-IP correlation flag |
Handling Google's CAPTCHA
When Google serves a CAPTCHA, it redirects to sorry/index with a 429 or 302 status. Don't try to solve it — rotate the IP and move on.
def handle_response(self, response, query, proxy):
if response.status_code == 429 or 'sorry/index' in response.url:
return 'captcha'
if response.status_code == 200 and 'unusual traffic' in response.text.lower():
return 'soft_block'
if response.status_code == 200:
return 'success'
return 'error'
CAPTCHA-solving services exist, but for SERP scraping they're not worth it. At $2-3 per 1000 CAPTCHAs and a solve time of 10-30 seconds each, it's cheaper and faster to just use residential proxies with proper pacing and avoid CAPTCHAs entirely.
Country-Targeted SERP Scraping
Google returns different results based on the IP's geolocation, not just the gl parameter. The gl parameter is a hint, but Google gives precedence to IP geolocation when they conflict. If you need accurate local results for SEO rank tracking, the IP must match the target country.
# Scrape UK SERPs with a UK residential IP
proxy_uk = self._get_proxy(country='GB', session_id='uk-serp-001')
# Scrape German SERPs — note: use google.de for better accuracy
# German users typically use google.de, not google.com
params = {
'q': query,
'num': 10,
'hl': 'de',
'gl': 'de',
}
response = requests.get('https://www.google.de/search', params=params,
headers=HEADERS_DE, proxies=self._get_proxy(country='DE'))
For city-level accuracy (e.g., "plumber near me" results for Chicago), use city-level geo-targeting:
proxy_chicago = self._get_proxy(country='US', session_id='chi-local')
# Add city targeting in the proxy username
username = f'{self.proxy_username}-country-US-city-Chicago-session-chi-local'
ProxyLabs supports city-level targeting in 195+ countries, which matters for local SEO monitoring where results shift between cities in the same state.
Scaling to 10K+ Queries/Day
At scale, the bottleneck isn't proxy quality — it's orchestration. You need parallel workers with independent sessions and coordinated rate limiting.
from concurrent.futures import ThreadPoolExecutor
import threading
class ScaledGoogleScraper:
def __init__(self, proxy_username, proxy_password, max_workers=5):
self.scraper = GoogleScraper(proxy_username, proxy_password)
self.max_workers = max_workers
self.global_rate = threading.Semaphore(max_workers)
def scrape_all(self, queries, country='US'):
# Split queries across workers
chunks = [queries[i::self.max_workers] for i in range(self.max_workers)]
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
futures = [
executor.submit(self.scraper.scrape_batch, chunk, country)
for chunk in chunks
]
results = {}
for future in futures:
results.update(future.result())
return results
Key numbers for planning:
- 5 workers × 8 requests/min = 40 queries/min = ~2,400/hour
- At 3-7 second delays, each worker handles ~10-15 queries/min peak
- Factor in backoffs and you'll average 8,000-12,000 queries/day with 5 workers
- Bandwidth per query: ~50-100KB = negligible on ProxyLabs' per-GB pricing
What About Google's API?
The Custom Search JSON API exists and is legitimate. It gives you 100 free queries/day and $5 per 1,000 after that. If you need fewer than 1,000 queries/day and only care about the top 10 results, the API is worth considering. It won't give you SERP features (People Also Ask, featured snippets, local packs) — only organic results.
For rank tracking at scale, ad intelligence, or SERP feature monitoring, scraping is the only option. The official API simply doesn't return the data you need.
Common Pitfalls
Using requests without TLS fingerprint masking — Google can fingerprint the TLS handshake of Python's requests library. For high-volume scraping, consider curl_cffi or tls_client which mimic Chrome's TLS fingerprint. See our Playwright proxy guide for browser-based approaches that avoid this entirely.
Scraping google.com for non-US results — Real users in Germany use google.de, users in France use google.fr. Using google.com with a German IP is a subtle but detectable anomaly.
Not matching Accept-Language to IP geo — If your IP is in Japan, your Accept-Language should include ja. This is basic but frequently missed.
Parsing HTML structure changes — Google A/B tests layout constantly. Don't rely on specific CSS class names. Build resilient parsers that extract based on DOM hierarchy rather than class selectors, and monitor for parse failures.
For a complete scraping setup with browser automation and all anti-detection measures, check the Playwright proxy guide. To test your proxy setup before running at scale, use the proxy tester.
Ready to try the fastest residential proxies?
Join developers and businesses who trust ProxyLabs for mission-critical proxy infrastructure.
Building proxy infrastructure since 2019. Previously failed at many things, now failing slightly less.
Related Articles
Residential Proxies for SEO & SERP Monitoring
How to use residential proxies for accurate SERP tracking, rank monitoring, and SEO audits. Covers geo-targeting, avoiding personalization bias, and code examples.
8 min readHow to Scrape Amazon Prices in 2026 (Without Getting Blocked)
A working guide to scraping Amazon product prices with residential proxies. Covers their anti-bot stack, request patterns, and code examples in Python.
7 min readContinue exploring
Implementation guides for requests, Scrapy, Axios, Puppeteer, and more.
See how geo-targeted residential IPs support rank tracking and SERP checks.
Evaluate ProxyLabs against Bright Data, Oxylabs, Smartproxy, and others.
Browse location coverage and targeting options across 195+ countries.