How to Use Proxies for Web Scraping (2026 Guide)

What Web Scraping Needs from Proxies

Web scraping is the broadest proxy use case. The proxy requirements vary significantly by target site — a public news archive has very different bot-detection than Amazon or Google. The key dimensions:

Target difficulty: Sites like Amazon, Google, LinkedIn, and Cloudflare-protected domains run sophisticated bot-detection. Sites like press releases, government data, or smaller e-commerce stores are much more permissive. Match proxy type to target difficulty.

Volume and throughput: High-volume scraping (millions of requests/day) requires large IP pools to avoid repeat-visit detection. Low-volume scraping can work with smaller, cheaper pools.

JavaScript rendering: Sites that load content via JavaScript (single-page apps, infinite scroll, dynamically loaded prices) require either a headless browser or a provider’s managed scraping API. Standard HTTP proxies do not render JavaScript.

Data freshness: Near-real-time scraping (price monitoring, news) requires low-latency proxies with fast rotation. Batch collection (research datasets, periodic snapshots) can tolerate higher latency for better cost-per-GB.

Recommended Proxies for Web Scraping

Provider	Why It Fits Web Scraping	Measured	Pricing from
Smartproxy	65M+ rotating residential; city-level geo; Site Unblocker API for JS-heavy targets	measuring	~$8.50/GB
Bright Data	150M+ residential IPs; Datasets marketplace; Scraping Browser for JS rendering	measuring	~$10.50/GB
Oxylabs	Web Scraper API handles JS, CAPTCHAs, and parsing in one managed service	measuring	~$12/GB
IPRoyal	Budget residential; suitable for permissive targets and moderate volume	measuring	~$7/GB

All performance figures measuring — see /benchmark/.

Setup: Standard Web Scraping Configuration

Choose the right proxy type for your target

Target	Recommended Proxy Type	Why
Amazon, Google SERP	Residential rotating	High detection; needs real consumer IPs
Cloudflare-protected sites	Residential + Scraping API	JS rendering + CAPTCHA handling required
Public government data	Datacenter	Low detection; cost-effective
Social media (public pages)	Residential	Platform detection focused on residential patterns
Travel aggregators	Residential + city geo	Geo-localized results
API endpoints (authenticated)	ISP / static residential	Stable sessions for OAuth flows

Basic Python example (requests + proxy rotation)

import requests
import random

# Provider endpoint handles rotation — no need to manage an IP list
proxies = {
    "http":  "http://USERNAME:PASSWORD@gate.smartproxy.com:10000",
    "https": "http://USERNAME:PASSWORD@gate.smartproxy.com:10000",
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
}

resp = requests.get(target_url, proxies=proxies, headers=headers, timeout=15)

Scraping API for JS-heavy targets

If target pages load prices or content via JavaScript:

# Oxylabs Web Scraper API example
import requests

payload = {
    "source": "universal",
    "url": "https://www.example.com/product",
    "render": "html",  # enables headless Chrome rendering
}

resp = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=("USERNAME", "PASSWORD"),
    json=payload,
)

Avoiding Blocks: Anti-Detection Basics

Rotate IPs per request: Don’t reuse the same IP on the same target more than once per session. Most provider gateways handle this automatically.

Randomize request timing: Scraping at a fixed interval (every 1.000 second exactly) is a bot signal. Add jitter: time.sleep(random.uniform(0.5, 2.5)).

Match headers to browser fingerprint: Use realistic browser headers (User-Agent, Accept-Language, Accept-Encoding, Sec-Fetch-*). Requests with missing standard headers are flagged by sophisticated detection.

Handle CAPTCHAs gracefully: Retry with a new IP rather than attempting CAPTCHA solving on the same IP. Managed scraping APIs (Oxylabs, Smartproxy Site Unblocker) handle CAPTCHAs internally.

Respect robots.txt for sensitive targets: While robots.txt is not legally binding in most jurisdictions, scraping disallowed paths on sensitive targets (user-generated content, medical/legal data) increases legal and ethical risk.

Pitfalls for Web Scraping

Using datacenter proxies on guarded targets: Amazon, Google, and most major platforms actively block known datacenter ASNs. Datacenter proxies are appropriate for permissive targets only.

Not rotating at all: A single IP making hundreds of requests per hour will be rate-limited and blocked. Use provider rotation gateways.

Ignoring response status codes: A 200 response with CAPTCHA HTML is not a success. Parse the response body to confirm you received actual data, not a challenge page.

Scraping at unsustainable concurrency: High concurrency (100+ concurrent requests to the same domain) triggers rate-limiting on both the target site and sometimes the proxy provider. Limit concurrency per domain.

Internal Links

FAQ

Residential vs datacenter for web scraping: which do I choose?

Use residential for sites with active bot-detection (Amazon, Google, major retailers, social platforms). Use datacenter for permissive targets (government data, public archives, small business sites) where cost-per-GB matters more than block rate.

How many IPs do I need for web scraping?

As a rough rule: 1 IP per 50-100 requests/hour per target domain. At 10,000 requests/hour across 5 domains, plan for 1,000+ IPs in your pool. Rotating residential providers manage the pool for you — you don’t specify a fixed IP count, just the bandwidth or request volume.

Do I need a headless browser for web scraping?

Only if the target site requires JavaScript rendering (which you can test by disabling JavaScript in your browser — if the content disappears, you need rendering). Many sites serve complete HTML on first load. Headless browsers are slower and more expensive; use them only when necessary.

Is web scraping legal?

Scraping publicly accessible data is legal in most jurisdictions (US, EU, AU, UK) for research, competition monitoring, and internal business use. The key cases (hiQ v. LinkedIn, Van Buren v. US) establish that accessing publicly available information is not unlawful under the CFAA. Restrictions apply to: bypassing paywalls or authentication, scraping personal data without legitimate purpose (GDPR), and republishing scraped content at scale. Consult a lawyer for your specific use case.

This article was produced with AI assistance and reviewed by an editor. As of 2026-05-31. Benchmark figures measured via free trial — see /benchmark/. Use proxies for legitimate purposes only.