This guide explains how to use Pydoll for scraping JavaScript-heavy websites, bypass Cloudflare, and scale with rotating proxies like Bright Data.
- An Introduction to Pydoll
- Using Pydoll for Web Scraping: Complete Tutorial
- Bypassing Cloudflare With Pydoll
- Limitations of This Approach to Web Scraping
- Integrating Pydoll with Bright Data's Rotating Proxies
- Alternatives to Pydoll for Web Scraping
This guide only explains the basics of Pydoll. You can learn more about its functionality and what makes it stand out as a Python web scraping library in this blog post.
Pydoll is a Python browser automation library built for web scraping, testing, and automating repetitive tasks. What sets it apart is that it eliminates the need for traditional web drivers. In detail, it connects directly to browsers through the DevTools Protocol—no external dependencies required.
- Zero webdrivers: No browser driver dependency for easier setup and fewer version issues.
- Async-first: Fully asynchronous, built on
asyncio
for high concurrency and efficiency. - Human-like interactions: Realistic typing, mouse movements, and clicks to evade bot detection.
- Event-driven: React to browser, DOM, network, and lifecycle events in real-time.
- Multi-browser support: Works with Chrome, Edge, and other Chromium browsers via a unified API.
- Screenshot & PDF export: Capture pages or elements, and generate high-quality PDFs.
- Native Cloudflare bypass: Bypass Cloudflare without third-party tools (given good IP reputation).
- Concurrent scraping: Scrape multiple pages/sites in parallel to speed up tasks.
- Advanced keyboard control: Simulate real typing with control over keys and timing.
- Powerful event system: Monitor and handle network and page events dynamically.
- File upload support: Automate uploads via inputs or file chooser dialogs.
- Proxy integration: Rotate IPs and geotarget using proxies.
- Request interception: Intercept, modify, or block HTTP requests and responses.
Learn more in the official documentation.
In this section, you'll discover how to utilize Pydoll to extract data from the asynchronous, JavaScript-powered version of "Quotes to Scrape":
This webpage dynamically renders quote elements using JavaScript after a brief delay. Consequently, conventional scraping tools won't function properly. To extract content from this page, you need a browser automation solution like Pydoll.
Before you start, ensure you have Python 3+ installed on your system. If not, download it and follow the installation guide.
Next, run this command to create a directory for your scraping project:
mkdir pydoll-scraper
The pydoll-scraper
folder will serve as your project directory.
Navigate to the folder in your terminal and initialize a Python virtual environment within it:
cd pydoll-scraper
python -m venv venv
Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are excellent choices.
Create a scraper.py
file in the project folder, which should now contain:
At this point, scraper.py
is just an empty Python script, but it will soon contain the data parsing logic.
Next, activate the virtual environment in your IDE's terminal. On Linux or macOS, execute:
source venv/bin/activate
Similarly, on Windows, run:
venv/Scripts/activate
Great! Your Python environment is now configured for web scraping with Pydoll.
In your activated virtual environment, install Pydoll through the pydoll-python
package:
pip install pydoll-python
Now, add the following code to the scraper.py
file to begin using Pydoll:
import asyncio
from pydoll.browser.chrome import Chrome
async def main():
async with Chrome() as browser:
# Launch the Chrome browser and open a new page
await browser.start()
page = await browser.get_page()
# scraping logic...
# Execute the async scraping function
asyncio.run(main())
Note that Pydoll provides an asynchronous API for web scraping and requires the use of Python's asyncio
standard library.
Invoke the go_to()
method available through the page
object to browse to the target website:
await page.go_to("https://quotes.toscrape.com/js-delayed/?delay=2000")
The ?delay=2000
query parameter instructs the page to load the desired data dynamically after a 2-second delay. This is a feature of the target sandbox site, designed to help test dynamic scraping behavior.
Now, try executing the above script. If everything works correctly, Pydoll will:
- Launch a Chrome instance
- Navigate to the target site
- Close the browser window immediately—since there's no additional logic in the script yet
This is what you should briefly see before it closes:
Examine the last image from the previous step. It shows the content of the page controlled by Pydoll in the Chrome instance. You'll notice it's completely empty—no data has loaded yet.
This occurs because the target site dynamically renders data after a 2-second delay. While this delay is specific to the example site, waiting for page elements to render is a common requirement when scraping SPAs (single-page applications) and other dynamic websites that depend on AJAX.
Learn more in our article about scraping dynamic websites with Python.
To handle this common scenario, Pydoll offers built-in waiting mechanisms via this method:
wait_element()
: Waits for a single element to appear (with timeout support)
This method supports CSS selectors, XPath expressions, and more—similar to how Selenium's By
object works.
Let's study the HTML structure of the target page. Open it in your browser, wait for the quotes to load, right-click one of the quotes, and select the "Inspect" option:
In the DevTools panel, you'll observe that each quote is wrapped in a <div>
with the class quote
. This means you can target them using the CSS selector:
.quote
Now, use Pydoll to wait for these elements to appear before proceeding:
await page.wait_element(By.CSS_SELECTOR, ".quote", timeout=3)
Don't forget to import By
:
from pydoll.constants import By
Run the script again, and this time you'll notice that Pydoll waits for the quote elements to load before closing the browser.
Remember, the target page contains multiple quotes. Since you want to extract all of them, you need a data structure to store this information. A simple array works perfectly, so initialize one:
quotes = []
To locate elements on the page, Pydoll provides two useful methods:
find_element()
: Locates the first matching elementfind_elements()
: Locates all matching elements
Just like with wait_element()
, these methods accept a selector using the By
object.
So, select all quote elements on the page with:
quote_elements = await page.find_elements(By.CSS_SELECTOR, ".quote")
Next, iterate through the elements and prepare to apply your scraping logic:
for quote_element in quote_elements:
# Scraping logic...
Begin by examining a single quote element:
As evident from the HTML above, each quote element contains:
- The text quote in a
.text
node - The author in the
.author
element - A list of tags in the
.tag
elements
Implement the scraping logic to select these elements and extract the relevant data:
# Extract the quote text (and remove curly quotes)
text_element = await quote_element.find_element(By.CSS_SELECTOR, ".text")
text = (await text_element.get_element_text()).replace(""", "").replace(""", "")
# Extract the author name
author_element = await quote_element.find_element(By.CSS_SELECTOR, ".author")
author = await author_element.get_element_text()
# Extract all associated tags
tag_elements = await quote_element.find_elements(By.CSS_SELECTOR, ".tag")
tags = [await tag_element.get_element_text() for tag_element in tag_elements]
Note: The replace()
method removes the unnecessary curly double quotes from the extracted quote text.
Now, use the scraped data to create a new dictionary object and add it to the quotes
array:
# Populate a new quote with the scraped data
quote = {
"text": text,
"author": author,
"tags": tags
}
# Append the extracted quote to the list
quotes.append(quote)
Currently, the scraped data resides in a Python list. Make it easier to share and analyze by exporting it to a human-readable format like CSV.
Use Python to generate a new file named quotes.csv
and populate it with the extracted data:
with open("quotes.csv", "w", newline="", encoding="utf-8") as csvfile:
# Add the header
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# Populate the output file with the scraped data
writer.writeheader()
for quote in quotes:
writer.writerow(quote)
Remember to import csv
from the Python Standard Library:
import csv
The complete scraper.py
file should now contain:
import asyncio
from pydoll.browser.chrome import Chrome
from pydoll.constants import By
import csv
async def main():
async with Chrome() as browser:
# Launch the Chrome browser and open a new page
await browser.start()
page = await browser.get_page()
# Navigate to the target page
await page.go_to("https://quotes.toscrape.com/js-delayed/?delay=2000")
# Wait up to 3 seconds for the quote elements to appear
await page.wait_element(By.CSS_SELECTOR, ".quote", timeout=3)
# Where to store the scraped data
quotes = []
# Select all quote elements
quote_elements = await page.find_elements(By.CSS_SELECTOR, ".quote")
# Iterate over them and scrape data from them
for quote_element in quote_elements:
# Extract the quote text (and remove curly quotes)
text_element = await quote_element.find_element(By.CSS_SELECTOR, ".text")
text = (await text_element.get_element_text()).replace(""", "").replace(""", "")
# Extract the author
author_element = await quote_element.find_element(By.CSS_SELECTOR, ".author")
author = await author_element.get_element_text()
# Extract all tags
tag_elements = await quote_element.find_elements(By.CSS_SELECTOR, ".tag")
tags = [await tag_element.get_element_text() for tag_element in tag_elements]
# Populate a new quote with the scraped data
quote = {
"text": text,
"author": author,
"tags": tags
}
# Append the extracted quote to the list
quotes.append(quote)
# Export the scraped data to CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as csvfile:
# Add the header
fieldnames = ["text", "author", "tags"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# Populate the output file with the scraped data
writer.writeheader()
for quote in quotes:
writer.writerow(quote)
# Execute the async scraping function
asyncio.run(main())
Test the script by executing:
python scraper.py
Once completed, a quotes.csv
file will appear in your project folder.
When interacting with websites through browser automation tools, one of the major challenges you'll encounter is web application firewalls (WAFs), such as Cloudflare.
When your requests are identified as coming from an automated browser, these systems often display a CAPTCHA. In certain cases, they present it to all visitors during their initial visit to the site.
Bypassing CAPTCHAs in Python is challenging. However, techniques exist to convince Cloudflare you're a legitimate user, preventing CAPTCHA challenges from appearing initially. Pydoll addresses this by providing a dedicated API for exactly this purpose.
To demonstrate this functionality, we'll use the "Antibot Challenge" test page from the ScrapingCourse website:
As shown, the page consistently performs the Cloudflare JavaScript Challenge. After bypassing it, sample content appears confirming that the anti-bot protection has been defeated.
Pydoll offers two approaches for handling Cloudflare:
- Context manager approach: Manages the anti-bot challenge synchronously, pausing script execution until the challenge is resolved.
- Background processing approach: Handles the anti-bot asynchronously in the background.
We'll cover both methods. However, as mentioned in the official documentation, be aware that Cloudflare bypassing isn't guaranteed. Factors like IP reputation or browsing history can affect success.
For more sophisticated techniques, check our comprehensive tutorial on scraping Cloudflare-protected sites.
To let Pydoll automatically handle the Cloudflare anti-bot challenge, use the expect_and_bypass_cloudflare_captcha()
method like this:
import asyncio
from pydoll.browser.chrome import Chrome
from pydoll.constants import By
async def main():
async with Chrome() as browser:
# Launch the Chrome browser and open a new page
await browser.start()
page = await browser.get_page()
# Wait for the Cloudflare challenge to be executed
async with page.expect_and_bypass_cloudflare_captcha():
# Connect to the Cloudflare-protected page:
await page.go_to("https://www.scrapingcourse.com/antibot-challenge")
print("Waiting for Cloudflare anti-bot to be handled...")
# This code runs only after the anti-bot is successfully bypassed
print("Cloudflare anti-bot bypassed! Continuing with automation...")
# Print the text message on the success page
await page.wait_element(By.CSS_SELECTOR, "#challenge-title", timeout=3)
success_element = await page.find_element(By.CSS_SELECTOR, "#challenge-title")
success_text = await success_element.get_element_text()
print(success_text)
asyncio.run(main())
When you execute this script, the Chrome window will automatically overcome the challenge and load the target page.
The output will be:
Waiting for Cloudflare anti-bot to be handled...
Cloudflare anti-bot bypassed! Continuing with automation...
You bypassed the Antibot challenge! :D
If you prefer not to halt script execution while Pydoll addresses the Cloudflare challenge, you can utilize the enable_auto_solve_cloudflare_captcha()
and disable_auto_solve_cloudflare_captcha()
methods like this:
import asyncio
from pydoll.browser import Chrome
from pydoll.constants import By
async def main():
async with Chrome() as browser:
# Launch the Chrome browser and open a new page
await browser.start()
page = await browser.get_page()
# Enable automatic captcha solving before navigating
await page.enable_auto_solve_cloudflare_captcha()
# Connect to the Cloudflare-protected page:
await page.go_to("https://www.scrapingcourse.com/antibot-challenge")
print("Page loaded, Cloudflare anti-bot will be handled in the background...")
# Disable anti-bot auto-solving when no longer needed
await page.disable_auto_solve_cloudflare_captcha()
# Print the text message on the success page
await page.wait_element(By.CSS_SELECTOR, "#challenge-title", timeout=3)
success_element = await page.find_element(By.CSS_SELECTOR, "#challenge-title")
success_text = await success_element.get_element_text()
print(success_text)
asyncio.run(main())
This approach enables your scraper to perform other operations while Pydoll resolves the Cloudflare anti-bot challenge in the background.
This time, the output will be:
Page loaded, Cloudflare anti-bot will be handled in the background...
You bypassed the Antibot challenge! :D
With Pydoll—or any scraping tool—sending too many requests can likely result in being blocked by the target server. This happens because most websites implement rate limiting to prevent bots (like your scraping script) from overwhelming their servers with excessive requests.
This represents a standard anti-scraping and anti-DDoS strategy. Understandably, website owners want to protect their sites from automated traffic floods.
Even when following best practices such as respecting robots.txt
, making numerous requests from a single IP address can still trigger suspicion. Consequently, you might encounter 403 Forbidden
or 429 Too Many Requests
errors.
The most effective solution is rotating your IP address using a web proxy.
For those unfamiliar with this concept, a web proxy functions as an intermediary between your scraper and the target website. It forwards your requests and returns responses, making it appear to the target site that traffic originates from the proxy—not your actual device.
This technique not only helps conceal your real IP but also assists in bypassing geo-restrictions and numerous other applications.
Various proxy types exist. To avoid blocking, you need a premium provider offering authentic rotating proxies like Bright Data.
In the following section, you'll learn how to combine Bright Data's rotating proxies with Pydoll for more effective web scraping—particularly at scale.
Let's implement Bright Data's residential proxies with Pydoll.
If you don't have an account yet, register for Bright Data. Otherwise, proceed and sign in to access your dashboard:
From the dashboard, select the "Get proxy products" button:
You'll be directed to the "Proxies & Scraping Infrastructure" page:
In the table, locate the "Residential" row and click it:
You'll arrive at the residential proxy configuration page:
For first-time users, follow the setup wizard to configure the proxy according to your requirements.
Navigate to the "Overview" tab and find your proxy's host, port, username, and password:
Utilize those details to construct your proxy URL:
proxy_url = "<brightdata_proxy_username>:<brightdata_proxy_password>@<brightdata_proxy_host>:<brightdata_proxy_port>";
Replace the placeholders (<brightdata_proxy_username>
, <brightdata_proxy_password>
, <brightdata_proxy_host>
, <brightdata_proxy_port>
) with your actual proxy credentials.
Ensure you activate the proxy product by switching the toggle from "Off" to "On":
With your proxy configured, here's how to incorporate it into Pydoll using its built-in proxy configuration capabilities:
import asyncio
from pydoll.browser.chrome import Chrome
from pydoll.browser.options import Options
from pydoll.constants import By
import traceback
async def main():
# Create browser options
options = Options()
# The URL of your Bright Data proxy
proxy_url = "<brightdata_proxy_username>:<brightdata_proxy_password>@<brightdata_proxy_host>:<brightdata_proxy_port>" # Replace it with your proxy URL
# Configure the proxy integration option
options.add_argument(f"--proxy-server={proxy_url}")
# To avoid potential SSL errors
options.add_argument("--ignore-certificate-errors")
# Start browser with proxy configuration
async with Chrome(options=options) as browser:
await browser.start()
page = await browser.get_page()
# Visit a special page that returns the IP of the caller
await page.go_to("https://httpbin.io/ip")
# Extract the page content containing only the IP of the incoming
# request and print it
body_element = await page.find_element(By.CSS_SELECTOR, "body")
body_text = await body_element.get_element_text()
print(f"Current IP address: {body_text}")
# Execute the async scraping function
asyncio.run(main())
Each time you run this script, you'll observe a different exit IP address, thanks to Bright Data's proxy rotation.
Note:
Typically, Chrome's
--proxy-server
flag doesn't support authenticated proxies directly. However, Pydoll's advanced proxy manager overcomes this limitation, allowing password-protected proxy server usage.
While Pydoll is certainly a powerful web scraping library, especially for browser automation with built-in anti-bot circumvention features, other valuable tools exist.
Here are several strong Pydoll alternatives worth exploring:
- SeleniumBase: A Python framework built upon Selenium/WebDriver APIs, providing a professional-grade toolkit for web automation. It supports everything from end-to-end testing to sophisticated scraping workflows.
- Undetected ChromeDriver: A modified version of ChromeDriver engineered to avoid detection by popular anti-bot services like Imperva, DataDome, and Distil Networks. Ideal for stealthy scraping when using Selenium.
Using Pydoll without IP rotation mechanisms can lead to inconsistent results. To make web scraping reliable and scalable, try Bright Data's proxy networks that include datacenter proxies, residential proxies, ISP proxies, and mobile proxies.
Create an account and begin testing our proxies at no cost today!