Selenium Spider
Introduction to Selenium
Selenium is a powerful browser automation framework primarily used for web application testing, but it excels at web scraping tasks as well. Unlike traditional HTTP request libraries (such as Requests), Selenium allows you to simulate real browser behavior and automate interactions with web pages, including:
- Rendering JavaScript-heavy websites
- Filling out forms and clicking buttons
- Scrolling through infinite-loading pages
- Handling dynamic content that loads after page initialization
- Managing authentication and cookies
This makes Selenium particularly valuable for scraping modern, dynamic websites where content is loaded asynchronously or requires user interaction.
Integrating Selenium Spider in Crawlab
Below, we'll walk through how to integrate a Selenium spider into Crawlab and display the scraping results in the Crawlab interface. This guide demonstrates the process using a real-world example: scraping article information from TechCrunch, a popular technology news website.
Creating the Spider
- In the Crawlab web interface, navigate to the Spider list
- Click the "New Spider" button
- Fill in the following details:
- Name: "techcrunch"
- Execute Command:
python main.py - Parameter: (leave empty)
- Click "Confirm" to create the spider
Setting Up the Selenium Script
After creating the spider, we need to create our Python script that utilizes Selenium. Let's create our main.py file:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from crawlab import save_item
import time
# Configure Chrome options for headless operation
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(options=chrome_options)
try:
# Navigate to TechCrunch
browser.get('https://techcrunch.com/category/startups/')
# Wait for the page to load
WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "article.post-block"))
)
# Find all article items on the page
articles = browser.find_elements(By.CSS_SELECTOR, "article.post-block")
# Extract data from each article
for article in articles:
try:
# Extract title and URL
title_element = article.find_element(By.CSS_SELECTOR, "h2 a")
title = title_element.text
url = title_element.get_attribute('href')
# Extract author
try:
author = article.find_element(By.CSS_SELECTOR, "span.river-byline__authors a").text
except:
author = "Unknown"
# Extract date
try:
date = article.find_element(By.CSS_SELECTOR, "time").get_attribute('datetime')
except:
date = None
# Extract description/excerpt
try:
description = article.find_element(By.CSS_SELECTOR, "div.post-block__content").text
except:
description = None
# Extract image URL
try:
image_element = article.find_element(By.CSS_SELECTOR, "figure img")
image_url = image_element.get_attribute('src')
except:
image_url = None
# Save extracted data to Crawlab
save_item({
'title': title,
'url': url,
'author': author,
'date': date,
'description': description,
'image_url': image_url
})
except Exception as e:
print(f"Error processing article: {e}")
continue
finally:
# Clean up resources
browser.quit()
Key Configuration for Selenium in Crawlab
When running Selenium in a headless environment like Crawlab, certain Chrome options are essential for proper operation:
chrome_options.add_argument('--headless') # Run Chrome in headless mode (without GUI)
chrome_options.add_argument('--no-sandbox') # Bypass OS security model, required in containerized environments
chrome_options.add_argument('--disable-dev-shm-usage') # Overcome limited /dev/shm size in containers
These arguments are critical for running Selenium in Crawlab's containerized environment. Without them, your script may fail to execute properly!
Integration with Crawlab
The save_item method from the Crawlab SDK is used to send the extracted data to Crawlab's database. This function accepts a dictionary containing your scraped data and automatically handles the storage process.
Uploading the Script to Crawlab
- In the Crawlab interface, navigate to your "techcrunch" spider
- Click on the "Files" tab
- Click "New File" and create a file named
main.py - Paste the code above into the editor
- Click "Save" to store the file
Running the Spider
- Navigate to your spider's detail page
- Click the "Run" button
- Select the desired node for execution
- Click "Confirm" to start the spider
After the spider completes, you can view the collected articles in the "Data" tab of your spider's detail page.
Troubleshooting Selenium Spiders
If you encounter issues with your Selenium spider in Crawlab, check the following:
- Chrome/ChromeDriver compatibility: Ensure the Chrome version installed on your Crawlab nodes is compatible with the ChromeDriver version being used
- Missing dependencies: Verify that all required packages (
selenium,crawlab-sdk) are installed on your nodes - Memory issues: If the browser crashes, consider adding
chrome_options.add_argument('--disable-gpu')and limiting the number of concurrent browser instances - Selector errors: Confirm that your CSS/XPath selectors match the current structure of the target website
- Network issues: Check if your Crawlab nodes have internet access and can reach the target website
Advanced Selenium Techniques
Waiting for Elements to Load
For pages that load content dynamically, use explicit waits:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait up to 10 seconds for the element to be present
element = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".dynamic-content"))
)
Handling Pagination
To scrape multiple pages, implement pagination logic:
# Example pagination code
page = 1
max_pages = 5
while page <= max_pages:
# Process current page...
# Find and click next page button
try:
next_button = browser.find_element(By.CSS_SELECTOR, ".pagination-next")
next_button.click()
page += 1
# Wait for new page to load
time.sleep(2)
except:
# No more pages or error
break
Handling Authentication
For sites requiring login:
# Navigate to login page
browser.get("https://example.com/login")
# Find and fill username and password fields
browser.find_element(By.ID, "username").send_keys("your_username")
browser.find_element(By.ID, "password").send_keys("your_password")
# Submit the form
browser.find_element(By.CSS_SELECTOR, "button[type='submit']").click()
# Wait for login to complete
WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".logged-in-indicator"))
)