Spider
What is a Spider?
A Spider in Crawlab is the fundamental unit of web crawling functionality. Think of it as a complete web scraping program or project that contains all the necessary code, configuration, and logic to extract data from specific websites.
Spiders can be built using various technologies and frameworks:
- Scrapy projects
- Python scripts with libraries like BeautifulSoup or Selenium
- JavaScript-based crawlers using Puppeteer or Playwright
- Any executable program that can extract web data
The concept Spider is central to Crawlab's architecture. Understanding how to create, configure, and manage spiders is essential for effective web scraping at scale.
Spider vs. Project
In Crawlab, a Spider represents an individual web crawler implementation, while a Project is an organizational unit that can contain multiple related spiders. For example:
- Project: E-commerce Data Collection
- Spider 1: Amazon Product Scraper
- Spider 2: eBay Product Scraper
- Spider 3: Walmart Product Scraper
This hierarchical organization helps manage complex data collection operations with multiple crawling targets.
Creating a Spider
Basic Creation Steps
- Navigate to the
Spiderspage from the main sidebar - Click the
New Spiderbutton in the top-left corner - Fill in the required information:
- Name: A unique, descriptive name (e.g., "amazon_product_scraper")
- Project: (Optional) The project this spider belongs to
- Execute Command: The command to run your spider (e.g.,
python spider.pyorscrapy crawl amazon)
- Configure additional options as needed
- Click
Confirmto create the spider
Configuration Options Explained
-
Name: A unique identifier for your spider. Use descriptive names that indicate the purpose or target website.
-
Project: The organizational group this spider belongs to. Grouping related spiders helps with management and monitoring.
-
Execute Command: The shell command that will be executed when running the spider. Examples:
python main.py # For a Python script
scrapy crawl my_spider # For a Scrapy spider
node crawler.js # For a Node.js crawler
./custom_crawler # For a compiled executable -
Parameters: Additional arguments passed to your execute command. These can be used to modify spider behavior without changing code. Examples:
--start-url="https://example.com"
-a category=electronics -a pages=5
--limit=100 --output=json -
Default Mode: Determines how the spider will be distributed across your Crawlab nodes:
- Random Node: Executes on one randomly selected node (good for testing)
- All Nodes: Runs the same spider on every available node (useful for distributed crawling)
- Selected Nodes: Allows you to choose specific nodes for execution (for specialized hardware requirements)
-
Priority: Determines the execution order when multiple spiders are queued. Higher priority (larger number) spiders execute first.
-
Results Collection: The MongoDB collection name where scraped data will be stored. If left blank, defaults to
results_<spider_name>.
Code Management
Uploading Code
Crawlab provides several methods to upload your spider code:
Method 1: Upload Folder (Recommended for Complete Projects)
- Navigate to the spider detail page
- Select the
Filestab - Click the
Uploadbutton in the navigation bar - Choose the
Folderoption - Click
Click to Select Folder to Upload - Select the folder containing your spider project
- Click
Confirmto upload
This method preserves your project structure, making it ideal for frameworks like Scrapy that depend on specific directory layouts.
Method 2: Upload Individual Files
- Navigate to the spider detail page
- Select the
Filestab - Click the
Uploadbutton - Choose the
Filesoption - Either drag and drop files into the upload zone or click to select files
- Click
Confirmto upload
Use this approach when you need to add or update specific files rather than the entire project.
Method 3: Drag & Drop Upload (Quick Method)
- Navigate to the spider detail page and the
Filestab - Directly drag files or folders from your local file explorer
- Drop them onto the file navigator on the left side
This method provides a quick way to update or add files to specific directories.
Creating and Editing Files
Crawlab includes a built-in code editor that supports:
- Syntax highlighting for multiple languages
- Code completion
- File creation and deletion
- Directory management
To edit a file:
- Navigate to the spider's
Filestab - Click on a file in the navigator to open it in the editor
- Make your changes
- Click the
Savebutton (or use Ctrl+S/Cmd+S)
To create a new file:
- Navigate to the spider's
Filestab - Click the
New Filebutton in the toolbar - Enter the file name with appropriate extension
- Click
Confirm
For more details on using the code editor, see the File Editor documentation.
Running Spiders
Basic Execution
- From the spider detail page, click the
Runbutton (play icon) in the navigation bar - Alternatively, from the
Spiderslist page, click theRunbutton for the specific spider - In the run dialog:
- Verify or adjust the execution parameters
- Select the execution mode (Random Node, All Nodes, or Selected Nodes)
- Add any custom parameters needed for this run
- Click
Confirmto start the execution
Advanced Execution Options
- Custom Parameters: Override default parameters for specific runs
- Priority Override: Temporarily change the execution priority
- Node Selection: Choose specific nodes for execution based on capabilities
Monitoring Execution
Once a spider is running:
- Navigate to the
Taskssection to see all active and completed spider runs - Click on a specific task to view:
- Real-time logs
- Execution statistics
- Error messages (if any)
Working with Scraped Data
Viewing Results
After a spider completes its run:
- Navigate to the spider detail page
- Click the
Datatab - Browse the table of collected records
- Use filters and search to find specific entries
- Click on any row to view the complete record details
Data Management
Crawlab provides several tools for working with scraped data:
- Export: Download data as CSV, JSON, or Excel formats using the export button
- Filtering: Apply filters to find specific records based on field values
- Pagination: Navigate through large result sets with the pagination controls
- Field Selection: Choose which fields to display in the table view
Database Integration
All scraped data is stored in MongoDB with these characteristics:
- The collection name follows the pattern
results_<spider_name>unless custom configured - Each record contains an automatic
_idfield and a_tid(task id) that links back to the execution - You can access this data directly via MongoDB clients for advanced querying or processing
You can refer to the [Database Integration] section for more details.
Database integration with other mainstream database systems (MySQL, PostgreSQL, ElasticSearch, etc.) are supported in Crawlab Pro.
Best Practices
Spider Organization
- Use meaningful, consistent naming for your spiders
- Group related spiders into projects
- Include a README.md file in your spider directory explaining its purpose and usage
- Add appropriate comments in your code to explain complex logic
Performance Optimization
- Set appropriate request delays to avoid overloading target websites
- Implement proper error handling and retry mechanisms
- Use the scheduler for recurring tasks rather than continuous manual execution
- Consider distributing large crawling tasks across multiple nodes
Troubleshooting Common Issues
- Spider Fails Immediately: Check your execute command and ensure all dependencies are installed
- Empty Results: Verify your selectors/extractors, the website structure may have changed
- Timeouts: Adjust the timeout setting or optimize your spider's performance
- High Resource Usage: Implement pagination or chunking for large data extraction tasks
Entity Relationships
The diagram below illustrates how Spiders relate to other components in the Crawlab ecosystem:
This shows that:
- A Spider belongs to a Project (optional)
- A Spider can have multiple Tasks (execution instances)
- A Spider can have multiple Schedules
- Tasks run on specific Nodes
- Schedules trigger Tasks
Next Steps
After mastering the basics of spider management, consider exploring these advanced topics: