Node
What is a Node?
A node is a Crawlab instance that performs specific functions within your distributed web crawling system. In simple terms, a node is a server running Crawlab software that can execute crawling tasks or provide management capabilities.
Nodes are the building blocks of Crawlab's distributed architecture, allowing you to scale your web crawling operations across multiple machines to increase throughput and resilience.
Types of Nodes
Crawlab uses a master-worker architecture with two distinct node types:
Master Node
The Master Node serves as the control center of your Crawlab system. It:
- Manages and coordinates all nodes in the system
- Assigns tasks to Worker Nodes and itself
- Deploys and distributes spider files across the system
- Provides APIs for the frontend application
- Handles communication between nodes
- Monitors system health and performance
There must be exactly ONE Master Node in a Crawlab cluster. This node is crucial as it orchestrates the entire system.
Worker Node
Worker Nodes focus on executing crawling tasks assigned by the Master Node. They:
- Run crawling tasks as directed
- Report task status and results back to the Master Node
- Can be scaled horizontally to increase crawling capacity
Adding more Worker Nodes allows you to:
- Crawl more websites simultaneously
- Distribute load across multiple machines
- Improve fault tolerance
- Overcome rate limiting by distributing requests across different IP addresses
There can be none or multiple Worker Nodes in Crawlab. A system can function with just a Master Node, but adding Worker Nodes allows for greater scalability.
System Architecture
Topology
Communication Flow
- The Master Node assigns tasks to Worker Nodes
- Worker Nodes execute their assigned tasks
- Worker Nodes report task status and results back to the Master Node
- The Master Node aggregates and stores results
Node Management
Viewing Node Status
In the Nodes page of the Crawlab UI, you can view all registered nodes and their current status (online/offline). This
helps you monitor the health of your crawling infrastructure.
Enabling and Disabling Nodes
You can temporarily remove a node from the task scheduling pool without removing it from the system:
- Navigate to the
Nodespage - Toggle the
Enabledswitch for the desired node - Alternatively, you can change this setting in the node detail page
Disabled nodes will not receive new tasks but will continue to run any currently executing tasks.
Configuring Maximum Concurrent Tasks
To control how many tasks a node can run simultaneously:
- Navigate to the node detail page
- Adjust the
Max Runnerssetting
This setting helps you optimize resource usage based on each node's capabilities. By default, this is set to unlimited.
For production environments, it's recommended to set Max Runners based on:
- Available CPU cores
- Available memory
- Network bandwidth limitations
- Target website constraints
Node Deployment
Hardware Recommendations
| Node Type | CPU | Memory | Disk Space |
|---|---|---|---|
| Master Node | 2+ cores | 4GB+ | 20GB+ |
| Worker Node | 2+ cores | 2GB+ | 10GB+ |
Actual requirements will vary based on your specific workload and the complexity of your spiders.
Adding a New Node
To expand your Crawlab cluster by adding Worker Nodes:
- Install Crawlab on the new server
- Configure it to connect to the same database as your Master Node
- Set the node type to "Worker" in the configuration
- Start the Crawlab service
For detailed instructions, refer to Set up Worker Nodes in the Multi-Node Deployment section.
Troubleshooting
Common Node Issues
-
Node shows as offline
- Check if the Crawlab service is running
- Verify network connectivity between nodes
- Ensure database connection is working properly
-
Node not receiving tasks
- Check if the node is enabled
- Verify the node has not reached its
Max Runnerslimit - Check log files for potential errors
-
Communication issues between nodes
- Verify firewall settings allow necessary communication
- Check that all nodes are connected to the same database
- Ensure consistent Crawlab versions across all nodes
Best Practices
- Start small: Begin with a single Master Node and add Worker Nodes as needed
- Monitor resource usage: Adjust
Max Runnersbased on actual performance - Regular maintenance: Update all nodes simultaneously to avoid version conflicts
- Geographic distribution: For global crawling, consider placing Worker Nodes in different regions
- Backup the Master Node: As it's critical to the system, ensure proper backup procedures
While you can run multiple Crawlab instances (nodes) on a single physical server, it's generally NOT recommended. A single instance per server is typically more efficient.