Node

What is a Node?

A node is a Crawlab instance that performs specific functions within your distributed web crawling system. In simple terms, a node is a server running Crawlab software that can execute crawling tasks or provide management capabilities.

Nodes are the building blocks of Crawlab's distributed architecture, allowing you to scale your web crawling operations across multiple machines to increase throughput and resilience.

Types of Nodes

Crawlab uses a master-worker architecture with two distinct node types:

Master Node

The Master Node serves as the control center of your Crawlab system. It:

Manages and coordinates all nodes in the system
Assigns tasks to Worker Nodes and itself
Deploys and distributes spider files across the system
Provides APIs for the frontend application
Handles communication between nodes
Monitors system health and performance

info

There must be exactly ONE Master Node in a Crawlab cluster. This node is crucial as it orchestrates the entire system.

Worker Node

Worker Nodes focus on executing crawling tasks assigned by the Master Node. They:

Run crawling tasks as directed
Report task status and results back to the Master Node
Can be scaled horizontally to increase crawling capacity

tip

Adding more Worker Nodes allows you to:

Crawl more websites simultaneously
Distribute load across multiple machines
Improve fault tolerance
Overcome rate limiting by distributing requests across different IP addresses

info

There can be none or multiple Worker Nodes in Crawlab. A system can function with just a Master Node, but adding Worker Nodes allows for greater scalability.

System Architecture

Topology

Communication Flow

The Master Node assigns tasks to Worker Nodes
Worker Nodes execute their assigned tasks
Worker Nodes report task status and results back to the Master Node
The Master Node aggregates and stores results

Node Management

Viewing Node Status

In the Nodes page of the Crawlab UI, you can view all registered nodes and their current status (online/offline). This helps you monitor the health of your crawling infrastructure.

Enabling and Disabling Nodes

You can temporarily remove a node from the task scheduling pool without removing it from the system:

Navigate to the Nodes page
Toggle the Enabled switch for the desired node
Alternatively, you can change this setting in the node detail page

Disabled nodes will not receive new tasks but will continue to run any currently executing tasks.

Configuring Maximum Concurrent Tasks

To control how many tasks a node can run simultaneously:

Navigate to the node detail page
Adjust the Max Runners setting

This setting helps you optimize resource usage based on each node's capabilities. By default, this is set to unlimited.

tip

For production environments, it's recommended to set Max Runners based on:

Available CPU cores
Available memory
Network bandwidth limitations
Target website constraints

Node Deployment

Hardware Recommendations

Node Type	CPU	Memory	Disk Space
Master Node	2+ cores	4GB+	20GB+
Worker Node	2+ cores	2GB+	10GB+

Actual requirements will vary based on your specific workload and the complexity of your spiders.

Adding a New Node

To expand your Crawlab cluster by adding Worker Nodes:

Install Crawlab on the new server
Configure it to connect to the same database as your Master Node
Set the node type to "Worker" in the configuration
Start the Crawlab service

For detailed instructions, refer to Set up Worker Nodes in the Multi-Node Deployment section.

Troubleshooting

Common Node Issues

Node shows as offline
- Check if the Crawlab service is running
- Verify network connectivity between nodes
- Ensure database connection is working properly
Node not receiving tasks
- Check if the node is enabled
- Verify the node has not reached its Max Runners limit
- Check log files for potential errors
Communication issues between nodes
- Verify firewall settings allow necessary communication
- Check that all nodes are connected to the same database
- Ensure consistent Crawlab versions across all nodes

Best Practices

Start small: Begin with a single Master Node and add Worker Nodes as needed
Monitor resource usage: Adjust Max Runners based on actual performance
Regular maintenance: Update all nodes simultaneously to avoid version conflicts
Geographic distribution: For global crawling, consider placing Worker Nodes in different regions
Backup the Master Node: As it's critical to the system, ensure proper backup procedures

info

While you can run multiple Crawlab instances (nodes) on a single physical server, it's generally NOT recommended. A single instance per server is typically more efficient.

What is a Node?​

Types of Nodes​

Master Node​

Worker Node​

System Architecture​

Topology​

Communication Flow​

Node Management​

Viewing Node Status​

Enabling and Disabling Nodes​

Configuring Maximum Concurrent Tasks​

Node Deployment​

Hardware Recommendations​

Adding a New Node​

Troubleshooting​

Common Node Issues​

Best Practices​

What is a Node?

Types of Nodes

Master Node

Worker Node

System Architecture

Topology

Communication Flow

Node Management

Viewing Node Status

Enabling and Disabling Nodes

Configuring Maximum Concurrent Tasks

Node Deployment

Hardware Recommendations

Adding a New Node

Troubleshooting

Common Node Issues

Best Practices