Task

What is a Task?

A Task in Crawlab represents a single execution instance of a spider. Whenever a spider runs—whether triggered manually, by a schedule, or through an API call—a new task is created to track that specific execution from start to finish. Tasks are the fundamental units of work in Crawlab that handle the actual data collection process.

Each task maintains its own:

Execution log
Status information
Resource usage metrics
Result data
Error records
Performance statistics

info

Tasks are the operational heartbeat of Crawlab. Understanding how they work is essential for monitoring, troubleshooting, and optimizing your web scraping operations.

Task vs. Spider vs. Schedule

Understanding how Tasks relate to other Crawlab concepts is important:

Spider: The definition of a web crawler (the code, configuration, and logic)
Schedule: A time-based trigger that determines when a spider should run
Task: A single execution instance of a spider at a specific point in time

This relationship can be understood as:

A Spider is the "what" (what code runs)
A Schedule is the "when" (when it should run)
A Task is the "instance" (one specific execution with its own results)

One spider can have many tasks (historical executions) and multiple schedules (different timing patterns).

Task Lifecycle

Every task goes through a series of states during its lifecycle:

Pending: Task has been created but is waiting in the execution queue
Running: Task is actively executing on a node
Finished: Task completed successfully
Error: Task encountered an error and stopped execution
Cancelled: Task was manually stopped before completion

This lifecycle is visualized below:

Creating Tasks

Tasks are typically created in one of three ways:

Method 1: Manual Execution

Navigate to the Spiders page
Find the spider you want to run
Click the Run button (play icon)
Configure any run-time parameters in the dialog
Click Confirm to create and start the task

Method 2: Scheduled Execution

A previously configured Schedule triggers based on its cron expression
The system automatically creates a new task for the associated spider
The task begins execution according to the schedule's configuration

Method 3: API Integration

An external system calls Crawlab's API with a request to run a specific spider
The API creates a new task based on the parameters provided
The task is queued for execution like any other task

tip

All three methods create identical task objects—the only difference is how they're initiated.

Task Configuration Options

When creating a task (particularly through manual execution), you can configure several parameters:

Core Parameters

Mode: Determines how the task will be distributed:
- Random Node: Executes on one randomly selected node
- All Nodes: Runs on every available node simultaneously
- Selected Nodes: Allows choosing specific nodes for execution
Priority: Sets the execution order when multiple tasks are queued. Higher priority (larger number) tasks execute first.
Parameters: Custom arguments passed to the spider for this specific execution. These override the spider's default parameters.

Advanced Options

Node Selection: When using "Selected Nodes" mode, you can choose exactly which nodes will execute the task

Monitoring Tasks

Task List View

The Tasks page provides an overview of all tasks in the system:

Navigate to the Tasks page from the main sidebar
View the list of recent tasks with their:
- ID
- Spider name
- Status
- Node
- Start/end times
- Duration
- Result count

This view supports:

Filtering by status, spider, node, and time range
Sorting by various columns
Searching by task ID or spider name
Batch operations on selected tasks

Task Detail View

For in-depth information about a specific task:

Click on any task in the task list
Access detailed information through the following tabs:

Overview Tab

Provides summary information including:

Task metadata (ID, spider, node, times)
Status and duration
Result statistics
Resource utilization graphs
Key performance metrics

Logs Tab

Displays the complete execution log:

Real-time streaming of logs for active tasks
Full console output from the spider execution
Log filtering by severity level
Log search functionality
Log download option

Results Tab

Shows the data collected by this task:

Tabular view of all scraped items
Field filtering and sorting
Record search capabilities
Data export options (CSV, JSON, Excel)
Relationship to MongoDB collection

Task Management

Cancelling Tasks

To stop a running task before completion:

Navigate to the Tasks page
Find the running task you want to stop
Click the Cancel button (stop icon)
Confirm the cancellation in the dialog

The system will attempt to gracefully terminate the execution process.

warning

Cancelling a task is not always instantaneous. Some operations might continue briefly before the task fully stops.

Rerunning Tasks

To execute a task again with the same parameters:

Navigate to the Tasks page
Find the task you want to rerun
Click the Rerun button (refresh icon)
The system creates a new task with identical configuration

This is useful when:

A task failed due to temporary issues
You need to refresh data with the same parameters
You want to compare results over time

Task Results and Data

Every task that successfully scrapes data stores its results in a MongoDB collection:

Default Collection

By default, results are stored in a collection named:

results_<spider_name> (e.g., results_amazon_product_scraper)

This means all tasks from the same spider share a collection, with each record containing a _tid (task id) field that links back to the specific task that created it.

Accessing Results

Task results can be accessed in several ways:

Web Interface:
- Navigate to the task's detail page
- Click the Results tab
- Browse, search, and export the data
MongoDB Integration:
- Connect directly to the MongoDB instance
- Query the appropriate collection
- Filter by _tid (task id) to get results for a specific task
API:
- Use Crawlab's API to programmatically retrieve results
- Filter and format data as needed for integration with other systems

Performance Metrics (WIP)

Tasks collect and display various performance metrics to help you understand and optimize your spiders:

System Metrics

CPU Usage: Percentage of CPU utilized by the task
Memory Usage: RAM consumption over time
Network Traffic: Bytes sent and received
Disk I/O: Read/write operations

Crawling Metrics

Request Count: Total number of HTTP requests made
Success Rate: Percentage of successful requests
Response Time: Average and distribution of request times
Data Throughput: Items scraped per second/minute
Request Frequency: Requests per second

Custom Metrics

Spiders can report custom metrics through the Crawlab SDK, allowing you to track domain-specific performance indicators.

Task Logs

Task logs are crucial for monitoring execution and troubleshooting issues:

Log Best Practices

Enable appropriate log levels based on your needs
Add contextual information to log messages
Use structured logging when possible
Implement custom logging for domain-specific events

Log Retention

Logs are retained according to your system configuration:

By default, logs are kept for 30 days
Configure retention policies based on your storage capacity
Consider exporting critical logs for longer-term storage

Best Practices for Task Management

Performance Optimization

Batch Size Control: Configure your spiders to process data in appropriate batch sizes
Resource Allocation: Assign higher priority to time-sensitive tasks
Concurrency Settings: Tune parallel execution parameters based on target site capabilities
Node Selection: Choose appropriate nodes based on task requirements

Monitoring Strategy

Active Observation: Keep an eye on long-running tasks to catch issues early
Alert Configuration: Set up notifications for task failures
Performance Baseline: Establish normal performance metrics to identify anomalies
Regular Review: Periodically analyze task history to spot trends

Troubleshooting Tips

Log Analysis: Start by examining logs for error messages or warnings
Parameter Verification: Confirm the task received the correct parameters
Node Inspection: Check if node-specific issues might be affecting the task
Incremental Testing: Modify spider parameters to isolate problematic components

Entity Relationships

The diagram below illustrates how Tasks relate to other components in the Crawlab ecosystem:

This shows that:

Each Task executes exactly one Spider
A Task may be triggered by a Schedule (optional)
Each Task runs on exactly one Node
A Task produces multiple Result records
A Task generates multiple Log entries

Advanced Task Concepts

Task Queue and Scheduling

Crawlab uses a priority queue mechanism to manage pending tasks:

Tasks enter the queue with a specified priority level
The scheduler evaluates available node resources
Higher priority tasks are assigned to nodes before lower priority ones
Tasks with equal priority are processed in FIFO (First In, First Out) order

Distributed Execution Modes

The task distribution strategy affects how work is divided:

Random Node: Simple allocation to a single node
- Best for: Testing, simple spiders, or when node selection doesn't matter
All Nodes: Replication of the same task across all nodes
- Best for: Distributing identical crawling tasks with different starting points
- Requires spider code to handle the distribution logic
Selected Nodes: Manual assignment to specific nodes
- Best for: Specialized tasks requiring specific capabilities
- Useful when certain nodes have special access or resources

Next Steps

After mastering task management, consider exploring these advanced topics:

What is a Task?​

Task vs. Spider vs. Schedule​

Task Lifecycle​

Creating Tasks​

Method 1: Manual Execution​

Method 2: Scheduled Execution​

Method 3: API Integration​

Task Configuration Options​

Core Parameters​

Advanced Options​

Monitoring Tasks​

Task List View​

Task Detail View​

Overview Tab​

Logs Tab​

Results Tab​

Task Management​

Cancelling Tasks​

Rerunning Tasks​

Task Results and Data​

Default Collection​

Accessing Results​

Performance Metrics (WIP)​

System Metrics​

Crawling Metrics​

Custom Metrics​

Task Logs​

Log Best Practices​

Log Retention​

Best Practices for Task Management​

Performance Optimization​

Monitoring Strategy​

Troubleshooting Tips​

Entity Relationships​

Advanced Task Concepts​

Task Queue and Scheduling​

Distributed Execution Modes​

Next Steps​