Task
What is a Task?
A Task in Crawlab represents a single execution instance of a spider. Whenever a spider runs—whether triggered manually, by a schedule, or through an API call—a new task is created to track that specific execution from start to finish. Tasks are the fundamental units of work in Crawlab that handle the actual data collection process.
Each task maintains its own:
- Execution log
- Status information
- Resource usage metrics
- Result data
- Error records
- Performance statistics
Tasks are the operational heartbeat of Crawlab. Understanding how they work is essential for monitoring, troubleshooting, and optimizing your web scraping operations.
Task vs. Spider vs. Schedule
Understanding how Tasks relate to other Crawlab concepts is important:
- Spider: The definition of a web crawler (the code, configuration, and logic)
- Schedule: A time-based trigger that determines when a spider should run
- Task: A single execution instance of a spider at a specific point in time
This relationship can be understood as:
- A Spider is the "what" (what code runs)
- A Schedule is the "when" (when it should run)
- A Task is the "instance" (one specific execution with its own results)
One spider can have many tasks (historical executions) and multiple schedules (different timing patterns).
Task Lifecycle
Every task goes through a series of states during its lifecycle:
- Pending: Task has been created but is waiting in the execution queue
- Running: Task is actively executing on a node
- Finished: Task completed successfully
- Error: Task encountered an error and stopped execution
- Cancelled: Task was manually stopped before completion
This lifecycle is visualized below:
Creating Tasks
Tasks are typically created in one of three ways:
Method 1: Manual Execution
- Navigate to the
Spiderspage - Find the spider you want to run
- Click the
Runbutton (play icon) - Configure any run-time parameters in the dialog
- Click
Confirmto create and start the task
Method 2: Scheduled Execution
- A previously configured Schedule triggers based on its cron expression
- The system automatically creates a new task for the associated spider
- The task begins execution according to the schedule's configuration
Method 3: API Integration
- An external system calls Crawlab's API with a request to run a specific spider
- The API creates a new task based on the parameters provided
- The task is queued for execution like any other task
All three methods create identical task objects—the only difference is how they're initiated.
Task Configuration Options
When creating a task (particularly through manual execution), you can configure several parameters:
Core Parameters
-
Mode: Determines how the task will be distributed:
- Random Node: Executes on one randomly selected node
- All Nodes: Runs on every available node simultaneously
- Selected Nodes: Allows choosing specific nodes for execution
-
Priority: Sets the execution order when multiple tasks are queued. Higher priority (larger number) tasks execute first.
-
Parameters: Custom arguments passed to the spider for this specific execution. These override the spider's default parameters.
Advanced Options
- Node Selection: When using "Selected Nodes" mode, you can choose exactly which nodes will execute the task
Monitoring Tasks
Task List View
The Tasks page provides an overview of all tasks in the system:
- Navigate to the
Taskspage from the main sidebar - View the list of recent tasks with their:
- ID
- Spider name
- Status
- Node
- Start/end times
- Duration
- Result count
This view supports:
- Filtering by status, spider, node, and time range
- Sorting by various columns
- Searching by task ID or spider name
- Batch operations on selected tasks
Task Detail View
For in-depth information about a specific task:
- Click on any task in the task list
- Access detailed information through the following tabs:
Overview Tab
Provides summary information including:
- Task metadata (ID, spider, node, times)
- Status and duration
- Result statistics
- Resource utilization graphs
- Key performance metrics
Logs Tab
Displays the complete execution log:
- Real-time streaming of logs for active tasks
- Full console output from the spider execution
- Log filtering by severity level
- Log search functionality
- Log download option
Results Tab
Shows the data collected by this task:
- Tabular view of all scraped items
- Field filtering and sorting
- Record search capabilities
- Data export options (CSV, JSON, Excel)
- Relationship to MongoDB collection
Task Management
Cancelling Tasks
To stop a running task before completion:
- Navigate to the
Taskspage - Find the running task you want to stop
- Click the
Cancelbutton (stop icon) - Confirm the cancellation in the dialog
The system will attempt to gracefully terminate the execution process.
Cancelling a task is not always instantaneous. Some operations might continue briefly before the task fully stops.
Rerunning Tasks
To execute a task again with the same parameters:
- Navigate to the
Taskspage - Find the task you want to rerun
- Click the
Rerunbutton (refresh icon) - The system creates a new task with identical configuration
This is useful when:
- A task failed due to temporary issues
- You need to refresh data with the same parameters
- You want to compare results over time
Task Results and Data
Every task that successfully scrapes data stores its results in a MongoDB collection:
Default Collection
By default, results are stored in a collection named:
results_<spider_name>(e.g.,results_amazon_product_scraper)
This means all tasks from the same spider share a collection, with each record containing a _tid (task id) field that links back to the specific task that created it.
Accessing Results
Task results can be accessed in several ways:
-
Web Interface:
- Navigate to the task's detail page
- Click the
Resultstab - Browse, search, and export the data
-
MongoDB Integration:
- Connect directly to the MongoDB instance
- Query the appropriate collection
- Filter by
_tid(task id) to get results for a specific task
-
API:
- Use Crawlab's API to programmatically retrieve results
- Filter and format data as needed for integration with other systems
Performance Metrics (WIP)
Tasks collect and display various performance metrics to help you understand and optimize your spiders:
System Metrics
- CPU Usage: Percentage of CPU utilized by the task
- Memory Usage: RAM consumption over time
- Network Traffic: Bytes sent and received
- Disk I/O: Read/write operations
Crawling Metrics
- Request Count: Total number of HTTP requests made
- Success Rate: Percentage of successful requests
- Response Time: Average and distribution of request times
- Data Throughput: Items scraped per second/minute
- Request Frequency: Requests per second
Custom Metrics
Spiders can report custom metrics through the Crawlab SDK, allowing you to track domain-specific performance indicators.
Task Logs
Task logs are crucial for monitoring execution and troubleshooting issues:
Log Best Practices
- Enable appropriate log levels based on your needs
- Add contextual information to log messages
- Use structured logging when possible
- Implement custom logging for domain-specific events
Log Retention
Logs are retained according to your system configuration:
- By default, logs are kept for 30 days
- Configure retention policies based on your storage capacity
- Consider exporting critical logs for longer-term storage
Best Practices for Task Management
Performance Optimization
- Batch Size Control: Configure your spiders to process data in appropriate batch sizes
- Resource Allocation: Assign higher priority to time-sensitive tasks
- Concurrency Settings: Tune parallel execution parameters based on target site capabilities
- Node Selection: Choose appropriate nodes based on task requirements
Monitoring Strategy
- Active Observation: Keep an eye on long-running tasks to catch issues early
- Alert Configuration: Set up notifications for task failures
- Performance Baseline: Establish normal performance metrics to identify anomalies
- Regular Review: Periodically analyze task history to spot trends
Troubleshooting Tips
- Log Analysis: Start by examining logs for error messages or warnings
- Parameter Verification: Confirm the task received the correct parameters
- Node Inspection: Check if node-specific issues might be affecting the task
- Incremental Testing: Modify spider parameters to isolate problematic components
Entity Relationships
The diagram below illustrates how Tasks relate to other components in the Crawlab ecosystem:
This shows that:
- Each Task executes exactly one Spider
- A Task may be triggered by a Schedule (optional)
- Each Task runs on exactly one Node
- A Task produces multiple Result records
- A Task generates multiple Log entries
Advanced Task Concepts
Task Queue and Scheduling
Crawlab uses a priority queue mechanism to manage pending tasks:
- Tasks enter the queue with a specified priority level
- The scheduler evaluates available node resources
- Higher priority tasks are assigned to nodes before lower priority ones
- Tasks with equal priority are processed in FIFO (First In, First Out) order
Distributed Execution Modes
The task distribution strategy affects how work is divided:
-
Random Node: Simple allocation to a single node
- Best for: Testing, simple spiders, or when node selection doesn't matter
-
All Nodes: Replication of the same task across all nodes
- Best for: Distributing identical crawling tasks with different starting points
- Requires spider code to handle the distribution logic
-
Selected Nodes: Manual assignment to specific nodes
- Best for: Specialized tasks requiring specific capabilities
- Useful when certain nodes have special access or resources
Next Steps
After mastering task management, consider exploring these advanced topics: