Introduction

If you already know what Crawlab is and what it is used for, you can head straight to Quick Start or Installation to install and start to use Crawlab.

If you are not familiar with Crawlab, you can read sections below in order to understand more about Crawlab.

What is Crawlab?

Crawlab is a powerful Web Crawler Management Platform that can run web crawlers and spiders developed in various programming languages including Python, Go, Node.js, Java, C# as well as frameworks including Scrapy, Colly, Selenium, Puppeteer. It is used for running, managing and monitoring web crawlers, particularly in production environment where traceability, scalability and stability are the major factors to concern.

Background and History

Crawlab project has been under continuous development since it was published in March 2019, and gone through a number of major releases. It was initially designed for solving the managerial issue when there are a large number of spiders to coordinate and execute. With a lot of improvements and newly updated features, Crawlab is becoming more and more popular in developer communities, particularly amongst web crawler engineers.

Change Logs

Who can use Crawlab?

Web Crawler Engineers. By integrating web crawler programs into Crawlab, you can now focus only on the crawling and parsing logics, instead of wasting too much time on writing common modules such as task queue, storage, logging, notification, etc.
Operation Engineers. The main benefits from Crawlab for Operation Engineers are the convenience in deployment (for both crawler programs and Crawlab itself). Crawlab supports easy installation with Docker and Kubernetes.
Data Analysts. Data analysts who can code (e.g. Python) are able to develop web crawler programs (e.g. Scrapy) and upload them into Crawlab. Then leave all the rest dirty work to Crawlab, and it will automatically collect data for you.
Others. Technically everyone can enjoy the convenience and easiness of automation provided by Crawlab. Though Crawlab is good at running web crawler tasks, it can also be used for other types of tasks such as data processing and automation.

Main Features

Core Components

💻 Node Management

Register & control multiple nodes in distributed systems
Real-time node monitoring & resource usage tracking

🕷️ Spider Operations

Multi-language spider support (Python/Node.js/Go/Java)
Framework integration (Scrapy/Colly/Selenium/Puppeteer)
Git version control & automatic deployment
Online code editor & file management

📋 Task Management

Distributed task scheduling & queueing
Real-time logging & execution monitoring
Detailed statistics & historical records

Key Capabilities

📦 Dependency Management

Python/Node.js package installation
Automatic dependency resolution

🔔 Notification System

Email notifications for task completion
Webhook integration support

👥 Team Collaboration

Multi-user access control
Role-based permissions

Infrastructure

🌐 Web-based Interface

Responsive dashboard
Cross-platform compatibility

🐳 Deployment Flexibility

Cloud-native architecture
Horizontal scaling capabilities

What is Crawlab?​

Background and History​

Who can use Crawlab?​

Main Features​

Core Components​

Key Capabilities​

Infrastructure​