Skip to main content

Data Integration

You can integrate your spiders with Crawlab SDK. This allows you to view scraped results visually on Crawlab.

Crawlab SDK supports integration with various web crawler frameworks including Scrapy and programming languages including Python, Node.js, Go and Java.

note

By default, Crawlab Python SDK (crawlab-sdk) is pre-installed in the base image of Crawlab. You can directly use it in the docker image of Crawlab.

Basic Usage

Below code snippets show how to save a basic item with different programming languages. The item is a dictionary with key hello and value crawlab. This will be saved to the database and displayed on the Crawlab web interface once the code is executed.

from crawlab import save_item

# Save dictionary as item
save_item({'hello': 'crawlab'})

Scrapy

Scrapy is a very popular web crawler framework for efficient and scalable web crawling tasks in Python.

Integrate Scrapy to Crawlab is very easy. You only need to add crawlab.CrawlabPipeline to settings.py.

ITEM_PIPELINES = {
'crawlab.CrawlabPipeline': 888,
}

More Examples

Crawlab allows users to integrate with other web crawling frameworks quite easily.

You can refer to Examples for more detailed data integration examples.

Data Preview

Crawlab provides a data preview feature that allows users to inspect crawled data increments directly in the UI.

View Task Data

You can view task data following the steps below:

  1. Navigate to the Tasks detail page
  2. Click on the Data tab to view the task data

View Spider Data

You can view spider data following the steps below:

  1. Navigate to the Spiders detail page
  2. Click on the Data tab to view the spider data