Data Integration
You can integrate your spiders with Crawlab SDK. This allows you to view scraped results visually on Crawlab.
Crawlab SDK supports integration with various web crawler frameworks including Scrapy and programming languages including Python, Node.js, Go and Java.
By default, Crawlab Python SDK (crawlab-sdk) is pre-installed in the base image of Crawlab. You can directly use it in the docker image of Crawlab.
Basic Usage
Below code snippets show how to save a basic item with different programming languages. The item is a dictionary with
key hello and value crawlab. This will be saved to the database and displayed on the Crawlab web interface once the
code is executed.
- Python
- Node.js
- Go
- Java
from crawlab import save_item
# Save dictionary as item
save_item({'hello': 'crawlab'})
const { saveItem } = require('@crawlab/sdk');
// Save object as item
saveItem({'hello': 'crawlab'})
Node.js is only supported in Crawlab Pro.
package main
import "github.com/crawlab-team/crawlab-sdk-go"
func main() {
// Save map as item
crawlab.SaveItem(map[string]interface{}{
"hello": "crawlab",
})
}
Go is only supported in Crawlab Pro.
import io.crawlab.sdk.CrawlabSdk;
import java.util.HashMap;
public class Main {
public static void main(String[] args) {
// Save HashMap as item
CrawlabSdk.saveItem(new HashMap<String, Object>() {{
put("hello", "crawlab");
}});
}
}
Java is only supported in Crawlab Pro.
Scrapy
Scrapy is a very popular web crawler framework for efficient and scalable web crawling tasks in Python.
Integrate Scrapy to Crawlab is very easy. You only need to add crawlab.CrawlabPipeline to settings.py.
ITEM_PIPELINES = {
'crawlab.CrawlabPipeline': 888,
}
More Examples
Crawlab allows users to integrate with other web crawling frameworks quite easily.
You can refer to Examples for more detailed data integration examples.
Data Preview
Crawlab provides a data preview feature that allows users to inspect crawled data increments directly in the UI.
View Task Data
You can view task data following the steps below:
- Navigate to the
Tasksdetail page - Click on the
Datatab to view the task data
View Spider Data
You can view spider data following the steps below:
- Navigate to the
Spidersdetail page - Click on the
Datatab to view the spider data