Node.js Spider

Introduction to Node.js Web Scraping

Node.js is a powerful platform for building web scrapers due to its asynchronous nature and rich ecosystem of libraries. Popular packages like Axios, Cheerio, and Puppeteer make it easy to fetch, parse, and interact with web content.

Integrating Node.js Scrapers with Crawlab

This guide demonstrates how to create a Node.js-based web scraper and integrate it with Crawlab to collect job listings from a fictional job board website.

Creating the Spider in Crawlab

In the Crawlab web interface, navigate to the Spider list
Click the "New Spider" button
Fill in the following details:
- Name: "job_board_scraper"
- Execute Command: node index.js
- Parameter: (leave empty)
Click "Confirm" to create the spider

Setting Up the Node.js Project

After creating the spider, we need to set up a Node.js project. Here's the structure we'll use:

job_board_scraper/
├── index.js
├── package.json
└── package-lock.json

1. package.json

{
  "name": "job_board_scraper",
  "version": "1.0.0",
  "description": "A job board scraper for Crawlab",
  "main": "index.js",
  "scripts": {
    "start": "node index.js"
  },
  "dependencies": {
    "axios": "^1.3.4",
    "cheerio": "^1.0.0-rc.12",
    "crawlab-sdk": "^0.6.0"
  }
}

2. index.js

const axios = require('axios');
const cheerio = require('cheerio');
const { saveItem } = require('crawlab-sdk');

// Configuration
const baseUrl = 'https://example-jobs.com';
const startUrl = `${baseUrl}/jobs?page=1`;

// Extract job parameters from Crawlab
let jobCategory = '';
if (process.env.CRAWLAB_SPIDER_PARAM) {
  try {
    const params = JSON.parse(process.env.CRAWLAB_SPIDER_PARAM);
    jobCategory = params.category || '';
  } catch (err) {
    console.error('Error parsing parameters:', err.message);
  }
}

// Use category in the URL if provided
const initialUrl = jobCategory 
  ? `${baseUrl}/jobs/category/${jobCategory}?page=1` 
  : startUrl;

// Main scraping function
async function scrapeJobListings() {
  try {
    // Start with the first page
    await scrapeJobPage(initialUrl);
  } catch (error) {
    console.error(`Error in main scraper: ${error.message}`);
  }
}

// Function to scrape a single page of job listings
async function scrapeJobPage(url) {
  try {
    console.log(`Scraping page: ${url}`);
    
    // Fetch the page content
    const response = await axios.get(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
      }
    });
    
    // Parse the HTML
    const $ = cheerio.load(response.data);
    
    // Extract job listings
    const jobListings = $('.job-card');
    console.log(`Found ${jobListings.length} job listings on this page`);
    
    // Process each job listing
    for (let i = 0; i < jobListings.length; i++) {
      const jobCard = $(jobListings[i]);
      
      // Extract job data
      const title = jobCard.find('h2.job-title').text().trim();
      const company = jobCard.find('div.company-name').text().trim();
      const location = jobCard.find('div.job-location').text().trim();
      const salary = jobCard.find('div.job-salary').text().trim() || 'Not specified';
      const jobType = jobCard.find('span.job-type').text().trim();
      const postedDate = jobCard.find('div.posted-date').text().trim();
      const jobUrl = baseUrl + jobCard.find('a.job-link').attr('href');
      const logoUrl = jobCard.find('img.company-logo').attr('src');
      
      // Create job object
      const job = {
        title,
        company,
        location,
        salary,
        jobType,
        postedDate,
        url: jobUrl,
        logoUrl,
        scrapedAt: new Date().toISOString()
      };
      
      // Save to Crawlab
      saveItem(job);
      console.log(`Saved job: ${title} at ${company}`);
    }
    
    // Check if there's a next page
    const nextPageLink = $('a.pagination-next');
    if (nextPageLink.length > 0) {
      const nextPageUrl = baseUrl + nextPageLink.attr('href');
      
      // Add a small delay to avoid overloading the server
      await new Promise(resolve => setTimeout(resolve, 1000));
      
      // Recursively scrape the next page
      await scrapeJobPage(nextPageUrl);
    }
  } catch (error) {
    console.error(`Error scraping page ${url}: ${error.message}`);
  }
}

// Start the scraper
scrapeJobListings().then(() => {
  console.log('Scraping completed successfully');
}).catch(err => {
  console.error('Scraping failed:', err.message);
  process.exit(1);
});

Key Integration Points

The main integration points with Crawlab are:

Importing the Crawlab SDK:

const { saveItem } = require('crawlab-sdk');

Saving data to Crawlab:
```
saveItem(job);
```

Reading parameters from Crawlab:

if (process.env.CRAWLAB_SPIDER_PARAM) {
  try {
    const params = JSON.parse(process.env.CRAWLAB_SPIDER_PARAM);
    // Use parameters
  } catch (err) {
    console.error('Error parsing parameters:', err.message);
  }
}

Environment Setup

Crawlab Pro

If you're using Crawlab Pro, Node.js is already pre-installed on your nodes. Browsers for Puppeteer are also pre-installed, making it easy to run JavaScript-based spiders without additional configuration.

You just need to upload your Node.js project files and install your project dependencies:

Navigate to your spider's detail page
Click on the "Files" tab
Upload both index.js and package.json files
Use the Dependency Management feature in the Crawlab UI to install your dependencies from the package.json file

Crawlab Community

If you're using Crawlab Community, you'll need to ensure Node.js is installed on your nodes:

SSH into your node server
Install Node.js following the official installation instructions
Verify installation with node -v and npm -v

Then, upload your project files and install dependencies:

Navigate to your spider's detail page
Click on the "Files" tab
Upload both index.js and package.json files
Connect to your node using the terminal and navigate to your spider directory
Run the following command to install dependencies:
```
npm install
```

Running the Spider

In Crawlab, navigate to your spider's detail page
Click the "Run" button
Select the desired node for execution
Click "Confirm" to start the spider

After the spider completes, you can view the collected job listings in the "Data" tab of your spider's detail page.

Troubleshooting

If you encounter errors when running your Node.js spider in Crawlab, check the following:

Missing Node.js environment: Ensure Node.js is installed on your Crawlab nodes
Dependency issues: Verify that all required packages are installed with npm install
Network connectivity: Check if your nodes have internet access to the target website
Empty results: The website structure may have changed; update your selectors

Advanced Configuration

Using Puppeteer for JavaScript-Heavy Sites

For websites that require JavaScript rendering, you can use Puppeteer instead of Axios and Cheerio:

// Add to package.json dependencies:
// "puppeteer": "^19.7.2"

const puppeteer = require('puppeteer');
const { saveItem } = require('crawlab-sdk');

async function scrapeDynamicWebsite() {
  // Launch a headless browser
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  
  try {
    const page = await browser.newPage();
    
    // Navigate to the target page
    await page.goto('https://example-dynamic-jobs.com/listings', {
      waitUntil: 'networkidle2'
    });
    
    // Wait for the job listings to load
    await page.waitForSelector('.job-card');
    
    // Extract job data
    const jobListings = await page.evaluate(() => {
      const jobs = [];
      
      document.querySelectorAll('.job-card').forEach(card => {
        jobs.push({
          title: card.querySelector('h2.job-title').innerText.trim(),
          company: card.querySelector('div.company-name').innerText.trim(),
          location: card.querySelector('div.job-location').innerText.trim(),
          url: card.querySelector('a.job-link').href
        });
      });
      
      return jobs;
    });
    
    // Save each job to Crawlab
    jobListings.forEach(job => {
      saveItem(job);
      console.log(`Saved job: ${job.title} at ${job.company}`);
    });
    
  } finally {
    // Always close the browser
    await browser.close();
  }
}

scrapeDynamicWebsite().catch(console.error);

Rate Limiting and Retry Logic

To make your scraper more robust, add rate limiting and retry logic:

// Utility function for delay
const delay = ms => new Promise(resolve => setTimeout(resolve, ms));

// Function with retry logic
async function fetchWithRetry(url, maxRetries = 3) {
  let attempts = 0;
  
  while (attempts < maxRetries) {
    try {
      // Add a random delay between 1-3 seconds
      const randomDelay = 1000 + Math.random() * 2000;
      await delay(randomDelay);
      
      return await axios.get(url, {
        headers: {
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        },
        timeout: 10000 // 10 second timeout
      });
    } catch (error) {
      attempts++;
      console.error(`Attempt ${attempts} failed for URL ${url}: ${error.message}`);
      
      if (attempts >= maxRetries) {
        throw new Error(`Maximum retry attempts reached for URL ${url}`);
      }
      
      // Exponential backoff
      await delay(Math.pow(2, attempts) * 1000);
    }
  }
}

Introduction to Node.js Web Scraping​

Integrating Node.js Scrapers with Crawlab​

Creating the Spider in Crawlab​

Setting Up the Node.js Project​

1. package.json​

2. index.js​

Key Integration Points​

Environment Setup​

Crawlab Pro​

Crawlab Community​

Running the Spider​

Troubleshooting​

Advanced Configuration​

Using Puppeteer for JavaScript-Heavy Sites​

Rate Limiting and Retry Logic​