Skip to main content

API Reference

WebScraper Class

new WebScraper(options?: ScraperOptions)

Creates a new instance of the WebScraper class.

Parameters:

  • options (optional) - ScraperOptions - An object containing configuration options for the scraper.

    • usePuppeteer - boolean (optional) - Whether to use Puppeteer for JavaScript-rendered pages. Default: true.
    • throttle - number (optional) - Delay in milliseconds between requests. Default: 1000.
    • rules - Record<string, string> - CSS selectors defining data extraction rules.

Returns:

  • A new instance of WebScraper.

Example:

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
usePuppeteer: true,
rules: { title: 'h1', content: 'p' },
});

Methods

scrape(url: string): Promise<Record<string, any>>

Scrapes the given URL based on the configured options.

Parameters:

  • url - string - The webpage URL to scrape.

Returns:

  • Promise<Record<string, any>> - The extracted data as an object.

Example:

const data = await scraper.scrape('https://example.com');
console.log(data);

exportToJSON(data: any, filePath: string): void

Exports the given data to a JSON file.

Parameters:

  • data - any - The data to be exported.
  • filePath - string - The path where the JSON file should be saved.

Returns:

  • void

Example:

import { exportToJSON } from 'simple-web-scraper';

const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');

exportToCSV(data: any | any[], filePath: string): void

Exports the given data to a CSV file.

Parameters:

  • data - any | any[] - The data to be exported.
  • filePath - string - The path where the CSV file should be saved.

Returns:

  • void

Example:

import { exportToCSV } from 'simple-web-scraper';

const data = [
{ name: 'Example 1', value: 42 },
{ name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');

// preserve nullish values
exportToCSV(data, 'output.csv', { preserveNulls: true });

Backend Example

This example demonstrates how to use simple-web-scraper in a Node.js backend:

import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';

const app = express();
const scraper = new WebScraper({
usePuppeteer: true,
rules: {
fullHTML: 'html', // Entire page HTML
title: 'head > title', // Page title
description: 'meta[name="description"]', // Meta description
keywords: 'meta[name="keywords"]', // Meta keywords
favicon: 'link[rel="icon"]', // Favicon URL
mainHeading: 'h1', // First H1 heading
allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
firstParagraph: 'p', // First paragraph
allParagraphs: 'p', // All paragraphs on the page
links: 'a', // All links on the page
images: 'img', // All image URLs
imageAlts: 'img', // Alternative text for images
videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
tables: 'table', // Capture table elements
tableData: 'td', // Capture table cells
lists: 'ul, ol', // Capture all lists
listItems: 'li', // Capture all list items
scripts: 'script', // JavaScript file sources
stylesheets: 'link[rel="stylesheet"]', // External CSS files
structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
socialLinks:
'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
author: 'meta[name="author"]', // Author meta tag
publishDate: 'meta[property="article:published_time"], time', // Publish date
modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
canonicalURL: 'link[rel="canonical"]', // Canonical URL
openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
openGraphImage: 'meta[property="og:image"]', // OpenGraph image
twitterCard: 'meta[name="twitter:card"]', // Twitter card type
twitterTitle: 'meta[name="twitter:title"]', // Twitter title
twitterDescription: 'meta[name="twitter:description"]', // Twitter description
twitterImage: 'meta[name="twitter:image"]', // Twitter image
},
});

app.get('/scrape-example', async (req, res) => {
try {
const url = 'https://github.com/The-Node-Forge';
const data = await scraper.scrape(url);

exportToJSON(data, 'output.json'); // export JSON
exportToCSV(data, 'output.csv'); // export CSV

// preserve nullish values while exporting to csv
exportToCSV(data, 'outputPreserveNulls.csv', { preserveNulls: true });

res.status(200).json({ success: true, data });
} catch (error) {
res.status(500).json({ success: false, error: error.message });
}
});

Contributing

Contributions are welcome! Please submit issues or pull requests on GitHub.