API Reference

WebScraper Class

`new WebScraper(options?: ScraperOptions)`

Creates a new instance of the WebScraper class.

Parameters:

options (optional) - ScraperOptions - An object containing configuration options for the scraper.
- usePuppeteer - boolean (optional) - Whether to use Puppeteer for JavaScript-rendered pages. Default: true.
- throttle - number (optional) - Delay in milliseconds between requests. Default: 1000.
- rules - Record<string, string> - CSS selectors defining data extraction rules.

Returns:

A new instance of WebScraper.

Example:

import { WebScraper } from 'simple-web-scraper';

const scraper = new WebScraper({
  usePuppeteer: true,
  rules: { title: 'h1', content: 'p' },
});

Methods

`scrape(url: string): Promise<Record<string, any>>`

Scrapes the given URL based on the configured options.

Parameters:

url - string - The webpage URL to scrape.

Returns:

Promise<Record<string, any>> - The extracted data as an object.

Example:

const data = await scraper.scrape('https://example.com');
console.log(data);

`exportToJSON(data: any, filePath: string): void`

Exports the given data to a JSON file.

Parameters:

data - any - The data to be exported.
filePath - string - The path where the JSON file should be saved.

Returns:

void

Example:

import { exportToJSON } from 'simple-web-scraper';

const data = { name: 'Example', value: 42 };
exportToJSON(data, 'output.json');

`exportToCSV(data: any | any[], filePath: string): void`

Exports the given data to a CSV file.

Parameters:

data - any | any[] - The data to be exported.
filePath - string - The path where the CSV file should be saved.

Returns:

void

Example:

import { exportToCSV } from 'simple-web-scraper';

const data = [
  { name: 'Example 1', value: 42 },
  { name: 'Example 2', value: 99 },
];
exportToCSV(data, 'output.csv');

// preserve nullish values
exportToCSV(data, 'output.csv', { preserveNulls: true });

Backend Example

This example demonstrates how to use simple-web-scraper in a Node.js backend:

import express from 'express';
import { WebScraper, exportToJSON, exportToCSV } from 'simple-web-scraper';

const app = express();
const scraper = new WebScraper({
  usePuppeteer: true,
  rules: {
    fullHTML: 'html', // Entire page HTML
    title: 'head > title', // Page title
    description: 'meta[name="description"]', // Meta description
    keywords: 'meta[name="keywords"]', // Meta keywords
    favicon: 'link[rel="icon"]', // Favicon URL
    mainHeading: 'h1', // First H1 heading
    allHeadings: 'h1, h2, h3, h4, h5, h6', // All headings on the page
    firstParagraph: 'p', // First paragraph
    allParagraphs: 'p', // All paragraphs on the page
    links: 'a', // All links on the page
    images: 'img', // All image URLs
    imageAlts: 'img', // Alternative text for images
    videos: 'video, iframe[src*="youtube.com"], iframe[src*="vimeo.com"]', // Video sources
    tables: 'table', // Capture table elements
    tableData: 'td', // Capture table cells
    lists: 'ul, ol', // Capture all lists
    listItems: 'li', // Capture all list items
    scripts: 'script', // JavaScript file sources
    stylesheets: 'link[rel="stylesheet"]', // External CSS files
    structuredData: 'script[type="application/ld+json"]', // JSON-LD structured data
    socialLinks:
      'a[href*="facebook.com"], a[href*="twitter.com"], a[href*="linkedin.com"], a[href*="instagram.com"]', // Social media links
    author: 'meta[name="author"]', // Author meta tag
    publishDate: 'meta[property="article:published_time"], time', // Publish date
    modifiedDate: 'meta[property="article:modified_time"]', // Last modified date
    canonicalURL: 'link[rel="canonical"]', // Canonical URL
    openGraphTitle: 'meta[property="og:title"]', // OpenGraph title
    openGraphDescription: 'meta[property="og:description"]', // OpenGraph description
    openGraphImage: 'meta[property="og:image"]', // OpenGraph image
    twitterCard: 'meta[name="twitter:card"]', // Twitter card type
    twitterTitle: 'meta[name="twitter:title"]', // Twitter title
    twitterDescription: 'meta[name="twitter:description"]', // Twitter description
    twitterImage: 'meta[name="twitter:image"]', // Twitter image
  },
});

app.get('/scrape-example', async (req, res) => {
  try {
    const url = 'https://github.com/The-Node-Forge';
    const data = await scraper.scrape(url);

    exportToJSON(data, 'output.json'); // export JSON
    exportToCSV(data, 'output.csv'); // export CSV

    // preserve nullish values while exporting to csv
    exportToCSV(data, 'outputPreserveNulls.csv', { preserveNulls: true });

    res.status(200).json({ success: true, data });
  } catch (error) {
    res.status(500).json({ success: false, error: error.message });
  }
});

Contributing

Contributions are welcome! Please submit issues or pull requests on GitHub.

WebScraper Class​

new WebScraper(options?: ScraperOptions)​

Methods​

scrape(url: string): Promise<Record<string, any>>​

exportToJSON(data: any, filePath: string): void​

exportToCSV(data: any | any[], filePath: string): void​

Backend Example​

Contributing​

WebScraper Class

`new WebScraper(options?: ScraperOptions)`

Methods

`scrape(url: string): Promise<Record<string, any>>`

`exportToJSON(data: any, filePath: string): void`

`exportToCSV(data: any | any[], filePath: string): void`

Backend Example

Contributing