2.0.0-beta.12 • Published 7 months ago

@marbec/web-auto-extractor v2.0.0-beta.12

Weekly downloads
-
License
MIT
Repository
github
Last release
7 months ago

Web Auto Extractor 2.0

GitHub License CI NPM Version Node Current

This project is a fork of indix/web-auto-extractor.

Parse semantically structured information from any HTML webpage.

Supported formats:

  • Encodings that support Schema.org vocabularies:
    • Microdata
    • RDFa-lite
    • JSON-LD
  • Meta tags

Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.

Installation

npm i --save @marbec/web-auto-extractor

Usage

import WebAutoExtractor from '@marbec/web-auto-extractor';

const parsed = new WebAutoExtractor({
  // Add location information to the root elements in the parsed data.
  // Location is stored as start,end offset values in the @location property.
  addLocation: false,

  // Embed the source HTML in the root elements in the parsed data using the @source property.
  // This property is either a boolean to embed sources for all data types or an array of data types to embed sources for.
  embedSource: false,
}).parse(sampleHTML);

// Output format
/* {
    "metatags": {},
    "microdata": {},
    "rdfa": {},
    "jsonld": {}
} */

Browser

You can run the parser directly in the browser on any website using the following commands:

const { default: WebAutoExtractor } = await import(
  'https://unpkg.com/@marbec/web-auto-extractor@latest/dist/index.js'
);
new WebAutoExtractor().parse(document.documentElement.outerHTML);

Examples

See test cases for sample in- and outputs.

2.0.0-beta.12

7 months ago

2.0.0-beta.11

8 months ago

2.0.0-beta.10

8 months ago

2.0.0-beta.9

8 months ago

2.0.0-beta.8

8 months ago

2.0.0-beta.7

8 months ago

2.0.0-beta.6

8 months ago

2.0.0-beta.5

8 months ago

2.0.0-beta.4

8 months ago

2.0.0-beta.3

8 months ago

2.0.0-beta.2

9 months ago

2.0.0-beta.1

9 months ago