0.8.1 • Published 12 months ago

lusail v0.8.1

Weekly downloads
-
License
MIT
Repository
github
Last release
12 months ago

lusail-js

npm version CI/CD Workflow

JavaScript implementation of Lusail, a domain-specific language for extracting structured data from HTML.

What is Lusail?

Lusail is an extensible domain-specific language designed to make it easy to express the structure of the data that needs to be extracted from an HTML document. It relies on a combination of field definitions and transformation pipelines to dictate data extraction and processing for each field. The transforms within a pipeline process input data sequentially, with each transform receiving the output of its predecessor, applying its specific logic, and then passing the result to the subsequent transform.

A Lusail template can be defined using any object notation. Here's a simple example of a Lusail template in YAML:

# Get the text content of the first element matching the CSS selector "title" and assign it to the
# field "pageTitle".
pageTitle:
  - cssSelector: title
  - get: single
  - get: text
# Get the text content of the first element matching the CSS selector ".description" and assign it
# to the field "pageDescription".
pageDescription:
  - cssSelector: .description
  - get: single
  - get: text
# Get all the href attributes of elements matching the CSS selector "body > a" and assign the
# resulting array to the "links" field.
links:
  - cssSelector: "body > a"
  - attribute: href
# Get all elements matching the CSS selector ".post" and extract certain fields from each.
posts:
  - cssSelector: .post
  - fields:
      title:
        - cssSelector: .title
        - get: single
        - get: text
      content:
        - cssSelector: .content
        - get: single
        - get: text

Now consider this HTML document:

<html>
  <head>
    <title>Lusail</title>
  </head>
  <body>
    <h1 class="description">JavaScript implementation of Lusail</h1>
    <a href="https://www.example.com">Example</a>
    <a href="https://www.github.com">Example 2</a>
    <div class="post">
      <h2 class="title">Post 1</h2>
      <p class="content">Content 1.</p>
    </div>
    <div class="post">
      <h2 class="title">Post 2</h2>
      <p class="content">Content 2.</p>
    </div>
  </body>
</html>

Applying the above template to the given HTML document will produce:

{
  "pageTitle": "Lusail",
  "pageDescription": "JavaScript implementation of Lusail",
  "links": [ "https://www.example.com", "https://www.github.com" ],
  "posts": [
    { "title": "Post 1", "content": "Content 1." },
    { "title": "Post 2", "content": "Content 2." }
  ]
}

This library is a JavaScript parser for the Lusail language.

Installation

npm install --save lusail

Usage

Create a Lusail instance by passing in a template as a JavaScript object:

import { Lusail, LusailTemplate } from 'lusail';

// Define your Lusail template.
const template: LusailTemplate = {
  pageTitle: [
    { cssSelector: '.title' },
    { get: 'single', index: 0 },
    { get: 'text' },
  ],
};

// Create a Lusail instance.
const lusail = new Lusail(template);

Or define it as a YAML template for a more concise structure:

import { Lusail } from 'lusail';

const yamlTemplate = `
pageTitle:
  - cssSelector: .title
  - get: single
    index: 0
  - get: text
`;

const lusail = Lusail.fromYaml(yamlTemplate);

Then parse your HTML as a string:

const result = await lusail.parseFromString(html);

Or let it fetch the HTML from a URL:

const result = await lusail.parseFromUrl(url);

Supported Transforms

Single

Retrieves a single element from an array by index.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformIf index is not definedsingle
indexThe index to pickIf getBy is not specified0

Range

Retrieves a range of elements by start and end indexes.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformIf none of the other properties are providedrange
startThe starting index of the rangeIf none of the other properties are provided0
endThe ending index of the rangeIf none of the other properties are providedEnd of the input array

CSS Selector

Retrieves elements matching the given selector.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNocssSelector
cssSelectorThe CSS selector to match elementsYes-

Element Text

Retrieves the text content of input element(s).

PropertyDescriptionRequiredDefault / required value
getByTriggers this transformYestext

Attribute

Retrieves the value of the specified attribute of input element(s).

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNoattribute
attributeThe name of the attribute to retrieveYes-

Cast

Casts incoming value(s) to a target type.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNocast
castToThe field type to cast the value toYes-

Date

Casts incoming value(s) to date(s), using an optional format and locale.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNocast
castToThe field type to cast the value toYesdate
formatThe format of the date string or 'timeAgo' for relative timeNoISO 8601 format
localeThe locale to be used when parsing the dateNo-

Regex

Applies a regular expression substitution to the input value(s).

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNoregex
regexThe regex pattern to applyYes-
replaceWithThe string to replace matched patterns withNo'$1'
requireMatchWhether to pass the input value if it does not match the patternNofalse

Extract Fields

Extracts fields by applying a sub-template to the input.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNofields
fieldsThe LusailTemplate for extracting fieldsYes-

Follow Links

Follows links from input strings and extracts fields given by a sub-template.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNofollowingLinks, followLinks, or links
followLinksThe LusailTemplate to apply to the linked contentYes-

Literal

Transforms the input(s) a fixed literal value.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNoliteral
literalThe fixed literal valueYes-

Hoist

Hoists nested fields to the top level of the result

PropertyDescriptionRequiredDefault / required value
getByTriggers this transformYeshoist or hoisting

Existence

Determines whether the value transformed up to this point exists or not. Existence is determined by truthiness. If the value is an array, then existence is determined by the existence of a truthy value in the array.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNoexistence or exists
existsWhether to check for existence (true) or absence (false)Notrue

Map

Replaces incoming values with new values using a key/value map.

PropertyDescriptionRequiredDefault / required value
getByExplicitly triggers this transformNomap or mapping
mapThe map to use for the conversionYes-
strictWhether to allow unmatched values to passNofalse

Adding Custom Transforms

Lusail-js allows you to extend its functionality by registering custom transformations. These additional transformations can then be used in your Lusail templates.

Here's how to create and register a custom plugin:

1. Create a custom transformer

Implement a custom transformer that extends the Transformer abstract class.

import { FieldTransform, Transformer, TransformerFactories } from 'lusail';

export interface MyTransform extends FieldTransform<'mine'> {
  myOption: any;
}

export default class MyTransformer extends Transformer<MyTransform> {
  async execute(input: Element | Element[]): Promise<string | string[]> {
    // Your transformation logic goes here...
  }
}

2. Register a custom transformer factory

A custom transformer factory is a function that returns a Transformer instance if the given FieldTransform matches the custom transformation type. To make your custom transformation available in Lusail templates, you need to register it with Lusail.

import { Lusail } from 'lusail';

function isMyTransform(transform: FieldTransform): transform is MyTransform {
  return transform.getBy === 'mine';
}

Lusail.registerTransform(
  (transform, options) => {
    return MyTransformer.isMyTransform(transform)
      ? new MyTransformer(transform, options)
      : undefined;
  },
  // Optional precedence argument. Factories that claim higher precedence will be chosen over those
  // with lower precedence in case of conflict.
  2
);

3. Use the custom transformer in your templates

Now, your custom transformation type can be used in Lusail templates:

customField:
  - getBy: mine
    myOption: <value>

Documentation

See API Documentation for more details.

Development Status

Please note that Lusail is still under development and has not been thoroughly tested. As such, its use in production environments is not yet recommended. Also note that while we will attempt to follow semantic versioning for the library, there might be breaking changes between minor versions from time to time until we reach a stable state. Please report any issues you encounter and/or submit a pull request so we can make the library better.

Contributing

This is an evolving project, and contributions are welcome. Please read the CONTRIBUTING.md file for guidelines on how to contribute.

License

This project is licensed under the MIT License. See the LICENSE file for details.

0.8.1

12 months ago

0.6.3

1 year ago

0.8.0

1 year ago

0.6.2

1 year ago

0.5.0

1 year ago

0.4.0

1 year ago

0.7.0

1 year ago

0.6.1

1 year ago

0.6.0

1 year ago

0.3.0

1 year ago

0.2.0

1 year ago

0.1.0

1 year ago