0.12.0 • Published 1 month ago

interrobot-plugin v0.12.0

Weekly downloads
-
License
MPL-2.0
Repository
github
Last release
1 month ago

Your web crawler just got superpowers. InterroBot plugins transform your web crawler into a customizable data powerhouse, unleashing unlimited potential for data extraction and analysis.

InterroBot plugins are simple HTML/JS/CSS pages that transform raw web crawl data into profound insights, stunning visualizations, and interactive dashboards. With our flexible API, you can create custom plugins that analyze website content across entire domains, connecting with analytics, LLMs, or your favorite SaaS for deeper insights.

Our plugin ecosystem is designed for versatility. Whether you're building proprietary tools, developing plugins for clients, or contributing to the open-source community, InterroBot plugins adapt to your needs. Available for Windows 10/11, macOS, and Android, our platform ensures your data analysis can happen wherever you work.

How Does it Work?

InterroBot hosts an iframe of your webpage and exposes an API from which you can pull data down for analysis.

If you're familiar with vanilla TypeScript or JavaScript, creating a custom plugin script for InterroBot is remarkably straight forward. First you start with a bare-bones HTML file and a script extending the Plugin base class.

// TypeScript vs. JavaScript, both are fine. See examples.
import { Plugin } from "./src/ts/core/plugin";
class BasicExamplePlugin extends Plugin {    
    static meta = {
        "title": "Example Plugin",
        "category": "Example",
        "version": "1.0.0",
        "author": "InterroBot",
        "synopsis": `a basic plugin example`,
        "description": `This example is as simple as it gets.`,
    };
    constructor() {
        super();
        // index() has nothing to do with the crawl index, btw. it is 
        // the plugin index (think index.html), a view that shows by
        // default, and would generally consist of a form or visualization.
        this.index();
    }
}
// configure to load when page is ready
Plugin.initialize(BasicExamplePlugin);

BasicExamplePlugin will not do much at this point, but it will load and run the default index() behavior. You can, of course, override the default index() behavior, rendering your page however you wish.

protected async index() {
    // add your form and supporting HTML
    this.render(`<div>HTML</div>`);
    // initialize the plugin within InterroBot, from within iframe
    await this.initData(BasicExamplePlugin.meta, {}, []);    
    // add handlers to the form
    const button = document.querySelector("button");
    button.addEventListener("click", async (ev) => { 
        await this.process(); // where process() is a form handler
    });
}

The process() method called above would be where you process data. Here a query is executed on the crawl index, and each result run through the exampleResultsHandler.

protected async process() {
    // gather title words and running counts with a result handler
    const titleWords: Map<string, number> = new Map<string, number>();
    let resultsMap: Map<number, SearchResult>;
    const exampleResultHandler = async (result: SearchResult, 
        titleWordsMap: Map<string, number>) => {
        const terms: string[] = result.name.trim().split(/[\s\-—]+/g);
        terms.forEach(term => titleWordsMap.set(term, 
            (titleWordsMap.get(term) ?? 0) + 1));
    }
    // projectId comes for free as a member of Plugin
    const projectId: number = this.getProjectId();
    // anything you put into InterroBot search, field or fulltext works
    // here we limit to HTML documents, which will have a <title> -> name
    const freeQueryString: string = "headers: text/html";
    // pipe delimited fields you want retrieved. id and url come with 
    // the base model, everything else must be requested explicitly
    const fields: string = "name";
    const internalHtmlPagesQuery = new SearchQuery(projectId, 
        freeQueryString, fields, SearchQueryType.Any, false);
    // run each SearchResult through its handler, and we're done processing
    await Search.execute(internalHtmlPagesQuery, resultsMap, "Processing…", 
        async (result: SearchResult) => {
            await exampleResultHandler(result, titleWords);
        }
    );
    // call for HTML presentation of titleWords with processing complete
    await this.report(titleWords);
}

The above snippets are pulled (and gently modified) from a plugin in the repository, basic.js. For more ideas getting started, check out the examples directory.

What data is available via API?

InterroBot's robust API provides plugin developers with access to crawled data, enabling deep analysis and useful customizations. This data forms the foundation of your plugin, allowing you to create insightful visualizations, perform complex analysis, or build interactive tools. Whether you're tracking SEO metrics, analyzing content structures, or developing custom reporting tools, our API offers the flexibility and depth you need. Below is an overview of the key data points available, organized by API endpoint:

GetProjects

Retrieves a list of projects using the Plugin API.

Optional Fields

FieldDescription
createdISO 8601 date/time, project created
imagedatauri of project icon
modifiedISO 8601 date/time, project modified

GetResources

Retrieves a list of resources associated with a project using the Plugin API.

Optional Fields

FieldDescription
assetsarray of assets, HTML only
contentpage/file contents
createdISO 8601 date/time, crawled resource
headersHTTP headers
linksarray of outlinks, HTML only
modifiedISO 8601 date/time, resource modified
namepage/file name
norobotscrawler indexable
originforwarding URL, if applicable
sizesize in bytes
statusHTTP status code
timerequest time, in millis
typeresource type, html, pdf, image, etc.

GetCrawls

Retrieves a list of crawls using the Plugin API.

Optional Fields

FieldDescription
createdISO 8601 date/time, crawl created
modifiedISO 8601 date/time, crawl modified
reportCrawl details as JSON
timeCrawl time in millis

Licensing

MPL 2.0, with exceptions. This repo contains JavaScript to TypeScript ports and a Markdown library based on existing code, all contained within ./src/lib. As they arrived under existing licenses, they will remain under those.

  • Typo.js: TypeScript port continues under the original Modified BSD License.
  • Snowball.js: TypeScript port continues under the original MPL 1.1 license.
  • HTML To Markdown Text: The Markdown library contains a modified version of an HTML to Markdown XSLT transformer by Michael Eichelsdoerfer. MIT license.

The InterroBot plugins and the Typo.js TypeScript port make use of a handful of unmodified Hunspell dictionaries, as found in wooorm's UTF-8 collection: dictionary-en, dictionary-en-gb, dictionary-es, dictionary-es-mx, dictionary-fr, and dictionary-ru.

0.11.0

2 months ago

0.12.0

1 month ago

0.10.0

3 months ago

0.9.0

8 months ago

0.8.2

10 months ago

0.8.1

11 months ago

0.8.0

11 months ago