interrobot-plugin v0.12.0
Your web crawler just got superpowers. InterroBot plugins transform your web crawler into a customizable data powerhouse, unleashing unlimited potential for data extraction and analysis.
InterroBot plugins are simple HTML/JS/CSS pages that transform raw web crawl data into profound insights, stunning visualizations, and interactive dashboards. With our flexible API, you can create custom plugins that analyze website content across entire domains, connecting with analytics, LLMs, or your favorite SaaS for deeper insights.
Our plugin ecosystem is designed for versatility. Whether you're building proprietary tools, developing plugins for clients, or contributing to the open-source community, InterroBot plugins adapt to your needs. Available for Windows 10/11, macOS, and Android, our platform ensures your data analysis can happen wherever you work.
How Does it Work?
InterroBot hosts an iframe of your webpage and exposes an API from which you can pull data down for analysis.
If you're familiar with vanilla TypeScript or JavaScript, creating a custom plugin script for InterroBot is remarkably straight forward. First you start with a bare-bones HTML file and a script extending the Plugin base class.
// TypeScript vs. JavaScript, both are fine. See examples.
import { Plugin } from "./src/ts/core/plugin";
class BasicExamplePlugin extends Plugin {
static meta = {
"title": "Example Plugin",
"category": "Example",
"version": "1.0.0",
"author": "InterroBot",
"synopsis": `a basic plugin example`,
"description": `This example is as simple as it gets.`,
};
constructor() {
super();
// index() has nothing to do with the crawl index, btw. it is
// the plugin index (think index.html), a view that shows by
// default, and would generally consist of a form or visualization.
this.index();
}
}
// configure to load when page is ready
Plugin.initialize(BasicExamplePlugin);
BasicExamplePlugin will not do much at this point, but it will load and run the default index()
behavior.
You can, of course, override the default index()
behavior, rendering your page however you wish.
protected async index() {
// add your form and supporting HTML
this.render(`<div>HTML</div>`);
// initialize the plugin within InterroBot, from within iframe
await this.initData(BasicExamplePlugin.meta, {}, []);
// add handlers to the form
const button = document.querySelector("button");
button.addEventListener("click", async (ev) => {
await this.process(); // where process() is a form handler
});
}
The process()
method called above would be where you process data. Here a query is executed on
the crawl index, and each result run through the exampleResultsHandler.
protected async process() {
// gather title words and running counts with a result handler
const titleWords: Map<string, number> = new Map<string, number>();
let resultsMap: Map<number, SearchResult>;
const exampleResultHandler = async (result: SearchResult,
titleWordsMap: Map<string, number>) => {
const terms: string[] = result.name.trim().split(/[\s\-—]+/g);
terms.forEach(term => titleWordsMap.set(term,
(titleWordsMap.get(term) ?? 0) + 1));
}
// projectId comes for free as a member of Plugin
const projectId: number = this.getProjectId();
// anything you put into InterroBot search, field or fulltext works
// here we limit to HTML documents, which will have a <title> -> name
const freeQueryString: string = "headers: text/html";
// pipe delimited fields you want retrieved. id and url come with
// the base model, everything else must be requested explicitly
const fields: string = "name";
const internalHtmlPagesQuery = new SearchQuery(projectId,
freeQueryString, fields, SearchQueryType.Any, false);
// run each SearchResult through its handler, and we're done processing
await Search.execute(internalHtmlPagesQuery, resultsMap, "Processing…",
async (result: SearchResult) => {
await exampleResultHandler(result, titleWords);
}
);
// call for HTML presentation of titleWords with processing complete
await this.report(titleWords);
}
The above snippets are pulled (and gently modified) from a plugin in the repository, basic.js. For more ideas getting started, check out the examples directory.
What data is available via API?
InterroBot's robust API provides plugin developers with access to crawled data, enabling deep analysis and useful customizations. This data forms the foundation of your plugin, allowing you to create insightful visualizations, perform complex analysis, or build interactive tools. Whether you're tracking SEO metrics, analyzing content structures, or developing custom reporting tools, our API offers the flexibility and depth you need. Below is an overview of the key data points available, organized by API endpoint:
GetProjects
Retrieves a list of projects using the Plugin API.
Optional Fields
Field | Description |
---|---|
created | ISO 8601 date/time, project created |
image | datauri of project icon |
modified | ISO 8601 date/time, project modified |
GetResources
Retrieves a list of resources associated with a project using the Plugin API.
Optional Fields
Field | Description |
---|---|
assets | array of assets, HTML only |
content | page/file contents |
created | ISO 8601 date/time, crawled resource |
headers | HTTP headers |
links | array of outlinks, HTML only |
modified | ISO 8601 date/time, resource modified |
name | page/file name |
norobots | crawler indexable |
origin | forwarding URL, if applicable |
size | size in bytes |
status | HTTP status code |
time | request time, in millis |
type | resource type, html, pdf, image, etc. |
GetCrawls
Retrieves a list of crawls using the Plugin API.
Optional Fields
Field | Description |
---|---|
created | ISO 8601 date/time, crawl created |
modified | ISO 8601 date/time, crawl modified |
report | Crawl details as JSON |
time | Crawl time in millis |
Licensing
MPL 2.0, with exceptions. This repo contains JavaScript to TypeScript ports and a Markdown library based on existing code, all contained within ./src/lib
. As they arrived under existing licenses, they will remain under those.
- Typo.js: TypeScript port continues under the original Modified BSD License.
- Snowball.js: TypeScript port continues under the original MPL 1.1 license.
- HTML To Markdown Text: The Markdown library contains a modified version of an HTML to Markdown XSLT transformer by Michael Eichelsdoerfer. MIT license.
The InterroBot plugins and the Typo.js TypeScript port make use of a handful of unmodified Hunspell dictionaries, as found in wooorm's UTF-8 collection: dictionary-en
, dictionary-en-gb
, dictionary-es
, dictionary-es-mx
, dictionary-fr
, and dictionary-ru
.