1.0.6 • Published 10 months ago

@playfulsparkle/robotstxt-js v1.0.6

Weekly downloads
-
License
BSD-3-Clause
Repository
github
Last release
10 months ago

robotstxt.js

robotstxt.js is a lightweight JavaScript library for parsing robots.txt files. It provides a compliant solution in both browser and Node.js environments.

Directives

  • Clean-param
  • Host
  • Sitemap
  • User-agent
    • Allow
    • Disallow
    • Crawl-delay
    • Cache-delay
    • Comment
    • NoIndex
    • Request-rate
    • Robot-version
    • Visit-time

Benefits

  • Accurately parse and interpret robots.txt rules.
  • Ensure compliance with robots.txt standards to avoid accidental blocking of legitimate bots.
  • Easily check URL permissions for different user agents programmatically.
  • Simplify the process of working with robots.txt in JavaScript applications.

Usage

Here's how to use robotstxt.js to analyze robots.txt content and check crawler permissions.

Node.js

const { robotstxt } = require("@playfulsparkle/robotstxt-js")
...

### JavaScript

```javascript
// Parse robots.txt content
const robotsTxtContent = `
User-Agent: GoogleBot
Allow: /public
Disallow: /private
Crawl-Delay: 5
Sitemap: https://example.com/sitemap.xml
`;

const parser = robotstxt(robotsTxtContent);

// Check URL permissions
console.log(parser.isAllowed("/public/data", "GoogleBot"));   // true
console.log(parser.isDisallowed("/private/admin", "GoogleBot")); // true

// Get specific user agent group
const googleBotGroup = parser.getGroup("googlebot"); // Case-insensitive
if (googleBotGroup) {
    console.log("Crawl Delay:", googleBotGroup.getCrawlDelay()); // 5
    console.log("Rules:", googleBotGroup.getRules().map(rule =>
        `${rule.type}: ${rule.path}`
    )); // ["allow: /public", "disallow: /private"]
}

// Get all sitemaps
console.log("Sitemaps:", parser.getSitemaps()); // ["https://example.com/sitemap.xml"]

// Check default rules (wildcard *)
console.log(parser.isAllowed("/protected", "*")); // true (if no wildcard rules exist)

Installation

NPM

npm i @playfulsparkle/robotstxt-js

Yarn

yarn add @playfulsparkle/robotstxt-js

Bower (deprecated)

Bower

bower install playfulsparkle/robotstxt.js

API Documentation

Core Methods

  • robotstxt(content: string): RobotsTxtParser - Creates a new parser instance with the provided robots.txt content.
  • getReports(): string[] - Get an array of parsing error, warning etc.
  • isAllowed(url: string, userAgent: string): boolean - Check if a URL is allowed for the specified user agent (throws if parameters are missing).
  • isDisallowed(url: string, userAgent: string): boolean - Check if a URL is disallowed for the specified user agent (throws if parameters are missing).
  • getGroup(userAgent: string): Group | undefined - Get the rules group for a specific user agent (case-insensitive match).
  • getSitemaps(): string[] - Get an array of discovered sitemap URLs from Sitemap directives.
  • getCleanParams(): string[] - Retrieve Clean-param directives for URL parameter sanitization.
  • getHost(): string | undefined - Get canonical host declaration for domain normalization.

Group Methods (via getGroup() result)

User Agent Info

  • getName(): string - User agent name for this group.
  • getComment(): string[] - Associated comment from the Comment directive.
  • getRobotVersion(): string | undefined - Robots.txt specification version.
  • getVisitTime(): string | undefined - Recommended crawl time window.

Crawl Management

  • getCacheDelay(): number | undefined - Cache delay in seconds.
  • getCrawlDelay(): number | undefined - Crawl delay in seconds.
  • getRequestRates(): string[] - Request rate limitations.

Rule Access

  • getRules(): Rule[] - All rules (allow/disallow/noindex) for this group.
  • addRule(type: string, path: string): void - Add rule (throws if type missing, throws if path missing).

Specification Support

Full Support

  • User-agent groups and inheritance
  • Allow/Disallow directives
  • Wildcard pattern matching (*)
  • End-of-path matching ($)
  • Crawl-delay directives
  • Sitemap discovery
  • Case-insensitive matching
  • Default user-agent (*) handling
  • Multiple user-agent declarations
  • Rule precedence by specificity

Support

Node.js

robotstxt.js runs in all active Node versions (6.x+).

Browser Support

This library is written using modern JavaScript ES2015 (ES6) features. It is expected to work in the following browser versions and later:

BrowserMinimum Supported Version
Desktop Browsers
Chrome49
Edge13
Firefox45
Opera36
Safari14.1
Mobile Browsers
Chrome Android49
Firefox for Android45
Opera Android36
Safari on iOS14.5
Samsung Internet5.0
WebView Android49
WebView on iOS14.5
Other
Node.js6.13.0

Specifications

License

robotstxt.js is licensed under the terms of the BSD 3-Clause License.

1.0.10

9 months ago

1.0.9

10 months ago

1.0.8

10 months ago

1.0.7

10 months ago

1.0.6

10 months ago

1.0.5

10 months ago

1.0.4

10 months ago

1.0.3

10 months ago