0.0.4 • Published 4 years ago

blogsearch-crawler v0.0.4

Weekly downloads
Last release
4 years ago

BlogSearch index building tool for generic static webpages (crawler)

// Asciidoc references // Documentation: https://asciidoctor.org/docs/user-manual/ // Quick reference: https://asciidoctor.org/docs/asciidoc-syntax-quick-reference/ // Asciidoc vs Markdown: https://asciidoctor.org/docs/user-manual/#comparison-by-example // GitHub Flavored Asciidoc (GFA): https://gist.github.com/dcode/0cfbf2699a1fe9b46ff04c41721dda74

:project-version: 0.0.3 :rootdir: https://github.com/kbumsik/blogsearch

ifdef::env-github[] :tip-caption: :bulb: :note-caption: :information_source: :important-caption: :heavy_exclamation_mark: :caution-caption: :fire: :warning-caption: :warning: endif::[]

CAUTION: This is a part of link:{rootdir}BlogSearch project. If you would like to know the overall concept, go to link:{rootdir}the parent directory.

1. Building a search index file

The easiest way

source,bash npx blogsearch-crawler

The formal way


npm install -g blogsearch-crawler



CAUTION: Go to link:{rootdir}#whats-in-the-indexthe "What's in the index file" section of the main project. For more details on how to configure fields.



module.exports = { // Mandatory This must be 'simple'. type: 'simple', // Mandatory Generated blogsearch database file. output: './my_blog_index.db.wasm', // Mandatory List of entries to parse. The crawler uses glob pattern internally. // How to use glob: https://github.com/isaacs/node-glob entries: './reactjs.org/public/docs/*/.html' , // Mandatory Fields configurations. // See: {rootdir}#whats-in-the-index fields: { title: { // The value can be a CSS selector. parser: 'article > header', }, body: { // Set false if you want to reduce the size of the database. hasContent: true, // It can be a function as well. parser: (entry, page) => { // Use puppeteer page object. // It's okay to return a promise. return page.$eval('article > div > div:first-child', el => el.textContent); } }, url: { // By setting this false the search engine won't index the URL. indexed: false, parser: (entry, page) => { // entry is a string of the path being parsed. return entry.replace('./reactjs.org/public', 'https://reactjs.org'); }, }, categories: { // This is disabled because the target website doesn't have categories. enabled: false, // This is a dummy parser. This is unused because the field is disabled. parser: () => 'categories-1, categories-2', }, tags: { // This is disabled because the target website doesn't have tags. enabled: false, // This is a dummy parser. This is unused because the field is disabled. parser: () => 'tags-1, tags-2', }, }


2. Enabling the search engine in the webpage

You need to enable the search engine in the web page. Go to link:../blogsearchblogsearch Engine.

Again, if you would like to understand the concept of BlogSearch, go to link:{rootdir}/the parent directory.