@24hr/rawb-search NPM

Combined scraper and wrapper for fuse search

A function that when called, returns an object containing functions for searching and starting new scrapes of the site. This can be used to maintain a searchable index of content of a site and later querying that search index.

Init

The exposed function we get when requiring the module is used to initiate rawbSearch. The function takes the following arguments:

options

This is a standard fuse.js options object used for initiating fuse.

baseUrl

The base url for the site that this will be used for. For example: https://www.24hr.se

parsers

A list of at least 1 parser. The parsers will be tested in order from first to last index in list, and will execute the last parsers parse function if no parser before it has tested true. Therefore put the parser you want as default as the last index in list.

A parser in this case is what we call a function that follows this structure:

const blogStartPage = {
    filter: (res, baseURL) => {
        /*
          This will use the res object that the scrape
          will return and scan the page for identifiers
          that it will use to determine if this parser will
          apply its parse function on the current page or if
          it will pass the current page along to the next
          parser.

          It is possible for a parser to not have a filter
          function. But that will mean that it will always
          get applied and thus it should be placed last in
          the list as the default parser.
        */
    },
    parse: async (res, baseURL) => {
        /*
          We get a cheerio function from the scraper that
          we can use to scrape the page. Below is a very
          simple example. The returned object is the object
          that will be returned and used as search index
          for this page.
          
          One solution is to have an attribute that marks an element containing
          relevant and indexable text. This gives a lot of control but of course
          demands that developers think about this and mark both indexable and
          non-indexable elements as they are developing. Example below
        */

        const $ = res.$;
        // Remove any style tags if found as they are guaranteed to irrelevant.
        $('style').remove();
        // Remove any elements on page markes with the data-non-indexable attribute
        $('[data-non-indexable] *').remove();
        // Below we get all the textnodes that are nested below elements
        // with data-indexable. We then grab the text from them.
        const $indexableElements = $('[data-indexable] *');
        const text = $indexableElements
          .contents()
          .filter(function() {
            return this.nodeType === 3;
          })
          .text();
        
        // Here we return the object that will be put in fuse.js list of indexable
        // content. We have control over what we want to put in here and how, this
        // is just one example.
        return {
          title: $("title").text(),
          link: res.request.uri.href.split(baseURL)[1],
          text: text,
        };

    },
}

search

The search function takes a query in string format and returns a list of results.

startNewScrape

Takes a URL and starts a new scrape on that page. This will also find all internal links on the page and start scrapes for them as well. If the page has a sitemap, that page should probably be used as the startUrl.

pino uuid crawler slugify cli-color meilisearch pino-pretty apollo-server

@everything-registry/sub-chunk-16 @infinitebrahmanuniverse/nolb-_2

3 years ago

3 years ago

3 years ago

3 years ago

3 years ago