@24hr/rawb-search v0.29.10
Combined scraper and wrapper for fuse search
A function that when called, returns an object containing functions for searching and starting new scrapes of the site. This can be used to maintain a searchable index of content of a site and later querying that search index.
Init
The exposed function we get when requiring the module is used to initiate rawbSearch. The function takes the following arguments:
options
This is a standard fuse.js options object used for initiating fuse.
baseUrl
The base url for the site that this will be used for. For example: https://www.24hr.se
parsers
A list of at least 1 parser. The parsers will be tested in order from first to last index in list, and will execute the last parsers parse function if no parser before it has tested true. Therefore put the parser you want as default as the last index in list.
A parser in this case is what we call a function that follows this structure:
const blogStartPage = {
filter: (res, baseURL) => {
/*
This will use the res object that the scrape
will return and scan the page for identifiers
that it will use to determine if this parser will
apply its parse function on the current page or if
it will pass the current page along to the next
parser.
It is possible for a parser to not have a filter
function. But that will mean that it will always
get applied and thus it should be placed last in
the list as the default parser.
*/
},
parse: async (res, baseURL) => {
/*
We get a cheerio function from the scraper that
we can use to scrape the page. Below is a very
simple example. The returned object is the object
that will be returned and used as search index
for this page.
One solution is to have an attribute that marks an element containing
relevant and indexable text. This gives a lot of control but of course
demands that developers think about this and mark both indexable and
non-indexable elements as they are developing. Example below
*/
const $ = res.$;
// Remove any style tags if found as they are guaranteed to irrelevant.
$('style').remove();
// Remove any elements on page markes with the data-non-indexable attribute
$('[data-non-indexable] *').remove();
// Below we get all the textnodes that are nested below elements
// with data-indexable. We then grab the text from them.
const $indexableElements = $('[data-indexable] *');
const text = $indexableElements
.contents()
.filter(function() {
return this.nodeType === 3;
})
.text();
// Here we return the object that will be put in fuse.js list of indexable
// content. We have control over what we want to put in here and how, this
// is just one example.
return {
title: $("title").text(),
link: res.request.uri.href.split(baseURL)[1],
text: text,
};
},
}
search
The search function takes a query in string format and returns a list of results.
startNewScrape
Takes a URL and starts a new scrape on that page. This will also find all internal links on the page and start scrapes for them as well. If the page has a sitemap, that page should probably be used as the startUrl.
1 year ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
2 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago