ipbcrawler v1.0.0
Invision Power Board Crawler
Package to mine forum data using the Invision Power Board platform.
Goals
We hope to allow you to get all topics that have already been posted in a particular forum, so you may be creating a seeder for your blog or even for your forum.
What can you shave with this package?
- Home
- List of topics
- Top Topic Post
1. What is scraped from home
- Forum Areas
Categories The icon, title, subcategories and description is returned
Ranks The name of the charges is returned
2. What is scraped from topic list
- Pagination
- Topics The id, title and url is returned
3. What is scraped from topic page
- Title topic
- Topic content
- Topic author
How to use
To install
npm i ipbcrawler --s
Process all topics
We provide an asynchronous function (with async) to access and scrape all topics contained within a category (yes, it will scroll through all pages)
const { findPosts } = require('ipbcrawler')
The system allows you to mine from different forums and return the post already with the category id of your forum.
To do this simply add the options and call
const options = [{
url: [
'https://example/forum/153-games'
'https://otherexample/forum/23-games-pc'
],
// the category id of my gaming forum
id: 140
}]
findPosts(options)
.then(topics => console.log(topics))
.catch(e => console.log(e))
You will have as a return
[{
category: 140
posts: [{
author: "Filipe",
post: "This is a sample post."
}]
}]
Extractions
If you want to access extractions individually, it's very simple
You can import the following extractions
- homeExtraction
- postExtraction
- listTopicsExtraction
All of them receive a Cheerio object, for this you just follow the example
const { domObject, homeExtraction } = require('ipbcrawler')
const extraction = async url => {
const $ = await domObject(url)
return homeExtraction($)
}
extraction("https://example.com/forum/home")
.then(home => console.log(home))
.catch(e => console.log(e))
Object returned by each extraction
homeExtraction
{
"zones": [
{
"title": string,
"categories": [
{
"icon": string,
"title": string,
"description": string,
"subCategories": [ { "title": string } ]
}
]
}
],
"ranks": [{
"name": string,
"withHTML": string
}]
}
listTopicsExtraction
{
"topics": [
{
"id": string,
"url": string,
"title": string
}
]
}
postExtraction
{
"title": string,
"post": string,
"author": string
}
Sorry for English.