0.0.157 • Published 10 months ago

@spider-rs/spider-rs v0.0.157

Weekly downloads
-
License
MIT
Repository
github
Last release
10 months ago

spider-rs

The spider project ported to Node.js

Getting Started

  1. npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')
  .withHeaders({
    authorization: 'somerandomjwt',
  })
  .withBudget({
    '*': 20, // limit max request 20 pages for the website
    '/docs': 10, // limit only 10 pages on the `/docs` paths
  })
  .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths
  .build()

// optional: page event handler
const onPageEvent = (_err, page) => {
  const title = pageTitle(page) // comment out to increase performance if title not needed
  console.info(`Title of ${page.url} is '${title}'`)
  website.pushData({
    status: page.statusCode,
    html: page.content,
    url: page.url,
    title,
  })
}

await website.crawl(onPageEvent)
await website.exportJsonlData('./storage/rsseau.jsonl')
console.log(website.getLinks())

Collect the resources for a website.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')
  .withBudget({
    '*': 20,
    '/docs': 10,
  })
  // you can use regex or string matches to ignore paths
  .withBlacklistUrl(['/resume'])
  .build()

await website.scrape()
console.log(website.getPages())

Run the crawls in the background on another thread.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (_err, page) => {
  console.log(page)
}

await website.crawl(onPageEvent, true)
// runs immediately

Use headless Chrome rendering for crawls.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr').withChromeIntercept(true, true)

const onPageEvent = (_err, page) => {
  console.log(page)
}

// the third param determines headless chrome usage.
await website.crawl(onPageEvent, false, true)
console.log(website.getLinks())

Cron jobs can be done with the following.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *')
// sleep function to test cron
const stopCron = (time: number, handle) => {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve(handle.stop())
    }, time)
  })
}

const links = []

const onPageEvent = (err, value) => {
  links.push(value)
}

const handle = await website.runCron(onPageEvent)

// stop the cron in 4 seconds
await stopCron(4000, handle)

Use the crawl shortcut to get the page content and url.

import { crawl } from '@spider-rs/spider-rs'

const { links, pages } = await crawl('https://rsseau.fr')
console.log(pages)

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Test url: https://espn.com

librariespagesspeed
spider(rust): crawl150,3871m
spider(nodejs): crawl150,387153s
spider(python): crawl150,387186s
scrapy(python): crawl49,5981h
crawlee(nodejs): crawl18,77930m

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Development

Install the napi cli npm i @napi-rs/cli --global.

  1. yarn build:test
0.0.157

10 months ago

0.0.153

11 months ago

0.0.152

11 months ago

0.0.151

11 months ago

0.0.156

10 months ago

0.0.155

10 months ago

0.0.154

11 months ago

0.0.149

11 months ago

0.0.147

11 months ago

0.0.79

1 year ago

0.0.142

12 months ago

0.0.141

12 months ago

0.0.146

11 months ago

0.0.145

12 months ago

0.0.144

12 months ago

0.0.70

1 year ago

0.0.69

1 year ago

0.0.65

1 year ago

0.0.66

1 year ago

0.0.64

1 year ago

0.0.63

1 year ago

0.0.62

1 year ago

0.0.61

1 year ago

0.0.60

1 year ago

0.0.59

1 year ago

0.0.56

1 year ago

0.0.55

1 year ago

0.0.52

2 years ago

0.0.43

2 years ago

0.0.45

2 years ago

0.0.41

2 years ago

0.0.38

2 years ago

0.0.39

2 years ago

0.0.37

2 years ago

0.0.31

2 years ago

0.0.26

2 years ago

0.0.24

2 years ago

0.0.21

2 years ago

0.0.19

2 years ago

0.0.18

2 years ago

0.0.16

2 years ago

0.0.15

2 years ago

0.0.14

2 years ago

0.0.11

2 years ago

0.0.7

2 years ago

0.0.6

2 years ago