0.0.157 • Published 12 months ago

@spider-rs/spider-rs v0.0.157

Weekly downloads
-
License
MIT
Repository
github
Last release
12 months ago

spider-rs

The spider project ported to Node.js

Getting Started

  1. npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')
  .withHeaders({
    authorization: 'somerandomjwt',
  })
  .withBudget({
    '*': 20, // limit max request 20 pages for the website
    '/docs': 10, // limit only 10 pages on the `/docs` paths
  })
  .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths
  .build()

// optional: page event handler
const onPageEvent = (_err, page) => {
  const title = pageTitle(page) // comment out to increase performance if title not needed
  console.info(`Title of ${page.url} is '${title}'`)
  website.pushData({
    status: page.statusCode,
    html: page.content,
    url: page.url,
    title,
  })
}

await website.crawl(onPageEvent)
await website.exportJsonlData('./storage/rsseau.jsonl')
console.log(website.getLinks())

Collect the resources for a website.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')
  .withBudget({
    '*': 20,
    '/docs': 10,
  })
  // you can use regex or string matches to ignore paths
  .withBlacklistUrl(['/resume'])
  .build()

await website.scrape()
console.log(website.getPages())

Run the crawls in the background on another thread.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (_err, page) => {
  console.log(page)
}

await website.crawl(onPageEvent, true)
// runs immediately

Use headless Chrome rendering for crawls.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr').withChromeIntercept(true, true)

const onPageEvent = (_err, page) => {
  console.log(page)
}

// the third param determines headless chrome usage.
await website.crawl(onPageEvent, false, true)
console.log(website.getLinks())

Cron jobs can be done with the following.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *')
// sleep function to test cron
const stopCron = (time: number, handle) => {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve(handle.stop())
    }, time)
  })
}

const links = []

const onPageEvent = (err, value) => {
  links.push(value)
}

const handle = await website.runCron(onPageEvent)

// stop the cron in 4 seconds
await stopCron(4000, handle)

Use the crawl shortcut to get the page content and url.

import { crawl } from '@spider-rs/spider-rs'

const { links, pages } = await crawl('https://rsseau.fr')
console.log(pages)

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Test url: https://espn.com

librariespagesspeed
spider(rust): crawl150,3871m
spider(nodejs): crawl150,387153s
spider(python): crawl150,387186s
scrapy(python): crawl49,5981h
crawlee(nodejs): crawl18,77930m

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Development

Install the napi cli npm i @napi-rs/cli --global.

  1. yarn build:test
0.0.157

12 months ago

0.0.153

1 year ago

0.0.152

1 year ago

0.0.151

1 year ago

0.0.156

1 year ago

0.0.155

1 year ago

0.0.154

1 year ago

0.0.149

1 year ago

0.0.147

1 year ago

0.0.79

1 year ago

0.0.142

1 year ago

0.0.141

1 year ago

0.0.146

1 year ago

0.0.145

1 year ago

0.0.144

1 year ago

0.0.70

1 year ago

0.0.69

1 year ago

0.0.65

1 year ago

0.0.66

1 year ago

0.0.64

1 year ago

0.0.63

2 years ago

0.0.62

2 years ago

0.0.61

2 years ago

0.0.60

2 years ago

0.0.59

2 years ago

0.0.56

2 years ago

0.0.55

2 years ago

0.0.52

2 years ago

0.0.43

2 years ago

0.0.45

2 years ago

0.0.41

2 years ago

0.0.38

2 years ago

0.0.39

2 years ago

0.0.37

2 years ago

0.0.31

2 years ago

0.0.26

2 years ago

0.0.24

2 years ago

0.0.21

2 years ago

0.0.19

2 years ago

0.0.18

2 years ago

0.0.16

2 years ago

0.0.15

2 years ago

0.0.14

2 years ago

0.0.11

2 years ago

0.0.7

2 years ago

0.0.6

2 years ago