0.7.1 • Published 6 months ago

website-scrap-engine v0.7.1

Weekly downloads
59
License
ISC
Repository
github
Last release
6 months ago

website-scrap-engine

Configurable website scraper in typescript.

Features

  • Resource types
  • Configurable process pipeline
  • Options
  • Logger
  • Concurrent downloader
  • Multi-thread processing (with native worker_thread)
  • Process CSS
  • Process HTML
  • Process SiteMap (but not replace path in it)
  • Configurable logging

Multi-thread processing

Note: use multi-thread processing only if your process is cpu sensitive.

  • Main thread
    • resource downloading in queue
    • process after download
    • save binary resources to disk
    • send other resources to worker thread
    • enqueue non-duplicated resource from worker thread
  • Worker thread
    • receive downloaded resource from main thread
    • process after download
      • parse html, css, etc.
    • collect referenced resources
    • process and filter referenced resources before download
    • send referenced resources to main thread
    • save resources to disk

Pipeline life cycle

  • skip or redirect link
  • detect resource type
  • create
  • process before download
  • download
  • process after download
  • save to disk
0.7.1

6 months ago

0.7.0

1 year ago

0.6.0

1 year ago

0.5.0

2 years ago

0.4.0

3 years ago

0.3.2

3 years ago

0.3.1

3 years ago

0.3.0

3 years ago

0.2.0

4 years ago

0.1.7

4 years ago

0.1.6

4 years ago

0.1.4

4 years ago

0.1.3

4 years ago

0.1.5

4 years ago

0.1.2

4 years ago

0.1.1

4 years ago

0.1.0

4 years ago