16.8.1 • Published 4 years ago

mega-scraper v16.8.1

Weekly downloads
127
License
MIT
Repository
github
Last release
4 years ago

mega-scraper

scrape a website's content.

npm i -g mega-scraper

mega-scraper https://www.wikipedia.org

requirements

  • running redis instance on host 0.0.0.0 port 6379

  • on debian/ubuntu, install additional required libraries via sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

api

see api.md for more usage example and options.

e.g.

(async () => {
  const {browser: {createBrowser, takeScreenshot}, queue: {createQueue}} = require('.')

  const browser = await createBrowser()
  const queue = createQueue('wikipedia')

  const page = await browser.newPage('https://www.wikipedia.org/')

  const url = 'https://www.wikipedia.org/'
  await queue.add({ url })

  queue.process(async (job) => {
    await page.goto(job.data.url)
    await takeScreenshot(page, job.data)
    const content = await page.content()
    console.log('content', content.substring(0, 500))
  })
})()

cli options

--headless default: true

set to false to run the scraper in "headful" mode (non-headless)

e.g.

mega-scraper https://www.wikipedia.org --headless false

--screenshot default: true

set to false to avoid taking a screenshot of each scraped page

e.g.

mega-scraper https://www.wikipedia.org --headless false

--proxy default: true

set to false to avoid proxying each request through a free proxy service (currently the module get-free-https-proxy is used)

e.g.

mega-scraper https://www.wikipedia.org --proxy false

--timeout default: 5000

set the timeout to a desired number in milliseconds (5000 = 5 seconds)

e.g.

mega-scraper https://www.wikipedia.org --timeout 10000

--images default: true

set to false to avoid loading images

e.g.

mega-scraper https://www.wikipedia.org --images false

--stylesheets default: true

set to false to avoid loading stylesheets

e.g.

mega-scraper https://www.wikipedia.org --stylesheets false

--javascript default: true

set to false to avoid loading javascript

e.g.

mega-scraper https://www.wikipedia.org --javascript false

--monitor default: true

set to false to avoid opening the web dashboard on localhost:4000

e.g.

mega-scraper https://www.wikipedia.org --monitor false

--exit default: false

set to true to exit the program with success or failure status code once done scraping.

e.g.

mega-scraper https://www.wikipedia.org --exit

--cookie default: none

set to a desired cookie to further prevent detection

e.g.

mega-scraper https://www.wikipedia.org --cookie 'my=cookie'
16.8.1

4 years ago

16.8.0

4 years ago

16.7.3

4 years ago

16.7.2

4 years ago

16.7.1

4 years ago

16.7.0

4 years ago

16.6.0

4 years ago

16.5.5

4 years ago

16.5.4

4 years ago

16.5.3

4 years ago

16.5.2

4 years ago

16.5.6

4 years ago

16.5.1

4 years ago

16.4.0

4 years ago

16.2.0

4 years ago

16.3.0

4 years ago

16.1.0

4 years ago

16.0.2

4 years ago

16.0.1

4 years ago

16.0.0

4 years ago

15.6.1

4 years ago

15.6.0

4 years ago

15.4.1

4 years ago

15.4.0

4 years ago

15.5.0

4 years ago

15.3.0

4 years ago

15.2.0

4 years ago

15.1.1

4 years ago

15.1.0

4 years ago

15.0.0

4 years ago

14.5.0

4 years ago

14.4.9

4 years ago

14.4.5

4 years ago

14.4.6

4 years ago

14.4.7

4 years ago

14.4.8

4 years ago

14.4.1

4 years ago

14.4.2

4 years ago

14.4.3

4 years ago

14.4.4

4 years ago

14.3.0

4 years ago

14.2.0

4 years ago

14.2.1

4 years ago

14.4.0

4 years ago

14.1.0

4 years ago

14.0.0

4 years ago

13.1.1

4 years ago

13.1.0

4 years ago

13.0.1

4 years ago

13.0.0

4 years ago

12.4.1

4 years ago

12.4.0

4 years ago

12.3.0

4 years ago

12.2.3

4 years ago

12.2.2

4 years ago

12.2.1

4 years ago

12.2.0

4 years ago

12.1.0

4 years ago

12.0.1

4 years ago

12.0.0

4 years ago

11.11.7

4 years ago

11.11.6

4 years ago

11.11.5

4 years ago

11.11.4

4 years ago

11.11.1

4 years ago

11.11.3

4 years ago

11.11.2

4 years ago

11.11.0

4 years ago

11.9.0

4 years ago

11.8.3

4 years ago

11.8.2

4 years ago

11.8.1

4 years ago

11.8.0

4 years ago

11.7.2

4 years ago

11.7.0

4 years ago

11.7.1

4 years ago

11.6.0

4 years ago

11.5.1

4 years ago

11.5.0

4 years ago

11.4.0

4 years ago

11.2.0

4 years ago

11.1.0

4 years ago

11.0.0

4 years ago

10.0.0

4 years ago

10.1.0

4 years ago

10.0.1

4 years ago

10.1.1

4 years ago

10.1.2

4 years ago

9.0.0

4 years ago

8.1.0

4 years ago

8.0.1

4 years ago

8.0.0

4 years ago

7.2.0

4 years ago

7.1.0

4 years ago

7.0.1

4 years ago

7.0.0

4 years ago

6.0.1

4 years ago

6.0.0

4 years ago

5.1.2

4 years ago

5.1.1

4 years ago

5.1.0

4 years ago

5.0.0

4 years ago

4.0.0

4 years ago

3.0.0

4 years ago

2.14.2

4 years ago

2.14.1

4 years ago

2.14.0

4 years ago

2.13.0

4 years ago

2.12.0

4 years ago

2.11.0

4 years ago

2.10.0

4 years ago

2.9.0

4 years ago

2.8.0

4 years ago

2.7.0

4 years ago

2.6.0

4 years ago

2.5.0

4 years ago

2.4.0

4 years ago

2.3.0

4 years ago

2.2.0

4 years ago

2.1.0

4 years ago

2.0.0

4 years ago

1.7.0

4 years ago

1.5.0

4 years ago

1.4.0

4 years ago

1.3.0

4 years ago

1.2.0

4 years ago

1.1.0

4 years ago

1.0.1

4 years ago