adult-detection-scraper v0.1.0
adult-detection-scraper
Links:
Is a part of the adult detection service and acts as a plain scraping service for the adult-detection-api.
The main objective is to load and parse the websites and scrape their metadata and images src as well as alt
and supply a JSON object back to the adult-detection-api
Table of contents
[TOC]
Infrastructure
the adult-detection-scraper runs on kubernetes like all our other tools.
It provides a RESTful API which is reachable via https://adult-detection-scraper.nlsn-eng-ops.com (https://adult-detection-scraper-staging.nlsn-eng-ops.com for staging)
It defaults to 3 running pods and will autoscale horizontally up to 20 pods based on CPU and Memory usage.
Design
ExpressJS provides the Webserver and takes in the requests.
To safe time and compute resources Express creates an instance of puppeteer and works by open tabs to process the requests comming in through the API.
To excelerate the speed of downloading and reducing load we're not downloading any media, font and other resources. Documentation for the different resourceTypes can be found in the official documentation
From our tests we're estimating a limit of ~13-15 Tabs/pod.
Below is a Flowchart that highlevel explains how the adult-detection-scraper works.
Development & Contribution
- Clone this Repository
- Run it on your local machine (You'll need to have Chrome or Chromium installed)
- run
yarn install - run
yarn dev
- Run it in docker
- run
docker build -t adult-detection-scraperor using makemake build_docker - run
docker run -p 8080:8080 adult-detection-scraperor using makemake run_docker
Option one is great for debugging purposes as you can run it with headless set to false
If you submit a merge request please fill in a new item in the CHANGELOG.md and and follow the gitlab guide
5 years ago