Adult-detection-scraper NPM

adult-detection-scraper

Links:

Is a part of the adult detection service and acts as a plain scraping service for the adult-detection-api. The main objective is to load and parse the websites and scrape their metadata and images src as well as alt and supply a JSON object back to the adult-detection-api

[TOC]

Infrastructure

the adult-detection-scraper runs on kubernetes like all our other tools. It provides a RESTful API which is reachable via https://adult-detection-scraper.nlsn-eng-ops.com (https://adult-detection-scraper-staging.nlsn-eng-ops.com for staging)

It defaults to 3 running pods and will autoscale horizontally up to 20 pods based on CPU and Memory usage.

Design

ExpressJS provides the Webserver and takes in the requests. To safe time and compute resources Express creates an instance of puppeteer and works by open tabs to process the requests comming in through the API. To excelerate the speed of downloading and reducing load we're not downloading any media, font and other resources. Documentation for the different resourceTypes can be found in the official documentation

From our tests we're estimating a limit of ~13-15 Tabs/pod.

Below is a Flowchart that highlevel explains how the adult-detection-scraper works. Alt text

Development & Contribution

Clone this Repository

Run it on your local machine (You'll need to have Chrome or Chromium installed)

run yarn install
run yarn dev

Run it in docker

run docker build -t adult-detection-scraper or using make make build_docker
run docker run -p 8080:8080 adult-detection-scraper or using make make run_docker

Option one is great for debugging purposes as you can run it with headless set to false

If you submit a merge request please fill in a new item in the CHANGELOG.md and and follow the gitlab guide

aws-sdk express express-prometheus-middleware kafkajs prom-client puppeteer tenv winston

0.1.0

5 years ago

adult-detection-scraper v0.1.0