0.1.0 • Published 4 years ago

adult-detection-scraper v0.1.0

Weekly downloads
-
License
ISC
Repository
-
Last release
4 years ago

adult-detection-scraper

pipeline status

Links:

Is a part of the adult detection service and acts as a plain scraping service for the adult-detection-api. The main objective is to load and parse the websites and scrape their metadata and images src as well as alt and supply a JSON object back to the adult-detection-api

Table of contents

[TOC]

Infrastructure

the adult-detection-scraper runs on kubernetes like all our other tools. It provides a RESTful API which is reachable via https://adult-detection-scraper.nlsn-eng-ops.com (https://adult-detection-scraper-staging.nlsn-eng-ops.com for staging)

It defaults to 3 running pods and will autoscale horizontally up to 20 pods based on CPU and Memory usage.

Design

ExpressJS provides the Webserver and takes in the requests. To safe time and compute resources Express creates an instance of puppeteer and works by open tabs to process the requests comming in through the API. To excelerate the speed of downloading and reducing load we're not downloading any media, font and other resources. Documentation for the different resourceTypes can be found in the official documentation

From our tests we're estimating a limit of ~13-15 Tabs/pod.

Below is a Flowchart that highlevel explains how the adult-detection-scraper works. Alt text

Development & Contribution

  • Clone this Repository
  1. Run it on your local machine (You'll need to have Chrome or Chromium installed)
  • run yarn install
  • run yarn dev
  1. Run it in docker
  • run docker build -t adult-detection-scraper or using make make build_docker
  • run docker run -p 8080:8080 adult-detection-scraper or using make make run_docker

Option one is great for debugging purposes as you can run it with headless set to false

If you submit a merge request please fill in a new item in the CHANGELOG.md and and follow the gitlab guide