adult-detection-scraper v0.1.0
adult-detection-scraper
Links:
Is a part of the adult detection service and acts as a plain scraping service for the adult-detection-api
.
The main objective is to load and parse the websites and scrape their metadata
and images src
as well as alt
and supply a JSON object back to the adult-detection-api
Table of contents
[TOC]
Infrastructure
the adult-detection-scraper
runs on kubernetes like all our other tools.
It provides a RESTful API which is reachable via https://adult-detection-scraper.nlsn-eng-ops.com (https://adult-detection-scraper-staging.nlsn-eng-ops.com for staging)
It defaults to 3 running pods and will autoscale horizontally up to 20 pods based on CPU
and Memory
usage.
Design
ExpressJS
provides the Webserver and takes in the requests.
To safe time and compute resources Express creates an instance of puppeteer
and works by open tabs to process the requests comming in through the API.
To excelerate the speed of downloading and reducing load we're not downloading any media
, font
and other
resources. Documentation for the different resourceTypes
can be found in the official documentation
From our tests we're estimating a limit of ~13-15 Tabs/pod.
Below is a Flowchart that highlevel explains how the adult-detection-scraper
works.
Development & Contribution
- Clone this Repository
- Run it on your local machine (You'll need to have Chrome or Chromium installed)
- run
yarn install
- run
yarn dev
- Run it in docker
- run
docker build -t adult-detection-scraper
or using makemake build_docker
- run
docker run -p 8080:8080 adult-detection-scraper
or using makemake run_docker
Option one is great for debugging purposes as you can run it with headless
set to false
If you submit a merge request please fill in a new item in the CHANGELOG.md
and and follow the gitlab guide
4 years ago