Mi-crawl-system NPM

mi-crawl-system

MI Crawl Job Controller and Workers

Architecture

Build and Lint

# following will build the entire project
npm run build 

# this will lint 
npm run lint

Tests

# Tests require redis and kafka, we spin them using docker-compose first 
# kafka container may take some time to create the topics, so wait for few sec.
docker-compose up

# run tests now
npm run test

Priority Queue

Redis supports data structure called 'Sorted Set' that provides priority queue, also ensuring the uniqueness of every key in the queue
Every search term and product that gets into MI system will be put in their respective queues with priority set as next crawl timestamp.

We leverage BullMQ that allows us to set up a repeatable (cron based) job. Internally, BullMQ keeps these jobs in a SortedSet and move them to 'waiting' state using their next crawl timestamp.

To handle the priority among crawl jobs, service that submits the job to JobController can supply a dynamic cron in place of standard cron expression.

Let's take following example:

i) 0 10 * * * regular cron expression

ii) H 10 * * * dynamic cron expression

Here 'H' introduces a dynamicity in cron expression. The 'H' symbol instructs JobController to schedule this task any time between 10:00 to 10:59. This can allow to schedule lower priority tasks over the higher priority one scheduled for sharp 10:00, expressed through i).

For TRE, to schedule the crawl tasks every 15 days, following expression can be used.

i) H H H/15 * *

This expression instructs JobController to schedule task every 15 days on any hour+minute within that day. Choice of these days will be randomly chosen by JobController to ensure the distribution over the entire month.

To maintain uniqueness across crawlable entities, JobController smartly compares the two cron - already present vs new requested cron. It picks the more frequent one and updates its internal queue if required.

Partitioned BullMQ(Priority Queue)

In production, Redis will be deployed in the 'cluster mode' that helps us to do horizontal scaling and is more resilient to failures (due to replicas).
Keys are distributed across shards in Clustered Redis. However a single SortedSet(Priority Queue) is not sharded across the cluster nodes as it is considered as a single key.
Single Priority Queue will not scale in future if we are planning to ingest hundreds of millions of search terms or products.
To solve this problem, we will create more than 1 BullMQ, say N
SearchTerms are going to be distributed among these queues using a simple algo :
- crc-checsum(search-term+country / productId+country) modulus N
This is needed only once when we insert an entry in Priority Queue for the first time, performed by the consumer(mi-tre-consumer) or API to create a new search term for a customer. For the current search terms and productIds in the MI database, we are going to write a job that can pull this data and write to respective priority queues.
Job Controller(explain more on right) is going to fetch top records from all partitioned queues to create 'crawl tasks' for workers.

Job Controller and Crawl Workers Interaction

The role of job controller is to pull crawlable records(search terms/products) from Priority Queue. and make them available to workers.
Crawl workers pull tasks from task queue and perform the crawl operation.
JobController updates the last crawled timestamp for every entity once completed. It also records whether it has been success or failed.

Serp Crawl Workers:

Serp workers crawl for the requested search terms. Once crawl is completed, it checks if brand info is present for all the products coming in SERP results. If not, they lookup into product-brand cache (stored in Redis) and update SERP results. In case, there is no brand info, they update NO_BRAND_MARKER(----------) and send the entire update result over to kafka topic.

For NO_BRAND_MARKER products, it also parallely calls JobController to scrape brand information for them.

PDP Crawl Workers:

PDP workers submit PDP results to kafka topic when scrape is completed for a product.

MI-Entities Redis Cluster

This redis cluster primarily solves two purposes.

Helps mi-tre kafka consumer app to quickly check if search term or product already exists in MI system This reduces the load on MI management Database. The redis key is 'st_' followed by hash of (search_term, country_code and channel)
Hold brand info for every product that is scraped by PDP workers. The redis key is 'p_' followed by hash of (product_id, country_code and channel)
Records last crawled timestamp for every search term and product. The redis key is described above. Please note that we are using hashset in redis to record multiple values for a given 'key'. The last-crawled information is used for diagnostics and possibly by the workers to avoid re-crawling(TBD).