Aljazeera-crawler NPM

logo

aljazeera-crawler

aljazeera-crawler is a command line application that helps crawl the https://www.aljazeera.net/ website.

Installation

Either installing the tool globally in your system path.

npm install -g aljazeera-crawler

Or using it directly with the help of npx:

npx aljazeera-crawler [options]

Usage

For CLI options, use the -h (or --help) argument:

aljazeera-crawler -h

Al Jazeera Crawler Usage: aljazeera-crawler options
Options: --version Show version number boolean -t, --threshold the minimum number of words to be crawled number -d, --domain the domain to crawl string [choices: "politics", "economy", "culture", "sport", "art", "technology", "heritage"] -h, --help Show help boolean

Let's say we want to crawl a minimum of 100k word in the technology domain

We will use either:

aljazeera-crawler -t 100000 -d technology

Or:

aljazeera-crawler --threshold 100000 --domain technology

After that a file named output-technology-100000.txt will be created.

Domains

For the possible domains to crawl as of know are:

Category	Link
politics سياسة	https://www.aljazeera.net/news/politics/
economy اقتصاد	https://www.aljazeera.net/news/ebusiness/
culture ثقافة	https://www.aljazeera.net/news/cultureandart/
sport رياضة	https://www.aljazeera.net/sport/
art فن	https://www.aljazeera.net/news/arts/
technology تكنولوجيا	https://www.aljazeera.net/news/scienceandtechnology/
heritage تراث	https://www.aljazeera.net/turath/

Licence

MIT

aljazeera arabic crawler nlp cli

chalk figlet millify ora puppeteer yargs

@everything-registry/sub-chunk-1119

5 years ago

5 years ago

5 years ago

5 years ago