aljazeera-crawler v1.0.3
aljazeera-crawler
aljazeera-crawler is a command line application that helps crawl the https://www.aljazeera.net/ website.
Installation
Either installing the tool globally in your system path.
npm install -g aljazeera-crawler
Or using it directly with the help of npx:
npx aljazeera-crawler [options]
Usage
For CLI options, use the -h
(or --help
) argument:
aljazeera-crawler -h
Al Jazeera Crawler Usage: aljazeera-crawler options
Options: --version Show version number boolean -t, --threshold the minimum number of words to be crawled number -d, --domain the domain to crawl string [choices: "politics", "economy", "culture", "sport", "art", "technology", "heritage"] -h, --help Show help boolean
Let's say we want to crawl a minimum of 100k word in the technology domain
We will use either:
aljazeera-crawler -t 100000 -d technology
Or:
aljazeera-crawler --threshold 100000 --domain technology
After that a file named output-technology-100000.txt
will be created.
Domains
For the possible domains to crawl as of know are:
Category | Link |
---|---|
politics سياسة | https://www.aljazeera.net/news/politics/ |
economy اقتصاد | https://www.aljazeera.net/news/ebusiness/ |
culture ثقافة | https://www.aljazeera.net/news/cultureandart/ |
sport رياضة | https://www.aljazeera.net/sport/ |
art فن | https://www.aljazeera.net/news/arts/ |
technology تكنولوجيا | https://www.aljazeera.net/news/scienceandtechnology/ |
heritage تراث | https://www.aljazeera.net/turath/ |
Licence
MIT