1.0.3 • Published 3 years ago

aljazeera-crawler v1.0.3

Weekly downloads
-
License
MIT
Repository
github
Last release
3 years ago

logo

aljazeera-crawler

aljazeera-crawler is a command line application that helps crawl the https://www.aljazeera.net/ website.

Installation

Either installing the tool globally in your system path.

npm install -g aljazeera-crawler

Or using it directly with the help of npx:

npx aljazeera-crawler [options]

Usage

For CLI options, use the -h (or --help) argument:

aljazeera-crawler -h

Al Jazeera Crawler Usage: aljazeera-crawler options

Options: --version Show version number boolean -t, --threshold the minimum number of words to be crawled number -d, --domain the domain to crawl string [choices: "politics", "economy", "culture", "sport", "art", "technology", "heritage"] -h, --help Show help boolean

Let's say we want to crawl a minimum of 100k word in the technology domain

We will use either:

aljazeera-crawler -t 100000 -d technology

Or:

aljazeera-crawler --threshold 100000 --domain technology

After that a file named output-technology-100000.txt will be created.

Domains

For the possible domains to crawl as of know are:

CategoryLink
politics سياسةhttps://www.aljazeera.net/news/politics/
economy اقتصادhttps://www.aljazeera.net/news/ebusiness/
culture ثقافةhttps://www.aljazeera.net/news/cultureandart/
sport رياضةhttps://www.aljazeera.net/sport/
art فنhttps://www.aljazeera.net/news/arts/
technology تكنولوجياhttps://www.aljazeera.net/news/scienceandtechnology/
heritage تراثhttps://www.aljazeera.net/turath/

Licence

MIT

1.0.3

3 years ago

1.0.2

3 years ago

1.0.1

3 years ago

1.0.0

3 years ago