2.1.7 • Published 2 years ago

generate-corpus v2.1.7

Weekly downloads
8
License
Apache
Repository
github
Last release
2 years ago

This module can build a corpus based on a google search or from a set of URLs. It also gives the possibility to make basic semantic analysis on the corpus.

Please wait ... still in progress. Your are welcome to contribute or suggest new ideas !

Install

npm install generate-corpus --save

Build a corpus from a google search

const corpus = require("generate-corpus");


const options = {
    host : "google.be",
    num : 100,
    qs: {
        q: "barbecue",
        pws : 0,
    }
};

try {
  const corpus = await corpus.generateCorpus(options);
  console.log(corpus); // Excellent data structure about the corpus !
} catch(error) {
  console.log(error);
}

Build a corpus from a set of URLs

const search = require("generate-corpus");

const options = {
    urls : ["http://www.site.com", "http://www.site2.com", ...]
};

try {
  const corpus = await corpus.generateCorpus(options);
  console.log(corpus); 
} catch(error) {
  console.log(error);
}

Understanding the options

In both previous examples, the option json structure can contain the following parameters :

For the google search

  • host : the google domain (google.com, google.fr, ... ). Default value : google.com.
  • num : the size of the SERP (number of pages to search).
  • qs : it used to customize the search on google : q : it the search keyword (replace spaces by +). It can be also an array of keywords. qs can also contains other Google search params, see this document : https://moz.com/ugc/the-ultimate-guide-to-the-google-search-parameters.

  • User-Agent : not mandatory. Default value is : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'

Other options

Proxy parameter can be replaced by proxyList if you are using a list of proxies (see below).

With proxies

If you want to use only one proxy for all http requests : The options can contain the proxy url

const options = {
    host : "google.fr",
    num : 15,
    qs: {
        q: "choisir son champagne",
        pws : 0
    },
    language : 'fr',
    proxy : "http://user:password@host:port"
};

If you want to user severals proxies In this case, you can use the nodejs module ("simple proxies")https://github.com/christophebe/simple-proxies This component load proxies from a text file or a DB.

2.1.7

2 years ago

2.1.6

4 years ago

2.1.5

4 years ago

2.1.3

5 years ago

2.1.2

5 years ago

2.1.1

5 years ago

2.1.0

5 years ago

2.0.3

5 years ago

2.0.2

5 years ago

2.0.1

5 years ago

1.0.9

8 years ago

1.0.8

8 years ago

1.0.7

8 years ago

1.0.6

8 years ago

1.0.5

8 years ago

1.0.4

8 years ago

1.0.3

8 years ago

1.0.2

8 years ago

1.0.1

8 years ago

1.0.0

8 years ago

0.0.1

8 years ago