Synonym-optimizer NPM

synonym-optimizer

Gives a score to a string depending on the variety of the synonyms used.

For instance, let's compare The coffee is good. I love that coffee with The coffee is good. I love that bewerage. The second alternative is better because a synonym is used for coffee. This module will give a better score to the second alternative.

The lowest score the better.

Fully supported languages are French German English Italian and Spanish.

What it does / How it works:

single words are extracted thanks to a tokenizer wink-tokenizer
words are lowercased
stopwords are removed
- for fully supported languages, a default stopwords list is included, which you can customize
- for all other languages, no default list is included, but you can provide a custom stop words lists
for fully supported languages, words are stemmed using snowball-stemmer (for all other languages: no stemming)
when the same word appears multiples times, it raises the score depending on the distance of the two occurrences (if the occurrences are closes it raises the score a lot)

Designed primarly to test the output of a NLG (Natural Language Generation) system.

The stemmer is not perfect. For instance in Italian, cameriere and cameriera have the same stem (camerier), while camerieri and cameriera have a different one (camer and camerier).

Installation

npm install synonym-optimizer

Usage

var synOptimizer = require('synonym-optimizer');

alts = [
  'The coffee is good. I love that coffee.',
  'The coffee is good. I love that bewerage.'
]

/*
The coffee is good. I love that coffee.: 0.5
The coffee is good. I love that bewerage.: 0
*/
alts.forEach((alt) => {
  let score = synOptimizer.scoreAlternative('en_US', alt, null, null, null, null);
  console.log(`${alt}: ${score}`);
});

The main function is scoreAlternative. It takes a string and returns its score. Arguments are:

lang (string, mandatory): the language.
- fully supported languages are fr_FR, en_US, de_DE, it_IT and es_ES
- with any other language (for instance Dutch nl_NL) stemming is disabled and stopwords are not removed
alternative (string, mandatory): the string to score
stopWordsToAdd (string[], optional): list of stopwords to add to the standard stopwords list
stopWordsToRemove (string[], optional): list of stopwords to remove to the standard stopwords list
stopWordsOverride (string[], optional): replaces the standard stopword list
identicals (string, optional): list of words that should be considered as beeing identical, for instance [ ['phone', 'cellphone', 'smartphone'] ].

You can also use the getBest function. Most arguments are exactly the same, but instead of alternative, use alternatives (string[]). The output number will not be the score, but simply the index of the best alternative.

The tokenizer is wink-tokenizer, it does works with many languages (English, French, German, Hindi, Sanskrit, Marathi etc.) but not asian languages. Therefore the module will not work properly with Japanese, Chinese etc.