0.0.8 • Published 14 years ago
nseg v0.0.8
Node.js Version of MMSG for Chinese Word Segmentation
MMSG originally invented by Chih-Hao Tsai is a very popular Chinese word segmentation algorithm. Many implementations are available on different platforms including Python, Java, etc.
This package provide Node.js version of MMSG algorithm.
So far this package is still in developing, but the basic functionalities are ready.
Install
Use nseg in your own package
$ npm install nsegOr if you want to use nseg command
$ npm install -g nsegCommand line
After intsalling globally, you can use nseg command to:
help
$ nseg helpsegment text using default dictionary
$ nseg segf -i ~/project/text/shi.txt -o ~/project/output/shi.txt
$ nseg segd -i ~/project/text -o ~/project/outputbuild user dictionary for loading aftermath
$ nseg dict ~/project/data/dict.js ~/dict/dict1.txt ~/dict/dict2.txt
$ nseg dict ~/project/data/dict.js ~/dictbuild character-frequecy map for loading aftermath
$ nseg freq ~/project/data/freq.js ~/freq/data1.csv ~/freq/data2.csv
$ nseg freq ~/project/data/freq.js ~/freqsegment text using customized settings
$ nseg seg -d ~/project/data/dict.js -f ~/project/data/freq.js -i ~/project/text -o ~/project/output
$ nseg seg -l ~/project/lex/ -i ~/project/text -o ~/project/outputcheck the existence of a word
$ nseg check "石狮"
$ nseg check -d ~/project/data/dict.js "石狮"Using nseg in program
Preparation
- (Optional) build your own dictionay and freqency map
- (Optional) create your own lexical handler for special text pattern
Examples
var dict = require('../data/dict'),
freq = require('../data/freq'),
date = require('../lex/datetime'),
sina = require('../lex/sina');
var opts = {
dict: dict,
freq: freq,
lex: [date, sina],
};
var nseg = require('nseg').evented(opts);
var strmOut = fs.createWriteStream(target, {flags: 'w+', encoding: 'utf-8'}),
strmIn = fs.createReadStream(input);
var pipe = nseg(strmIn, strmOut);
pipe.on('error', function (err) {
console.log('error', err);
});
pipe.start();Lexical handler customization
Lexical handlers support definitions by regexp patterns or acceptor functions.
License
MIT License
