0.1.0 • Published 8 years ago

node-word-boundaries v0.1.0

Weekly downloads
5
License
AGPL-3.0
Repository
github
Last release
8 years ago

Word Boundaries

The task is simple: take a String as input, and return an Array of every word boundary.

This implements the Unicode 8.0 Text Segmentation Algorithm. That makes it valid for English and European languages; but it's terrible for Chinese, Japanese, and other languages that do not have any characters between words.

Usage

You may need to install a prerequisite: apt-get install libicu-dev or dnf install libicu-devel. (Node itself depends on ICU; you just need the development headers.)

(On Mac: try brew install icu4c && brew link icu4c --force)

Add it to your project: npm install --save node-word-boundaries

Then use it:

const word_boundaries = require('word-boundaries')
const text = 'See Jack run.'

// f
const boundaries = word_boundaries.find_word_boundaries(text)
console.log(boundaries) // 0, 3, 4, 8, 9, 12, 13

const parts = word_boundaries.split(text)
console.log(parts) // 'See', ' ', 'Jack', ' ', 'run', '.'

Boundary indices are pretty standard in C-like languages. As a refresher: they point to the spaces between characters in a String. Visually:

 S e e   J a c k   r u n .
^ - - ^ ^ - - - ^ ^ - - ^ ^
0   2   4   6   8  10  12
  1   3   5   7   9  11  13

Constraints

The input must be a valid Unicode. In particular, a string like \uDC00\uD800 is invalid (it's a low surrogate followed by a high surrogate); that will cause undefined behavior. (This constraint is true of most programs that deal with Strings.)

Competition

Developing

Download and npm install.

Run mocha -w in the background as you implement features. Write tests in test/.

TODO

Pull requests are welcome! In particular, this library could use:

LICENSE

AGPL-3.0. This project is (c) Overview Services Inc. and Adam Hooper. Please contact both should you desire a more permissive license.