0.0.4 • Published 6 years ago

spacy v0.0.4

Weekly downloads
23
License
MIT
Repository
github
Last release
6 years ago

spaCy JS

npm GitHub unpkg

JavaScript interface for accessing linguistic annotations provided by spaCy. This project is mostly experimental and was developed for fun to play around with different ways of mimicking spaCy's Python API.

The results will still be computed in Python and made available via a REST API. The JavaScript API resembles spaCy's Python API as closely as possible (with a few exceptions, as the values are all pre-computed and it's tricky to express complex recursive relationships).

const spacy = require('spacy');

(async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('This is a text about Facebook.');
    for (let ent of doc.ents) {
        console.log(ent.text, ent.label);
    }
    for (let token of doc) {
        console.log(token.text, token.pos, token.head.text);
    }
})();

⌛️ Installation

Installing the JavaScript library

You can install the JavaScript package via npm:

npm install spacy

Setting up the Python server

First, clone this repo and install the requirements. If you've installed the package via npm, you can also use the api/server.py and requirements.txt in your ./node_modules/spacy directory. It's recommended to use a virtual environment.

pip install -r requirements.txt

You can then run the REST API. By default, this will serve the API via 0.0.0.0:8080:

python api/server.py

If you like, you can install more models and specify a comma-separated list of models to load as the first argument when you run the server. All models need to be installed in the same environment.

python api/server.py en_core_web_sm,de_core_news_sm
ArgumentTypeDescriptionDefault
modelspositional (str)Comma-separated list of models to load and make available.en_core_web_sm
--host, -hooption (str)Host to serve the API.0.0.0.0
--port, -poption (int)Port to server the API.8080

🎛 API

spacy.load

"Load" a spaCy model. This method mostly exists for consistency with the Python API. It sets up the REST API and nlp object, but doesn't actually load anything, since the models are already available via the REST API.

const nlp = spacy.load('en_core_web_sm');
ArgumentTypeDescription
modelStringName of model to load, e.g. 'en_core_web_sm'. Needs to be available via the REST API.
apiStringAlternative URL of REST API. Defaults to http://0.0.0.0:8080.
RETURNSLanguageThe nlp object.

nlp async

The nlp object created by spacy.load can be called on a string of text and makes a request to the REST API. The easiest way to use it is to wrap the call in an async function and use await:

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('This is a text.');
}
ArgumentTypeDescription
textStringThe text to process.
RETURNSDocThe processed Doc.

Doc

Just like in the original API, the Doc object can be constructed with an array of words and spaces. It also takes an additional attrs object, which corresponds to the JSON-serialized linguistic annotations created in doc2json in api/server.py.

The Doc behaves just like the regular spaCy Doc – you can iterate over its tokens, index into individual tokens, access the Doc attributes and properties and also use native JavaScript methods like map and slice (since there's no real way to make Python's slice notation like doc[2:4] work).

Construction

import { Doc } from 'spacy';

const words = ['Hello', 'world', '!'];
const spaces = [true, false, false];
const doc = Doc(words, spaces)
console.log(doc.text) // 'Hello world!'
ArgumentTypeDescription
wordsArrayThe individual token texts.
spacesArrayWhether the token at this position is followed by a space or not.
attrsObjectJSON-serialized attributes, see doc2json.
RETURNSDocThe newly constructed Doc.

Symbol iterator and token indexing

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('Hello world');

    for (let token of doc) {
        console.log(token.text);
    }
    // Hello
    // world

    const token1 = doc[0];
    console.log(token1.text);
    // Hello
}

Properties and Attributes

NameTypeDescription
textStringThe Doc text.
lengthNumberThe number of tokens in the Doc.
entsArrayA list of Span objects, describing the named entities in the Doc.
sentsArrayA list of Span objects, describing the sentences in the Doc.
nounChunksArrayA list of Span objects, describing the base noun phrases in the Doc.
catsObjectThe document categories predicted by the text classifier, if available in the model.
isTaggedBooleanWhether the part-of-speech tagger has been applied to the Doc.
isParsedBooleanWhether the dependency parser has been applied to the Doc.
isSentencedBooleanWhether the sentence boundary detector has been applied to the Doc.

Span

A Span object is a slice of a Doc and contains of one or more tokens. Just like in the original API, it can be constructed from a Doc, a start and end index and an optional label, or by slicing a Doc.

Construction

import { Doc, Span } from 'spacy';

const doc = Doc(['Hello', 'world', '!'], [true, false, false]);
const span = Span(doc, 1, 3);
console.log(span.text) // 'world!'
ArgumentTypeDescription
docDocThe reference document.
startNumberThe start token index.
endNumberThe end token index. This is exclusive, i.e. "up to token X".
label StringOptional label.
RETURNSSpanThe newly constructed Span.

Properties and Attributes

NameTypeDescription
textStringThe Span text.
lengthNumberThe number of tokens in the Span.
docDocThe parent Doc.
startNumberThe Span's start index in the parent document.
endNumberThe Span's end index in the parent document.
labelStringThe Span's label, if available.

Token

For token attributes that exist as string and ID versions (e.g. Token.pos vs. Token.pos_), only the string versions are exposed.

Usage Examples

async function() {
    const nlp = spacy.load('en_core_web_sm');
    const doc = await nlp('Hello world');

    for (let token of doc) {
        console.log(token.text, token.pos, token.isLower);
    }
    // Hello INTJ false
    // world NOUN true
}

Properties and Attributes

NameTypeDescription
textStringThe token text.
whitespaceStringWhitespace character following the token, if available.
textWithWsStringToken text with training whitespace.
orthNumberID of the token text.
docDocThe parent Doc.
headTokenThe syntactic parent, or "governor", of this token.
iNumberIndex of the token in the parent document.
entTypeStringThe token's named entity type.
entIobStringIOB code of the token's named entity tag.
lemmaStringThe token's lemma, i.e. the base form.
normStringThe normalised form of the token.
lowerStringThe lowercase form of the token.
shapeStringTransform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".
prefixStringA length-N substring from the start of the token. Defaults to N=1.
suffixStringLength-N substring from the end of the token. Defaults to N=3.
posStringThe token's coarse-grained part-of-speech tag.
tagStringThe token's fine-grained part-of-speech tag.
isAlphaBooleanDoes the token consist of alphabetic characters?
isAsciiBooleanDoes the token consist of ASCII characters?
isDigitBooleanDoes the token consist of digits?
isLowerBooleanIs the token lowercase?
isUpperBooleanIs the token uppercase?
isTitleBooleanIs the token titlecase?
isPunctBooleanIs the token punctuation?
isLeftPunctBooleanIs the token left punctuation?
isRightPunctBooleanIs the token right punctuation?
isSpaceBooleanIs the token a whitespace character?
isBracketBooleanIs the token a bracket?
isCurrencyBooleanIs the token a currency symbol?
likeUrlBooleanDoes the token resemble a URL?
likeNumBoolean Does the token resemble a number?
likeEmailBooleanDoes the token resemble an email address?
isOovBooleanIs the token out-of-vocabulary?
isStopBooleanIs the token a stop word?
isSentStartBooleanDoes the token start a sentence?

🔔 Run Tests

Python

First, make sure you have pytest and all dependencies installed. You can then run the tests by pointing pytest to /tests:

python -m pytest tests

JavaScript

This project uses Jest for testing. Make sure you have all dependencies and development dependencies installed. You can then run:

npm run test

To allow testing the code without a REST API providing the data, the test suite currently uses a mock of the Language class, which returns static data located in tests/util.js.

✅ Ideas and Todos

  • Add Travis CI integration.
  • Improve JavaScript tests.
  • Experiment with NodeJS bindings to make Python integration easier. To be fair, running a separate API in an environment controlled by the user and not hiding it a few levels deep is often much easier. But maybe there are some modern Node tricks that this project could benefit from.