text-corpus v0.0.2
text-corpus
Some classes to represent elements in a text corpus. Currently, this is mainly something to be used in cetem-publico, tnt-tagger and other modules, but hopefully it will be generic enough to be useful in other contexts as well.
Installation
$ npm install text-corpusClasses
Token
Used to represent the tokens (words) in the corpus.
new Token(word, info)
wordis the word in the original corpus textinfo(all these are optional)tokenId: an ID for this tokenlemma: the lemmatized version ofwordpos: the part-of-speech (POS) tag forword- `other*: more information about the token
MultiWordExpression
This class provides a way to group some tokens into multi-word expressions.
MWEs can have attributes indicating the lemma and the POS tag for the whole expression.
new MultiWordExpression({lemma, pos}, tokens)
lemma: the lemma for the multi-word expressionpos: the POS tag for the multi-word expressiontokens: an array of Token objects which make this MWE
Sentence
Sentences contain a list of tokens (the words in that sentence).
Because some words can form multi-word expressions, inside a
Sentence we can find both Tokens and MultiWordExpressions
(which, in turn, have Token objects inside).
new Sentence(id, tokens)
id: an id for the sentencetokens: an array of tokens and MWEs which form this sentence
Paragraph
Paragraphs are composed of a sequence of sentences.
new Paragraph(id, sentences)
id: an id for the paragraphsentences: an array of sentences which form this paragraph
Bugs and stuff
Open a GitHub issue or, preferably, send me a pull request.
License
MIT