0.1.7 • Published 6 years ago

tokenize-file v0.1.7

Weekly downloads
2
License
MIT
Repository
github
Last release
6 years ago

tokenize-file

Build Status Coverage Status

Read a file, tokenize it, and spit out a handy JSON.

Installation

npm i tokenize-file -S

Example

var tokenizeFile = require("tokenize-file");

tokenizeFile("path/to/file.txt", tokens => {
  console.log(tokens.filter(d => !d.stop_word && d.pos !== "N"));
});

API

# tokenizeFile(path/to/file_name, callback)

Read a file, tokenize it, and spit out the JSON of the tokens. The tokenized data is passed as an array of objects to the callback function. In the array, each token is an object, represented as:

{
  value: "String", // the token
  count: Number, // the number of times it appears in the file
  pos: "String" // the token's Penn Treebank POS tag,
  stop_word: Boolean // whether the token value is a stop word, which can be filtered out in some analyses
}

tokenizeFile can read any type of file supported by textract:

  • HTML, HTM
  • ATOM, RSS
  • Markdown
  • XML, XSL
  • PDF
  • DOC, DOCX
  • ODT, OTT (experimental, feedback needed!)
  • RTF
  • XLS, XLSX, XLSB, XLSM, XLTX
  • CSV
  • ODS, OTS
  • PPTX, POTX
  • ODP, OTP
  • ODG, OTG
  • PNG, JPG, GIF
  • DXF
  • application/javascript
  • All text/* mime-types.

The POS tags are:

POS TagDescriptionExample
CCcoordinating conjunctionand
CDcardinal number1, third
DTdeterminerthe
EXexistential therethere is
FWforeign wordd’hoevre
INpreposition/subordinating conjunctionin, of, like
JJadjectivebig
JJRadjective, comparativebigger
JJSadjective, superlativebiggest
LSlist marker1)
MDmodalcould, will
NNnoun, singular or massdoor
NNSnoun pluraldoors
NNPproper noun, singularJohn
NNPSproper noun, pluralVikings
PDTpredeterminerboth the boys
POSpossessive endingfriend‘s
PRPpersonal pronounI, he, it
PRP$possessive pronounmy, his
RBadverbhowever, usually, naturally, here, good
RBRadverb, comparativebetter
RBSadverb, superlativebest
RPparticlegive up
TOtoto go, to him
UHinterjectionuhhuhhuhh
VBverb, base formtake
VBDverb, past tensetook
VBGverb, gerund/present participletaking
VBNverb, past participletaken
VBPverb, sing. present, non-3dtake
VBZverb, 3rd person sing. presenttakes
WDTwh-determinerwhich
WPwh-pronounwho, what
WP$possessive wh-pronounwhose
WRBwh-abverbwhere, when
0.1.7

6 years ago

0.1.6

6 years ago

0.1.5

6 years ago

0.1.4

6 years ago

0.1.3

6 years ago

0.1.2

6 years ago

0.1.1

6 years ago

0.1.0

6 years ago