0.0.2 • Published 6 years ago

penn-treebank-sample v0.0.2

Weekly downloads
1
License
CC-BY-NC-4.0
Repository
github
Last release
6 years ago

a small sample of PENN treebank part-of-speech tagged english dataset, with tags from the nlp-compromise tagset.

simply a transformation of the fair-use subset of the Penn Treebank by the NLTK library, with cosmetic formatting changes for javascript-use.

This data is for non-commercial fair-use only, and all users are encouraged to purchase a license of the full dataset for any commercial projects.

data is (only) 4,000 tagged sentences, with compromise tag-mappings, and some opinionated lumping of punctuation, contractions, etc.

972kb uncompressed.

sample:

{ text: 'Another OTC bank stock involved in a buy-out deal, First Constitution Financial, was higher.',
  tags:
   [ 'Determiner',
     'Noun',
     'Noun',
     'Noun',
     'Verb',
     'Preposition',
     'Determiner',
     'Noun',
     'Noun',
     'Noun',
     'Noun',
     'Noun',
     'Verb',
     'Comparative'
   ]
}

Original statement in NLTK:

Copyright (C) 1995 University of Pennsylvania;
This is a 10% fragment of Penn Treebank, (C) LDC 1995, which has been dependency parsed.
It is made available under fair use for the purposes of illustrating NLTK tools for tokenizing, tagging, chunking and parsing.
This data is for non-commercial use only.;

please file an issue if there are any copyright concerns in placing this on npm or github.