1.0.0 • Published 10 months ago

@picosearch/language-english v1.0.0

Weekly downloads
-
License
MIT
Repository
github
Last release
10 months ago

English Text Preprocessor

This module provides basic text preprocessing functions for English text, including tokenization, punctuation removal, stopword filtering, and stemming.

Functions

tokenizer(doc: string): string[]

This function takes a string as input and returns an array of tokens (words) extracted from by matching it against word characters. If the input is not a string, it returns an empty array.

analyzer(token: string): string

This function processes a single token by removing punctuation and converting it to lowercase. It then checks the token against a list of English stopwords and removes it if found. If not, it stems the token using the porter stemmer.

Dependencies

  • porter-stemmer: English word stemmer.
  • stopword: A library containing a list of stopwords for various languages, including English.
1.0.0

10 months ago

1.0.0-rc1

11 months ago