1.0.0 • Published 7 months ago

@picosearch/language-english v1.0.0

Weekly downloads
-
License
MIT
Repository
github
Last release
7 months ago

English Text Preprocessor

This module provides basic text preprocessing functions for English text, including tokenization, punctuation removal, stopword filtering, and stemming.

Functions

tokenizer(doc: string): string[]

This function takes a string as input and returns an array of tokens (words) extracted from by matching it against word characters. If the input is not a string, it returns an empty array.

analyzer(token: string): string

This function processes a single token by removing punctuation and converting it to lowercase. It then checks the token against a list of English stopwords and removes it if found. If not, it stems the token using the porter stemmer.

Dependencies

  • porter-stemmer: English word stemmer.
  • stopword: A library containing a list of stopwords for various languages, including English.
1.0.0

7 months ago

1.0.0-rc1

8 months ago