0.1.2 • Published 6 years ago

morpheme-splitter-np v0.1.2

Weekly downloads
1
License
ISC
Repository
-
Last release
6 years ago

Morpheme Splitter - Nepali

Script adapted from this code in ipython notebook.

Nepali words are composed of various morphemes which can be broadly divided into two categories: Vowels and Consonants. A given word can be resolved into its morphemes by some elementary rules. While these rules are relatively straightforward, the unicode representation make it a little bit non-trivial to work with. Consider these scenarios:

  • क is actually a single character in Unicode, while it is two morphemes, क् + अ in Nepali.
  • क + ् in Unicode representation translates to क्, a single morpheme in Nepali.
  • क + ि in Unicode representation translates to क् + इ in Nepali.

In this script, we define rules for the separation of morphemes in Nepali Unicode representation. This shall serve as a building block as we later construct systems for separating syllables from multi-syllables words in Nepali.

Rules

  • If any character is a vowel, leave it as it is
  • If any character is a single unicode consonant क - ह
    • If this is a last letter, the previous character as well as this character make two morpheme, where the latter is the independent vowel अ.
    • If next character is a halanta u(्), the previous character is a single morpheme.
    • If next character is a vowel, the previous character as well as this vowel make two morphemes (क् + ि).
    • If next character is a consonant, the previous character as well as this character make two morpheme, where the latter is the independent vowel अ.

License

MIT License

Copyright

0.1.2

6 years ago

0.1.1

6 years ago

0.1.0

6 years ago