0.0.2 • Published 5 years ago

hanzi2reading v0.0.2

Weekly downloads
1
License
BSD-2-Clause
Repository
github
Last release
5 years ago

hanzi2reading

This library is distributed with a default database; you can find the license information here. https://github.com/g0v/moedict-data/blob/master/README.md

Design goals

  • Annotation of Chinese characters with Standard Mandarin (國語/普通話) readings
  • Agnostic to simplified/traditional script and transliteration method
  • Should work offline, and database format should be as compact as possible - e.g. Protocol Buffers loaded by WebAssembly
  • Should support word-based disambiguation of characters with multiple readings
  • separation of code and data - dictionary backend should be swappable

Limitations

  • Word segmentation is a non-goal.
  • Target should be good performance for non-sentence inputs, without needing part-of-speech classification, e.g. 得

Database Format

PartBits
Initial5
Medial2
Final4
Tone3
Erhua1

Total = 15 bits per syllable. This is less compact than enumerating all standard syllables, but allows dictionaries to have non-standard syllables.

Notes

Resources