1.0.1 • Published 7 days ago

taibun v1.0.1

Weekly downloads
-
License
MIT
Repository
github
Last release
7 days ago

台語 | 國語

Contributions Tests Release Licence LinkedIn

Taiwanese Hokkien Transliterator and Tokeniser

It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation. Includes word tokeniser for Taiwanese Hokkien.

Report Bugnpm


Versions

Python Version

Install

Taibun can be installed from npm

$ npm install taibun --save

Usage

Converter

Converter class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.

// Constructor
c = new Converter({ system, dialect, format, delimiter, sandhi, punctuation, convertNonCjk });

// Transliterate Chinese characters
c.get(input);

System

system String - system of transliteration.

textTailoPOJZhuyinTLPAPingyimTongiongIPA
台灣Tâi-uânTâi-oânㄉㄞˊ ㄨㄢˊTai5 uan5DáiwánTāi-uǎnTai²⁵ uan²⁵

Dialect

dialect String - preferred pronunciation.

  • south (default) - Zhangzhou-leaning pronunciation
  • north - Quanzhou-leaning pronunciation
textsouthnorth
五月節Gōo-gue̍h-tsehGōo-ge̍h-tsueh

Format

format String - format in which tones will be represented in the converted sentence.

  • mark (default) - uses diacritics for each syllable. Not available for TLPA.
  • number - add a number which represents the tone at the end of the syllable
  • strip - removes any tone marking
textmarknumberstrip
台灣Tâi-uânTai5-uan5Tai-uan

Delimiter

delimiter String - sets the delimiter character that will be placed in between syllables of a word.

Default value depends on the chosen system:

  • '-' - for Tailo, POJ, Tongiong
  • '' - for Pingyim
  • ' ' - for Zhuyin, TLPA, IPA
text'-'''' '
台灣Tâi-uânTâiuânTâi uân

Sandhi

sandhi String - applies the sandhi rules of Taiwanese Hokkien.

Since it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.

  • none - doesn't perform any tone sandhi
  • auto - closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with 仔
  • excLast - changes tone for every syllable except for the last one
  • inclLast - changes tone for every syllable including the last one

Default value depends on the chosen system:

  • auto - for Tongiong
  • none - for Tailo, POJ, Zhuyin, TLPA, Pingyim, IPA
textnoneautoexcLastinclLast
這是你的手機仔無Tse sī lí ê tshiú-ki-á bôTse sì li ē tshiu-kī-á bô?Tsē sì li ē tshiu-kī-a bôTsē sì li ē tshiu-kī-a bō

Sandhi rules also change depending on the dialect chosen.

textno sandhisouthnorth
台灣Tâi-uânTāi-uânTài-uân

Punctuation

punctuation String

  • format (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence.
  • none - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences.
textformatnone
這是臺南,簡稱「南」(白話字:Tâi-lâm;注音符號:ㄊㄞˊ ㄋㄢˊ,國語:Táinán)。Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán).tse sī Tâi-lâm,kán-tshing「lâm」(Pe̍h-uē-jī:Tâi-lâm;tsù-im hû-hō:ㄊㄞˊ ㄋㄢˊ,kok-gí:Táinán)。

Convert non-CJK

convertNonCjk Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.

  • true - convert non-Chinese character words
  • false (default) - convert only Chinese character words
textfalsetrue
我食phángㆣㄨㄚˋ ㄐㄧㄚㆷ˙ phángㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ

Tokeniser

Tokeniser class performs NLTK wordpunct_tokenize-like tokenisation of a Taiwanese Hokkien sentence.

// Constructor
t = new Tokeniser();

// Tokenise Taiwanese Hokkien sentence
t.tokenise(input);

Other Functions

Handy functions for NLP tasks in Taiwanese Hokkien.

// Convert to Traditional
toTraditional(input);

// Convert to Simplified
toSimplified(input);

// Check if the string is fully composed of Chinese characters
isCjk(input);

Example

// Converter
const { Converter } = require('taibun');

//// System
c = new Converter(); // Tailo system default
c.get('先生講,學生恬恬聽。');
>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.

c = new Converter({ system: 'Zhuyin' });
c.get('先生講,學生恬恬聽。');
>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.

//// Dialect
c = new Converter(); // south dialect default
c.get("我欲用箸食魚");
>> Guá beh īng tī tsia̍h hî

c = new Converter({ dialect: 'north' });
c.get("我欲用箸食魚");
>> Guá bueh īng tū tsia̍h hû

//// Format
c = new Converter(); // for Tailo, mark by default
c.get("生日快樂");
>> Senn-ji̍t khuài-lo̍k

c = new Converter({ format: 'number' });
c.get("生日快樂");
>> Senn1-jit8 khuai3-lok8

c = new Converter({ format: 'strip' });
c.get("生日快樂");
>> Senn-jit khuai-lok

//// Delimiter
c = new Converter({ delimiter: '' });
c.get("先生講,學生恬恬聽。");
>> Siansinn kóng, ha̍ksing tiāmtiām thiann.

c = new Converter({ system: 'Pingyim', delimiter: '-' });
c.get("先生講,學生恬恬聽。");
>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.

//// Sandhi
c = new Converter(); // for Tailo, sandhi none by default
c.get("這是台灣囡仔");
>> Tse sī Tâi-uân gín-á

c = new Converter({ sandhi: 'auto' });
c.get("這是台灣囡仔");
>> Tse sì Tāi-uān gin-á

c = new Converter({ sandhi: 'excLast' });
c.get("這是台灣囡仔");
>> Tsē sì Tāi-uān gin-á

c = new Converter({ sandhi: 'inclLast' });
c.get("這是台灣囡仔");
>> Tsē sì Tāi-uān gin-a

//// Punctuation
c = new Converter(); // format punctuation default
c.get("太空朋友,恁好!恁食飽未?");
>> Thài-khong pîng-iú, lín-hó! Lín tsia̍h-pá buē?

c = new Converter({ punctuation: 'none' });
c.get("太空朋友,恁好!恁食飽未?");
>> thài-khong pîng-iú,lín-hó!lín tsia̍h-pá buē?

//// Convert non-CJK
c = new Convert({ system: 'Zhuyin' }); // false convertNonCjk default
c.get("我食pháng");
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng

c = new Convert({ system: 'Zhuyin', convertNonCjk: true });
c.get("我食pháng");
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ


// Tokeniser
const { Tokeniser } = require('taibun');

t = new Tokeniser();
t.tokenise("太空朋友,恁好!恁食飽未?");
>> ['太空', '朋友', ',', '恁好', '!', '恁', '食飽', '未', '?']


// Other Functions
const { toTraditional, toSimplified, isCjk } = require('taibun');

toTraditional("我听无台湾话");
>> 我聽無台灣話

toSimplified("我聽無臺灣話");
>> 我听无台湾话

isCjk('我食麭');
>> true

isCjk('我食pháng');
>> false

Data

Acknowledgements

Licence

Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under CC BY-SA 4.0