5.2.0 • Published 6 months ago

trigrams v5.2.0

Weekly downloads
9
License
MIT
Repository
github
Last release
6 months ago

trigrams

Build Coverage Downloads

Trigrams for 460+ languages.

Contents

What is this?

This package exposes all trigrams for natural languages. Based on the most translated copyright-free document on this planet: UDHR.

When should I use this?

When you are dealing with natural language detection.

Install

This package is ESM only. In Node.js (version 14.14+, 16.0+), install with npm:

npm install trigrams

In Deno with esm.sh:

import {top, min} from 'https://esm.sh/trigrams@5'

In browsers with esm.sh:

<script type="module">
  import {top, min} from 'https://esm.sh/trigrams@5?bundle'
</script>

Use

import {top, min} from 'trigrams'

console.log((await top()).pam)
console.log((await min()).nld)

Yields:

{ // 300 top trigrams.
  'isa': 6,
  'upa': 6,
  'i k': 6,
  // …
  'ang': 273,
  'ing': 282,
  'ng ': 572 // Most common trigram with how often it was found.
}
[ // 300 top trigrams.
  ' ar',
  'eer',
  'tij',
  // …
  'de ',
  'an ',
  'en ' // Most common trigram.
]

API

This package exports the identifiers top and min. There is no default export.

top()

Get top trigrams to occurrence counts.

Returns

Returns a promise resolving to an object mapping UDHR in Unicode codes to objects mapping the top 300 trigrams to occurrence counts (Promise<Record<string, Record<string, number>>>).

min()

Get top trigrams.

Returns

Returns a promise resolving to arrays containing the top 300 trigrams sorted from least occurring to most occurring (Promise<Record<string, Array<string>>>).

Data

The trigrams are based on the unicode versions of the universal declaration of human rights.

The files are created from all paragraphs made available by wooorm/udhr and do not include headings and such.

Before creating trigrams,

  • the unicode characters from \u0021 to \u0040 (both including) are removed
  • one or more white space characters (\s+) are replaced with a single space
  • alphabetic characters are lower cased ([A-Z])

Additionally, the input is padded with two spaces on both sides.

CodeName
007Sãotomense
008Crioulo, Upper Guinea (008)
009Mbundu (009)
010Tetun Dili
011Umbundu (011)
013(Mijisa)
014(Maiunan)
016(Minjiang, spoken)
017(Minjiang, written)
020Drung
021(Muzzi)
022(Klau)
025(Bizisa)
026(Yeonbyeon)
027Gumuz
028Kafa
029Sidamo
030Kituba (2)
032South Azerbaijani
041Latvian (2)
042Spanish (resolution)
043Zarma
aarAfar
abkAbkhaz
aceAceh
acuAchuar-Shiwiar
acu_1Achuar-Shiwiar (1)
adaDangme
adyAdyghe
afrAfrikaans
agrAguaruna
aiiAssyrian Neo-Aramaic
ajgAja
aka_akuapemTwi (Akuapem)
aka_asanteTwi (Asante)
aka_fanteFante
alsAlbanian, Tosk
altAltai, Southern
amcAmahuaca
ameYaneshaʼ
amhAmharic
amiAmis
amrAmarakaeri
arbArabic, Standard
arlArabela
arnMapudungun
astAsturian
aucWaorani
auvOccitan (Auvergnat)
ayrAymara, Central
azj_cyrlAzerbaijani, North (Cyrillic)
azj_latnAzerbaijani, North (Latin)
bamBamanankan
banBali
baxBamun
bbaBaatonum
bciBaoulé
bclBicolano, Central
belBelarusan
bemBemba
benBengali
bfaBari
bhoBhojpuri
binEdo
bisBislama
bltTai Dam
bluHmong Njua
boaBora
bodTibetan, Central
bos_cyrlBosnian (Cyrillic)
bos_latnBosnian (Latin)
breBreton
btbBulu
bucBushi
bugBugis
bulBulgarian
cabGarifuna
cakKaqchikel, Central
catCatalan-Valencian-Balear
cbiChachi
cbrCashibo-Cacataibo
cbsCashinahua
cbtChayahuita
cbuCandoshi-Shapra
ccxZhuang, Yongbei
cebCebuano
cesCzech
chaChamorro
chjChinantec, Ojitlán
chkChuukese
chr_casedCherokee (cased)
chr_uppercaseCherokee (uppercase)
chvChuvash
cicChickasaw
cjkChokwe
cjk_AOChokwe (Angola)
cjsShor
ckbKurdish, Central
cnhChin, Haka
cniAsháninka
cnrMontenegrin
cofColorado
cosCorsican
cotCaquinte
cpuAshéninka, Pichis
crhCrimean Tatar
crsSeselwa Creole French
csaChinantec, Chiltepec
cswCree, Swampy
ctdChin, Tedim
cymWelsh
dagDagbani
danDanish
ddnDendi
deu_1901German, Standard (1901)
deu_1996German, Standard (1996)
dgaDagaare, Southern
dipDinka, Northeastern
divMaldivian
dyoJola-Fonyi
dyuJula
dzoDzongkha
ell_monotonicGreek (monotonic)
ell_polytonicGreek (polytonic)
emkManinkakan, Eastern
emlRomagnolo
engEnglish
epoEsperanto
eseEse Ejja
estEstonian
eusBasque
eveEven
evnEvenki
eweÉwé
faoFaroese
fijFijian
finFinnish
fkvFinnish, Kven
flmChin, Falam
fonFon
fraFrench
friFrisian, Western
fufPular
furFriulian
fuvFulfulde, Nigerian
fuv2Fulfulde, Nigerian (2)
fvrFur
gaaGa
gagGagauz
gaxOromo, Borana-Arsi-Guji
gjnGonja
gkpKpelle, Guinea
glaGaelic, Scottish
gldNanai
gleGaelic, Irish
glgGalician
glvManx
gsw1Alemannisch (Elsassisch)
gucWayuu
gugGuaraní, Paraguayan
gujGujarati
guuYanomamö
gyrGuarayu
hat_kreyolHaitian Creole French (Kreyol)
hat_popularHaitian Creole French (Popular)
hau_NEHausa (Niger)
hau_NGHausa (Nigeria)
hau_3Hausa
hawHawaiian
heaHmong, Northern Qiandong
hebHebrew
hilHiligaynon
hinHindi
hltChin, Matu
hmsHmong, Southern Qiandong
hnaGen
hniHani
hnsHindustani, Sarnami
hrvCroatian
hsbSorbian, Upper
hsfHuastec (Sierra de Otontepec)
hunHungarian
husHuastec (Veracruz)
huuHuitoto, Murui
hvaHuastec (San Luís Potosí)
hyeArmenian
ibbIbibio
iboIgbo
idoIdo
iduIdoma
ijsIjo, Southeast
ikeInuktitut, Eastern Canadian
iloIlocano
inaInterlingua
indIndonesian
islIcelandic
itaItalian
javJavanese (Latin)
jav_javaJavanese (Javanese)
jivShuar
jpnJapanese
jpn_osakaJapanese (Osaka)
jpn_tokyoJapanese (Tokyo)
kaaKarakalpak
kalInuktitut, Greenlandic
kanKannada
katGeorgian
kazKazakh
kbdKabardian
kbpKabiyé
kdeMakonde
kdhTem
keaKabuverdianu
kekQ'eqchi'
khaKhasi
khkMongolian, Halh (Cyrillic)
khmKhmer, Central
kinRwanda
kirKirghiz
kjhKhakas
kkh_lanaKhün
kmbMbundu
kmrKurdish, Northern
kncKanuri, Central
kngKoongo
kng_AOKoongo (Angola)
koiKomi-Permyak
kooKonjo
korKorean
kqnKaonde
kqsKissi, Northern
kriKrio
krlKarelian
ktuKituba
kwiAwa-Cuaiquer
ladLadino
laoLao
latLatin
lat_1Latin (1)
lavLatvian
liaLimba, West-Central
lijLigurian
linLingala
lin_tonesLingala (tones)
litLithuanian
lldLadin
lncOccitan (Languedocien)
lnsLamnso'
lobLobi
lotOtuho
lozLozi
ltzLuxembourgeois
luaLuba-Kasai
lueLuvale
lugGanda
lunLunda
lusMizo
madMadura
magMagahi
mahMarshallese
maiMaithili
malMalayalam
mal_chillusMalayalam
mamMam, Northern
marMarathi
mazMazahua Central
mcdSharanahua
mcfMatsés
menMende
mfqMoba
micMicmac
minMinangkabau
miqMískito
mkdMacedonian
mltMaltese
mly_arabMalay (Arabic)
mly_latnMalay (Latin)
mnwMon
morMoro
mosMòoré
mriMaori
mtoMixe, Totontepec
mxiMozarabic
mxvMixtec, Metlatónoc
myaBurmese
mziMazatec, Ixcatlán
navNavajo
nbaNyemba
nblNdebele
ndoNdonga
ndsSaxon, Low
nepNepali
nhnNahuatl, Central
nioNganasan
niuNiue
nivGilyak
njoNaga, Ao
nkuKulango, Bouna
nldDutch
nnoNorwegian, Nynorsk
nobNorwegian, Bokmål
notNomatsiguenga
nsoSotho, Northern
nya_chechewaNyanja (Chechewa)
nya_chinyanjaNyanja (Chinyanja)
nymNyamwezi
nynNyankore
nziNzema
oaaOrok
oci_1Occitan (Francoprovençal, Fribourg)
oci_2Occitan (Francoprovençal, Savoie)
oci_3Occitan (Francoprovençal, Vaud)
oci_4Occitan (Francoprovençal, Valais)
ojbOjibwa, Northwestern
okiOkiek
orhOroqen
ossOsetin
oteOtomi, Mezquital
pamPampangan
panPanjabi, Eastern
papPapiamentu
pauPalauan
pbbPáez
pbuPashto, Northern
pcdPicard
pcmPidgin, Nigerian
pes_1Farsi, Western
pes_2Dari
pisPijin
piuPintupi-Luritja
pltMalagasy, Plateau
pnbPanjabi, Western
polPolish
ponPohnpeian
por_BRPortuguese (Brazil)
por_PTPortuguese (Portugal)
povCrioulo, Upper Guinea
pplPipil
prvOccitan
qucK'iche', Central
qudQuechua (Unified Quichua, old Hispanic orthography)
qugQuichua, Chimborazo Highland
quyQuechua, Ayacucho
quzQuechua, Cusco
qvaQuechua, Ambo-Pasco
qvcQuechua, Cajamarca
qvhQuechua, Huamalíes-Dos de Mayo Huánuco
qvmQuechua, Margos-Yarowilca-Lauricocha
qvnQuechua, North Junín
qwhQuechua, Huaylas Ancash
qxaQuechua, South Bolivian
qxnQuechua, Northern Conchucos Ancash
qxuQuechua, Arequipa-La Unión
rarRarotongan
rmnRomani, Balkan
rmn_1Romani, Balkan (1)
rmyAromanian
rohRomansch
roh_puterRomansch (Puter)
roh_rumgrRomansch (Grischun)
roh_surmiranRomansch (Surmiran)
roh_sursilvRomansch (Sursilvan)
roh_sutsilvRomansch (Sutsilvan)
roh_valladerRomansch (Vallader)
ron_1953Romanian (1953)
ron_1993Romanian (1993)
ron_2006Romanian (2006)
runRundi
rusRussian
sagSango
sahYakut
sanSanskrit
scoScots
seySecoya
shkShilluk
shnShan
shpShipibo-Conibo
sinSinhala
skrSeraiki
slkSlovak
slrSalar
slvSlovenian
smeSaami, North
smoSamoan
snaShona
snkSoninke
snnSiona
somSomali
sotSotho, Southern
spaSpanish
srcSardinian, Logudorese
srp_cyrlSerbian (Cyrillic)
srp_latnSerbian (Latin)
srrSerer-Sine
sswSwati
sukSukuma
sunSunda
susSusu
swbComorian, Maore
sweSwedish
swhSwahili
tahTahitian
tamTamil
tam_LKTamil (Sri Lanka)
tatTatar
tbzDitammari
tcaTicuna
telTelugu
temThemne
tetTetun
tgkTajiki
tglTagalog
thaThai
tha2Thai (2)
tirTigrigna
tivTiv
tlyTalysh
tobToba
toiTonga
tojTojolabal
tonTongan
topTotonac, Papantla
tpiTok Pisin
tsnTswana
tso_MZTsonga (Mozambique)
tso_ZWTsonga (Zimbabwe)
tszPurepecha
tuk_cyrlTurkmen (Cyrillic)
tuk_latnTurkmen (Latin)
turTurkish
tyvTuva
tzcTzotzil (Chamula)
tzhTzeltal, Oxchuc
tzmTamazight, Central Atlas
uduUduk
uig_arabUyghur (Arabic)
uig_latnUyghur (Latin)
ukrUkrainian
umbUmbundu
uraUrarina
urdUrdu
urd_2Urdu (2)
uzn_cyrlUzbek, Northern (Cyrillic)
uzn_latnUzbek, Northern (Latin)
vaiVai
vecVenetian
venVenda
ven2Venda
vepVeps
vieVietnamese
vmwMakhuwa
warWaray-Waray
wlnWalloon
wolWolof
wwaWaama
xhoXhosa
xsmKasem
yadYagua
yaoYao
yapYapese
yddYiddish, Eastern
ykgYukaghir, Northern
yorYoruba
yrkNenets
yuaMaya, Yucatán
zamZapotec, Miahuatlán
zdjComorian, Ngazidja
zghTamazight, Standard Morocan
zroZáparo
ztuZapotec, Güilá
zulZulu

Types

This package is fully typed with TypeScript. It exports no additional types.

Compatibility

This package is at least compatible with all maintained versions of Node.js. As of now, that is Node.js 14.14+ and 16.0+. It also works in Deno and modern browsers.

Contribute

Yes please! See How to Contribute to Open Source.

Security

This package is safe.

License

MIT © Titus Wormer

5.2.0

6 months ago

5.1.0

1 year ago

5.0.0

3 years ago

4.2.0

4 years ago

4.1.1

5 years ago

4.1.0

5 years ago

4.0.0

6 years ago

2.0.0

7 years ago

1.0.0

8 years ago

0.1.1

9 years ago

0.1.0

10 years ago

0.0.4

10 years ago

0.0.3

10 years ago

0.0.2

10 years ago

0.0.1

10 years ago