Biku_ssml_parser NPM

Shorthand SSML for Bikubot

Shorthand SSML for Bikubot

What is this

This is a custom and shortend way to control the TTS voices of Bikubot, this uses AWS Polly SSML tags to control how the voice sounds, but shortend and simplfies the tags to make it easier and shorter to use.

How it works

Any change to how something is spoken start with # followed by the modifications you wanna do to the voice, these modifications are represented by a letter as an example p for pitch and for some modification the addition of numbers are needed to represent the scale of the modification. Finally the spoken word you want the modification to apply to is encapsulated by and . Because of this the characters [ and ] are reserved and if used within a voice modification it needs to be a matching pair. \

an example would be the SSML <prosody pitch="+50%" rate="200%">This is a test</prosody> would in shorthand be #p150r200this is a test. Note that it's not a one to one for some things, as pitch in Normal SSML goes between -30 and +50, but shorthand only works with positive numbers so a conversion is done, where instead of starting at 0 the shorthand starts at 100 for pitch. \

You can also mix any modifications, as an example if you wanted to add a whisper to the above example the shorthand would be: #wp150r200this is a test. The order of the modification characters does not matter. So you could do it like #p150wr200this is a test and it would work the same. \

But if you would try to do something like #wr20r200this is a test, that is to have the same modification more than once in the same tag it will only take the latest modification it sees in the tag so in the case it would seen the same as #wr200this is a test, the r20 will be thrown away. \

The shorthand also support nested tags, so you could do something like #p150[this is a #wtest]. All modification is also case insensitive so #P150L(Sv-Se)test is the same as #p150l(sv-se)test. \

The bot also does its best to fix any issues, such as if a value is too high it will set it to highest possible for that modification. \ The possible modifications and their values can be found next.

Short Notes

A voice modifications starts with # followed by one or more modification found below, then ending with the speech you want modified encapsulated in [ and ].
The characters [ and ] are reserved characters and if used, need to be used in pairs when used outside their intended use case (marking what to modifiy).
You can do nested modifications.
- Example:
  - #p150[this is a nested pitch #wwhisper test]
  - #p150[this is #w[deeply #r120s[nested and #t120going deeper, and] now] back up]
  - #v11[#w[testing #ssoftly whispering] with a bit higher volume, #t50ending with some timbre]
You can add more then one modification per voice modificiation, the order does not matter.
- Example:
  - #p150wthis is a modifed pitch with whipser
  - #wst50l(sv-se)this soft and whispering swedish language voice with modified timbre
  - #b.5t50p150r180This starts with a 0.5s break and modified pitch, rate and timbre
The modification part is case insensative.
Any modification value outside it's min or max range will be set to its min or max (whatever is closest).
Any modification value that is not valid will be set to a normalized default value.
Any characters that does not represent a modification will be ignored if part of the modification part.
A Faulty voice modification, like a space in the modification part or not correctly encapsulated will be read as normal.

Modifications

Break

pitch is represented by the letter b and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <break time=””> tag. The break happens before any given text, if there is any in the encapsulating []

Effect: Creates a break in the speech at the given point of the tag for the given amount of time in seconds..
Characters: \ These represent the same preset values that normal SSML has.
- ++ = x-high
- + = high
- - = low
- -- = x-low
Numeric:
- default: 1.0
- max: 10.0
- min: 0.0
Example:
- Characters: \ #b+[] is equal to <break strength=”strong” />
- Numeric: \ #b1.2A test is equal to <break strength=”1200ms” />A test \ #b.5[] is equal to <break strength=”500ms” />

Emphasis

Emphasis is represented by the letter m and needs a following - , + , ++. The SSML equivalence is the<emphasis level=""> tag.

Effect: Tries to (de)emphasis the word/sentence.
Characters: \ These represent the same preset values that normal SSML has.
- ++ = strong
- + = moderate
- - = reduced
Example: \ #m++A test is equal to <emphasis level="strong">A test</say-as> \ #m-A test is equal to <emphasis level="reduced">A test</say-as>

Expletive/Beep

Expletive/beep is represented by the letter e and does not need any additional data. The SSML equivalence is the<say-as interpret-as="expletive"> tag.

Effect: Beeps out the content.
Example: \ #eA test is equal to <say-as interpret-as="expletive">A test</say-as>

IPA (International Phonetic Alphabet)

IPA is represented by the letter i and followed by encapsulated in () the phonetic symbols for pronunciation. The SSML equivalence is the <phoneme alphabet="ipa" ph=”"> tag.

Effect: Changes how the word(s) encapsulated in [] are spoken.**

Example: \ #i(pɪˈkɑːn)A test is equal to <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>

Language

Language is represented by the letter l and followed by encapsulated in () the language code for the language you want to use. The SSML equivalence is the <lang xml:lang="fr-FR"> tag.

Effect: Changes what language the voice will use to try to speak the words.
Language codes:

Example:
- Characters: \ #l(ja-jp)A test is equal to <lang xml:lang="ja-JP">A test</lang>
- Numeric: \ #l(en-us)A test is equal to <lang xml:lang="en-US">A test</lang>

Max Duration

max duration is represented by the letter d and needs a following numeric value. The SSML equivalance is the <prosody amazon:max-duration=""> tag. There is limits on how fast the speech can be speed up, and if it already fits within the duration no changes are made.

Effect: Tries to speed up the speech so it fits within the given time.
Numeric:
- default: 1.0
- max: 60.0
- min: 0.0
Example #d5.3A test is equal to <prosody amazon:max-duration="5300ms">A test</prosody> / #d.5A test is equal to <prosody amazon:max-duration="500ms">A test</prosody> /

Pitch

pitch is represented by the letter p and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody pitch=””> tag.

Effect: Changes the pitch at which the spoken words are spoken at.
Characters: \ These represent the same preset values that normal SSML has.
- ++ = x-high
- + = high
- - = low
- -- = x-low
Numeric:
- default: 100
- max: 150
- min: 70
Example:
- Characters: \ #p++A test is equal to <prosody pitch=”x-high”>A test</prosody>
- Numeric: \ #p150A test is equal to <prosody pitch=”50%”>A test</prosody>

Soft

soft speech is represented by the letter s and does not need any additional data. The SSML equivalence is the <amazon:effect phonation="soft""> tag.

Effect: Makes the speech being spoken sound softer.
Example: \ #sA test is equal to <amazon:effect phonation="soft""A test</amazon:effect>

Rate

Rate is represented by the letter r and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody rate=””> tag.

Effect: Changes the speed at which the words are spoken.
Characters: \ These represent the same preset values that normal SSML has.
- ++ = x-fast
- + = fast
- - = slow
- -- = x-slow
Numeric:
- default: 100
- max: 200
- min: 20
Example:
- Characters: \ #r--A test is equal to <prosody rate=”x-slow”>A test</prosody>
- Numeric: \ #r150A test is equal to <prosody rate=”150%”>A test</prosody>

Timbre

Rate is represented by the letter t and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <amazon:effect vocal-tract-length=""> tag.

Effect: Changes the timbre of voice.
Characters:
- ++ = 200%
- + = 150%
- - = 75%
- –- = 50%
Numeric:
- default: 100
- max: 200
- min: 50
Example:
- Characters: \ #t--A test is equal to <amazon:effect vocal-tract-length="50%">A test</amazon:effect>
- Numeric: \ #t50A test is equal to <amazon:effect vocal-tract-length="50%">A test</amazon:effect>

Volume

Volume is represented by the letter v and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody volume=””> tag.

Effect: Changes the volume of the speech.
Characters: \ These represent the same preset values that normal SSML has.
- ++ = x-loud
- + = loud
- - = soft
- -- = x-soft
Numeric:
- default: 10
- max: 14
- min: 4
Example:
- Characters: \ #v+A test is equal to <prosody volume=”loud”>A test</prosody>
- Numeric: \ #v4A test is equal to <prosody rate=”-6db”>A test</prosody>**

Whisper

Is represented by the letter w and does not need any additional data. The SSML equivalence is the <amazon:effect name="whispered"> tag.

Effect: Makes the spoken words be spoken in a whispering voice. \
Example: \ #wA test is equal to <amazon:effect name="whispered">A test</amazon:effect>

Special Effects

There are a few special effects that the shorthand supports. These sounds are represented by the effect name encapsulated by :: , like ::effectname:: . Some of these will be affected by modifications as they are created with SSML and TTS, if so it will be noted._ _Plans for the future is to allow streamers to add their own sounds to this system. These are all case insensitive.