0.7.0 • Published 2 years ago

mtl-text-processor v0.7.0

Weekly downloads
-
License
GNU General Publi...
Repository
github
Last release
2 years ago

mtl-text-processor

GitHub package.json version License

A set of tools to pre-process for translation (by machine translators) and post-process the results back into the full sentence.

Get started

Install:

npm install mtl-text-processor

Use it in your project:

const { TextProcessor } = require('mtl-text-processor');
let myProcessor = new TextProcessor({myOptions});
let myProcess = myProcessor.process("My text.");
let translatableStrings = myProcess.getTranslatableLines();
/**
 * Send translatableStrings (Array<string>) to the translator.
 * Store response as an array in translationResult.
 */
myProcess.setTranslatedLines(...translationResult);
let translatedText = myProcess.getTranslatedLines();

The use will vary a bit according to the translator you are using. If your translator already accepts Arrays and already returns Arrays, you're good to go. If your translator can only work with one string at a time, then you'd need to get the entire array before you can use it. But you don't have to deliver all translations at once:

// ...
translatableStrings.forEach(translatableLine => {
    let translation = sendToMyTranslator(translatableLine);
    myProcess.setTranslatedLines([translation]);
});
let translatedText = myProcess.getTranslatedLines();

Complete Example using Axios and Sugoi Translator to translate a .txt file

In this example we utilize Axios to control HTTP requests to a Sugoi Translator server started at port 14366. You'll need to install both axios and mtl-text-processor:

npm install mtl-text-processor && npm install axios
const axios = require('axios');
const { TextProcessor } = require('mtl-text-processor');
const fs = require('fs');
let file = "MyTextFile";
let text = fs.readFileSync(`./${file}.txt`, {encoding : "utf8"});
let myProcessor = new TextProcessor();
let myProcess = myProcessor.process(text);
let toTranslate = myProcess.getTranslatableLines();

let translations = [];

let maxRequests = 10;
let i = 0;
function sendBatch () {
    let batch = [];
    while (i < toTranslate.length) {
        batch.push(toTranslate[i++]);
        if (batch.length >= maxRequests) {
            break;
        }
    }

    axios.post("http://0.0.0.0:14366/", {
        message: "translate sentences",
        content : batch
    }).then(res => {
        translations.push(...res.data);
        if (i < toTranslate.length) {
            sendBatch();
        } else {
            myProcess.setTranslatedLines(...translations);
            fs.writeFileSync(
                `./${file}_translated.txt`,
                myProcess.getTranslatedLines().join("\n"),
                {encoding : "utf8"}
            );
        }
    });
}

sendBatch();

Why?

Machine Translators are great. But most Machine Translators can't handle being thrown a ton of text at once - if they don't outright crash, they will produce less than ideal translations.

This Text Processor has the main goal of safeguarding untranslatable symbols, but the way in which this is achieved also results in smaller sentences which are translated in isolation and then put back together. This is just a big win when the different sentences already have no relation to the others (like two entirely separate sentences that are only related by the fact that they are used in sequence, with no refences at all). Translation in isolation is also beneficial when translating a symbol (like a name, or a sub-sentence inside brackets, makes it more consistent).

This processor attempts to give the Machine the contextual information of "A thing is in this part of the sentence", but without overwhelming them with the entire content of that symbol.

Example

Overview of uses

  • Escaping Symbols: Allows setting up Regular Expressions to catch and remove untranslatable symbols, like scripts. Can also be used to save the translation of names for later - the translator will be fed a Placeholder Symbol instead of the original sequence. Most translators have support to certain symbols that are understood as "a thing", which allows the translator to still have the benefit of context, but without having to actually deal with the hard to understand original sequence.
    • Can be used for untranslatable scripts (\n0 used as a name, or an actual script call).
    • Can be used if you have a manually crafted translation for a name (instead of having the translator translate the name, it will maintain the Placeholder).
    • There are many choices of placeholders. Ideal placeholder choice varies with translator. Some translators, like Google and DeepL, for instance, are able to understand symbols such as {{A}} as something that has relevance in the context, but that should not be touched. Other translators will handle symbols such as %A better.
  • Isolating Sentences: Internal parts of a sentence can be isolated for translation. Their place in the original sentence will be replaced by a placeholder.
  • Sentence Splitting: Sentences can be split in reasonable points so that each part is translated in isolation.
    • Most translators are not able to properly handle multiple paragraphs in a row, so by splitting sentences on new paragraphs, better translation quality can be achieved. It's faster and cheaper, too!
  • Protected Untranslatables: Parts of a sentence can easily not even be sent to the translator at all, like the bounding brackets of a quote - the TextProcessor will store them until it gets the translation for what's inside, then put it back together again.

Author

@reddo9999

License

GNU General Public License v3.0


README created with ❤️ by md-generate

0.5.0

2 years ago

0.7.0

2 years ago

0.6.0

2 years ago

0.5.1

2 years ago

0.4.0

2 years ago

0.3.0

2 years ago

0.2.0

2 years ago

0.1.0

2 years ago

1.0.1

2 years ago

1.0.0

2 years ago