1.0.1 • Published 6 years ago

@dangbh1002/language-data v1.0.1

Weekly downloads
1
License
ISC
Repository
github
Last release
6 years ago

Build language data

If you have a Language Learning Application, data is very important. So, this sample code will help you generate sentences data (indexed json data for every language in the world).

The raw data included, it's have more than 6 million sentences in every language, provide by tatoeba project.

Prepare

1. Install

Install via github:

git clone https://github.com/dangbh1002/language-data.git

Install via npm:

npm install @dangbh1002/language-data

2. Required

Download 3 largeSize files below at this link and put them to the ./rawData directory:

  • links.csv
  • sentences_with_audio.csv
  • sentences.csv

3. Project tree

  • build/
  • node_modules/
  • rawData/
    • links.csv
    • sentences_with_audio.csv
    • sentences.csv
  • src/
    • language-data.js
    • letter-code.js
  • .gitignore
  • index.js
  • package.json
  • README.md

How To Use

1. Find the supported language you want to translate from ./letter-code.js file.

/**
 * This sample code only have 5 language. But you can add any language you want.
 * The 3-letter codes of every language, you can find here: ./rawData/sentences.csv
 */

let letterCode = {
    "english": "eng",
    "vietnamese": "vie",
    "japan": "jpn",
    "france": "fra",
    "germany": "deu"
};

2. Run bash

npm start @OriginLanguage @TargetLanguage

3. Example bash

  • Build indexed json for translate English to Vietnamese:
npm start english vietnamese
  • Build indexed json for translate English to Japanese:
npm start english japanese
  • Build indexed json for translate Japanese to English:
npm start japanese english

4. Result

After run bash command line, you'll get the json file in ./build directory:

For example: ./build/japanese-to-english.json

[   
    {"eng":"I have to go to sleep.","vie":"Tôi phải đi ngủ.","author":"tatoeba","soundID":"1277","syllables":6},
    {"eng":"The password is Muiriel.","vie":"Mật mã là Muiriel.","author":"tatoeba","soundID":"1283","syllables":4},
    {"eng":"I just don't know what to say.","vie":"Tôi không biết nên nói gì cả.","author":"tatoeba","soundID":"1288","syllables":7},
    {"eng":"I don't know what you mean.","vie":"Tôi không biết ý của bạn là gì.","author":"tatoeba","soundID":"1408","syllables":6}
    ...
]

Created by Brian Dhang. Powered by tatoeba, nodejs, javascript, csvtojson and love.

All rights reserved.