0.2.0 • Published 8 months ago

@ioris/tokenizer-kuromoji v0.2.0

Weekly downloads
-
License
MIT
Repository
github
Last release
8 months ago

@ioris/tokenizer-kuromoji

Overview

@ioris/tokenizer-kuromoji is a tokenizer library for lyrics analysis that uses the Japanese morphological analysis engine "Kuromoji". This library is designed to be used in conjunction with @ioris/core, providing functionality to detect natural breaks in lyrics and divide them into meaningful phrases.

Features

  • Rule sets specialized for phrase segmentation in lyrics
  • Proper handling of mixed Japanese and English lyrics
  • Special processing for parentheses and symbols
  • Natural break point detection using part-of-speech information

Installation

npm install @ioris/tokenizer-kuromoji @ioris/core kuromoji
# or
yarn add @ioris/tokenizer-kuromoji @ioris/core kuromoji

Basic Usage

import path from "path";
import { Paragraph } from "@ioris/core";
import { builder } from "kuromoji";
import { LineArgsTokenizer } from "@ioris/tokenizer-kuromoji";

// Initialize kuromoji tokenizer
const kuromojiBuilder = builder({
  dicPath: path.resolve(__dirname, "node_modules/kuromoji/dict")
});

// Get kuromoji tokenizer (Promise)
const getTokenizer = () => new Promise((resolve, reject) => {
  kuromojiBuilder.build((err, tokenizer) => {
    if (err) reject(err);
    resolve(tokenizer);
  });
});

// Usage example
async function example() {
  // Get kuromoji tokenizer instance
  const tokenizer = await getTokenizer();

  // Prepare lyrics data
  const lyricData = {
    lyricID: "1",
    position: 1,
    timelines: [
      {
        wordID: "",
        begin: 1,
        end: 5,
        text: "あの花が咲いたのは、そこに種が落ちたからで"
      }
    ]
  };

  // Create and initialize Paragraph instance
  const paragraph = await new Paragraph({
    ...lyricData,
    lineTokenizer: (lineArgs) => LineArgsTokenizer({
      lineArgs,
      tokenizer
    })
  }).init();

  // Get processing results
  const lines = paragraph.allLines();
  console.log(lines[0].text());
  // Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"
}

example();

Lyrics Analysis Behavior

This tokenizer divides sentences considering the following characteristics of lyrics:

  • Natural breaks based on part-of-speech information (e.g., boundaries between particles and verbs)
  • Well-balanced division based on length (appropriate division when lines are too long)
  • Preservation of phrases enclosed in parentheses or quotation marks
  • Appropriate breaks in mixed Japanese and English text
  • Proper handling of repetitive expressions (e.g., "Baby Baby Baby")

Applying Custom Rules

You can also add your own breakpoint rules:

import { LineArgsTokenizer } from "@ioris/tokenizer-kuromoji";
import { DEFAULT_BRAKE_RULES } from "@ioris/tokenizer-kuromoji";

// Example with custom rules
const customRules = [
  ...DEFAULT_BRAKE_RULES,
  {
    // Define your own rule
    current: {
      surface_form: [/^(.*[specific string]).*$/]
    },
    after: {
      pos: [["noun", false]]
    }
  }
];

// Use custom rules
const result = LineArgsTokenizer({
  lineArgs,
  tokenizer,
  brakeRules: customRules
});

License

MIT

0.1.13

10 months ago

0.1.14

9 months ago

0.2.0

8 months ago

0.1.12

1 year ago

0.1.10

1 year ago

0.1.11

1 year ago

0.1.8

1 year ago

0.1.9

1 year ago

0.1.6

2 years ago

0.1.5

2 years ago

0.1.2

2 years ago

0.1.4

2 years ago

0.1.3

2 years ago

0.1.0

2 years ago

0.1.1

2 years ago

0.0.1

2 years ago