1.0.1 • Published 1 year ago

@scriptin/is-han v1.0.1

Weekly downloads
-
License
MIT
Repository
github
Last release
1 year ago

is-han

Unicode-aware Han characters (hanzi, kanji, hanja) detection

npm i @scriptin/is-han

Usage

Note You need to use Unicode-aware methods/operators in JavaScript - Array.from(str) and for/of loops - in order to process all Han characters. Some of them have code points which don't fit into 16 bits, and JavaScript uses UTF-16.

Examples of correct usage:

import { isHan } from "@scriptin/is-han";

for (const char of "漢字") {
  console.log(isHan(char));
}

// or

Array.from("漢字").filter(isHan)

Incorrect usage:

'𠀋'.split('').filter(isHan); // -> empty array
// because code point of '𠀋' is '2000B' which is more than 16 bit long,
// so it is split into a surrogate pair
console.log('𠀋'.split('')); // -> ['\uD840', '\uDC0B']

// Compare to:
console.log(Array.from('𠀋')); // -> ['𠀋']

API

  • isHan(char: string): boolean - Checks if a character is a Han script character: hanzi, kanji, hanja

  • isHanExt(char: string): boolean - Checks if a character is an "extended" Han script character. Useful when you're looking for obscure characters which contain Han script, e.g. symbols like 🈲, 🈯, 🈳, 🉐, 🉑, ㊄, ㋋, ㏾, ㍰, etc. "Extended" means all Unicode characters which:

    • contain Han characters with additional wrappers, such as characters inside brackets, circles, etc.
    • contain multiple "compacted" Han characters, such as Japanese "square era names", etc.
    • contain parts of Han characters, such as CJK strokes
    • 々 IDEOGRAPHIC ITERATION MARK (see below)
    • 〆 IDEOGRAPHIC CLOSING MARK (see below)
  • isIterationMark(char: string): boolean - Checks if character is 々 IDEOGRAPHIC ITERATION MARK. This mark means "repeat previous character". Can be useful if you want to replace this mark with the character it repeats/represents. See Wiktionary article about 々

  • isClosingMark(char: string): boolean - Checks if character is 〆 IDEOGRAPHIC CLOSING MARK. This mark is used in place of another Han character. See Wiktionary article about 〆

  • Some constants are also exported in case you need to extend the functionality.

FAQ

❓ Why do I have to use Array.from(str) and for/of?

Because JavaScript (and TypeScript) use UTF-16 for strings, and some of more recent additions into Unicode don't fit into 16 bit. In such cases, characters are represented with surrogates. Array.from() and for/of were added in more recent versions of ECMAScript and are Unicode-aware.

This library cannot change this JavaScript feature, so you have to use these two methods, and avoid using Array.split(), String.codePointAt(), String.charCodeAt(), etc.

❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?

No. Because of the Han unification most of CJK characters are represented with shared code points. Each code point can be associated with multiple versions/variants of the same character, including regional, stylistic, and other variations. In order to determine a language, you need to know some context. For example, language can be set as an attribute of a web page or a PDF document, or as a setting in an operating system.

This library doesn't provide methods to distinguish between languages.

❓ Can I distinguish between Traditional and Simplified Chinese characters?

In some cases, yes. In others, traditional and simplified variants share the same code points. See this article. For a sufficiently big text, you can determine if it's traditional or simplified by looking for specific code points.

This library doesn't provide methods to distinguish between traditional and simplified scripts.