virastar v0.21.0
Virastar (ویراستار)
Virastar is a Persian text cleaner.
A javascript port of aziz/virastar with lots of help from ebraminio/persiantools
see live demo
Install
npm install virastarUsage
var Virastar = require('virastar');
var virastar = new Virastar();
virastar.cleanup("فارسي را كمی درست تر می نويسيم"); // Outputs: "فارسی را کمی درستتر مینویسیم"Browser
<script src="lib/virastar.js"></script>
<script>
var virastar = new Virastar();
alert(virastar.cleanup("فارسي را كمی درست تر می نويسيم"));
</script>Virastar(text)
text
Type: string
string of persian source to be cleaned.
options
Type: object
Virastar("سلام 123" ,{"fix_english_numbers":false}); // Outputs: "سلام 123"Options and Specifications
Virastar comes with a list of options to control its behavior.
normalize_eol
default: true
- replaces windows end of lines with unix eol (
\n)
decode_htmlentities
default: true
- converts numeral and selected html character-sets into original characters
fix_dashes
default: true
- replaces triple dash to mdash
- replaces double dash to ndash
fix_three_dots
default: true
- removes spaces between dots
- replaces three dots with ellipsis character
normalize_ellipsis
default: true
- replaces more than one ellipsis with one
- replaces (space|tab|zwnj) after ellipsis with one space
normalize_dates
default: true
- re-orders date parts with slash as delimiter
fix_english_quotes_pairs
default: true
- replaces english quote pairs (
“”) with their persian equivalent («»)
fix_english_quotes
default: true
- replaces english quote marks with their persian equivalent
fix_hamzeh
default: true
- replaces
هfollowed by (space|ZWNJ|lrm) follow byیwithهٔ - replaces
هfollowed by (space|ZWNJ|lrm|nothing) follow byءwithهٔ - replaces
هٓor single-characterۀwith the standardهٔ
fix_hamzeh_arabic
default: false
- converts arabic hamzeh
ةtoهٔ
cleanup_rlm
default: true
- converts Right-to-left marks followed by persian characters to zero-width non-joiners (ZWNJ)
cleanup_zwnj
default: true
- converts all soft hyphens (
­) into zwnj - removes more than one zwnj
- cleans zwnj after characters that don't conncet to the next
- cleans zwnj before and after numbers, english words, spaces and punctuations
- removes unnecessary zwnj on start/end of each line
fix_arabic_numbers
default: true
- replaces arabic numbers with their persian equivalent
fix_english_numbers
default: true
- replaces english numbers with their persian equivalent
fix_numeral_symbols
default: true
- replaces english percent signs (U+066A)
- replaces dots between numbers into decimal separator (U+066B)
- replaces commas between numbers into thousands separator (U+066C)
fix_misc_non_persian_chars
default: true
- replaces arabic normal/swash kaf with its persian equivalent
- replaces arabic/urdu/pushtu/uyghur yeh with its persian equivalent
- replaces kurdish he with its persian equivalent
fix_punctuations
default: true
- replaces
,,;with its persian equivalent
fix_question_mark
default: true
- replaces question marks with its persian equivalent
fix_perfix_spacing
default: true
- puts zwnj between the word and the prefix:
-
mi*,nemi*,bi*
fix_suffix_spacing
default: true
- puts zwnj between the word and the suffix:
-
*ha,*haye-*am,*at,*ash,*ei,*eid,*eem,*and,*man,*tan,*shan-*tar,*tari,*tarin-*hayee,*hayam,*hayat,*hayash,*hayetan,*hayeman,*hayeshan
fix_suffix_misc
default: true
- replaces
هfollowed byئorی, and then byی, withهای
fix_spacing_for_braces_and_quotes
default: true
- removes inside spaces and more than one outside for
(),[],{},“”and«»
fix_spacing_for_punctuations
default: true
- removes space before punctuations
- removes more than one space after punctuations, except followed by new-lines
- removes space after colon that separates time parts
- removes space after dots in numbers
- removes space before some common domain tlds
- removes space between question and exclamation marks
- removes space between same marks
fix_diacritics
default: true
- cleans zwnj before diacritic characters
- cleans more than one diacritic characters
- cleans spaces before diacritic characters
remove_diacritics
default: false
- removes all diacritic characters
fix_persian_glyphs
default: true
- converts incorrect persian glyphs to standard characters
fix_misc_spacing
default: true
- removes space before parentheses on misc cases
- removes space before braces containing numbers
cleanup_spacing
default: true
- replaces more than one space with just a single one
- cleans whitespace/zwnj between new-lines
cleanup_line_breaks
default: true
- cleans more than two contiguous line breaks
cleanup_begin_and_end
default: true
- removes space/tab/zwnj/nbsp from the beginning of the new-lines
- removes spaces, tabs, zwnj, direction marks and new lines from the beginning and end of text
markdown
markdown_normalize_braces
default: true
- removes spaces between
[]and()([text] (link)into[text](link)) - removes space between
!and opening brace (! [alt](src)into) - removes spaces inside double
(),[],{}([[ text ]]into[[text]]) - removes spaces between double
(),[],{}([[text] ]into[[text]])
markdown_normalize_lists
default: true
- removes extra lines between two items on a markdown list beginning with
-,*or#
skip_markdown_ordered_lists_numbers_conversion
default: false
- skips converting english numbers of ordered lists in markdown
aggressive editing
cleanup_extra_marks
default: true
- replaces more than one exclamation mark with just one
- replaces more than one english or persian question mark with just one
- re-orders consecutive marks:
?!into!?
kashidas_as_parenthetic
default: true
- replaces kashidas to ndash in parenthetic
cleanup_kashidas
default: true
- converts kashida between numbers to ndash
- removes all kashidas between non-whitespace characters
extras
preserve_frontmatter
default: true
- preserves frontmatter data in the text
preserve_HTML
default: true
- preserves all html tags in the text
preserve_comments
default: true
- preserves all html comments in the text
preserve_entities
default: true
- preserves all html entities in the text
preserve_URIs
default: true
- preserves all uri strings in the text
preserve_brackets
default: false
- preserves strings inside square brackets (
[])
preserve_braces
default: false
- preserves strings inside curly braces (
{})
preserve_nbsps
default: true
- preserves all no-break space entities in the text
License
This software is licensed under the MIT License. View the license.
6 years ago
6 years ago
6 years ago
6 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
8 years ago
9 years ago
9 years ago
9 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
10 years ago
11 years ago
11 years ago
11 years ago
11 years ago
11 years ago