Random-mock NPM

random-mock

轻量级 Javascript 样本生成器

Random mock 是一种优雅的样本生成器，可以根据预设的规则生成各种样本点，通过 npm 进行安装：

npm install random-mock

使用方法

上图中的样本是通过以下代码生成的：

const attributes = [
    {
        name: 'x',
        type: 'continuous',
        distribution: {
            type: 'uniform',
            begin: -5,
            end: 5
        }
    },
    {
        name: 'y',
        type: 'continuous',
        distribution: {
            type: 'uniform',
            begin: -5,
            end: 5
        }
    },
    {
        name: 'series',
        type: 'category',
        distribution: {
            type: 'standard',
            range: ['a', 'b']
        }
    }
]
const rules = [
    {
        source: ['x', 'y'],
        target: 'series',
        type: 'mappingtable',
        conditions: [
            {
                and: (item) =>
                    item.x * item.x + Math.pow(item.y - Math.pow(item.x * item.x, 1 / 2), 2) <= 5,
                value: 'a'
            },
            {
                and: (item) =>
                    item.x * item.x + Math.pow(item.y - Math.pow(item.x * item.x, 1 / 2), 2) > 5,
                value: 'b'
            }
        ]
    }
]
const config = {
    attributes,
    rules
}
let mocker = new RandMock.Mocker(config)
let data = mocker.create({
    count: 10000,
    mode: RandMock.DataMode.Object
})

上述代码定义了范围为-5,5的两个数值型变量 x、y，以及类别型变量 series，并定义了两条规则：

当$x^2+(y-\sqrt3{x^2})^2\leq5$（此为心形线方程）时，series 被分为 a 类
当$x^2+(y-\sqrt3{x^2})^2>5$时，series 被分为 b 类

你也可以根据需要，将规则进行如下调整：

const rules = [
    {
        source: ['x', 'y'],
        target: 'series',
        type: 'mappingtable',
        conditions: [
            {
                and: (item) => item.x * item.x + item.y * item.y <= 9,
                value: 'a'
            },
            {
                and: (item) => item.x * item.x + item.y * item.y > 9,
                value: 'b'
            }
        ]
    }
]

当$x^2+y^2\leq9$时，series 被分为 a 类
当$x^2+y^2>9$时，series 被分为 b 类

生成 10000 个符合上述条件的样本，效果如下：

定义

Random-mock 中有四种主要的数据结构，属性Attribute、分布Distribution、规则Regulation以及样本生成器Mocker

属性定义

let attributes = [
    {
        name: 'x',
        type: 'continuous',
        distribution: {
            type: 'uniform',
            begin: -5,
            end: 5
        }
    },
    {
        name: 'y',
        type: 'continuous',
        distribution: {
            type: 'uniform',
            begin: -5,
            end: 5
        }
    }
]

attributes 基本定义：

参数	说明	类型	是否必需
name	指定属性名	`string`	是
type	指定属性类型	`string`（或枚举类型 `AttributeType`）	是
distribution	指定属性分布	`DistributionConfig` 或 `DistributionConstructor`	除 compound 类型外是

属性类型 AttributeType

属性类型	说明
category	无序类别型变量
compound	组合变量
continuous	连续型变量（在任意区间内可能取得无数个值）
date	时序型变量
discrete	有序离散型变量
primary	主键型变量（确保所有主键变量的组合是唯一的，所有的 primary 变量默认是独立的，在开始执行规则之前通过笛卡尔积计算初始值）
unique	唯一型变量（确保所有样本中该变量是唯一的）

Attribute.Category

参数	说明	类型	是否必需	默认值
binarization	是否进行二元化	`boolean`	否	`false`
binaryFormat	指定二元化模板	`[any, any]`	否	`[false, true]`

Attribute.Compound

参数	说明	类型	是否必需	默认值
arguments	指定子属性	`string[]`（必须为属性名）	是	-

Attribute.Continuous

无特殊参数

Attribute.Date

参数	说明	类型	是否必需	默认值
format	指定日期模板	`string`（必须符合日期模板格式）	否	`'YYYY/MM/DD'`
record	是否逐条记录（设为 true 后，每生成一条记录都会在属性的 range 当中保存）	`boolean`	否	`false`
sort	是否排序（设为 true 后，每生成一条记录都会对 range 进行排序）	`boolean`	否	`false`

Attribute.Discrete

参数	说明	类型	是否必需	默认值
step	指定截取间隔	`number`	否	`1`
record	同`Attribute.Date`	`boolean`	否	`false`
sort	同`Attribute.Date`	`boolean`	否	`false`

Attribute.Primary

参数	说明	类型	是否必需	默认值
count	从指定分布中生成键值的数量	`number`	否	`100`
formatToValue	将模板转换成数字	`function`	否	`(source)=>source`
valueToFormat	将数字转换成模板	`function`	否	`(source)=>source`
retryCount	从指定分布中随机变量时，一旦出现重复则会重试，该值表示允许重试的次数	`boolean`	否	`100`

Attribute.Unique

参数	说明	类型	是否必需	默认值
formatToValue	同`Attribute.Primary`	`function`	否	`(source)=>source`
valueToFormat	同`Attribute.Primary`	`function`	否	`(source)=>source`
retryCount	同`Attribute.Primary`	`boolean`	否	`100`

分布定义 Distribution

let distribution = {
    type: 'uniform',
    begin: 0,
    end: 10
}

上述对象将定义一个0,10范围内的均匀分布。

distribution 基本定义：

参数	说明	类型	是否必需
type	指定变量遵循的分布类型	`string`（或枚举类型 `DistributionType`）	是

DistributionType 分布类型

目前已实现的分布包括： | 分布类型 | 说明 | 示例 | | --- | --- | --- | | cauchy | 柯西分布 | | | disposable | 一次性分布 | | | exponential| 指数分布 | | | hypergeometric | 超几何分布 | | | normal | 正态分布 | | | standard | 标准概率分布 | | | uniform | 均匀分布 | |

Distribution.Cauchy

$F(x)=\frac{1}{\pi}\arctan(\frac{x-x_0}{\theta})+\frac{1}{2}$ 易得$x=\tan(\pi(F(x)-\frac{1}{2}))$

参数	说明	类型	是否必需	默认值
x0	$x_0$	`number`	是	-
theta	$\theta$	`number`	是	-

Distribution.Disposable

参数	说明	类型	是否必需	默认值
range	一次性样本	`any[]`	是	-

Distribution.Exponential

$F(x)=1-e^{-x\lambda}(x\geq0)$ 易得$x=offset-\frac{\ln(1-F(x))}{\lambda}$

参数	说明	类型	是否必需	默认值
offset	$offset$	`number`	是	-
lambda	$\lambda$	`number`	是	-

Distribution.Hypergeometric

$P(x=k)=\frac{CM^kC{N-M}^{n-k}}{CN^m}$ 易得$x=\Sigma{k=0}^{x}\frac{CM^kC{N-M}^{n-k}}{C_N^m}$

参数	说明	类型	是否必需	默认值
range	长度应与$min(n,M)$一致	`Array`	是	-
n	$n$	`number`	是	-
M	$M$	`number`	是	-
N	$N$	`number`	是	-

Distribution.Normal

正态分布无概率分布函数$F(x)$，本 API 采用 Box-Muller 算法：已知变量$u$、$v$服从$(-1,1)$上的均匀分布，令$w=u^2+v^2$ 则有：$n=u\sqrt{\frac{-2\ln{w}}{w}}$或$n=v\sqrt{\frac{-2\ln{w}}{w}}$服从标准正态分布。易得$x=\mu+n\sigma$

参数	说明	类型	是否必需	默认值
u	$\mu$	`number`	是	-
sigma	$\sigma$	`number`	是	-

Distribution.Standard

对于集合$range$当中的每一个样本$k\in range$，都有$P(x=k)=p_k$

参数	说明	类型	是否必需	默认值
range	$range$	`any[]`	是	-
p	$p$	`string[]`	是	-

Distribution.Uniform

$F(x)=\frac{x-a}{b-a}$ 易得$x=F(x)-a$

参数	说明	类型	是否必需	默认值
begin	$a$	`number`	是	-
end	$b$	`number`	是	-
range	$a,b$	`[number, number]`	是	-

规则 Regulation

let rules = [
    {
        target: 'y',
        source: ['x'],
        type: 'expression',
        expression: (item) => item.x^2 + 2 * item.x + 5
        distribution: 'normal',
        sigma: 5,
        confidence: 0.98
    },
    {
        target: 'z',
        source: ['region', 'x'],
        type: 'expression',
        conditions: [
            {
                region: 'CHN',
                value: 10000
            },
            {
                region: ['US', 'UK'],
                and: (item) => item.x >= 100
                value: {
                    type: 'uniform',
                    range: [5000,50000]
                }
            },
            {
                region: 'RUS',
                or: (item) => item.x < 50,
                value: {
                    type: 'expression',
                    expression: (item) => item.x * 1000
                }
            }
        ]
        confidence: 0.98
    }
]

上述代码设置了以下规则：

定义了 y 关于 x 的函数式规则：$y$服从以 $y_0=x^2+2x+5$为均值，sigma 为方差的正态分布。
定义了 z 关于 region 和 x 的映射表规则： | region | x | z | | --- | --- | --- | | 'CHN' | - | 10000 | | 'US'|'UK' | 且x>=100 | 服从[5000,50000]的均匀分布 | | 'RUS' | 或x<50 | 表达式z=x*1000

regulation 基本定义：

参数	说明	类型	是否必需	默认值
source	指定该规则中存在哪些自变量	`string[]`	是	-
target	指定该规则决定的因变量	`string`	是	-
type	规则类型	`string`或`RegulationType`	是	-
confidence	置信度，满足前置条件的项执行此规则的几率（用于设置噪声）	`number`	否	`1`

RegulationType 规则类型

规则类型	说明
expression	函数式规则
mappingtable	映射表规则

Regulation.Expression

参数	说明	类型	是否必需	默认值
expression	关于元素`item`的表达式，返回值即为`item[target]`	`Function`	是	-
distribution	设置此值后，target 将按照指定分布在表达式周围生成散点	仅允许`cauchy \| normal \| uniform`	否	-
theta	$\theta$	`number`	仅当`distribution`为`cauchy`时是	-
sigma	$\sigma$	`number`	仅当`distribution`为`normal`时是	-
difference	$x_0+d,x_0-d$此处为$d$	`number`	仅当`distribution`为`uniform`时是	-

Regulation.MappingTable

参数	说明	类型	是否必需	默认值
conditions	映射表集合	`Condition[]`	是	-

Condition

参数	说明	类型	是否必需	默认值
and	满足attributeName基础条件且满足该函数均执行规则	`Function`	否	-
or	满足attributeName基础条件或满足该函数均执行规则（与 and 同时出现时，先执行 and）	`Function`	否	-
value	映射表集合	`ExpressionConfig`\|`DistributionConfig`\|`any`	是	-
attributeName	自变量的值	`Array<any>`\|`any`	否	-

DataConfiguration

设置输出的数据格式

参数	说明	类型	是否必需	默认值
count	输出的数据条数	`number`	否	`100`
type	输出的格式	`'object'`\|`'table'`	否	`Datatype.Object`
settings	输出设置	`object`	否	-

DataSettings

参数	说明	类型	是否必需	默认值
categoryBinarization	是否需要对全局所有`Category`类型值进行二元化	`boolean`	否	`false`
categoryBinaryFormat	二元化模板，仅当`categoryBinarization`为`true`时才会生效	`[any, any]`	否	`[false, true]`
saveOriginal	是否保留原始属性（若不保留，则在Compound类型及二元化后删除属性）	`boolean`	否	`false`