1.2.1 • Published 4 years ago

schema-analyzer v1.2.1

Weekly downloads
10
License
MIT
Repository
github
Last release
4 years ago

Build Status Codacy Badge GitHub package version GitHub stars Node.js CI npm bundle size npm downloads codecov

Schema Analyzer

An Open Source joint by Dan Levy

Analyze column type & size summary from any input JSON array

Schema Analyzer is the core library behind Dan's Schema Generator.

Features

The primary goal is to support any input JSON/CSV and infer as much as possible. More data will generally yield better results.

  • Heuristic type analysis for arrays of objects.
  • Browser-based (local, no server necessary)
  • Automatic type detection:
    • ID - Identifier column, by name and unique Integer check (detects BigInteger)
    • ObjectId (MongoDB's 96 bit/12 Byte ID. 32bit timestamp + 24bit MachineID + 16bit ProcessID + 24bit Counter)
    • UUID/GUID (Common 128 bit/16 Byte ID. Stored as a hex string, dash delimited in parts: 8, 4, 4, 4, 12)
    • Boolean (detects obvious strings true, false, Y, N)
    • Date (Smart detection via comprehensive regex pattern)
    • Timestamp (integer, number of milliseconds since unix epoch)
    • Currency (62 currency symbols supported)
    • Float (w/ scale & precision measurements)
    • Number (Integers)
    • Null (sparse column data helps w/ certain inferences)
    • Email (falls back to string)
    • String (big text and varchar awareness)
    • Array (includes min/max/avg length)
    • Object
  • Detects column size minimum, maximum and average
  • Includes data points at the 30th, 60th and 90th percentiles (for detecting outliers and enum types!)
  • Handles some error/outliers
  • Quantify # of unique values per column
  • Identify enum Fields w/ Values
  • Identify Not Null fields
  • Nested data structure & multi-table relational output.

Getting Started

npm install schema-analyzer
import { schemaBuilder } from 'schema-builder'

schemaBuilder(schemaName: String, data: Array<Object>): TypeSummary

Preview Analysis Results

What does this library's analysis look like?

It consists of 3 key top-level properties:

  • totalRows - # of rows analyzed.
  • fields: FieldTypeSummary - a map of field names with all detected types (includes meta-data for each type detected, with possible overlaps. e.g. an Email is also a String, "42" is a String and Number)

Review the raw results below

Details about each field can be found below.

{
  "totalRows": 5,
  "fields": {
    "id": {
      "types": {
        "Number": {
          "rank": 8,
          "count": 5,
          "value": { "min": 1, "mean": 3, "max": 5, "p25": 2, "p33": 2, "p50": 3, "p66": 4, "p75": 4, "p99": 5 }
        },
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 1, "mean": 1, "max": 1, "p25": 1, "p33": 1, "p50": 1, "p66": 1, "p75": 1, "p99": 1 }
        }
      }
    },
    "name": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 3, "mean": 7.2, "max": 15, "p25": 3, "p33": 3, "p50": 5, "p66": 10, "p75": 10, "p99": 15 }
        }
      }
    },
    "role": {
      "types": {
        "String": {
          "rank": 12,
          "count": 5,
          "length": { "min": 4, "mean": 5.4, "max": 9, "p25": 4, "p33": 4, "p50": 5, "p66": 5, "p75": 5, "p99": 9 }
        }
      }
    },
    "email": {
      "types": {
        "Email": {
          "rank": 11,
          "count": 5,
          "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
        }
      }
    },
    "createdAt": {
      "types": {
        "Date": {
          "rank": 4,
          "count": 4,
          "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 6, "mean": 6, "max": 6, "p25": 6, "p33": 6, "p50": 6, "p66": 6, "p75": 6, "p99": 6 }
        }
      }
    },
    "accountConfirmed": {
      "types": {
        "Unknown": {
          "rank": -1,
          "count": 1
        },
        "String": {
          "rank": 12,
          "count": 1,
          "length": { "min": 9, "mean": 9, "max": 9, "p25": 9, "p33": 9, "p50": 9, "p66": 9, "p75": 9, "p99": 9 }
        },
        "Boolean": {
          "rank": 3,
          "count": 4
        }
      }
    }
  }
}

Sample input dataset for the example results above

idnameroleemailcreatedAtaccountConfirmed
1Evepowerusereve@example.com01/20/2020undefined
2Aliceuserali@example.com02/02/2020true
3Bobuserrobert@example.com12/31/2019true
4Elliot Aldersonadminfalkensmaze@protonmail.com01/01/2001false
5Sam Sepioladminfalkensmaze@hotmail.com9/9/99true

AggregateSummary

Numeric and String types include a summary of the observed field sizes:

Number & String Range Object Details

Properties
  • min the minimum number or string length
  • max the maximum number or string length
  • mean the average number or string length
  • percentiles[25th, 33th, 50th, 66th, 75th, 99th] values from the Nth percentile number or string length

Percentile is based on input data, as-is with out sorting.

Length Range Data

Range data for the length of a String field type:

{
  "rank": 11,
  "count": 5,
  "length": { "min": 15, "mean": 19.4, "max": 26, "p25": 15, "p33": 15, "p50": 18, "p66": 23, "p75": 23, "p99": 26 }
}

This is useful for defining strict length limits or minimums, for example as SQL servers often require..

Range data for a Date fields value:

{
  "rank": 4,
  "count": 4,
  "value": { "min": "2001-01-01T00:00:00.000Z", "mean": "2015-04-14T18:00:00.000Z", "max": "2020-02-02T00:00:00.000Z", "p25": "2020-02-02T00:00:00.000Z", "p33": "2020-02-02T00:00:00.000Z", "p50": "2019-12-31T00:00:00.000Z", "p66": "2019-12-31T00:00:00.000Z", "p75": "2001-01-01T00:00:00.000Z", "p99": "2001-01-01T00:00:00.000Z" }
}

Notes

We recommend you provide at least 100+ rows. Accuracy increases greatly with 1,000 rows.

The following features require a certain minimum # of records:

  • Enumeration detection.
    • 100+ Rows Required.
    • Number of unique values must not exceed 20 or 5% of the total number of records. (100 records will identify as Enum w/ 5 values. Up to 20 are possible given 400 or 1,000+.)
  • Not Null detection.
    • where rowCount === field count

Full List of Detected Types

  • Unknown
  • ObjectId
  • UUID
  • Boolean
  • Date
  • Timestamp
  • Currency
  • Float
  • Number
  • Email
  • String
  • Array
  • Object
  • Null
1.2.0

4 years ago

1.2.1

4 years ago

1.1.0

4 years ago

1.0.1

4 years ago

1.0.0

4 years ago

0.0.1

4 years ago