1.1.6 • Published 2 years ago

@datatables/type-detector v1.1.6

Weekly downloads
-
License
MIT
Repository
github
Last release
2 years ago

Type-Detection

This Library provieds raw data type detection for Numbers, HTML, Dates and Strings. It was created with by the CloudTables team.

Install

Install from npm by using

npm install @datatables/type-detector

Usage

ES3

var TypeDetect = require('@datatables/type-detector');

var detector = new TypeDetect["default"]();

var type = detector.typeDetect(dataArray);

ES6

import TypeDetect from '@datatables/type-detector';

let detector = new TypeDetect();

let type = detector.typeDetect(dataArray);

Where dataArray is a single dimensional array of data who's type is to be detected.

The constructor takes 2 optional parameters as follows. | Parameter | Default | Description | |:---------:|:-------:|:-----------:| | decimalCharacter | '.' | This is used when determining the number of decimal places that a number has. | | thousandsSeparator | ',' | This is removed from strings when attempting to determine a number type. |

API

There is a single API method i18n().

detector.i18n(langOpts)

The langOpts parameter is an object that is used to add internationalisation options used to detect dates. The default value of the internal object is as follows.

{
	"abbrDays": {
		"deDE": /^(mo\.|di\.|mi\.|do\.|fr\.|sa\.|so\.)$/gi,
		"en": /^(mon|tue|wed|thu|fri|sat|sun)$/gi,
		"esES": /^(lun\.|mar\.|mié\.|jue\.|vie\.|sáb\.|dom\.)$/gi,
		"frFR": /^(lun\.|mar\.|mer\.|jeu\.|ven\.|sam\.|dim\.)$/gi
	},
	"abbrMonths": {
		"deDE": /^(jan\.|feb\.|märz\.|apr\.|mai\.|juni\.|juli\.|aug\.|sep\.|okt\.|nov\.|dez\.)$/gi,
		"en": /^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/gi,
		"esES": /^(ene\.|feb\.|mar\.|abr\.|may\.|jun\.|jul\.|ago\.|sep\.|oct\.|nov\.|dic\.)$/gi,
		"frFR": /^(janv\.|févr\.|mars\.|avr\.|mai\.|juin\.|juil\.|août\.|sept\.|oct\.|nov\.|dec\.)$/gi
	},
	"days": {
		"deDE": /(montag|dienstag|mittwoch|donnerstag|freitag|samstag|sonntag)/gi,
		"en": /(monday|tuesday|wednesday|thursday|friday|saturday|sunday)/gi,
		"esES": /(lunes|martes|miércoles|jueves|viernes|sábado|domingo)/gi,
		"frFR": /(lundi|mardi|mercredi|jeudi|vendredi|samedi|dimanche)/gi
	},
	"months": {
		"deDE": /^(januar|februar|märz|april|mai|juni|juli|august|september|oktober|november|dezember)$/gi,
		"en": /^(january|february|march|april|may|june|july|august|september|october|november|december)$/gi,
		"esES": /^(enero|febrero|marzo|abril|mayo|junio|julio|agosto|septiembre|octubre|noviembre|diciembre)$/gi,
		"frFR": /^(janvier|février|mars|avril|mai|juin|juillet|août|septembre|octobre|novembre|décembre)$/gi
	},
	"postFixes": {
		"deDE": /^[0-9]+(st|nd|rd|th)$/gi,
		"en":   /^[0-9]+(st|nd|rd|th)$/gi,
		"esES": /^[0-9]+(st|nd|rd|th)$/gi,
		"frFR": /^[0-9]+(st|nd|rd|th)$/gi
	}
}

To add internationalisation options for a language(s) an object of this form must be passed in. The values within the RegExps need to be parseable by moment otherwise dates will not be identified correctly. The key of the object value pairs should match the locale for the language that is being added. For example to add english as a language you would make the following call.

detector.i18n({
	"abbrDays": {
		"en": /^(mon|tue|wed|thu|fri|sat|sun)$/gi,
	},
	"abbrMonths": {
		"en": /^(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/gi,
	},
	"days": {
		"en": /(monday|tuesday|wednesday|thursday|friday|saturday|sunday)/gi,
	},
	"months": {
		"en": /^(january|february|march|april|may|june|july|august|september|october|november|december)$/gi,
	},
	"postFixes": {
		"en":   /^[0-9]+(st|nd|rd|th)$/gi,
	}
});

The method returns the detector instance so that chaining can occur.

Return Format

The following object is returned from the detector.typeDetect(dataArray) call mentioned above. If a given property is not relevant to the type that has been detected then it is set to null.

{
	type: null | string;
	format: null | string;
	locale: null | string;
	prefix: null | string;
	postfix: null | string;
	dp: null | number;
}
PropertyDescription
typeThe type that has been determined for the data.
formatIf a date has been detected as the type then this property is set to the moment format of that date.
localeIf a date has been detected then this property is set to store the locale of that date.
prefixIf numeric data has been detected then any common prefix across the entire data set is stored on this property. This excludes any numbers that are at the end of the prefix.
postfixIf numeric data has been detected then any common postfix across the entire data set is stored on this property. This excludes any numbers that are at the start of the postfix.
dpIf numeric data has been detected then the largest number of decimal places present within the data set is stored on this property.

Detectable Types

The following types can be detected by the Type-Detection library. They are listed here in the order that identification is attempted in.

Excel

CloudTables allows the importing of data from excel sheets. When this occurs the data that is read in has both a value and a format within an object.

[{ value: 1.1, excel: '#.#'}, { value: 2.5, excel: '#.#'}, { value: 3.6, excel: '#.#'}]

The data within this set is numeric, and therefore the type returned will be number. The dp, prefix and postfix properties are also determined using a combination of the value and excel properties. This page is useful for understanding the different formats that excel can create.

Excel imports can also include dates, in which case a format is also set in the excel property. In this case the date detection methods implemented within this library are used as the excel date formats are different to the moment formats that Cloudtables uses.

Strings and can also be determined from excel data.

Numbers

The Type-Detection library is capable of identifying a variety of formats of numbers.

  • JS numbers (100)
  • Numeric Strings ('100')
  • Numeric Strings with a prefix, such ascurrency ('$100')
  • Numeric Strings with a postfix, such as a unit ('100 cm')
  • Numeric Strings with both a prefix and a postfix ('$100 per month')
  • All of the above with decmial places, '.' by default (100.5, '100.5', '$100.50', '100.5 cm', '$100.50 per month')
  • All of the above (except JS Numbers) with a thousands separator, ',' by default ('100,000.5', '$100,000.50', '100,000.5 cm', '$100,000.50 per month')

There is no limit to the size of prefixes or postfixes.

Numbers

The Type-Detection library is capable of identifying sequences of numbers. To do this all of the numbers in the data must be integer values. The must also not include and prefixes or postfixes, but they can include thousands separators.

  • JS numbers (1, 2, 3)
  • Numeric Strings ('1', '2', '3')
  • Numeric Strings ('1,000', '2,000', '3,000')

HTML

To determine HTML the string inputs are checked against the following Regular Expression...

/<(“[^”]*”|'[^’]*’|[^'”>])*>/g

A data set does not need to be solely comprised of html code to get the html type. If there are also strings present then an html type can be returned.

Dates

Moment is used to aid in the detection of dates and their formats. Any date containing the following tokens can be detected, but each token should only be used once in the field.

Tokens

TokenDescriptionExampleExtra Notes
YYYYFull Year2020, 2021Cannot be used with YY
YYShort Year20, 21Cannot be used with YYYY
MMMMFull MonthJanuary, FebruaryCannot be used with MMM, MM, M
MMMAbbreviated MonthJan, FebCannot be used with MMMM, MM, M
MMMonth as Number01, 02Cannot be used with MMMM, MMM, M
MMonth as Short Number1, 2Cannot be used with MMMM, MMM, MM
DDDay as Number01, 02Cannot be used with D, Do
DDay as Short Number1, 2Cannot be used with DD, Do
DoDay with Postfix1st, 2ndCannot be used with DD, D
ddddDay of the WeekMonday, TuesdayCannot be used with ddd
dddAbbreviated Day of the WeekMon, TueCannot be used with dddd
HH24 Hour01, 02, ..., 23Cannot be used with H, hh, h. If used with A or a will become h
H24 Hour Short1, 2, ..., 23Cannot be used with HH, hh, h. If used with A or a will become h
hh12 Hour01, 02, ..., 12Cannot be used with HH, H, h
h12 Hour Short1, 2, ..., 12Cannot be used with HH, H, hh
mmMinutes01, 02, ..., 59Cannot be used with m
mMinutes Short1, 2, ..., 59Cannot be used with mm
ssSeconds01, 02, ..., 59Cannot be used with s
sSeconds Short1, 2, ..., 59Cannot be used with ss
AAM/PM in CapitalsAM, PMCannot be used with HH, H
aam/pm in lower caseam, pmCannot be used with HH, H

Some Examples

The date detection segment will always return one of the following.

  • A moment format - in this case a date has been succesfully identified with a consistent format.
  • A mixed type - in this case multiple different date formats have been identified.
  • null - no dates could be identified and other types will be checked.

Consider the following array of data.

['Thursday 1st Oct, 2020 11:12:47 AM','Sunday 17th Oct, 1999 10:56:07 PM','Friday 3rd Jan, 2025 9:02:00 AM']

The moment format that is detected is dddd Do MMM, YYYY h:mm:ss A, which is correct. Some changes can be made to this data to create ambiguity in the format that is used. If the AM/PM part of the dates is removed as follows...

['Thursday 1st Oct, 2020 11:12:47','Sunday 17th Oct, 1999 10:56:07','Friday 3rd Jan, 2025 9:02:00']

Now the moment format that is returned is dddd Do MMM, YYYY H:mm:ss, Notice the change from 12 hours (h) to 24 hours (H). In this case either token is valid, H is selected as it allows more flexibility if more data is added in the future.

There are a number of cases such as the above where some ambiguity is present - this is why there isn't a pre-existing type detection library in common use. A common example that is used is...

['10/10/10', '11/11/11', '12/12/12']

There isn't a definitive way to derive a format for these tokens, instead a good guess is taken. The moment format that is produced in this case is M/D/Y which again is valid, but it could be MM/DD/YY, YY/MM/DD , YY/D/M or a whole variety of combinations of the Day, Month and Year. The best that can be done in situations like these is to take a guess with a format that is valid for the entire set.

When it comes to dates in the above format where certainty is not guaranteed the order of priority is Months > Days > Years.

When it comes to time the order of priority is Hours > Minutes > Seconds.

Locales

Differnet locales can also be detected where the tokens are strings rather than numbers (Do, dddd etc.). The languages that are currently integrated are

  • English (default)
  • German
  • French
  • Spanish

As with other elements of date detection, where a single locale cannot be determined, one will be selected that can apply to all of the data in the set.

The detection of locales is completed using a series of Regular Expressions. These are iterated over in an attempt to find matches.

Mixed

If multiple different types are detected across the entire dataset then a mixed type is returned. This type will also be returned if there are dates with multiple different formats.

Note. The only exception to the mixed rule is a combination of strings and html. In this case a type of html will be returned.

Strings

If no other type can be detected then the type defaults to string.

Performance

The library has been tested with a data set 10,000 elements in length for each type that can be detected. The response times achieved were around 10-20 ms for most types. The exception to this was complex date formats where the response time increased to around 400 ms.

Pull requests are welcome and encouraged to increase the performance of the library.

1.1.6

2 years ago

1.1.5

2 years ago

1.1.4

2 years ago

1.1.3

2 years ago

1.1.2

4 years ago

1.1.1

4 years ago

1.1.0

4 years ago

1.0.6

4 years ago

1.0.5

4 years ago

1.0.4

4 years ago

1.0.3

4 years ago

1.0.2

4 years ago

1.0.1

4 years ago

1.0.0

4 years ago

0.0.5

4 years ago

0.0.6

4 years ago

0.0.4

4 years ago

0.0.3

4 years ago

0.0.2

4 years ago

0.0.1

4 years ago