0.2.6 • Published 10 years ago

sql-data-profiler v0.2.6

Weekly downloads
18
License
MIT
Repository
github
Last release
10 years ago

npm version Build Status

sql-data-profiler

sql-data-profiler is a utility module that generates sql code to profile data sets (eg. tables) in Redshift.

Installation

npm install sql-data-profiler --save

Quick start

var sql_data_profiler = require('sql-data-profiler');

Table profiler

Generates a sql statement that provides basic stats on the data in the "contacts" table.

var data_profiler = sql_data_profiler.data_profiler;

var options = {
	target_table: 'contacts',
	target_columns: ['email', 'a_industry']
};
var sql_code = data_profiler(options);

which accepts the following options

  • target_table
  • target_columns
  • results_table
  • calculate_frequency
  • use_perm_table
  • truncate_table

Distribution analysis

What does this do?

var distribution_analyzer = sql_data_profiler.distribution_analyzer;

var sql_code = distribution_analyzer({
	target_table: 'contacts',
	target_variable: {
		data_set: 'SELECT DISTINCT contact_id FROM events WHERE meta_event = \'conversion\'',
		join: {
			source_column: 'contact_id',
			target_column: 'contact_id'
		}
	},
	target_columns: [
		{
			name: 'a_name',
			max_number_of_values: 10,
		},
		{
			name: 'a_gender',
			max_number_of_values: 10
		},
		{
			name: 'a_address',
			max_number_of_values: 10,
			transformation: 'has_any_value'
		}
	]
});

Table stats

The following stats are calculated for each column.

StatsDescription
count_totalnumber of records in the table
count_not_nullnumber of records where the value for the specified column is not null
fill_ratenumber of non-null values divided by number of records
count_distinctnumber of distinct values
dupe_ratenumber of distinct values divided by number of records where the value for the specified column is not null
maximum_value
minimum_value
most_frequent_value_1
most_frequent_value_1_frequency
most_frequent_value_2
most_frequent_value_2
most_frequent_value_3
most_frequent_value_3

TO DO

  • handle different data type (eg. boolean)
  • performance improvement
0.2.6

10 years ago

0.2.5

10 years ago

0.2.4

10 years ago

0.2.3

10 years ago

0.2.2

10 years ago

0.2.1

10 years ago

0.1.2

10 years ago

0.1.1

10 years ago

0.1.0

10 years ago