sqlite-file-mirror v0.3.0
SQLite File Mirror, next version (?) of MKTS Mirage
Objective:
A facility to mirror the contents of text files into SQLite tables with configurable names and fields;
provide a set of formats (CSV, TSV and so on) to simplify reading and persisting configuration data and such.
In contradistinction to the current version of MKTS Mirage, which uses one big table to hold all sources and has a few bells and whistles to accommodate the particular needs of DataMill, SQLite File Mirror uses one table per source file.
Source Descriptions
source_description =
name: 'nickname' # used to identify this source in API calls
path: 'path/to/source.txt' # where the source file is to be found
fields: [ # a list of field descriptions; see below
{ name: 'linenr', 'type': 'integer', } # field description
{ name: 'text', 'type': 'text', null: false, } # field description
]
primary_key: 'linenr' # can be list of field names
The name of the source becomes the name of the table that will hold the file contents, so when you add a
source with SFM.add_source { name: 'config', path: 'myfile.txt', }
, you can then load the contents of
myfile.txt
with SFM.refresh 'config'
and browse the resulting table with an SQL statement like select *
from "config" order by linenr;
.
Field Descriptions
The fields
attribute of a source description may contain field descriptions:
field_description =
name: 'foo "bar" baz' # will be escaped as a name, so can contain weird characters
type: 'integer' # will to be used as-is, so must make sense to SQLite
default: 42 # will be escaped as a value; do not write an SQL literal
null: false # set to true to omit `not null` clause
unique: false # set to true to add `unique` clause
The Format, Max_Split, Transform Settings
format
can be set to the name of a known format to have Mirage preprocess file contents. Its default
value is plain
, which is a no-op; other formats are
wsv
for whitespace-separated valuestsv
for tab-separated values
transform
General Behavior of _preprocess()
Blank lines—lines that match
/^\s*$/u
—will be skipped entirely. (### TAINT make configurable)Commented lines will be skipped entirely; by default, all lines matching
/^\s*#/u
(read: lines that start with a hash#
, optionally preceded by whitespace) are considered comments and silently skipped. (### TAINT make configurable)Blank fields—again, fields that match
/^\s*$/u
—will be set to the default value for that field. Since it is possible to set the default tonull
while simultaneously configure the field to not allownull
s (which, in fact, is what happens unless the field is explicitly configured otherwise), this means that, by default, lines with missing values will trigger data integrity errors.For each line,
_preprocess()
will always return as many fields as there are fields in that DB table; the first field will always be the line number, and the last field will always hold the remainders of the line with the full text of the unsplit line contents. Put more simply, each line gets split into as many parts as there are content fields minus one, and whatever material exceeds that many splits is returned in the last position; in effect, the last field acts a catch-all field. (### TAINT make configurable)As an illustration, consider a line like
いち に さん よん ご
(observe that longish space between the last two fields), consisting of five whitespace-separated entries. We will use{ format: 'wsv', }
to configure this source, which stands for 'whitespace-separated values', a format where stretches of arbitrary whitespace characters are considered field separators.In case the source has been configured to contain columns
lnr
,c1
,c2
,c3
,c4
,c5
, that means one will get something like[ 314, 'いち', 'に', 'さん', 'よん', 'ご', ]
as a representation of that line: there's a line number and five content fields. If, however, one had configured one column more for that source, the result would've been[ 314, 'いち', 'に', 'さん', 'よん', 'ご', null, ]
: all columns are filled, and the remainder is padded with default values (null
in this case, but could be anything).If, on the other hand, one had configured the source with fields
lnr
,c1
,c2
,c3
, then the result would have been[ 314, 'いち', 'に', 'さん よん ご', ]
instead. Observe how that longish space is now preserved in the value of the last (catch-all) field.As a logical consequence of the above rules, not configuring source fields means that no splitting is done at all no matter the format setting, since there is only a single content field named
text
in the basic configuration.
Mirroring Procedure
Steps:
- File is read line-by-line;
- each line is fed to the preprocessor which
- filters out blank lines,
- filters out commented lines, and
- splits the line into fields as described in the section on line-splitting.
- The result is one list
[ linenr, fields..., ]
per unfiltered line wherelinenr
is a positive integer and thefields
are strings. - Each field value of each line is then subjected to casting, i.e. literal values are turned into
appropriate data types (e.g. the string
'true'
might be turned into a Booleantrue
and so on); - finally, a custom formatter is called with an object that represents the 'raw' (but cast) fields of each
line; the formatter is expected to return either
null
(to indicate the line should be dropped) or else an object matching the names and data types of the table's configured target structure.
Performance Considerations
For unknown reasons, serializing the DB object to a file is unbearably slow;
this is in any event no fault of the application code as the problem goes away when using an in-memory DB (with
new SqliteDB ':memory:'
);currently the best strategy is to always use an in-memory DB and then use the
DB.save()
method to persist the data to disk:mirage = SFM.new_mirror new SqliteDB ':memory:' SFM.add_source mirage, { name: 'readme', path: 'README.md', } await SFM.refresh mirage SFM.save mirage, 'path/to/mirage.db'
The
save()
method used SQLite'svacuum into 'path';
command which is very fast indeed.