0.3.0 • Published 4 years ago

sqlite-file-mirror v0.3.0

Weekly downloads
-
License
MIT
Repository
-
Last release
4 years ago

SQLite File Mirror, next version (?) of MKTS Mirage

Objective:

  • A facility to mirror the contents of text files into SQLite tables with configurable names and fields;

  • provide a set of formats (CSV, TSV and so on) to simplify reading and persisting configuration data and such.

In contradistinction to the current version of MKTS Mirage, which uses one big table to hold all sources and has a few bells and whistles to accommodate the particular needs of DataMill, SQLite File Mirror uses one table per source file.

Source Descriptions

source_description =
  name:         'nickname'                            # used to identify this source in API calls
  path:         'path/to/source.txt'                  # where the source file is to be found
  fields:       [                                     # a list of field descriptions; see below
    { name: 'linenr', 'type': 'integer', }            # field description
    { name: 'text',   'type': 'text', null: false, }  # field description
    ]
  primary_key:  'linenr'                              # can be list of field names

The name of the source becomes the name of the table that will hold the file contents, so when you add a source with SFM.add_source { name: 'config', path: 'myfile.txt', }, you can then load the contents of myfile.txt with SFM.refresh 'config' and browse the resulting table with an SQL statement like select * from "config" order by linenr;.

Field Descriptions

The fields attribute of a source description may contain field descriptions:

field_description =
  name:     'foo "bar" baz' # will be escaped as a name, so can contain weird characters
  type:     'integer'       # will to be used as-is, so must make sense to SQLite
  default:  42              # will be escaped as a value; do not write an SQL literal
  null:     false           # set to true to omit `not null` clause
  unique:   false           # set to true to add `unique` clause

The Format, Max_Split, Transform Settings

format can be set to the name of a known format to have Mirage preprocess file contents. Its default value is plain, which is a no-op; other formats are

  • wsv for whitespace-separated values

  • tsv for tab-separated values

transform

General Behavior of _preprocess()

  • Blank lines—lines that match /^\s*$/u—will be skipped entirely. (### TAINT make configurable)

  • Commented lines will be skipped entirely; by default, all lines matching /^\s*#/u (read: lines that start with a hash #, optionally preceded by whitespace) are considered comments and silently skipped. (### TAINT make configurable)

  • Blank fields—again, fields that match /^\s*$/u—will be set to the default value for that field. Since it is possible to set the default to null while simultaneously configure the field to not allow nulls (which, in fact, is what happens unless the field is explicitly configured otherwise), this means that, by default, lines with missing values will trigger data integrity errors.

  • For each line, _preprocess() will always return as many fields as there are fields in that DB table; the first field will always be the line number, and the last field will always hold the remainders of the line with the full text of the unsplit line contents. Put more simply, each line gets split into as many parts as there are content fields minus one, and whatever material exceeds that many splits is returned in the last position; in effect, the last field acts a catch-all field. (### TAINT make configurable)

    As an illustration, consider a line like いち に さん よん    ご (observe that longish space between the last two fields), consisting of five whitespace-separated entries. We will use { format: 'wsv', } to configure this source, which stands for 'whitespace-separated values', a format where stretches of arbitrary whitespace characters are considered field separators.

    In case the source has been configured to contain columns lnr, c1, c2, c3, c4, c5, that means one will get something like [ 314, 'いち', 'に', 'さん', 'よん', 'ご', ] as a representation of that line: there's a line number and five content fields. If, however, one had configured one column more for that source, the result would've been [ 314, 'いち', 'に', 'さん', 'よん', 'ご', null, ]: all columns are filled, and the remainder is padded with default values (null in this case, but could be anything).

    If, on the other hand, one had configured the source with fields lnr, c1, c2, c3, then the result would have been [ 314, 'いち', 'に', 'さん よん    ご', ] instead. Observe how that longish space is now preserved in the value of the last (catch-all) field.

    As a logical consequence of the above rules, not configuring source fields means that no splitting is done at all no matter the format setting, since there is only a single content field named text in the basic configuration.

Mirroring Procedure

Steps:

  • File is read line-by-line;
  • each line is fed to the preprocessor which
  • The result is one list [ linenr, fields..., ] per unfiltered line where linenr is a positive integer and the fields are strings.
  • Each field value of each line is then subjected to casting, i.e. literal values are turned into appropriate data types (e.g. the string 'true' might be turned into a Boolean true and so on);
  • finally, a custom formatter is called with an object that represents the 'raw' (but cast) fields of each line; the formatter is expected to return either null (to indicate the line should be dropped) or else an object matching the names and data types of the table's configured target structure.

Performance Considerations

  • For unknown reasons, serializing the DB object to a file is unbearably slow;

  • this is in any event no fault of the application code as the problem goes away when using an in-memory DB (with new SqliteDB ':memory:');

  • currently the best strategy is to always use an in-memory DB and then use the DB.save() method to persist the data to disk:

    mirage = SFM.new_mirror new SqliteDB ':memory:'
    SFM.add_source mirage, { name: 'readme', path: 'README.md', }
    await SFM.refresh mirage
    SFM.save mirage, 'path/to/mirage.db'
  • The save() method used SQLite's vacuum into 'path'; command which is very fast indeed.

0.3.0

4 years ago

0.2.0

4 years ago

0.1.0

4 years ago

0.0.2

5 years ago