1.0.8 • Published 4 years ago

sicil

Licence

Apache-2.0

Version

1.0.8

Deps

Size

24 kB

Vulns

Weekly

Summary Dependency Versions

Sicil - Efficient Integer Stabbing

Quick Start

To install:

npm install sicil

To import:

const Sicil = require('sicil');

To construct an index on a set of intervals:

// Create a collection of intervals (with no duplicates).
let ivls = []
for (let i = 0; i < 100; i++) {
    let a = Math.floor(1 + 100*Math.random());
    let b = a + 3*(i + 1);
    ivls.push([a, b]);
}
let idx = new Sicil(ivls);

Find the intervals that overlap a given position:

let res = idx.find(42);

Find the intervals that overlap a given interval:

let res = idx.findRange([42, 61]);

Background

This module provides a clean Javascript implementation of the Integer Stabbing method described by Jens M. Schmidt in:

Schmidt J.M. (2009) Interval Stabbing Problems in Small Integer Ranges. In: Dong Y., Du DZ., Ibarra O. (eds) Algorithms and Computation. ISAAC 2009. Lecture Notes in Computer Science, vol 5878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10631-6_18

In this paper, the author presents a clever data structure that supports efficient queries over a set of (closed) intervals in a dense integer domain. The last bit is important, because it means we can use a O(1) lookup in an array to find things, and during construction, we can pre-compute the starting points for all queries.

The downside, is that in the context of genomic annotations, we want to query over genomic positions which are sparse, with respect to annoations. To address this, we introduce a pair of rank/select data structures to allow efficient mapping between the sparse and dense domains.

Consider the following set of intervals from the annotation of the gene TRAPPC4 on Chromosome 11:

    |-----------------------------------------------------------| [119018763,119025454]
    |-----------------------------------------------------------| [119018766,119025454]
    |-----------------------------------------------|             [119018766,119024134]
    |-------------------------------------------|                 [119018766,119023672]
    |------------------------------------------|                  [119018766,119023665]
    |------------------------------------------|                  [119018766,119023642]
    |----------------------------------------|                    [119018766,119023421]
    ||                                                            [119018766,119018970]
       ||                                                         [119019143,119019317]
       ||                                                         [119019143,119019210]
       |                                                          [119019143,119019188]
                ||                                                [119020150,119020253]
                ||                                                [119020205,119020253]
                              |-|                                 [119021760,119021886]
                               ||                                 [119021874,119021886]
                                            |-------|             [119023321,119024134]
                                            |---|                 [119023321,119023672]
                                            |--|                  [119023321,119023665]
                                            |--|                  [119023321,119023642]
                                            ||                    [119023321,119023421]
                                                          |-----| [119024857,119025454]
   -+---------+---------+---------+---------+---------+---------+-
    + 119018763         + 119020993         + 119023223         + 119025454
              + 119019878         + 119022108         + 119024338

If we take the positions, and condense them into ranks, the set of intervals becomes:

    |-------------------|     [119018763,119025454]/[0,20]
     |------------------|     [119018766,119025454]/[1,20]
     |----------------|       [119018766,119024134]/[1,18]
     |---------------|        [119018766,119023672]/[1,17]
     |--------------|         [119018766,119023665]/[1,16]
     |-------------|          [119018766,119023642]/[1,15]
     |------------|           [119018766,119023421]/[1,14]
     ||                       [119018766,119018970]/[1,2]
       |--|                   [119019143,119019317]/[3,6]
       |-|                    [119019143,119019210]/[3,5]
       ||                     [119019143,119019188]/[3,4]
           |-|                [119020150,119020253]/[7,9]
            ||                [119020205,119020253]/[8,9]
              |-|             [119021760,119021886]/[10,12]
               ||             [119021874,119021886]/[11,12]
                 |----|       [119023321,119024134]/[13,18]
                 |---|        [119023321,119023672]/[13,17]
                 |--|         [119023321,119023665]/[13,16]
                 |-|          [119023321,119023642]/[13,15]
                 ||           [119023321,119023421]/[13,14]
                       ||     [119024857,119025454]/[19,20]
   -+---------+---------+-----
    + 0       + 10      + 20

This is now in a suitable form for the the algorithm discussed in the Schmidt paper. We can construct a sparse array using the implentation of Okanohara and Sadakane's sdarray in the SDSL library by Simon Gog et al. This contains the positions of the coordinates of interest, and then we can use rank/select to map between the two domains. The beauty of this approach is that both operations are O(1).

The problem is that it doesn't readily support all the queries we want, since we might want to query for annotations at the genomic position chr11:119020256, which lies between two exons (i.e. in an intron). Since, it doesn't coincide with with any of the positions used to build the dense domain, and lies between positions 9 and 10, we can't efficiently query it directly. One solution would be to use the dense domain, and stab both positions 9 and 10, and take the intersection of the results, but this doubles the amount of work.

Instead, we use an augmented dense domain. When we construct the list of positions from the intervals, we note any positions that are not contiguous in the sparse (genomic) domain, and add an element to the dense domain to stand for all positions between the two. In practice, this approximately doubles the size of the domain, since most positions we want to mention have gaps between them.

Taking the example above, this yields the set of intervals:

     |---------------------------------------|     [119018763,119025454]/[1,41]
       |-------------------------------------|     [119018766,119025454]/[3,41]
       |---------------------------------|         [119018766,119024134]/[3,37]
       |-------------------------------|           [119018766,119023672]/[3,35]
       |-----------------------------|             [119018766,119023665]/[3,33]
       |---------------------------|               [119018766,119023642]/[3,31]
       |-------------------------|                 [119018766,119023421]/[3,29]
       |-|                                         [119018766,119018970]/[3,5]
           |-----|                                 [119019143,119019317]/[7,13]
           |---|                                   [119019143,119019210]/[7,11]
           |-|                                     [119019143,119019188]/[7,9]
                   |---|                           [119020150,119020253]/[15,19]
                     |-|                           [119020205,119020253]/[17,19]
                         |---|                     [119021760,119021886]/[21,25]
                           |-|                     [119021874,119021886]/[23,25]
                               |---------|         [119023321,119024134]/[27,37]
                               |-------|           [119023321,119023672]/[27,35]
                               |-----|             [119023321,119023665]/[27,33]
                               |---|               [119023321,119023642]/[27,31]
                               |-|                 [119023321,119023421]/[27,29]
                                           |-|     [119024857,119025454]/[39,41]
   -+---------+---------+---------+---------+------
    + 0       + 10      + 20      + 30      + 40

Now, our query can map to dense position 19, which correctly stabs between the relevant exons.

We implement this by generating the list of positions of interest from the intervals as before, and we construct the same sparse array, but in addition, we construct a list of the positions of the coordinates in the augmented dense domain. We can now construct a dense array with rank/select support. We can take ranks from the original sparse domain, and use select to determine the position in the augmented domain, and similarly, when we get an interval back from the the data structure, we can use rank on the end-points to get their ranks in the sparse domain, and we can then use select to recover the original genomic position. Again, we can perform these operations in O(1).