@peterwmwong/gto NPM

GTO: Gremlin TypeScript ORM

WARNING: This project is an experiment and has not been put into production yet.

Developer Getting Started

npm ci
npm run test-db-build-docker-image
npm run test-db-start
npm run test

Enforced consistency and correctness

Currently, our with repository methods are ad-hoc groupings of raw read/write DB queries. Hard to enforce consistency or correctness.

Example: IngestionRow's rowNumber

Excel import path added rowNumber property, IRI/CSV import path did not. If repositories were Object Oriented that have a consistent read/write view of properties, this would not have happened.

Example: IngestionRow's rowNumber PART 2.

Mike and I attempted to add setting rowNumber in the IRI/CSV import path, but only found out later it was incorrectly set as a string instead of a number... and effectively causing ingestion row chunking to take forever/blow up.

Benefits from GTO

Nodes and Edges are created and filtered with the correct properties with the correct types
Traversals between Nodes and Edges are always correct
- ex. Prevent accidentally going from Ingestion to IngestionRow through the wrong edge (HAS_???)
- ex. Prevent accidentally using the wrong direction (in_? out?)

FUTURE IDEA: DB/Query Metrics/Statistics

Individual
Aggregate
- What are the longest taking queries?
- What are the most frequent queries?
- What are the biggest queries?

FUTURE IDEA: Automated DB validation

It is still possible for the database's structure to be tampered with outside of the application (JupyterHub, direct '/gremlin' access).

As GTO provides a single source of truth/schema for the DB, we could easily build a script that runs through each GTO Node, Edge, properties and make sure we're still in sync/valid.

ex. Using Node.name and new Node(g).properties, query nodes that don't have all the required properties, mis-typed properties, extra-properties, etc.

A more accessible Graph DB

Currently, the learning curve to enable Product/QA/Developer to access data in the DB is steep for a number of reasons:

Gremlin Querying
- Not widely known as other DB querying languages (ex. SQL)
- Less Stack Overflows
- Less Documentation
- Little-to-no tooling support (is this gremlin query syntactically correct?)
No Schema
- Unlike SQL DBs, where out-of-the-box tooling can surface tables, columns (name, type), relationships between tables... Neptune does not.
- This makes it hard to even know where to begin when trying to access data:
  - What nodes/edges are available?
  - What properties for nodes/edges and their types (number? string?)
  - Which direction is the edge? (inE? outE? in_? out?)
- Currently, the structure of the Graph DB is enforced by our code.
- Even worse, the code currently does not have a single-source-of-truth about which nodes/edges nor the properties (name/type) on nodes/edges.
Constants
- Labels for Nodes/Edges and property names are mostly in flat "lists" of constants
- Incredibly easy to use the wrong constant, in the wrong place. Nothing stopping you from trying to use P_VERTEX_TYPE when querying against an Edge.

Benefits from GTO

Single source of truth for a Node/Edge and relationship between Nodes and Edges
Type/Editor driven querying
- Type information provides users accurate hints on what's possible and valid
  - ex. Ingestion. options - all, byId
  - ex. Ingestion.all(g, { options - source (property)
  - ex. Ingestion.all(g, {source: 'Annotator'}). options - having, count, fetchOne, fetchAll, IngestionRows.

Discoveries

Gremlin: GraphTraversalSource, GraphTraversal, Statics (Anonymous Traversal) have different steps.

Step	GraphTraversalSource	GraphTraversal	Statics
E	✖
V	✖	✖	✖
addE	✖	✖	✖
addV	✖	✖	✖
toList		✖
iterate		✖
next		✖

@infinitebrahmanuniverse/nolb-_pet @everything-registry/sub-chunk-713

5 years ago

5 years ago

5 years ago

5 years ago

5 years ago

5 years ago

5 years ago