0.1.0 • Published 8 years ago

collabo-flow v0.1.0

Weekly downloads
10
License
MIT
Repository
github
Last release
8 years ago

CollaboFlow

NPM version Build Status Coverage Status Dependency Status semantic-release

'CollaboFlow' is a node.js environment for coordinated execution of flows (or workflows) that can be in node, python, ... It is part of the CollaboScience initiative.

Introduction

Collabo-Flow Class diagram

Flow

'Flow' is a structured set of tasks that, taken together, serve a particular purpose. The purpose could be to parse a set of data in a particular way, or to analyse the corpus of an author @see Bukvik{@link http://bukvik.litterra.info}, or to crawl Wikipedia and analyse community behavior @see Democracy-Framework - Wikipedia{@link http://cha-os.github.io/democracy-framework-wikipedia/}, etc, etc.

Flow can be executed fully or partially. A user can run the entire flow, a particular task, or a particular task with dependencies - the task with all the other tasks that the specified task depends upon. For example, a task calculating some data may depend on the task that loads that data in the first place (its 'dependency').

@see:

Task

'Task' describes a clearly isolated unit of work. Task explains 'what' has to be done and 'how' it has to be done. Each task has datasets that it uses for performing intended work. 'How' tasks are executed is described by referring to a specific 'processor' that will perform the work. 'What' the task works with is described with a set of 'datasets' that should be used for performing the intended work.

@see:

Processor

'Processor' is the 'code-implementation' of a particular clearly defined type of work, called up by a task. A processor usually takes the form of an internal or external module in the system. It can be encoded in different programming languages (JavaScript (nodejs), Python, ...). Each task relates 1-1 with the processor that is performing it.

Users can

  1. use preexisting processors from some free/commercial bundles (packages)
  2. develop new processors themselves, hire someone to develop them
  3. 'wrap' existing complex tools by creating light wrapper-processors that will make those tools understandable for CollaboFlow.

@see:

Dataset

A 'dataset' is a reference to any type (and cardinality) of data organized under a single logical/conceptual unit and identified with a unique namespace(:name). The dataset has different metadata associated with it, it can be versioned etc, but overall it is a set of data the task can uniquely access, load, update or store.

@see:

Port

A 'port' is a structural unit that connects tasks and datasets. Every task can have multiple named ports for accessing datasets. The two most common ports are 'in' and 'out'. Those two ports contains input datasets that task uses to retrieve all necessary data for processing, and output datasets that task uses for storing working results, retrospectively. Usually each port contains just one dataset, but in the case a task processed an array of datasets, the port can contain any provisionally large array of datasets.

Each port has its semantically described direction. It can be one of the following:

  • in: datasets will be used just for reading from
  • out: datasets will be used just for writing to
  • inout: datasets will be used both for reading from and writing to

@see:

Namespace and name

'Namespace' is a notation used for organizing both tasks and datasets. It consists of a set of words separated with periods, for example:

  • df-wiki.get-pages-revisions.by-pages-list or
  • <NAMESPACE_DATA>.wiki-user.registered.contribution, or
  • <NAMESPACE_DATA>.wiki-user.registered.talk-pages.snapshots.

For the last two namespaces, we can see that they share the same beginning <NAMESPACE_DATA>.wiki-user.registered`.

To fully specify (address) one entity, either a dataset or a task, we use both namespace and the name of the entity in the form 'namespace:name'.

'Dependency' between tasks is mainly structured through the datasets they use. For example, if taskB consumes dataset1 (through its input port), and taskA produces dataset1 (through its output port), then we say that 'taskB depends on taskA'.

Examples

Task

{
	"type": "task",
	"namespace": "<NAMESPACE_TASK>.execute",
	"name": "titles-to-pageids",
	"comment": "Converts list of titles (+adding title prefix) into list with titles and pageids",
	"processorNsN": "df-wiki.get-pages.titles-to-pageids",
	"execution": 0,
	"forceExecute": true,
	"isPower": true,
	"params": {
		"outputFileFormats": ["json", "csv"],
		"processorSubType": "titles-to-pageids",
		"pagePrefix": "",
		"wereSpecialPages": false,
		"retrievingDepth": 0,
		"wpNamespacesToRetrieve": [0],
		"everyWhichToRetrieve": 1,
		"csvSeparator": ", ",
		"storeInDb": false,
		"transferExtraFields": ["popularity", "accuracy"]
	},
	"ports": {
		"in": {
			"direction": "in",
			"datasets": [
				{
					"namespace": "<NAMESPACE_DATA>",
					"name": "zeitgeist_2008-2013"
				}
			]
		},
		"listOut": {
			"direction": "out",
			"datasets": [
				{
					"namespace": "<NAMESPACE_DATA>",
					"name": "zeitgeist_2008-2013-with-ids"
				}
			]
		}
	}
}

Notably, there are a few important aspects:

  • type: it tells that the entity in the flow is a task
  • namespace: the namespace that the task belongs to
  • name: the name of the task (inside of the namespace)
  • comment: a comment about the task (what it does, etc), the phrasing is free
  • processorNsN: a full namespace:name path to the processor that will execute the task
  • params: set of properties we are passing to the processor to tweak the task's execution
  • ports: set of ports to the task, where each port has an array of datasets that task can access In this case we have two ports, each with one dataset associated with it.

For more details please @see the TaskStructure{@link collabo-flow.Task.TaskStructure}

Ports and dataset

Creating a flow

In order to create a new flow, the user needs to create a new json file with the flow's description in it @see FlowStructure{@link collabo-flow.Flow.FlowStructure}.

Example of a blanc flow:

{
	"name": "Micro-analysis",
	"authors": "Sinisha Rudan, Sasha Mile Rudan, Lazar Kovacevic",
	"version": "0.1",
	"description": "Micro-analyses of Wikipedia behavior",
	"environment": {
		"variables": {
			"NAMESPACE_DATA": "democracy-framework.wikipedia"
		}
	},
	"data": [
	]
}

Note: Our sub-project Bukvik{@link http://bukvik.litterra.info} does support visualization and editing of the flow, but we are still working on migrating it to a more general context necessary to visualize, edit, and run collaboflows.

node df-wiki/exec.js micro-analysis.json

After setting up all the metadata in the newly created flow, the user can add the first task @see the TaskStructure{@link collabo-flow.Task.TaskStructure}. For each task, the user needs to find the appropriate processor (@see Processor implementation{@link collabo-flow.Processor}), or create a new one. Some of easily understandable processors can be found among core Collabo-Flow processors{@link collabo-flow.processors}.

We need to set up the root of the namespace, at the moment we can do it through 'collabo-flow.execute-flow.js' setting FFS constructor

Creating a new processor

Create a new nodejs file: analyse-wp-polices.js and store it with other processors: df-wiki/processors/analyse-wp-polices.js.

Write stub content for the new processor

Creating a new sub-processor

Register it in 'collabo-flow.execute-flow:loadProcessors' :

// analyze-wp-polices
this.processors['df-wiki.processors.analyze-wp-polices.analyze-user-talk-pages'] = "../df-wiki/processors/analyze-wp-polices";

Possible Scenarios

Democracy Framework

Bukvik

multiple environments

  • python,
  • nodejs, and
  • potentially more (R, java, ...)

communicating through messagging

communicating through streams

communicating through caches

communicating through filesystem storage

communicating through databases

communicating through services

need to control tasks'/services' execution

need to pipeline

need to verify which parts of dataflow have to be reprocessed/updated

LitTerra

ToDo

CollaboFlow

+

Release History