@ebay/flow-telemetry NPM

flow-telemetry

Adding observability to feature flow with OpenTelemetry.

More and more companies rely on software as a competitive advantage, allowing them to serve their customers better, understand and optimize their business, improve their operations, and so on. But building software is not the same as building physical objects. There is no factory floor where you can see what is going on. It is also not a repeatable process where every widget is built in exactly the same way with the same parts.

Seminal books like Accelerate and From Product to Project have shown us what we need to measure to observe and improve the flow of software delivery. Many organization are building tools to help them measure and visualize how their software is being built, and some companies are starting to provide platforms that delivery some of this tooling. But what we're lacking is a shared, open-source framework for software delivery observability.

OpenTelemetry has allowed the software community to build an ecosystem of tools and libraries for observability of software systems. It helps us answer questions like questions like "why is it taking so long to render the page?" and "what happened that caused this significant increase in latency?"

But these are the same questions have for feature delivery. Just like we want to improve and maintain latency and throughput as we scale our software, we want to maintain and improve the speed and throughput of our feature delivery as we scale our organization.

This project is a space where we as a community can start adding OpenTelemetry observability into feature delivery, allowing us to have the same rich ecosystem of dashboards, tools and libraries for digging deep into how we are building software that we currently have for observing software systems.

Conceptual mapping

OpenTelemetry models everything as Signals. There are three types of signals: traces, logs, and metrics. Let's look at how those concepts apply when we are looking at feature flow rather than software system flow.

Tracing

In order for us to understand how our software delivery systems are behaving and where the hidden bottlenecks are, we need to trace the flow of a feature through our systems, from concept to delivery. This maps to an OpenTelemetry Trace.

For example, for many companies a JIRA Epic for many companies maps to a new feature. This becomes our Root Span. Each story under the epic represents a new span under the epic. A pull request in github would start a new span.
Jobs that are run as part of PR validation would be spans under the PR span.

The commit kicks off a CI/CD pipeline and each job under the pipeline is a new span. Each step in a job, such as building the software, running unit tests, and publishing results, is a new span under that job.

epic span

Imagine if you could use OpenTelemetry dashboards to view the overall cycle time of feature delivery, and then drill into a specific feature and see where the time was being spent, or being able to run OpenTelemetry queries to understand patterns of behavior across a series of features. This observability would be so powerful, and you would be able to take advantage of all the visualization and querying products and tools that are out there in the OpenTelemetry ecosystem.

Events

Events capture data about the system at a specific moment in time. For example, you can have events that tell you when a JIRA item changed state, when a comment was added to a ticket, when somebody was assigned, when a change was committed in github, and so on.

In OpenTelemetry, events are modeled as a specific type of log

Metrics

OpenTelemetry defines a metric as a measurement about a service, captured at runtime

In the feature flow world, a "service" is some tool or process we are using in support of delivering a feature. "Runtime" means, as the feature is being built and delivered.

Some examples of very metrics around feature flow we might want to deliver to the OpenTelemetry collector:

Number of comments on a pull request or a JIRA ticket
How many times a ticket was marked as blocked
The number of times a build job was retried
How much memory was used to compile a project
Code coverage for a test run

Open Telemetry for the win!

As you can see, Open Telemetry is a great fit for how we model and observe feature flow in our software delivery pipeline. By taking advantage of this well-established model and community, we can start gaining deep, powerful and actionable insights into what is going on as we are building features.

When we combine this with the science, math, and concepts behind Lean, Continuous Delivery, Accelerate, and the Flow Framework, we can much more readily understand what to focus on and what to improve, so that our efforts to optimize and improve feature delivery are effective and based on real, visible data.

Installation

Install node and npm
Run npm install

Usage

The service is currently hardcoded to talk to a local Jaeger instance. You can run Jaeger locally by running tools/run-local-jaegar.sh.

If you would like it to talk to a different OpenTelemetry collector, you can create a provider under github/opentelemetry and in index.js provide that provider when you create the GithubEventProcessor

One thing we would like to do is make the collector configurable, but that is not yet implemented.

Once you have Jaeger running, you can start the service by running npm start