0.1.0 • Published 1 year ago

@chanzuckerberg/edu-platform-observability v0.1.0

Weekly downloads
-
License
-
Repository
github
Last release
1 year ago

edu-platform-observability

The intention of the edu-platform-observability library, is to provide a "paved road" for all edu platform services who wish to capture telemetry data. This data includes logs, traces, and metrics. We are striving for the following:

  • Minimize friction in setting up observability for a new service.
  • Encourage standardization of a basic set of telemetry to report.
  • Encourage standardization of telemetry metadata (ex: log attributes).
  • Minimize effort required to make strategic changes w.r.t how we sample or format our data, and what downstream tools we use.

This library makes use of OpenTelemetry and Winston

Installation

npm install @chanzuckerberg/edu-platform-observability

Note that we currently are not published to npm. It is TBD if we should publish this to our public NPM repo.

Usage

import path from 'node:path';
import {init} from '@chanzuckerberg/edu-platform-observability';
import type {TelemetryConfig} from '@chanzuckerberg/edu-platform-observability';
import {createRequestHandler} from '@remix-run/express';
import type {AppLoadContext} from '@remix-run/node';
import type {Response, Request} from 'express';

const BUILD_DIR = path.join(process.cwd(), 'build');

const telemetryConfig: TelemetryConfig = {
  serviceName: 'my-service',
  reportExpressRoutes: false // Set to 'false' for remix apps
};
const telemetry = init(telemetryConfig);

// For all services (vanilla express, remix express, or apollo express)
const app = express();
app.use(telemetry.createMiddleware());

// Additional steps for express-remix:
const getLoadContext = (
  req: Request,
  res: Response,
): AppLoadContext => {
  return telemetry.createExpressRemixContext(req, res);
};

app.all('*', createRequestHandler({
  build: telemetry.instrumentRemixBuild(require(BUILD_DIR)),
  getLoadContext,
}));

// setup rest of app

Doing this alone will ensure basic functionality.

  • All requests are logged (request received, request sent, and request error).
  • Traces include spans for all outgoing HTTP requests, as well as remix action and loader functions.
  • Traces are sampled (100% sampling when running locally, and 10% when running on happy infra).
  • All requests are metered with a histogram.
  • On happy infra: logs, traces, and metrics will be properly captured and made available.
  • Node runtime metrics are exposed (these all have the nodejs prefix).
  • Suport for metrics and traces in local development environment (see below).

All express handlers will have telemetry tools accessible through the res.locals object:

const {
  logger,
  tracer,
  meter
} = res.locals as TelemetryContext;

All remix loader and action functions will have telemetry tools available through context:

export async function loader({request, context, params}: LoaderArgs) {
  const {tracer, meter, logger} = context as TelemetryContext;
  //...
}

Local Development

When running on AWS, open-telemetry collection is enabled by default (this corresponds to the enableCollection property in the config. In order to use local telemetry tools, you must enable collection explicitly. The best way to do this is to set the ENABLE_OTEL_COLLECTION env var to true. This is typically done in your .env file or local docker-compose.yml

Then, a local telemetry stack can be spun up with the following commands (executed at the root of your project)

npx -p @chanzuckerberg/edu-platform-observability telemetry-up

To shut down the stack:

npx -p @chanzuckerberg/edu-platform-observability telemetry-down

In your browser you can view traces and metrics using

Zipkin: http://localhost:9411 Prometheus: http://localhost:9090

Alternatively, you can enable console telemetry like this:

const telemetryConfig = {
  //...
  enableConsoleTracingAndMetrics: true, //or env var ENABLE_CONSOLE_TRACING_AND_METRICS = true
};

Time Measurement

Use the TimeMeasurement class to do time measurement for service-specific metrics. Example:

import {TimeMeasurement} from '@chanzuckerberg/edu-platform-observability'

const measurement = new TimeMeasurement();

//do stuff

const elapsedTime = measurement.getElapsedMs();

histogram.record(elapsedTime, histogramAttributes);

Apollo Plugin

To instrument an apollo server with metrics and logging, you can do the following:

const server = createApolloServer([telemetry.createApolloPlugin()]);

Configuration

The following configuration options are available in TelemetryConfig. Some have defaults, and some have alternative environment variables that can be used if the value is not provided in TelemetryConfig.

OptionMeaningEnvironment VariableDefault Value
isDevIndicates that the service is running in a local dev env.!process.env.DEPLOYMENT_STAGE
enableConsoleTracingAndMetricsIf true, and isDev is true, metrics and tracing are outputted to the console. Very noisy!ENABLE_CONSOLE_TRACING_AND_METRICSfalse
serviceNameThe name of the service, to be used in telemetry metadata.No default value
serviceVersionThe version of the service, to be used in telemetry metadata.When isDev is false: TBD (auto-detect) When isDev is true: dev
collectorHostThe hostname of the open telemetry collectorOTEL_COLLECTOR_HOSTWhen isDev is false: scraper-collector.opentelemetry-operator-system.svc.cluster.local When isDev is true: localhost
logLevelThe minimum log level to output for loggingLOG_LEVELWhen isDev is false: info When isDev is true: debug
enableCollectionWhen true, collectorHost is used in order to publish metrics and traces.ENABLE_OTEL_COLLECTIONWhen isDev is false: true When isDev is true: false
ignoreOutgoingRequestHookA function used to ignore certain outgoing requests for tracing. Signature is: (req: RequestOptions) => booleanNo default implementation
enableGraphQLTracingWhen enabled, GraphQL istrumentation is enabled for tracing.false
reportExpressRoutesWhen enabled, request/response lifecycle logs and metrics will have a route attribute from express (disable this for remix apps).true
histogramBucketsA map from bucket name to array of numbers, where the numbers are the bucket boundaries.{}

Usage with MSW Mocking

When running a service with mocked dependencies using MSW, you should not enable tracing for these outgoing requests. You can use ignoreOutgoingRequestHook in order to handle this. Either hard set its return value to true or implement some kind of checking. The error that you would see if this is not handled properly is:

TypeError: Cannot destructure property 'remoteAddress' of 'socket' as it is null.

Out of the Box Metrics

Node Runtime

nodejs_active_handles
nodejs_active_handles_total
nodejs_active_requests_total
nodejs_eventloop_lag_max_seconds
nodejs_eventloop_lag_mean_seconds
nodejs_eventloop_lag_min_seconds
nodejs_eventloop_lag_p50_seconds
nodejs_eventloop_lag_p90_seconds
nodejs_eventloop_lag_p99_seconds
nodejs_eventloop_lag_seconds
nodejs_eventloop_lag_stddev_seconds
nodejs_external_memory_bytes
nodejs_gc_duration_seconds_bucket
nodejs_gc_duration_seconds_count
nodejs_gc_duration_seconds_sum
nodejs_heap_size_total_bytes
nodejs_heap_size_used_bytes
nodejs_heap_space_size_available_bytes
nodejs_heap_space_size_total_bytes
nodejs_heap_space_size_used_bytes
nodejs_version_info

Express

http_request_duration_ms_bucket
http_request_duration_ms_count
http_request_duration_ms_sum

Apollo

graphql_errors_encountered
graphql_operations_resolved
graphql_total_request_time_ms

Histogram Buckets

There are currently 3 histograms that come with this library. Each has an explicit set of bucket boundaries:

nodejs_gc_duration_seconds - [0.001, 0.01, 0.1, 1, 2, 5] http_request_duration_ms - [0, 5, 10, 25, 50, 75, 100, 250, 500, 1000, 2500, 5000, 7500, 10000] graphql_total_request_time_ms - [0, 5, 10, 25, 50, 75, 100, 250, 500, 1000, 2500, 5000, 7500, 10000]

For all other histograms, the default bucket boundaries are:

[0, 5, 10, 25, 50, 75, 100, 250, 500, 1000]

If you create a service-specific histogram, and would like to specify a specific set of boundaries for that histogram, the histogramBuckets config option can be used. Example:

const config: TelemetryConfig = {
  serviceName: 'my-service',
  histogramBuckets: {
    my_cool_histogram: [5, 10, 15, 20],
  },
};

Then use that same histogram name when creating it:

const histo = telemetry.meter.createHistogram('my_cool_histogram');

Viewing Telemetry in AWS

Logs

Currently, logs show up CloudWatch. The best way to search for service logs is to use the Log Insights tab. Queries like the following can be executed:

fields @timestamp, @message
| filter data.service == "edu-platform-data-service"
| filter data.path == "/graphql"
| sort @timestamp desc

All service log fields shows up under data. There are also Kubernetes metadata under kubernetes.

The current log groups are:

rdev - /edu-platform-rdev-eks/fluentbit-cloudwatch staging - /aws/eks/edu-platform-staging-eks/cluster

Traces

TBD Details - Metrics in X-Ray

Metrics

TBD Details - Metrics in Grafana

Contributing

Limitations and Issues

  • Care needs to be taken to initialize this library before any usage of node http. Otherwise the monkeypatching that otel does, will not work (tracing).

  • opentelemetry-node-metrics is not a well adopted library. Unfortunately there wasn't much else available for node runtime metrics. We should eventually replace this. Maybe they will bring this functionality into the main otel libraries.