Observability Model

Purpose

This document defines initial observability requirements.

Observability must support human operators and future AI-assisted support workflows.

Logs

Services must use structured logs.

Logs should include:

Timestamp.
Log level.
Service name.
Message.
Correlation ID where available.
Tenant ID where safe and relevant.
Organization ID where safe and relevant.
Device ID where safe and relevant.
Error code where relevant.

Logs must not include secrets or sensitive raw credentials.

The Phase 4 service baseline uses JSON logs for backend deployables. Every log record includes:

service name,
runtime environment,
log level,
timestamp,
message.

Logs include correlationId when request, event, or worker context provides one. The x-correlation-id HTTP header is the initial transport header for operation endpoints and future HTTP APIs.

Metrics

Initial metrics should cover:

HTTP request count.
HTTP error count.
HTTP latency.
MQTT connection state.
MQTT messages received.
MQTT messages rejected.
Internal stream backlog.
Telemetry messages processed.
Decoder failures.
Unknown device messages.
Database write failures.
Export jobs created.
Export jobs completed.
Export job failures.

Backend services expose Prometheus-compatible metrics at:

GET /metrics

The Phase 4 baseline exposes:

sens_service_info{service_name,environment} 1
sens_service_ready{service_name,environment}
sens_http_requests_total{service_name,environment,method,route,status_code}
sens_http_request_duration_seconds
Node.js process metrics from prom-client

Metrics must not use tenant IDs, organization IDs, device IDs, user IDs, tokens, raw payload values, or other high-cardinality or sensitive labels.

Local Observability Verification

The repository includes a local dev cockpit for Phase 4 observability checks. It can start the backend service skeletons, capture their stdout and stderr, inspect structured JSON logs, call health and readiness endpoints, validate baseline Prometheus metrics, and verify correlation ID behavior.

The dev cockpit is a local development tool only. It is not a log aggregation stack, not a production dashboard, not a Kubernetes component, and not part of the public Platform API.

Health and Readiness

Every service must expose health and readiness signals.

Health answers whether the process is alive.

Readiness answers whether the service can serve traffic or process work.

Readiness should consider critical dependencies such as database or broker where appropriate.

Backend deployables expose these operation endpoints:

GET /healthz
GET /readyz
GET /metrics

GET /healthz returns HTTP 200 when the process is alive. It does not check external dependencies.

GET /readyz returns HTTP 200 when all readiness checks pass and HTTP 503 when one or more checks fail. Phase 4 only includes a config readiness check, because database, NATS, MQTT, decoder, and storage integrations are not implemented yet.

Responses include x-correlation-id. If the caller supplies a valid x-correlation-id, the service reuses it; otherwise the service generates a new correlation ID.

The platform-api also exposes:

GET /test

GET /test returns { "success": true } and emits a structured JSON log entry with the message test endpoint called. This endpoint exists only as an early deployment smoke test and must not be used as a product API contract.

Tracing

Correlation IDs should flow through HTTP requests, internal events, and worker logs.

Full distributed tracing may be introduced later.

Alerting

Initial alerting should consider:

Service down.
Database unavailable.
Broker unavailable.
MQTT disconnected.
Queue backlog growing.
Decoder errors above threshold.
Unknown devices above threshold.
Export job failures.
Disk/storage pressure.

Open Decisions

Metrics stack.
Log aggregation stack.
Dashboard tooling.
Alert manager.