# relay-exporter Production-oriented Prometheus exporter for monitoring a fixed set of Nostr relays. ## What it does - Probes each configured relay either on an interval, or on-demand when `/metrics` is scraped. - Uses `@nostrwatch/nocap` for `open` + `read` checks. - Performs a low-noise write confirmation check (kind `30078` by default) using deterministic `d` tags. - By default, requires read-after-write verification. - Exposes: - `/metrics` for Prometheus scraping - `/healthz` for process and probe-loop health - Publishes default process metrics via `prom-client`. ## Install ```bash pnpm install ``` If you are bootstrapping from scratch, these are the exact dependency commands: ```bash pnpm add @nostrwatch/nocap prom-client nostr-tools p-limit pnpm add -D typescript tsx @types/node ``` ## Configuration Copy and edit: ```bash cp .env.example .env ``` Example: run probes every 5 minutes instead of on each scrape: ```bash export PROBE_INTERVAL_SECONDS=300 ``` Environment variables: - `RELAYS` (required): comma-separated `wss://` URLs - `PROBE_INTERVAL_SECONDS` (default: `0`; `0` means run probes on each `/metrics` scrape) - `PROBE_TIMEOUT_SECONDS` (default: `10`) - `LISTEN_ADDR` (default: `0.0.0.0`) - `PORT` (default: `9464`) - `LOG_LEVEL` (default: `info`; one of `debug|info|warn|error`) - `WRITE_CHECK_ENABLED` (default: `true`) - `WRITE_CHECK_VERIFY_READ` (default: `true`; set `false` to treat publish `OK` as sufficient) - `WRITE_CHECK_KIND` (default: `30078`) - `WRITE_CHECK_PRIVKEY` (optional; supports `nsec1...` or 64-char hex) ### Write confirmation key material - `WRITE_CHECK_PRIVKEY` may be an `nsec1...` value or a 64-character hex private key. - `WRITE_CHECK_PUBKEY` is not needed; write-check pubkey is always derived from the private key. - If `WRITE_CHECK_PRIVKEY` is missing or invalid and write checks are enabled, the exporter generates an ephemeral key for the running process and continues write checks. - Private key values are never logged. ## Run Development with auto-reload: ```bash pnpm dev ``` Production build: ```bash pnpm build pnpm start ``` Run tests: ```bash pnpm test ``` Run the live write-confirm diagnostic for `offchain.pub` (opt-in): ```bash LIVE_RELAY_TEST_OFFCHAIN=1 \ LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \ pnpm test test/live.offchain.test.ts ``` Full example with write verification enabled: ```bash LIVE_RELAY_TEST_OFFCHAIN=1 \ LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \ LIVE_RELAY_TEST_WRITE_VERIFY_READ=1 \ LIVE_RELAY_TEST_TIMEOUT_SECONDS=8 \ LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1 \ pnpm test test/live.offchain.test.ts ``` Faster local loop (reduced stability sampling): ```bash LIVE_RELAY_TEST_OFFCHAIN=1 \ LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \ LIVE_RELAY_TEST_SAMPLES=2 \ LIVE_RELAY_TEST_SCRAPE_EVERY_MS=250 \ pnpm test test/live.offchain.test.ts ``` Optional knobs (defaults favor faster feedback): - `LIVE_RELAY_TEST_SAMPLES` (default `4`) - `LIVE_RELAY_TEST_SCRAPE_EVERY_MS` (default `500`) - `LIVE_RELAY_TEST_TIMEOUT_SECONDS` (default `8`) - `LIVE_RELAY_TEST_RELAYS` (default `"wss://offchain.pub"`; comma-separated relay list) - `LIVE_RELAY_TEST_WRITE_VERIFY_READ=1` to force read-after-write verification - `LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1` to make the test fail on any write-confirm/probe failures ## Exposed metrics Relay-level labels use `{relay}` unless stated: - `nostr_relay_up` (gauge) - `nostr_relay_open_ok` (gauge) - `nostr_relay_read_ok` (gauge) - `nostr_relay_write_confirm_ok` (gauge) - `nostr_relay_open_duration_ms` (gauge) - `nostr_relay_read_duration_ms` (gauge) - `nostr_relay_write_duration_ms` (gauge, `-1` when unavailable/disabled) - `nostr_relay_last_success_unixtime` (gauge) - `nostr_relay_probe_errors_total{relay,check}` (counter) - `nostr_relay_probe_runs_total{relay,result}` (counter; `result=success|failure`) Also includes all default Node.js process/runtime metrics from `prom-client`. ## Prometheus scrape config example ```yaml scrape_configs: - job_name: nostr-relay-exporter scrape_interval: 15s metrics_path: /metrics static_configs: - targets: - "relay-exporter.internal:9464" ``` ## Example Grafana queries - Relay up status by relay: - `max by (relay) (nostr_relay_up)` - Open latency: - `avg_over_time(nostr_relay_open_duration_ms[5m])` - Read latency: - `avg_over_time(nostr_relay_read_duration_ms[5m])` - Write confirmation success ratio (15m): - `sum by (relay) (increase(nostr_relay_probe_runs_total{result="success"}[15m])) / sum by (relay) (increase(nostr_relay_probe_runs_total[15m]))` - Probe errors by check: - `sum by (relay, check) (increase(nostr_relay_probe_errors_total[15m]))` ## Health endpoint - `GET /healthz` returns: - `200` when process is running and probe data is fresh enough - `503` when shutting down or probe data is stale/not yet available ## Notes - Relay probes are isolated; one relay failure does not block others. - `nocap` default adapters are explicitly loaded before checks. - Probe concurrency is bounded in code (`DEFAULT_PROBE_CONCURRENCY` in `src/config.ts`). - Graceful shutdown handles `SIGINT` and `SIGTERM`.