pleb/relay-exporter

Fork 0

pleb ec51dcc52e Fix write verification and re-enable it

2026-03-22 22:59:02 -07:00

5.1 KiB

Raw Permalink Blame History

relay-exporter

Production-oriented Prometheus exporter for monitoring a fixed set of Nostr relays.

What it does

Probes each configured relay either on an interval, or on-demand when /metrics is scraped.
Uses @nostrwatch/nocap for open + read checks.
Performs a low-noise write confirmation check (kind 30078 by default) using deterministic d tags.
By default, requires read-after-write verification.
Exposes:
- /metrics for Prometheus scraping
- /healthz for process and probe-loop health
Publishes default process metrics via prom-client.

Install

pnpm install

If you are bootstrapping from scratch, these are the exact dependency commands:

pnpm add @nostrwatch/nocap prom-client nostr-tools p-limit
pnpm add -D typescript tsx @types/node

Configuration

Copy and edit:

cp .env.example .env

Example: run probes every 5 minutes instead of on each scrape:

export PROBE_INTERVAL_SECONDS=300

Environment variables:

RELAYS (required): comma-separated wss:// URLs
PROBE_INTERVAL_SECONDS (default: 0; 0 means run probes on each /metrics scrape)
PROBE_TIMEOUT_SECONDS (default: 10)
LISTEN_ADDR (default: 0.0.0.0)
PORT (default: 9464)
LOG_LEVEL (default: info; one of debug|info|warn|error)
WRITE_CHECK_ENABLED (default: true)
WRITE_CHECK_VERIFY_READ (default: true; set false to treat publish OK as sufficient)
WRITE_CHECK_KIND (default: 30078)
WRITE_CHECK_PRIVKEY (optional; supports nsec1... or 64-char hex)

Write confirmation key material

WRITE_CHECK_PRIVKEY may be an nsec1... value or a 64-character hex private key.
WRITE_CHECK_PUBKEY is not needed; write-check pubkey is always derived from the private key.
If WRITE_CHECK_PRIVKEY is missing or invalid and write checks are enabled, the exporter generates an ephemeral key for the running process and continues write checks.
Private key values are never logged.

Run

Development with auto-reload:

pnpm dev

Production build:

pnpm build
pnpm start

Run tests:

pnpm test

Run the live write-confirm diagnostic for offchain.pub (opt-in):

LIVE_RELAY_TEST_OFFCHAIN=1 \
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
pnpm test test/live.offchain.test.ts

Full example with write verification enabled:

LIVE_RELAY_TEST_OFFCHAIN=1 \
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
LIVE_RELAY_TEST_WRITE_VERIFY_READ=1 \
LIVE_RELAY_TEST_TIMEOUT_SECONDS=8 \
LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1 \
pnpm test test/live.offchain.test.ts

Faster local loop (reduced stability sampling):

LIVE_RELAY_TEST_OFFCHAIN=1 \
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
LIVE_RELAY_TEST_SAMPLES=2 \
LIVE_RELAY_TEST_SCRAPE_EVERY_MS=250 \
pnpm test test/live.offchain.test.ts

Optional knobs (defaults favor faster feedback):

LIVE_RELAY_TEST_SAMPLES (default 4)
LIVE_RELAY_TEST_SCRAPE_EVERY_MS (default 500)
LIVE_RELAY_TEST_TIMEOUT_SECONDS (default 8)
LIVE_RELAY_TEST_RELAYS (default "wss://offchain.pub"; comma-separated relay list)
LIVE_RELAY_TEST_WRITE_VERIFY_READ=1 to force read-after-write verification
LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1 to make the test fail on any write-confirm/probe failures

Exposed metrics

Relay-level labels use {relay} unless stated:

nostr_relay_up (gauge)
nostr_relay_open_ok (gauge)
nostr_relay_read_ok (gauge)
nostr_relay_write_confirm_ok (gauge)
nostr_relay_open_duration_ms (gauge)
nostr_relay_read_duration_ms (gauge)
nostr_relay_write_duration_ms (gauge, -1 when unavailable/disabled)
nostr_relay_last_success_unixtime (gauge)
nostr_relay_probe_errors_total{relay,check} (counter)
nostr_relay_probe_runs_total{relay,result} (counter; result=success|failure)

Also includes all default Node.js process/runtime metrics from prom-client.

Prometheus scrape config example

scrape_configs:
  - job_name: nostr-relay-exporter
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets:
          - "relay-exporter.internal:9464"

Example Grafana queries

Relay up status by relay:
- max by (relay) (nostr_relay_up)
Open latency:
- avg_over_time(nostr_relay_open_duration_ms[5m])
Read latency:
- avg_over_time(nostr_relay_read_duration_ms[5m])
Write confirmation success ratio (15m):
- sum by (relay) (increase(nostr_relay_probe_runs_total{result="success"}[15m])) / sum by (relay) (increase(nostr_relay_probe_runs_total[15m]))
Probe errors by check:
- sum by (relay, check) (increase(nostr_relay_probe_errors_total[15m]))

Health endpoint

GET /healthz returns:
- 200 when process is running and probe data is fresh enough
- 503 when shutting down or probe data is stale/not yet available

Notes

Relay probes are isolated; one relay failure does not block others.
nocap default adapters are explicitly loaded before checks.
Probe concurrency is bounded in code (DEFAULT_PROBE_CONCURRENCY in src/config.ts).
Graceful shutdown handles SIGINT and SIGTERM.

5.1 KiB Raw Permalink Blame History