166 lines
4.9 KiB
Markdown
166 lines
4.9 KiB
Markdown
# relay-exporter
|
|
|
|
Production-oriented Prometheus exporter for monitoring a fixed set of Nostr relays.
|
|
|
|
## What it does
|
|
|
|
- Probes each configured relay either on an interval, or on-demand when `/metrics` is scraped.
|
|
- Uses `@nostrwatch/nocap` for `open` + `read` checks.
|
|
- Optionally performs a low-noise write confirmation check (kind `30078` by default) using deterministic `d` tags.
|
|
- By default, treats relay `OK` for publish as success; optional read-after-write verification can be enabled.
|
|
- Exposes:
|
|
- `/metrics` for Prometheus scraping
|
|
- `/healthz` for process and probe-loop health
|
|
- Publishes default process metrics via `prom-client`.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
pnpm install
|
|
```
|
|
|
|
If you are bootstrapping from scratch, these are the exact dependency commands:
|
|
|
|
```bash
|
|
pnpm add @nostrwatch/nocap prom-client nostr-tools p-limit
|
|
pnpm add -D typescript tsx @types/node
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Copy and edit:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
```
|
|
|
|
Example: run probes every 5 minutes instead of on each scrape:
|
|
|
|
```bash
|
|
export PROBE_INTERVAL_SECONDS=300
|
|
```
|
|
|
|
Environment variables:
|
|
|
|
- `RELAYS` (required): comma-separated `wss://` URLs
|
|
- `PROBE_INTERVAL_SECONDS` (default: `0`; `0` means run probes on each `/metrics` scrape)
|
|
- `PROBE_TIMEOUT_SECONDS` (default: `10`)
|
|
- `LISTEN_ADDR` (default: `0.0.0.0`)
|
|
- `PORT` (default: `9464`)
|
|
- `LOG_LEVEL` (default: `info`; one of `debug|info|warn|error`)
|
|
- `WRITE_CHECK_ENABLED` (default: `true`)
|
|
- `WRITE_CHECK_VERIFY_READ` (default: `false`; when `true`, require event read-back after publish)
|
|
- `WRITE_CHECK_KIND` (default: `30078`)
|
|
- `WRITE_CHECK_PRIVKEY` (optional; supports `nsec1...` or 64-char hex)
|
|
|
|
### Write confirmation key material
|
|
|
|
- `WRITE_CHECK_PRIVKEY` may be an `nsec1...` value or a 64-character hex private key.
|
|
- `WRITE_CHECK_PUBKEY` is not needed; write-check pubkey is always derived from the private key.
|
|
- If `WRITE_CHECK_PRIVKEY` is missing or invalid and write checks are enabled, the exporter generates an ephemeral key for the running process and continues write checks.
|
|
- Private key values are never logged.
|
|
|
|
## Run
|
|
|
|
Development with auto-reload:
|
|
|
|
```bash
|
|
pnpm dev
|
|
```
|
|
|
|
Production build:
|
|
|
|
```bash
|
|
pnpm build
|
|
pnpm start
|
|
```
|
|
|
|
Run tests:
|
|
|
|
```bash
|
|
pnpm test
|
|
```
|
|
|
|
Run the live write-confirm diagnostic for `offchain.pub` (opt-in):
|
|
|
|
```bash
|
|
LIVE_RELAY_TEST_OFFCHAIN=1 \
|
|
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
|
|
pnpm test test/live.offchain.test.ts
|
|
```
|
|
|
|
Full example with write verification enabled:
|
|
|
|
```bash
|
|
LIVE_RELAY_TEST_OFFCHAIN=1 \
|
|
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
|
|
LIVE_RELAY_TEST_WRITE_VERIFY_READ=1 \
|
|
LIVE_RELAY_TEST_TIMEOUT_SECONDS=8 \
|
|
LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1 \
|
|
pnpm test test/live.offchain.test.ts
|
|
```
|
|
|
|
Optional knobs:
|
|
|
|
- `LIVE_RELAY_TEST_SAMPLES` (default `12`)
|
|
- `LIVE_RELAY_TEST_SCRAPE_EVERY_MS` (default `1500`)
|
|
- `LIVE_RELAY_TEST_TIMEOUT_SECONDS` (default `8`)
|
|
- `LIVE_RELAY_TEST_RELAYS` (default `"wss://offchain.pub"`; comma-separated relay list)
|
|
- `LIVE_RELAY_TEST_WRITE_VERIFY_READ=1` to enable read-after-write verification (disabled by default)
|
|
- `LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1` to make the test fail on any write-confirm/probe failures
|
|
|
|
## Exposed metrics
|
|
|
|
Relay-level labels use `{relay}` unless stated:
|
|
|
|
- `nostr_relay_up` (gauge)
|
|
- `nostr_relay_open_ok` (gauge)
|
|
- `nostr_relay_read_ok` (gauge)
|
|
- `nostr_relay_write_confirm_ok` (gauge)
|
|
- `nostr_relay_open_duration_ms` (gauge)
|
|
- `nostr_relay_read_duration_ms` (gauge)
|
|
- `nostr_relay_write_duration_ms` (gauge, `-1` when unavailable/disabled)
|
|
- `nostr_relay_last_success_unixtime` (gauge)
|
|
- `nostr_relay_probe_errors_total{relay,check}` (counter)
|
|
- `nostr_relay_probe_runs_total{relay,result}` (counter; `result=success|failure`)
|
|
|
|
Also includes all default Node.js process/runtime metrics from `prom-client`.
|
|
|
|
## Prometheus scrape config example
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: nostr-relay-exporter
|
|
scrape_interval: 15s
|
|
metrics_path: /metrics
|
|
static_configs:
|
|
- targets:
|
|
- "relay-exporter.internal:9464"
|
|
```
|
|
|
|
## Example Grafana queries
|
|
|
|
- Relay up status by relay:
|
|
- `max by (relay) (nostr_relay_up)`
|
|
- Open latency:
|
|
- `avg_over_time(nostr_relay_open_duration_ms[5m])`
|
|
- Read latency:
|
|
- `avg_over_time(nostr_relay_read_duration_ms[5m])`
|
|
- Write confirmation success ratio (15m):
|
|
- `sum by (relay) (increase(nostr_relay_probe_runs_total{result="success"}[15m])) / sum by (relay) (increase(nostr_relay_probe_runs_total[15m]))`
|
|
- Probe errors by check:
|
|
- `sum by (relay, check) (increase(nostr_relay_probe_errors_total[15m]))`
|
|
|
|
## Health endpoint
|
|
|
|
- `GET /healthz` returns:
|
|
- `200` when process is running and probe data is fresh enough
|
|
- `503` when shutting down or probe data is stale/not yet available
|
|
|
|
## Notes
|
|
|
|
- Relay probes are isolated; one relay failure does not block others.
|
|
- `nocap` default adapters are explicitly loaded before checks.
|
|
- Probe concurrency is bounded in code (`DEFAULT_PROBE_CONCURRENCY` in `src/config.ts`).
|
|
- Graceful shutdown handles `SIGINT` and `SIGTERM`.
|