relay-exporter/README.md

136 lines
3.8 KiB
Markdown

# relay-exporter
Production-oriented Prometheus exporter for monitoring a fixed set of Nostr relays.
## What it does
- Probes each configured relay either on an interval, or on-demand when `/metrics` is scraped.
- Uses `@nostrwatch/nocap` for `open` + `read` checks.
- Optionally performs a low-noise write confirmation check (kind `30078` by default) using deterministic `d` tags.
- Exposes:
- `/metrics` for Prometheus scraping
- `/healthz` for process and probe-loop health
- Publishes default process metrics via `prom-client`.
## Install
```bash
pnpm install
```
If you are bootstrapping from scratch, these are the exact dependency commands:
```bash
pnpm add @nostrwatch/nocap prom-client nostr-tools p-limit
pnpm add -D typescript tsx @types/node
```
## Configuration
Copy and edit:
```bash
cp .env.example .env
```
Example: run probes every 5 minutes instead of on each scrape:
```bash
export PROBE_INTERVAL_MS=300000
```
Environment variables:
- `RELAYS` (required): comma-separated `wss://` URLs
- `PROBE_INTERVAL_MS` (default: `0`; `0` means run probes on each `/metrics` scrape)
- `PROBE_TIMEOUT_MS` (default: `10000`)
- `LISTEN_ADDR` (default: `0.0.0.0`)
- `PORT` (default: `9464`)
- `LOG_LEVEL` (default: `info`; one of `debug|info|warn|error`)
- `WRITE_CHECK_ENABLED` (default: `true`)
- `WRITE_CHECK_KIND` (default: `30078`)
- `WRITE_CHECK_PRIVKEY` (optional; supports `nsec1...` or 64-char hex)
### Write confirmation key material
- `WRITE_CHECK_PRIVKEY` may be an `nsec1...` value or a 64-character hex private key.
- `WRITE_CHECK_PUBKEY` is not needed; write-check pubkey is always derived from the private key.
- If `WRITE_CHECK_PRIVKEY` is missing or invalid and write checks are enabled, the exporter generates an ephemeral key for the running process and continues write checks.
- Private key values are never logged.
## Run
Development with auto-reload:
```bash
pnpm dev
```
Production build:
```bash
pnpm build
pnpm start
```
Run tests:
```bash
pnpm test
```
## Exposed metrics
Relay-level labels use `{relay}` unless stated:
- `nostr_relay_up` (gauge)
- `nostr_relay_open_ok` (gauge)
- `nostr_relay_read_ok` (gauge)
- `nostr_relay_write_confirm_ok` (gauge)
- `nostr_relay_open_duration_ms` (gauge)
- `nostr_relay_read_duration_ms` (gauge)
- `nostr_relay_write_duration_ms` (gauge, `-1` when unavailable/disabled)
- `nostr_relay_last_success_unixtime` (gauge)
- `nostr_relay_probe_errors_total{relay,check}` (counter)
- `nostr_relay_probe_runs_total{relay,result}` (counter; `result=success|failure`)
Also includes all default Node.js process/runtime metrics from `prom-client`.
## Prometheus scrape config example
```yaml
scrape_configs:
- job_name: nostr-relay-exporter
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets:
- "relay-exporter.internal:9464"
```
## Example Grafana queries
- Relay up status by relay:
- `max by (relay) (nostr_relay_up)`
- Open latency:
- `avg_over_time(nostr_relay_open_duration_ms[5m])`
- Read latency:
- `avg_over_time(nostr_relay_read_duration_ms[5m])`
- Write confirmation success ratio (15m):
- `sum by (relay) (increase(nostr_relay_probe_runs_total{result="success"}[15m])) / sum by (relay) (increase(nostr_relay_probe_runs_total[15m]))`
- Probe errors by check:
- `sum by (relay, check) (increase(nostr_relay_probe_errors_total[15m]))`
## Health endpoint
- `GET /healthz` returns:
- `200` when process is running and probe data is fresh enough
- `503` when shutting down or probe data is stale/not yet available
## Notes
- Relay probes are isolated; one relay failure does not block others.
- `nocap` default adapters are explicitly loaded before checks.
- Probe concurrency is bounded in code (`DEFAULT_PROBE_CONCURRENCY` in `src/config.ts`).
- Graceful shutdown handles `SIGINT` and `SIGTERM`.