relay-exporter/README.md

# relay-exporter

Production-oriented Prometheus exporter for monitoring a fixed set of Nostr relays.

## What it does

- Probes each configured relay either on an interval, or on-demand when `/metrics` is scraped.
- Uses `@nostrwatch/nocap` for `open` + `read` checks.
- Optionally performs a low-noise write confirmation check (kind `30078` by default) using deterministic `d` tags.
- Exposes:
  - `/metrics` for Prometheus scraping
  - `/healthz` for process and probe-loop health
- Publishes default process metrics via `prom-client`.

## Install

```bash
pnpm install
```

If you are bootstrapping from scratch, these are the exact dependency commands:

```bash
pnpm add @nostrwatch/nocap prom-client nostr-tools p-limit
pnpm add -D typescript tsx @types/node
```

## Configuration

Copy and edit:

```bash
cp .env.example .env
```

Example: run probes every 5 minutes instead of on each scrape:

```bash
export PROBE_INTERVAL_MS=300000
```

Environment variables:

- `RELAYS` (required): comma-separated `wss://` URLs
- `PROBE_INTERVAL_MS` (default: `0`; `0` means run probes on each `/metrics` scrape)
- `PROBE_TIMEOUT_MS` (default: `10000`)
- `LISTEN_ADDR` (default: `0.0.0.0`)
- `PORT` (default: `9464`)
- `LOG_LEVEL` (default: `info`; one of `debug|info|warn|error`)
- `WRITE_CHECK_ENABLED` (default: `true`)
- `WRITE_CHECK_KIND` (default: `30078`)
- `WRITE_CHECK_PRIVKEY` (optional; supports `nsec1...` or 64-char hex)

### Write confirmation key material

- `WRITE_CHECK_PRIVKEY` may be an `nsec1...` value or a 64-character hex private key.
- `WRITE_CHECK_PUBKEY` is not needed; write-check pubkey is always derived from the private key.
- If `WRITE_CHECK_PRIVKEY` is missing or invalid and write checks are enabled, the exporter generates an ephemeral key for the running process and continues write checks.
- Private key values are never logged.

## Run

Development with auto-reload:

```bash
pnpm dev
```

Production build:

```bash
pnpm build
pnpm start
```

Run tests:

```bash
pnpm test
```

## Exposed metrics

Relay-level labels use `{relay}` unless stated:

- `nostr_relay_up` (gauge)
- `nostr_relay_open_ok` (gauge)
- `nostr_relay_read_ok` (gauge)
- `nostr_relay_write_confirm_ok` (gauge)
- `nostr_relay_open_duration_ms` (gauge)
- `nostr_relay_read_duration_ms` (gauge)
- `nostr_relay_write_duration_ms` (gauge, `-1` when unavailable/disabled)
- `nostr_relay_last_success_unixtime` (gauge)
- `nostr_relay_probe_errors_total{relay,check}` (counter)
- `nostr_relay_probe_runs_total{relay,result}` (counter; `result=success|failure`)

Also includes all default Node.js process/runtime metrics from `prom-client`.

## Prometheus scrape config example

```yaml
scrape_configs:
  - job_name: nostr-relay-exporter
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets:
          - "relay-exporter.internal:9464"
```

## Example Grafana queries

- Relay up status by relay:
  - `max by (relay) (nostr_relay_up)`
- Open latency:
  - `avg_over_time(nostr_relay_open_duration_ms[5m])`
- Read latency:
  - `avg_over_time(nostr_relay_read_duration_ms[5m])`
- Write confirmation success ratio (15m):
  - `sum by (relay) (increase(nostr_relay_probe_runs_total{result="success"}[15m])) / sum by (relay) (increase(nostr_relay_probe_runs_total[15m]))`
- Probe errors by check:
  - `sum by (relay, check) (increase(nostr_relay_probe_errors_total[15m]))`

## Health endpoint

- `GET /healthz` returns:
  - `200` when process is running and probe data is fresh enough
  - `503` when shutting down or probe data is stale/not yet available

## Notes

- Relay probes are isolated; one relay failure does not block others.
- `nocap` default adapters are explicitly loaded before checks.
- Probe concurrency is bounded in code (`DEFAULT_PROBE_CONCURRENCY` in `src/config.ts`).
- Graceful shutdown handles `SIGINT` and `SIGTERM`.