# relay-exporter

Production-oriented Prometheus exporter for monitoring a fixed set of Nostr relays.

## What it does

- Probes each configured relay either on an interval, or on-demand when `/metrics` is scraped.
- Uses `@nostrwatch/nocap` for `open` + `read` checks.
- Performs a low-noise write confirmation check (kind `30078` by default) using deterministic `d` tags.
- By default, requires read-after-write verification.
- Exposes:
  - `/metrics` for Prometheus scraping
  - `/healthz` for process and probe-loop health
- Publishes default process metrics via `prom-client`.

## Install

```bash
pnpm install
```

If you are bootstrapping from scratch, these are the exact dependency commands:

```bash
pnpm add @nostrwatch/nocap prom-client nostr-tools p-limit
pnpm add -D typescript tsx @types/node
```

## Configuration

Copy and edit:

```bash
cp .env.example .env
```

Example: run probes every 5 minutes instead of on each scrape:

```bash
export PROBE_INTERVAL_SECONDS=300
```

Environment variables:

- `RELAYS` (required): comma-separated `wss://` URLs
- `PROBE_INTERVAL_SECONDS` (default: `0`; `0` means run probes on each `/metrics` scrape)
- `PROBE_TIMEOUT_SECONDS` (default: `10`)
- `LISTEN_ADDR` (default: `0.0.0.0`)
- `PORT` (default: `9464`)
- `LOG_LEVEL` (default: `info`; one of `debug|info|warn|error`)
- `WRITE_CHECK_ENABLED` (default: `true`)
- `WRITE_CHECK_VERIFY_READ` (default: `true`; set `false` to treat publish `OK` as sufficient)
- `WRITE_CHECK_KIND` (default: `30078`)
- `WRITE_CHECK_PRIVKEY` (optional; supports `nsec1...` or 64-char hex)

### Write confirmation key material

- `WRITE_CHECK_PRIVKEY` may be an `nsec1...` value or a 64-character hex private key.
- `WRITE_CHECK_PUBKEY` is not needed; write-check pubkey is always derived from the private key.
- If `WRITE_CHECK_PRIVKEY` is missing or invalid and write checks are enabled, the exporter generates an ephemeral key for the running process and continues write checks.
- Private key values are never logged.

## Run

Development with auto-reload:

```bash
pnpm dev
```

Production build:

```bash
pnpm build
pnpm start
```

Run tests:

```bash
pnpm test
```

Run the live write-confirm diagnostic for `offchain.pub` (opt-in):

```bash
LIVE_RELAY_TEST_OFFCHAIN=1 \
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
pnpm test test/live.offchain.test.ts
```

Full example with write verification enabled:

```bash
LIVE_RELAY_TEST_OFFCHAIN=1 \
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
LIVE_RELAY_TEST_WRITE_VERIFY_READ=1 \
LIVE_RELAY_TEST_TIMEOUT_SECONDS=8 \
LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1 \
pnpm test test/live.offchain.test.ts
```

Faster local loop (reduced stability sampling):

```bash
LIVE_RELAY_TEST_OFFCHAIN=1 \
LIVE_RELAY_TEST_RELAYS="wss://offchain.pub" \
LIVE_RELAY_TEST_SAMPLES=2 \
LIVE_RELAY_TEST_SCRAPE_EVERY_MS=250 \
pnpm test test/live.offchain.test.ts
```

Optional knobs (defaults favor faster feedback):

- `LIVE_RELAY_TEST_SAMPLES` (default `4`)
- `LIVE_RELAY_TEST_SCRAPE_EVERY_MS` (default `500`)
- `LIVE_RELAY_TEST_TIMEOUT_SECONDS` (default `8`)
- `LIVE_RELAY_TEST_RELAYS` (default `"wss://offchain.pub"`; comma-separated relay list)
- `LIVE_RELAY_TEST_WRITE_VERIFY_READ=1` to force read-after-write verification
- `LIVE_RELAY_TEST_EXPECT_NO_FAILURES=1` to make the test fail on any write-confirm/probe failures

## Exposed metrics

Relay-level labels use `{relay}` unless stated:

- `nostr_relay_up` (gauge)
- `nostr_relay_open_ok` (gauge)
- `nostr_relay_read_ok` (gauge)
- `nostr_relay_write_confirm_ok` (gauge)
- `nostr_relay_open_duration_ms` (gauge)
- `nostr_relay_read_duration_ms` (gauge)
- `nostr_relay_write_duration_ms` (gauge, `-1` when unavailable/disabled)
- `nostr_relay_last_success_unixtime` (gauge)
- `nostr_relay_probe_errors_total{relay,check}` (counter)
- `nostr_relay_probe_runs_total{relay,result}` (counter; `result=success|failure`)

Also includes all default Node.js process/runtime metrics from `prom-client`.

## Prometheus scrape config example

```yaml
scrape_configs:
  - job_name: nostr-relay-exporter
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets:
          - "relay-exporter.internal:9464"
```

## Example Grafana queries

- Relay up status by relay:
  - `max by (relay) (nostr_relay_up)`
- Open latency:
  - `avg_over_time(nostr_relay_open_duration_ms[5m])`
- Read latency:
  - `avg_over_time(nostr_relay_read_duration_ms[5m])`
- Write confirmation success ratio (15m):
  - `sum by (relay) (increase(nostr_relay_probe_runs_total{result="success"}[15m])) / sum by (relay) (increase(nostr_relay_probe_runs_total[15m]))`
- Probe errors by check:
  - `sum by (relay, check) (increase(nostr_relay_probe_errors_total[15m]))`

## Health endpoint

- `GET /healthz` returns:
  - `200` when process is running and probe data is fresh enough
  - `503` when shutting down or probe data is stale/not yet available

## Notes

- Relay probes are isolated; one relay failure does not block others.
- `nocap` default adapters are explicitly loaded before checks.
- Probe concurrency is bounded in code (`DEFAULT_PROBE_CONCURRENCY` in `src/config.ts`).
- Graceful shutdown handles `SIGINT` and `SIGTERM`.