Releases: BitpingApp/Bitping-Node
26.6.12-1
Changelog
⚙️ Miscellaneous Tasks
- Remove Ack.move_to field, MoveTo variant owns redirects
⛰️ Features
- Redirect shed arrivals with MoveTo, not Ack
An Ack carrying move_to said "you are in" while the hub had not taken
the node on — a redirected node believed it was connected while pooled
nowhere, serving nothing. The redirect is now its own response:
MoveTo(targets) means NOT accepted, hub is full, take the handshake to
one of these addresses.
Node side: on MoveTo the node either follows the redirect or — when it
refuses (own move cooldown, missing RecommendPeers scope, empty
targets) — treats it as a plain refusal and dials a different hub.
There is no state in which a node stays attached to a hub that
redirected it.
Wire compat: the variant is appended, so decoders that predate it fail
cleanly and redial elsewhere — shedding degrades to untargeted for
not-yet-updated nodes, with no Backoff loop (the node was never
pooled, so the dedup gate is not involved). Ack keeps its now-unused
move_to field: removing it changes the Ack encoding, and deployed
nodes that verify signatures over a re-encode would reject every Ack
(the 2026-06-11 incident class) until BIT-609 saturates.
🐛 Bug Fixes
- Make move_to hints land — hint after dedup, pool-clear on issue
Production showed hints firing at the full budget ceiling with zero net
drain off the hot hub, for two reasons. First, the hint was computed at
the top of the handshake actor, so budget tokens burned on handshakes
that ended in Backoff/Error — during the straggler churn that was most
of them. Second, a hinted node's move bounced off the cross-hub dedup
gate: the target's HasPeer probe raced the old hub's connection close
and answered present, costing every move a 60s Backoff and a random
re-roll. Hints now issue only after validation + dedup-allow, and the
issuing hub clears the node from its own pool in the same breath, so
the target sees it as absent immediately. The hint log now carries the
node's peer id so individual moves are traceable.
BIT-596
- Defer the post-move dial one tick
A hinted node disconnected from the shedding hub and dialed the target
in the same tick, so the target's HasPeer dedup probe raced the close
and bounced the move with a 60s Backoff. Arming the dial throttle after
the disconnect pushes the target dial to the next tick, giving the old
hub time to register the close.
BIT-596
- Exempt own-pool refreshes from cross-hub dedup
A token-refresh handshake from a node already in this hub pool is not a
double-connect — fanning it out let any stale pool entry on a peer hub
(a connection the node forgot to close) bounce every refresh into
Backoff(60s). At 5-min refresh intervals and 75s retries this grew into
the 2026-06-12 carousel: 90% of 19k handshakes/min rejected and all-hub
CPU climbing linearly toward the limit.
The exemption is TTL-bound (30-min fanout re-verify) so a dual-accept
race cannot hide forever; a re-verify that finds the node resident
elsewhere resolves current-hub-wins — the fresh signed handshake proves
the live session is here — with a metric instead of a reject storm.
Drops the hint-time pool-clear from the shed path: unpooling at hint
issuance silently stopped dispatching to nodes that decline the hint
(15-min move cooldown) while they stay connected and green. The pool
entry now clears on the node's real ConnectionClosed; compliant nodes
close first and defer the target dial a tick, so the move target's
HasPeer probe still sees them as absent.
Also instruments the dedup reject path — it was invisible: rejects
returned signed Backoffs while handshake_status/handshake_backoff_issued
only counted the validation path, so 17k dedup Backoffs/min metered as
result="completed".
- Shed only new arrivals, never pooled nodes
A node already in the pool is serving — uprooting it mid-session was
the disruptive half of load shedding. The move_to hint is now an answer
to a knock from a NEW node only: "soz we are full, check these guys".
Hubs drain through natural churn — every reconnecting node gets
redirected away from a hot hub — instead of through mid-session
evictions.
A hinted node is not registered: being redirected means we did not take
it on, so it never enters the pool and never receives jobs here. That
is also the enforcement for ignoring the hint — a decliner sits jobless
until it follows the hint or this hub drops back under the shed
threshold and accepts it plainly.
🚜 Refactor
- Actor-model hub connection — one inbox, one machine
HubConnection logic was split across a 754-line multi-hub handshake
table, dial/move/refresh helpers, and five call sites that each mutated
state directly. The multi-hub table also made stray connections
possible: a node could dial a new hub while old connections lived on,
keeping it in other hubs pools — the residency behind the 2026-06-12
dedup-Backoff carousel.
Now every signal enters HubConnection::handle as a HubMsg; ingestion
mutates a five-state session (Idle, Dialing, Handshaking, Connected,
plus Refreshing for in-place token rotation); plan() is the single
decision point turning state into actions; reconcile() applies them to
the swarm. Single-homing is enforced unconditionally in plan: every
known hub the session is not bound to gets disconnected, so a forgotten
connection is structurally impossible. Failures tear down to Idle with
the hub cooled down.
Hardening from adversarial review: connection-close redials are
jittered and avoid the closed hub (no synchronized herd on hub
restart); dial failures and identifies only act on the session's own
dial target (dcutr/relay failures cannot disturb the session); a slow
token refresh keeps the serving connection instead of migrating; every
teardown path surfaces DisconnectedFromHub to the UI; wire-supplied
Backoff durations are clamped; the bootstrap list is capped.
The session machine is pure and covered by 20 unit tests.
26.6.11-1
Changelog
⚙️ CI/CD
- Pin commit-lint to amd64 runners
The commit-lint job installs cocogitto, which ships only an x86_64 musl build
for Linux (the aarch64 release is glibc, incompatible with the alpine:3.20 musl
image). The docker runner pool started scheduling the job on aarch64 nodes,
so the x86_64 cog binary failed to exec (cog: not found -> job fails, blocking
every MR's merge). Pin to amd64, matching the customer release jobs.
- Sync alert rules to Grafana provisioning API on master
Grafana Cloud has no file-based alert provisioning and Git Sync only
covers dashboards, so alerting/*.yaml gets a real reconciler:
alerting/.gitlab-ci.yml adds a deploy-prod job (included from the root
pipeline) that runs alerting/sync-alerts.sh — bash + yq + jq + curl —
upserting every rule by UID via the provisioning API whenever alerting/
changes on master (manual trigger via SYNC_ALERTS=1). Rules are sent
with X-Disable-Provenance so they stay UI-editable between syncs; the
next sync overwrites drift. The jq transform expands the YAML shorthand
(refId/datasource/threshold-condition defaults) to exactly what Grafana
fills in server-side; --dry-run prints payloads without calling the API.
Uses the existing GRAFANA_CLOUD_URL / GRAFANA_ORG_ID /
GRAFANA_SERVICE_ACCOUNT_TOKEN CI variables — no new secrets.
BIT-604
- Tolerate trailing slash and redirects in Grafana alert sync
The first sync-alerts run failed with HTTP 301 on every rule:
GRAFANA_CLOUD_URL carries a trailing slash, the //api path gets
redirected to the canonical URL, and curl does not follow redirects by
default. Strip trailing slashes from the base URL and add -L (explicit
-X keeps PUT/POST across redirects) so both the slash and any
http-to-https canonicalization resolve.
BIT-604
- Drop X-Grafana-Org-Id from alert sync
The second sync run reached the API but every upsert got 401
organization-mismatch: GRAFANA_ORG_ID holds the grafana.com cloud-org
id, not the instance-internal org, and sending it as X-Grafana-Org-Id
makes Grafana reject an otherwise-valid token. Instance service-account
tokens are already org-bound, so drop the header entirely.
BIT-604
- Force frame pointers through the zigbuild env (BIT-606)
The build job's before_script regenerates .cargo/config.toml, clobbering
the committed force-frame-pointers config before every build — which is
why deployed hub heap profiles stayed 2-frame stacks. Set RUSTFLAGS in
the job env (env wins over any config file) and add CFLAGS
-fno-omit-frame-pointer so jemalloc's own C objects keep frame pointers
too: the profiler's stack walk starts inside jemalloc, so without C
frame pointers the chain dies before reaching Rust regardless of how
the Rust code was built.
⚙️ Miscellaneous Tasks
- Bump version to 1.0.2
⛰️ Features
- Re-add cross-hub handshake dedup gate (BIT-595)
Restores the one-hub-per-node guarantee the actor migration dropped — the
per-K8s-node connection-saturation guard from INCIDENT-DNS-CP2.
Gate (HandshakeActor): before accepting a node, fan out a HasPeer probe to the
peer hubs (QUERY_TIMEOUT budget, default-allow on timeout). If any hub already
holds the node, reject with a Backoff and don't register it. The HasPeer wire
protocol was still intact on the receiving side; the migration had dropped only
the sender.
Repeat-offender enforcement: a node that keeps double-connecting past
BLOCKLIST_THRESHOLD is blocked at the libp2p layer. The gate fires a BlockPeer
command; the event loop blocks via a new allow_block_list behaviour (closes
connections + denies new ones) and records an unblock time, and a periodic sweep
drains expired blocks to unblock_peer.
-
RejectionTracker (moka LRU+TTL) decides when to block; BlocklistRegistry
(event-loop side) tracks unblock times; sign_response shared with validation. -
Metrics restored: dedup_pre_accept_total{outcome}, hub_dedup_blocks_total,
hub_dedup_unblocks_total, hub_dedup_blocked_peers. -
Tests: gate rejects a node a peer hub holds (no NewNodeConnection); empty
peer-hub set accepts; tracker thresholds; blocklist drain. 163 hub tests green. -
Event-loop heartbeat with liveness probe
The June 6 wedge ran silently for 4.5 days: the swarm loop parked, the
pod stayed Running, nothing alerted or restarted it. Beat a timestamp
(atomic + hub_event_loop_last_iteration_timestamp_seconds gauge) on
every loop iteration, serve /healthz/live on :9000 that fails after
120s without a beat (the idle loop still iterates every 30s mesh-dial
tick), and point a k8s liveness probe at it so a parked loop becomes a
pod restart within ~4 minutes.
BIT-602
- Build with frame pointers for usable heap profiles
Every profile from the hub's jemalloc pprof endpoints was a 2-frame
stack (__libc_start_main → calloc) because release binaries lacked
frame pointers, making heap attribution impossible during the June leak
investigation. Enable force-frame-pointers workspace-wide via
.cargo/config.toml (~1% codegen cost, all binaries stay profilable).
BIT-606
- Cap gRPC request lifetime and per-connection concurrency
The hub's tonic server had no request timeout or concurrency bound, so
the June 6 wedge accumulated ~95k parked dispatch handlers (~4.4GB)
over 4.5 days. A 60s server-side timeout (clearing the 32s dispatch
timeout) and a 256-stream per-connection cap bound the retained request
state if anything ever parks handlers again.
BIT-605
- Hub wedge/leak alert rules as code
Six Grafana alert rules covering the June 6 incident signature, created
live in the Bitping Dashboards folder and snapshotted here in
provisioning format (UIDs match so re-import updates, not duplicates):
- Hub P2P event loop dead (rate(p2p_loop_event) == 0)
- Hub mesh partitioned (hub_mesh_peers_connected < 2)
- Hub ping-failure counter flatlined (live gauge, dead loop)
- Hub memory leak slope (predict_linear RSS past 90% of 16Gi in 7d)
- Hub event-loop heartbeat stale (BIT-602 metric; NoData=OK until deploy)
- Hub shedding P2P events (BIT-600 metric; NoData=OK until deploy)
Any one of these would have caught the June 6 wedge within minutes; it
ran silent for 4.5 days.
BIT-604
- Hub-load mesh report and move_to hint in handshake Ack
New /bitping/hub-load/1.0.0 notify protocol carries Auth
(raw connected-node counts) between mesh hubs — mesh-internal only, the
number never reaches nodes. HandshakeResponse::Ack grows a
serde-defaulted move_to address list: the node is authed and welcome,
but the hub asks it to take its session elsewhere (load shedding).
Compat tests prove old decoders skip the new field and new decoders
default it from old bytes, so hubs and nodes deploy in either order.
BIT-596
- Shed nodes via paced move_to hints at re-handshake
Hubs report raw connected-node counts to mesh peers every 30s; the
LoadLedger (90s TTL) knows each hub's surplus over the mesh average.
When this hub is over 1.2x average, routine node (re)handshakes —
already arriving every ~10min per node for credential refresh — get a
move_to hint in their otherwise-normal Ack pointing at the lightest
under-average peer. A token bucket (refill = surplus/hour, burst 5)
paces the hints, so shedding is per-node and gradual by construction —
no fleet-wide signal exists to flap on. SETTINGS__LOAD_SHED_DISABLED
stops the broadcast, which empties the budget and the hints. Metrics:
hub_load_mesh_total, hub_load_move_hints_total,
hub_load_reports_{sent,received}_total.
BIT-596
- Follow the hub's move_to hint
On an Ack carrying move_to (gated behind the same RecommendPeers scope
as the peer-hub list), the node remembers the move and switches on the
next reconnect tick: clean disconnect from the shedding hub, then dial
the target directly instead of a random bootstrap (break-before-make —
the cross-hub dedup gate rejects concurrent sessions). At most one move
per 15 minutes, so a target that is itself shedding cannot ping-pong a
node around the mesh. Hub selection otherwise stays uniform random.
BIT-596
🐛 Bug Fixes
- Repoint dashboard panels to post-actor-migration metrics
The actor-model migration renamed/dropped the metrics several panels queried,
leaving them flat or No-data:
- Jobs per min / Job Success % / Failed Job Requests: p2p_event{event_type="job"}
(and its job_msg_type label) was gutted -> hub_local_dispatch_total +
hub_fanout_total dispatch, job_rtt_count. - Pending Jobs: no shared pending map in the actor model -> completed throughput.
- Hub Forward Outbound -> hub_fanout_total.
- Hub Forward Inbound Duration P95: malformed quantile() -> histogram_quantile.
- Peer Reputation Score -> peer_reputation_max/min aggregate gauges.
Dedup panels (Blocked Peers, Pre-Accept Fanout, HasPeer) left as-is: they track
cross-hub node dedup, which the migration removed entirely and needs a product
decision, not a dashboard repoint.
- Patch libp2p-stream to prune per-connection senders on close
crates.io libp2p-stream 0.4.0-alpha never removes a connection's
mpsc::Sender from Shared::senders when the connection closes, leaking
~480B per node reconnect on every hub pod (~10-22MB/h, unbounded —
hub-2 would hit its 16Gi limit in ~3-4 weeks). Upstream rust-libp2p
master still has the bug, so [patch.crates-io] swaps in the fixed
branch on our fork: Firaenix/rust-libp2p
bitping/libp2p-stream-v0.4.0-alpha-senders-fix — the v0.4.0-alpha
release tag plus the senders prune, DialError::Aborted pending-channel
drain, and reconnect-churn regression tests, with libp2p deps pinned to
registry versions so the patch unifies with our locked libp2p 0.56.
Fixes hub, bitpingd and p2proxy in one patch entry. Drop when upstream
ships the fix (PR branch fix/stream-senders-leak on the same fork).
BIT-598
- Bound the auth channel and reauth await
A half-open TCP flow to the auth service parked the orchestrator's
reauth arm forever (the channel had no request/connect timeout and no
keepalive), which cascaded into the June 6 full-swarm wedge: the
orchestrator is the sole P2PEvents...
26.6.10-1
Changelog
⚙️ Miscellaneous Tasks
- Local dev stack, CI updater verification, workspace config
Adds the tools/local-auth-stub dev gRPC server + justfile wrappers, registers libs/p2p-protocol + the stub in the workspace, and codifies the CLAUDE/AGENTS rules. CI: repairs cog.toml key order + hardens the cog install so commit-lint actually runs, and adds an updater-verification job (root .gitlab-ci.yml) running the wiremock update suite + the updater-first source-order guards on every pipeline that touches rust-node paths.
⛰️ Features
-
Include QueryNodes in the customer default scope set
-
Static-dispatch actor protocol crate
New libp2p-free crate: typed P2pRequest/P2pNotify protocols, a LibP2pClient (ask/ask_with_timeout/notify) over libp2p-stream, and inbound RequestActor/NotifyActor served by a bounded ProtocolRegistry — one register_request(actor, opts) call per protocol, RegisterOpts carrying per-protocol max_concurrent / serve_timeout / max_frame backpressure. Static dispatch throughout (no dyn, no async_trait); HandlerContext threads the app state + client so an actor can call other protocols mid-handling. Covered by in-memory listener tests + a registry-driven e2e swarm.
- Stream-era protocol framing + job/query/handshake/hub-forward types
Postbag-framed request/response types and their P2pRequest impls for the /bitping/*/2.0.0 stream protocols, plus the Auth envelope (PASETO + Ed25519 signature). Drops the legacy request_response codecs and the hand-rolled handshake ConnectionHandler — the wire is the typed impls now. Protocol-id pinning + wire-snapshot tests guard the on-wire format (no legacy fallback).
🐛 Bug Fixes
- Treat empty Exclusions as excluding nothing
A present-but-empty Exclusions object AND-ed every per-field exclusion
filter to a vacuous true, dropping every peer and failing dispatch with
"Failed to find any nodes that fit the criteria". Guard with
exclusions_are_empty so an all-empty exclusions behaves like None, with a
find_node regression test.
- Backdate token nbf to tolerate clock skew
Mint nbf = now - 60s so a freshly issued token validates on a peer whose
clock trails the minter, instead of being rejected with ClaimValidation(Nbf)
and flapping the node through repeated reauth. Adds a testable
issued_at_and_not_before helper plus validator + mint regression tests.
🚜 Refactor
- Stream notify + Arc validator; drop the legacy codec
Threads PASETOValidator by Arc (not &'static), exposes AuthedBandwidthReport as a P2pNotify, and removes the legacy BandwidthReporterCodec. Splits the oversized module into mod/binding/session_id/tests; tcp_forwarder + framing unchanged. Wire snapshots hold.
- Migrate the hub to the stream-era actor model
Hub swarm collapses from nine behaviours to five {connection_limits, identify, relay, stream, ping}: the query/bandwidth/hub-forward request_response wires and the custom HubHandshakeBehaviour are deleted, and the four inbound protocols become actors under p2p/actors// sharing one HubP2pState via HandlerContext. The handler/ folder is dissolved (validation/identity/fanout/forward cores re-homed with their actors, outbound dispatch → p2p/jobs.rs, node lifecycle → event_loop/swarm_events.rs). Statics retired to Arc/observable; from_settings bootstrap + an Orchestrator over concrete deps (ports seam deleted, with unit + event-routing coverage); optional rabbitmq. Registry backpressure sourced from settings (handshake re-carries its 256 validation-flood ceiling + 64 KiB pre-auth frame cap); reauth_tracker actor-local. Integration tests drive the stream wire; reauth-tracker instant math saturates; a flaky capability test pinned.
- Node actor-model p2p, supervisor/orchestrator, self-update hardening
Node p2p goes stream-era: an inbound JobActor over NodeP2pState, outbound handshake driven by the hub_connection manager (renamed from handshake/ — it's a client of the hub's HandshakeActor, not an inbound actor), swarm plumbing in event_loop/swarm_events.rs; the handlers/ folder, the write-only ConnectedPeers and the never-emitted NatChanged variant are gone. Adds the transport/UI-agnostic SupervisionLoop + bon-built NodeSupervisor orchestrator and the libp2p-free UiEvent/UiState surface. Self-update path hardened: perform_update split out for testing, explicit reqwest timeouts, catch_unwind around the restart hand-off, hub-driven panic vectors contained (Instant::checked_add backoff, clamped ICMP payload, fallible-write DNS diagnostics); a 14-case wiremock updater suite. Probe API executors + dead-code/indirection cleanup throughout.
- Bitpingd + tauri shells over the shared orchestrator
Both shells become thin UI over common's NodeSupervisor: bitpingd a TUI, src-tauri the desktop/mobile webview, with the duplicated supervisor bootstrap factored into one shared fn. bitpingd startup is panic-proofed (fallible home-dir/gRPC statics → ? after the updater, panic-isolated 15-min update loop, args_os restart) and the --system restart level falls back to euid. The dead UpdateScheduler chain is removed from both. Updater-first invariant guarded by source-order tests for both bitpingd main() and the desktop entrypoint.
- Migrate hub query + bandwidth reports to the stream client
Both legacy behaviours were Outbound-only, so a pure client-side swap: FindNodes via LibP2pClient::ask_with_timeout, bandwidth reports via notify, over the existing libp2p-stream behaviour. TUI/event semantics unchanged.
26.6.2-1
Changelog
⚙️ Miscellaneous Tasks
-
Bump version to 1.3.4
-
Widen prod RUST_LOG to surface orchestrator/handshake events
The existing span-allowlist (send_job/attempt_send_job/handle_commands/
job_events) filtered out every INFO/WARN from orchestrator (NewNode
Connection upserts), handshake validation (new per-variant response
sends), the ping-failure tracker, and the dispatch-skip metric. The
2026-05-30 cohort wedge wasn't diagnosable from Loki because Jane's peer
had zero log lines on the hub side — not because the hub didn't know
about her, but because none of the relevant code paths' events were
allow-listed. Adds module-scoped INFO/WARN directives without opening
the firehose on every libp2p substream.
⛰️ Features
- Publish ghcr.io docker images via kaniko
Both customer apps now publish multi-arch images to
ghcr.io/bitpingapp/{p2proxy,distributed-metrics}: after the
goreleaser GitHub Release step. Closes the deferred TODO that meant
the hosted-metrics-operator's per-user MetricsInstance deployments
were chasing a bitping/distributed-metrics:1.3.4 nobody was publishing.
-
Dockerfiles: multi-stage, alpine downloads + verifies the binary
from the GitHub release goreleaser just created, then distroless
runtime. No bind-mount-binary handshake with goreleaser. -
.customer_docker template: kaniko parallel matrix amd64/arm64,
pushes per-arch tags. Auth via the same GitHub App token already
minted for releases (.gh_token_mint anchor factored out so the
release + docker jobs share one impl). -
.customer_docker_manifest template: manifest-tool combines per-arch
tags into the versioned + :latest manifest lists. -
HTTP CDN Monitoring (v3) — crowdsourced eyeball view
MVP dashboard showing what makes Bitping's distributed-metrics unique
vs Catchpoint/ThousandEyes/Datadog Synthetics. 8 collapsible sections,
22 panels, scoped by endpoint × continent × country × ISP with two
SLO target variables (P95 latency, success rate).
Differentiator panels:
-
Crowdsourced reach (live counts of countries/ISPs/cities/OSes probing)
-
Per-monitor health card (one row per endpoint with drift + SLO status)
-
TLS handshake failures by country (state-level interference detector)
-
Error-type heatmap (country × type) — censorship + ISP-misconfig signal
-
Distinct response-body hashes per endpoint (cache coherence /
split-horizon detection nobody else can see) -
ISP performance deep dive (residential ISP comparison)
-
Emit CDN-grade metrics (phase timings, protocol, cache status, edge POP, cert expiry)
Extends HttpCollector::record_success_metrics to emit 10 new metric
families derived from PerformHttpResponse.results[].result fields the
collector was previously discarding:
- http_dns_resolve_ms / http_tcp_connect_ms / http_tls_handshake_ms /
http_ttfb_ms / http_content_download_ms — per-phase timings from
result.metrics (the optional phases are skipped when null, matching
the API semantics). - http_protocol{protocol=h1|h2|h3} — from result.negotiatedProtocol.
- http_address_family{family=ipv4|ipv6} — from result.addressFamilyUsed.
- http_fallback_total{from_protocol,to_protocol} — one increment per
entry in result.fallbackChain. - http_cdn_provider{provider=cloudflare|fastly|cloudfront|akamai|vercel|
bunny|gcore|netlify|none} — response-header sniff via a new cdn_headers
module. - http_cache_status{status=hit|miss|stale|expired|bypass|dynamic|unknown}
— normalised across cf-cache-status, x-cache, x-vercel-cache, cdn-cache,
and server-timing cdn-cache desc. - http_edge_pop{pop} — IATA code extracted per provider (cf-ray suffix,
x-amz-cf-pop prefix, x-served-by middle segment, x-vercel-id prefix,
bunny server suffix). - http_cache_age_seconds — Age response header.
- http_ssl_expires_days_remaining + http_ssl_chain_valid — from
result.sslInfo (gated on the customer monitor enabling sslInfo=true).
cdn_headers module includes nine provider fixtures (one per supported
CDN + a no-CDN baseline) plus case-insensitive lookup and cache-status
normalisation tests; all green via cargo test.
No proto / API changes needed — the bitping-api was already returning
every field; the collector is the bottleneck this commit removes.
Bumps Cargo.toml to 1.3.5.
- Add ssl_info/transport HttpConfig knobs + live CDN metrics smoke test
Two follow-ons to the CDN-metrics emission landed in 221f57b:
-
HttpConfig gains two optional fields the collector previously
hardcoded:ssl_info: Option<bool>— opts the probe into TLS cert capture.
Without this the http_ssl_chain_valid / http_ssl_expires_days_remaining
gauges are dead code in production.transport: Option<String>— TCP (default) / AUTO / QUIC. AUTO is
what populates http_fallback_total (the chain only has entries when
happy-eyeballs tried h3 first and fell back).
Both default to None so existing customer configs are unchanged.
-
New
tests/cdn_metrics_live.rsintegration test (#[ignore], runs
against the real Bitping API with BITPING_API_KEY set). Probes
cloudflare.com / aws.amazon.com / vercel.com with ssl_info=true,
transport=AUTO and asserts the full metric set lands in the rendered
Prometheus snapshot, including provider-label correctness. Verified
output:- Cloudflare: protocol=h3, pop=SIN, ssl_chain_valid=1,
ssl_expires_days_remaining=65.9 - CloudFront: pop=MEL, cache_status, cache_age_seconds
- Vercel: pop=CDG1, cache_status=hit, cache_age_seconds
- Cloudflare: protocol=h3, pop=SIN, ssl_chain_valid=1,
To enable the new integration test, mod collectors and the
generate_api! macro move from src/main.rs into src/lib.rs so
tests/* can use distributed_metrics::collectors::http::HttpCollector;.
Behavioural no-op — main.rs still owns setup, CONFIG, render_prom, main.
- V5 — eight new CDN insight sections wired to BIT-562 metrics
Adds 8 sections + 12 data panels to dashboards/http-cdn-monitoring.json
(also pushed live to uid bitping-http-cdn-mvp), all wired to the new
metric families distributed-metrics 1.3.5 emits:
- CDN: Latency waterfall — DNS / TCP / TLS / TTFB / Content download
stacked timeseries from http_dns_resolve_ms / http_tcp_connect_ms /
http_tls_handshake_ms / http_ttfb_ms / http_content_download_ms (P50). - CDN: Provider distribution — bar gauge + per-endpoint table from
http_cdn_provider{provider}. - CDN: Cache performance — status mix (stacked) and per-endpoint hit
rate from http_cache_status{status}. - CDN: Edge POP geography — pop × endpoint table from
http_edge_pop{pop}. Picks up cf-ray / x-amz-cf-pop / x-served-by /
x-vercel-id / bunny IATA codes. - CDN: HTTP version adoption — h1/h2/h3 stacked + HTTP/3 share stat
from http_protocol{protocol}. (h3 series require customer monitor
with transport: AUTO.) - CDN: IPv6 reachability — share stat + per-endpoint family table
from http_address_family{family}. - CDN: Certificate expiry watch — per-endpoint min days remaining +
chain validity from http_ssl_expires_days_remaining +
http_ssl_chain_valid. Empty until at least one monitor sets
ssl_info: true (panel description calls this out). - CDN: Origin RTT (proxy) — TTFB P95 for endpoints whose CDN is NOT
reporting cache hits. Approximates upstream origin response time.
Header description updated to list the new section names. Existing 35
panels (overview, per-monitor, crowdsourced reach, latency, errors,
geo, CDN diagnostics, ISP deep dive) untouched. Total panels: 55
(8 original rows + 27 original data + 8 new rows + 12 new data).
Live dashboard pushed via Grafana MCP patch operations against uid
bitping-http-cdn-mvp (version now 7 in Grafana).
- Observability for handshake variants and dispatch skips
The 2026-05-30 cohort wedge investigation drifted onto wrong hypotheses
for hours because the hub had no telemetry to answer either question:
"did this peer ever finish handshaking?" and "is find_node skipping this
peer, and why?" Both fixed here.
- handshake/validation.rs: per-variant info!/warn! on every response we
sign and ship back, tagged with structuredvariantfield. The Ack
path was previously silent (counter only) — now every accepted node
leaves a Loki breadcrumb. Backoff / Reauthenticate / Error keep their
warn level but get the variant tag for grouping. - connection_pool/mod.rs: apply_filters refactored from
-> boolto
-> Option<&'static str>returning the rejection reason. find_node_iter
increments find_node_skipped_total{reason} counter on every silent
return-false plus emits a trace!() per peer for deep-dive. Reasons:
dedup_loser_unique_by_ip, capability_mismatch, req_{continent,country,
city,os,isp,proxy,mobile,residential,peer_id}, excluded. Exclusion AND-
chain semantics preserved (a peer is excluded only when every named
exclusion fires).
🐛 Bug Fixes
-
Update spec
-
Mode: replace makes goreleaser asset upload idempotent
The 1.3.4 distributed-metrics release attempt landed in an inconsistent
state — release tag created, partial asset uploads, then 422 already_exists
on retry. mode: replace tells goreleaser to delete existing assets with the
same name before upload, so re-runs converge on a clean release.
-
Delete stale GitHub release before goreleaser
-
Pass kaniko docker auth via release-job artifact
-
Publish docker images to Docker Hub via DOCKER_PAT
GitHub App lacks packages:write scope so ghcr.io push hit 'installation
not allowed to Create organization package'. Switching to Docker Hub
(bitping account) since:
- DOCKER_PAT is already wired up as a GitLab CI variable
- The hosted-metrics-operator's metrics_image config already points at
bitping/distributed-metrics:, so no operator config flip
needed once 1.3.4 publishes - Old 1.3.0 pods on the cluster are already pulling from the same repo
Image tags land under docker.io/bitping/{distributed-metrics,p2proxy}
with per-arch suffixes (-amd64/-arm64) and the manifest-tool job ties
them under : + :latest multi-a...
26.5.30-1
Changelog
⚙️ CI/CD
- Drop DMG bundling, ship .app.zip via ditto
Tauri's bundle_dmg.sh depends on the SetFile Carbon tool that Xcode 14+
removed (tauri-apps/tauri#3055, open since 2021). Drop dmg from
bundle.targets in tauri.conf.json so Tauri stops trying to build it, and
produce a codesigned + notarized .app.zip via ditto -c -k --keepParent --sequesterRsrc in the after_script. ditto preserves the notarization
staple that a plain zip would strip.
Naming mirrors the legacy DMG (Bitping.Desktop__.app.zip)
so the customer website download links and the GitHub release flow only
change extension. The Tauri updater path (.app.tar.gz + .sig) is
unaffected — that artifact is governed by createUpdaterArtifacts, not
bundle.targets.
⚙️ Miscellaneous Tasks
- Bump version to 1.3.3
⛰️ Features
- Tolerate transient ping failures before disconnect
Add PingFailureTracker counting consecutive hub-side ping failures per
peer. Disconnect only after MAX_CONSECUTIVE_PING_FAILURES (3) in a row —
a single timed-out ping is a transient blip for a globally-distributed
residential/mobile fleet, not a dead node. Ping success resets the
counter.
- Surface proto ErrorCode as error_type label
Replace the substring-based error-message taxonomy in the HTTP collector
with the typed ErrorCode field forwarded by bitping-api. The proto enum
value is used verbatim — strip the redundant ERROR_CODE_ prefix and
lowercase — so new variants flow into the metric label automatically with
no hand-curated mapping table to keep in sync. Missing/unrecognised codes
bucket as "unknown" and log a warning.