Micro-telemetry: What to Log for Learning

Context

Dashboards full of token counts won’t tell you why people trust, or abandon, your AI. The signals that move quality are small and behavioral: edits, reverts, abandonments, overrides, dwell-time. Instrument those well and you’ll know what to fix this week, not next quarter.

What to Log (the short list)

surface_used (inline | panel | slash | agent), task_id, session_id
suggestion_shown → applied | edited | reverted | abandoned
edit_distance (0–1 normalized token/Levenshtein diff)
dwell_ms (show → action), time_to_first_action_ms
override (boolean) + override_reason (enum)
guardrail_event (refusal | unsafe_output | policy_violation) + classifier scores
latency_ms, cost_usd, model_version, prompt_pack_version
retrieval_ids (top-k IDs/facets—non-PII)

Minimal Event Model

{
  "event_name": "suggestion_applied",
  "ts": "2025-06-19T14:22:05Z",
  "user_id_hash": "u_7f3c...", "tenant_id_hash": "t_12ab...",
  "session_id": "s_91c2...", "task_id": "order_refund_summary",
  "surface": "panel",
  "model_version": "omx-2025-06-01", "prompt_pack_version": "pp-14",
  "latency_ms": 412, "cost_usd": 0.0041,
  "dwell_ms": 5600, "edit_distance": 0.18,
  "override": false, "override_reason": null,
  "retrieval_ids": ["kb:returns#policy-2024", "kb:sop#refund-steps"],
  "guardrail_event": null
}

Derived Metrics that Matter

Adoption rate: applied / shown
Edit distance p50/p90: lower is better
Revert rate: reverted / applied
Abandonment: abandoned / shown
Time-to-decision: p50/p90 of dwell_ms
Override rate and top reasons
Guardrail precision/recall (incident-labeled)
Cost-per-accepted-action & latency budget hit-rate

Dashboards that Travel Across Teams

By surface: adoption, edit distance, abandonment (inline vs panel vs slash vs agent)
By task: top 10 tasks by accepted actions and revert rate
Quality × Cost: acceptance vs cost
Guardrail view: refusals over time, FP/FN estimates, override heatmap
Experiment lane: A/B of prompt_pack_version vs acceptance & edit distance

Privacy & Governance (non-negotiables)

Hash or pseudonymize IDs; avoid raw PII in events.
Strip payloads; log structure, not full prompts/outputs.
Retention: 90 days raw → 12 months aggregates; legal holds via tag.
Role-based access: product vs risk vs support.

Instrument in a Week (practical steps)

SDK wrapper: one function per event; auto-attach session/model/prompt_pack/surface.
Event IDs & sampling: dedupe and throttle client-side.
Warehouse tables: events_raw, events_daily, derived_metrics.
Golden questions: daily replay to watch acceptance/latency drift.
Governance sync: monthly guardrail metrics + top overrides.

Common Pitfalls

Everything logging, nothing learning: too many fields; no derived metrics.
Raw text hoarding: increases risk; prefer IDs/facets/hashes.
No versioning: can’t attribute change without model/prompt versions.
Surface blindness: mixing surfaces hides where UX truly works.

Quick Win Checklist

Ship shown/applied/edited/reverted/abandoned with edit_distance and dwell_ms.
Add model_version + prompt_pack_version to every event.
Publish one-pager of derived metrics & definitions.
Stand up a surface-by-task dashboard; review weekly.

Closing

Log less, learn more. A small, purpose-built telemetry set will show which surfaces to double down on, where to graduate to background agents, and which prompt packs to retire. Make learning a weekly habit.

Essay by OneMind Strata Team