Micro-telemetry: What to Log for Learning

Retail & Consumer • ~8–9 min read • Updated Jun 19, 2025 • By OneMind Strata Team

Context

Dashboards full of token counts won’t tell you why people trust—or abandon—your AI. The signals that move quality are small and behavioral: edits, reverts, abandonments, overrides, dwell-time. Instrument those well and you’ll know what to fix this week, not next quarter.

What to Log (the short list)

  • surface_used (inline | panel | slash | agent), task_id, session_id
  • suggestion_shownapplied | edited | reverted | abandoned
  • edit_distance (0–1 normalized token/Levenshtein diff)
  • dwell_ms (show → action), time_to_first_action_ms
  • override (boolean) + override_reason (enum)
  • guardrail_event (refusal | unsafe_output | policy_violation) + classifier scores
  • latency_ms, cost_usd, model_version, prompt_pack_version
  • retrieval_ids (top-k IDs/facets—non-PII)

Minimal Event Model

{
  "event_name": "suggestion_applied",
  "ts": "2025-06-19T14:22:05Z",
  "user_id_hash": "u_7f3c...", "tenant_id_hash": "t_12ab...",
  "session_id": "s_91c2...", "task_id": "order_refund_summary",
  "surface": "panel",
  "model_version": "omx-2025-06-01", "prompt_pack_version": "pp-14",
  "latency_ms": 412, "cost_usd": 0.0041,
  "dwell_ms": 5600, "edit_distance": 0.18,
  "override": false, "override_reason": null,
  "retrieval_ids": ["kb:returns#policy-2024", "kb:sop#refund-steps"],
  "guardrail_event": null
}

Derived Metrics that Matter

  • Adoption rate: applied / shown
  • Edit distance p50/p90: lower is better
  • Revert rate: reverted / applied
  • Abandonment: abandoned / shown
  • Time-to-decision: p50/p90 of dwell_ms
  • Override rate and top reasons
  • Guardrail precision/recall (incident-labeled)
  • Cost-per-accepted-action & latency budget hit-rate

Dashboards that Travel Across Teams

  • By surface: adoption, edit distance, abandonment (inline vs panel vs slash vs agent)
  • By task: top 10 tasks by accepted actions and revert rate
  • Quality × Cost: acceptance vs cost
  • Guardrail view: refusals over time, FP/FN estimates, override heatmap
  • Experiment lane: A/B of prompt_pack_version vs acceptance & edit distance

Privacy & Governance (non-negotiables)

  • Hash or pseudonymize IDs; avoid raw PII in events.
  • Strip payloads; log structure, not full prompts/outputs.
  • Retention: 90 days raw → 12 months aggregates; legal holds via tag.
  • Role-based access: product vs risk vs support.

Instrument in a Week (practical steps)

  1. SDK wrapper: one function per event; auto-attach session/model/prompt_pack/surface.
  2. Event IDs & sampling: dedupe and throttle client-side.
  3. Warehouse tables: events_raw, events_daily, derived_metrics.
  4. Golden questions: daily replay to watch acceptance/latency drift.
  5. Governance sync: monthly guardrail metrics + top overrides.

Common Pitfalls

  • Everything logging, nothing learning: too many fields; no derived metrics.
  • Raw text hoarding: increases risk; prefer IDs/facets/hashes.
  • No versioning: can’t attribute change without model/prompt versions.
  • Surface blindness: mixing surfaces hides where UX truly works.

Quick Win Checklist

  • Ship shown/applied/edited/reverted/abandoned with edit_distance and dwell_ms.
  • Add model_version + prompt_pack_version to every event.
  • Publish one-pager of derived metrics & definitions.
  • Stand up a surface-by-task dashboard; review weekly.

Closing

Log less, learn more. A small, purpose-built telemetry set will show which surfaces to double down on, where to graduate to background agents, and which prompt packs to retire. Make learning a weekly habit.