Summary

AI at a distribution utility lives or dies on data readiness, and the typical operator's data is trapped in silos: AMI and meter data in one system, SCADA telemetry in another, GIS asset records in a third, and work orders and failure history in yet another, often in legacy formats with no lineage. Before a leak or asset-failure model can work, the utility must connect these sources, resolve assets to a common identity, and establish lineage. This playbook assesses AMI, SCADA, GIS, and asset data readiness and lays out the integration work that comes first.

Context

The data a utility needs for AI is real but scattered

A distribution utility already generates the data AI needs; the problem is that it lives in disconnected systems built over decades. AMI or metering platforms hold consumption and interval reads. SCADA holds real-time flow, pressure, voltage, and status telemetry. GIS holds the spatial record of every main, service line, feeder, and transformer, including material, install year, and diameter, though those attributes are frequently incomplete for older assets. The CMMS or work-management system holds failure history, repairs, and inspection results. Each was procured separately, keyed differently, and rarely designed to talk to the others.

An asset-failure model needs all four: GIS for what and where the asset is, work orders for how it has failed, SCADA for how it is loaded, and sometimes AMI for downstream consumption effects. A leak model needs AMI or district-metered flow, SCADA pressure, and GIS network topology together. If those cannot be joined on a common asset identity, the model cannot be built, and if lineage is missing, the model's outputs cannot be audited or defended to a regulator. Data readiness is therefore the real first project, not the model.

The framework

A readiness assessment across the four core data domains

Score each domain honestly before committing to a use case. A model is only as ready as its weakest required input.

Data domainWhat it providesCommon readiness gap
AMI / meteringConsumption and interval reads for demand, loss, and outage signalsPartial rollout, gaps in coverage, and reads not time-aligned across the network
SCADA / telemetryReal-time flow, pressure, voltage, and equipment statusHistorian data siloed and hard to extract; inconsistent tag naming across sites
GIS / asset inventoryLocation, material, install year, and dimensions of every network assetMissing material and age attributes on older assets; no stable unique asset ID
CMMS / work ordersFailure, repair, and inspection history per assetFree-text failure notes, no link to the GIS asset record, inconsistent coding
Lineage and qualityTraceability from source to model outputNo lineage at all; results cannot be reproduced or explained to a regulator
Recommended actions

Fix the foundation before the first model

  • Establish a common asset identity that ties every GIS record to its work-order history and telemetry so the four silos can be joined on a single stable key.
  • Prioritize attribute cleanup on the assets that matter most: fill material and install-year gaps for high-consequence mains and feeders before modeling their failure risk.
  • Extract SCADA historian data into an accessible analytics store with standardized tag naming, rather than leaving it locked in the operational historian.
  • Structure work-order failure data by mining free-text notes into consistent failure codes so failure history becomes a usable model input.
  • Implement data lineage from source system through transformation to model output so every AI result can be traced, reproduced, and defended in a rate case.
Common pitfalls

Data mistakes that quietly kill utility AI projects

  • Assuming GIS is complete: older mains and service lines frequently lack material or age, and a failure model trained on those blanks will be biased toward the newer, better-documented assets.
  • Modeling on a partial AMI rollout without accounting for coverage gaps, so the model silently ignores whole districts and produces misleading loss estimates.
  • Leaving SCADA data in the operational historian, where it is too slow or restricted to feed analytics, and then blaming the model for poor performance.
  • Skipping lineage, which makes results unreproducible and, in a regulated context, impossible to defend when a commission or auditor asks how a number was produced.
Metrics that matter

Quantify readiness, not just intent

  • Asset-attribute completeness: percentage of GIS assets with valid material and install-year values, tracked for high-consequence segments specifically.
  • Join rate: share of work orders that successfully link to a GIS asset record via the common identity key.
  • AMI coverage and time-alignment: percentage of meters reporting interval data on a consistent, synchronized cadence.
  • Lineage coverage: proportion of model inputs with documented source-to-output traceability.
FAQ

Frequently asked questions

Do we need a full data lake before we can do anything?

No. A full enterprise data lake is a multi-year effort, and you should not wait for it. Instead, assemble just the data needed for your first use case: for leak detection, that might be AMI or district-metered flow, SCADA pressure, and GIS topology for one part of the network. Prove the model on that slice, then expand the integration as you take on more use cases.

Our GIS is missing material and age on a lot of old pipe. Is a failure model still possible?

Yes, but you must handle the gaps deliberately. Missing attributes are informative in themselves, and you can use proxies such as installation era by neighborhood or infer material from records and field verification. Prioritize filling attributes on high-consequence assets first. What you must not do is silently train on the blanks, because the model will then just learn to trust newer, better-documented assets.

How much does data lineage really matter for a utility?

In a regulated utility it matters more than in most industries. When an AI model reprioritizes capital or flags an asset, you may have to show a commission or auditor exactly how that result was produced. Without lineage you cannot reproduce or defend the output, which puts both the decision and its cost recovery at risk. Build lineage in from the start rather than reconstructing it later.