AI in manufacturing lives or dies on OT data. Sensor, PLC, and SCADA streams are high-frequency and messy, historians hold years of unlabeled tags, and MES and ERP systems sit in silos that never speak the same language. This playbook helps a plant or IT-OT leader assess data readiness across edge and cloud, decide where a unified namespace fits, and build the data foundation that predictive and optimization models actually need. It maps the real gaps, tag chaos, missing context, unsynchronized clocks, so you fix the foundation before funding models.
The plant floor speaks in tags, not features
A modern line generates enormous data volume, a single CNC machine can emit thousands of tags at sub-second rates, and a plant historian may hold hundreds of thousands of tags accumulated over a decade. But the data is rarely model-ready. Tags carry cryptic names like TT_4021_PV with no unit, no asset context, and no product linkage. Two machines from different vendors label the same measurement differently. Clocks drift between the PLC, the SCADA server, and the MES, so aligning a quality result to the process conditions that caused it becomes guesswork.
The result is a familiar gap: leaders assume their historian is a data lake, but a historian optimized for trend display is not the same as a labeled, contextualized dataset a model can learn from. Studies consistently show data engineers on industrial AI projects spend the majority of their time, often 60 to 80 percent, on cleaning, contextualizing, and aligning OT data. Readiness is the work most programs underestimate and the reason most models arrive late.
Edge versus cloud is the architectural decision that data readiness forces early. High-frequency PLC and sensor streams cannot all be shipped to the cloud without exploding cost and latency, so raw signals are buffered and processed at the edge, while contextualized, aggregated data flows upward for training and cross-line analysis. Getting this split wrong is expensive in both directions: push too much raw data up and cloud bills balloon while models still lack context; keep too much at the edge and you cannot train across lines or plants. Readiness planning has to answer where each transformation happens before models are built.
The unified namespace has emerged as the leading pattern for solving the silo problem structurally. Instead of building a brittle point-to-point integration between every pair of systems, each source publishes to a single contextualized model of the plant, and consumers subscribe to it. This turns the historian, SCADA, MES, and ERP from isolated islands into one queryable picture where a quality result already carries the process conditions and the work order that produced it. The namespace is not a prerequisite for a first pilot, but it is the difference between a program that scales to many lines and one that drowns in integration plumbing.
Assessing data readiness by layer
Evaluate each data layer against what AI needs, then invest where the gap is largest. The table below scores the common sources.
Score each source honestly against the gap column before assuming it is model-ready. The recurring mistake is to treat the historian as a finished data lake because it holds years of data, when in fact its cryptic tags and missing product linkage make it one of the harder sources to use. MES and ERP hold the context that makes floor data meaningful, the batch genealogy and the order, but they sit in schemas that are painful to join to time-series signals. The readiness work is precisely the labor of closing these gaps, and it is where the majority of an industrial AI project timeline actually goes.
| Source | What it holds | Readiness gap |
|---|---|---|
| PLC and sensors | Real-time process values, states | No context, no units, high frequency to buffer |
| SCADA | Alarms, setpoints, operator actions | Event logs unstructured, clocks unsynced |
| Historian | Years of time-series tags | Cryptic tag names, no product or batch link |
| MES | Work orders, genealogy, quality | Siloed schema, hard to join to process data |
| ERP | Orders, inventory, shipments | Daily granularity, disconnected from the floor |
Build the foundation first
- Stand up a unified namespace so PLC, SCADA, MES, and ERP data publish to one contextualized model rather than point-to-point integrations.
- Add asset and product context to raw tags, mapping TT_4021_PV to furnace zone 2 temperature on line 3, so features carry meaning.
- Synchronize clocks across PLC, SCADA, and MES with a common time source so quality results align to the conditions that produced them.
- Decide edge versus cloud per use case: run low-latency inference at the edge, and stream aggregated context to the cloud for training.
- Sample and label a representative dataset early to expose gaps before you commit to a modeling roadmap.
Data traps to avoid
- Treating the historian as a ready data lake when it lacks the context and labels a model needs.
- Pushing every raw high-frequency tag to the cloud, driving up cost and latency while adding no context.
- Ignoring clock drift, so process and quality data never align and every model learns noise.
- Building point-to-point integrations between each system, creating brittle plumbing that breaks with every upgrade.
How to measure readiness
- Share of historian tags with asset context, units, and product linkage attached.
- Clock synchronization error across PLC, SCADA, and MES, targeted at sub-second.
- Time to assemble a labeled training dataset for a new use case, a direct proxy for readiness.
- Edge-to-cloud data volume and cost per line, watched to catch runaway raw streaming.
Frequently asked questions
Is our historian enough to start building AI models?
Usually not on its own. A historian is optimized for trending, so its tags often lack units, asset context, and product linkage. You can start there, but budget for the contextualization and labeling work first, or your first model will learn from noise.
What is a unified namespace and do we need one?
A unified namespace is a single contextualized data model that all systems publish to, replacing brittle point-to-point integrations. You do not strictly need one to run a first pilot, but you will need it to scale AI across lines without a web of fragile connectors.
Should inference run at the edge or in the cloud?
Decide per use case. Low-latency, safety-adjacent inference like vision or fast anomaly detection belongs at the edge. Model training and cross-line optimization belong in the cloud, fed by aggregated, contextualized data rather than raw high-frequency tags.
Related reading
Go deeper on this sector and topic.