AI in Manufacturing: Data Readiness

Summary

AI in manufacturing lives or dies on OT data. Sensor, PLC, and SCADA streams are high-frequency and messy, historians hold years of unlabeled tags, and MES and ERP systems sit in silos that never speak the same language. This playbook helps a plant or IT-OT leader assess data readiness across edge and cloud, decide where a unified namespace fits, and build the data foundation that predictive and optimization models actually need. It maps the real gaps, tag chaos, missing context, unsynchronized clocks, so you fix the foundation before funding models.

Context

The plant floor speaks in tags, not features

A modern line generates enormous data volume, a single CNC machine can emit thousands of tags at sub-second rates, and a plant historian may hold hundreds of thousands of tags accumulated over a decade. But the data is rarely model-ready. Tags carry cryptic names like TT_4021_PV with no unit, no asset context, and no product linkage. Two machines from different vendors label the same measurement differently. Clocks drift between the PLC, the SCADA server, and the MES, so aligning a quality result to the process conditions that caused it becomes guesswork.

The result is a familiar gap: leaders assume their historian is a data lake, but a historian optimized for trend display is not the same as a labeled, contextualized dataset a model can learn from. Studies consistently show data engineers on industrial AI projects spend the majority of their time, often 60 to 80 percent, on cleaning, contextualizing, and aligning OT data. Readiness is the work most programs underestimate and the reason most models arrive late.

Edge versus cloud is the architectural decision that data readiness forces early. High-frequency PLC and sensor streams cannot all be shipped to the cloud without exploding cost and latency, so raw signals are buffered and processed at the edge, while contextualized, aggregated data flows upward for training and cross-line analysis. Getting this split wrong is expensive in both directions: push too much raw data up and cloud bills balloon while models still lack context; keep too much at the edge and you cannot train across lines or plants. Readiness planning has to answer where each transformation happens before models are built.

The unified namespace has emerged as the leading pattern for solving the silo problem structurally. Instead of building a brittle point-to-point integration between every pair of systems, each source publishes to a single contextualized model of the plant, and consumers subscribe to it. This turns the historian, SCADA, MES, and ERP from isolated islands into one queryable picture where a quality result already carries the process conditions and the work order that produced it. The namespace is not a prerequisite for a first pilot, but it is the difference between a program that scales to many lines and one that drowns in integration plumbing.

The framework

Assessing data readiness by layer

Evaluate each data layer against what AI needs, then invest where the gap is largest. The table below scores the common sources.

Score each source honestly against the gap column before assuming it is model-ready. The recurring mistake is to treat the historian as a finished data lake because it holds years of data, when in fact its cryptic tags and missing product linkage make it one of the harder sources to use. MES and ERP hold the context that makes floor data meaningful, the batch genealogy and the order, but they sit in schemas that are painful to join to time-series signals. The readiness work is precisely the labor of closing these gaps, and it is where the majority of an industrial AI project timeline actually goes.

Source	What it holds	Readiness gap
PLC and sensors	Real-time process values, states	No context, no units, high frequency to buffer
SCADA	Alarms, setpoints, operator actions	Event logs unstructured, clocks unsynced
Historian	Years of time-series tags	Cryptic tag names, no product or batch link
MES	Work orders, genealogy, quality	Siloed schema, hard to join to process data
ERP	Orders, inventory, shipments	Daily granularity, disconnected from the floor

Recommended actions

Build the foundation first

Stand up a unified namespace so PLC, SCADA, MES, and ERP data publish to one contextualized model rather than point-to-point integrations.
Add asset and product context to raw tags, mapping TT_4021_PV to furnace zone 2 temperature on line 3, so features carry meaning.
Synchronize clocks across PLC, SCADA, and MES with a common time source so quality results align to the conditions that produced them.
Decide edge versus cloud per use case: run low-latency inference at the edge, and stream aggregated context to the cloud for training.
Sample and label a representative dataset early to expose gaps before you commit to a modeling roadmap.

Common pitfalls

Data traps to avoid

Treating the historian as a ready data lake when it lacks the context and labels a model needs.
Pushing every raw high-frequency tag to the cloud, driving up cost and latency while adding no context.
Ignoring clock drift, so process and quality data never align and every model learns noise.
Building point-to-point integrations between each system, creating brittle plumbing that breaks with every upgrade.

Metrics that matter

How to measure readiness

Share of historian tags with asset context, units, and product linkage attached.
Clock synchronization error across PLC, SCADA, and MES, targeted at sub-second.
Time to assemble a labeled training dataset for a new use case, a direct proxy for readiness.
Edge-to-cloud data volume and cost per line, watched to catch runaway raw streaming.

FAQ

Frequently asked questions

Is our historian enough to start building AI models?

Usually not on its own. A historian is optimized for trending, so its tags often lack units, asset context, and product linkage. You can start there, but budget for the contextualization and labeling work first, or your first model will learn from noise.

What is a unified namespace and do we need one?

A unified namespace is a single contextualized data model that all systems publish to, replacing brittle point-to-point integrations. You do not strictly need one to run a first pilot, but you will need it to scale AI across lines without a web of fragile connectors.

Should inference run at the edge or in the cloud?

Decide per use case. Low-latency, safety-adjacent inference like vision or fast anomaly detection belongs at the edge. Model training and cross-line optimization belong in the cloud, fed by aggregated, contextualized data rather than raw high-frequency tags.

AI in Manufacturing: Data Readiness

The plant floor speaks in tags, not features

Assessing data readiness by layer

Build the foundation first

Data traps to avoid

How to measure readiness

Frequently asked questions

Is our historian enough to start building AI models?

What is a unified namespace and do we need one?

Should inference run at the edge or in the cloud?

Related reading

This is a taste. The full library goes deeper.

Stratenity is the AI Operating System for Strategic Execution.

AI in Manufacturing: Data Readiness

The plant floor speaks in tags, not features

Assessing data readiness by layer

Build the foundation first

Data traps to avoid

How to measure readiness

Frequently asked questions

Is our historian enough to start building AI models?

What is a unified namespace and do we need one?

Should inference run at the edge or in the cloud?

Related reading

Found this useful? Pass it on.

This is a taste. The full library goes deeper.

Stratenity is the AI Operating System for Strategic Execution.