Deep tech data is the opposite of big-tech data: small, expensive, siloed, and heterogeneous. A materials or fusion program may hold only hundreds of high-quality experiments, each costing thousands to millions, scattered across instrument files, ELN notebooks, simulation dumps, and engineers heads. Getting AI value in deep tech depends less on model choice and more on unifying experimental, simulation, and instrument data with real lineage, and on methods built for the small-data regime. This playbook covers how to break instrument and simulation silos, capture negative results, standardize metadata, and choose physics-informed and active-learning approaches that work when you cannot simply gather more data.
Deep tech runs on small, costly, scattered data
The defining constraint is scarcity. A consumer-AI team trains on billions of examples; a deep tech program might have 200 to 2,000 well-characterized experiments in its entire history, and each one may have cost from thousands of dollars for a bench synthesis to over a million for a fab lot or a fusion shot. You cannot solve a data problem by collecting more when each new point costs a quarter of runway. That inverts the usual playbook: quality, metadata, and reuse of every existing point, including failures, matter more than volume. Every characterized sample is a capital asset, and treating it as a disposable log entry throws away money the venture has already spent.
The second problem is fragmentation. Experimental results sit in instrument-specific binary files, simulation outputs in HPC scratch directories, process parameters in spreadsheets, and interpretation in scientists heads and PDF reports. Studies of R&D organizations repeatedly find scientists spend a large share of their time, often cited around a third, finding and reformatting data rather than doing science. Until that data is unified with lineage, from raw instrument reading to derived property, AI adoption stalls no matter how good the models are. A model trained on data whose provenance nobody can reconstruct is a model no scientist will trust with a six-figure decision, and rightly so. The unglamorous work of ingestion, schema, and lineage is therefore not a precursor to the AI project, it is the larger half of the AI project, and skipping it is the single most common reason deep tech AI programs quietly stall in year two.
A data-readiness ladder for the small-data regime
Climb the ladder in order. Each rung is a precondition for the AI approaches that make deep tech data usable despite scarcity.
| Readiness rung | What it fixes | Enables |
|---|---|---|
| Capture and ingest | Instrument files and sim outputs trapped in silos | A single queryable store of raw results |
| Metadata and standards | No shared schema for conditions, materials, units | Cross-experiment comparison and search |
| Lineage and provenance | Cannot trace a property back to its raw reading | Trust, reproducibility, audit |
| Negative-result capture | Failed runs discarded, information lost | Balanced training; active learning |
| Physics-informed methods | Too few points for pure data-driven models | Reliable models from hundreds not millions of points |
Make scarce data usable before buying more models
- Stand up a central experiment store that ingests instrument files and simulation outputs automatically, so results stop dying in local drives and scratch space.
- Adopt a shared metadata schema for materials, conditions, units, and provenance, and enforce it at capture; retrofitting metadata onto old data is far more expensive.
- Record negative and out-of-spec results as first-class data, because in small-data regimes a failure is as informative as a success for active learning.
- Preserve full lineage from raw instrument reading to derived property so every AI training point is traceable and reproducible, which auditors and grant officers will require.
- Favor physics-informed and Bayesian methods that inject known laws as priors, so you get reliable models from hundreds of points rather than needing millions.
Data mistakes that waste scarce, expensive experiments
- Throwing away failed experiments, which discards roughly half your hard-won information and biases any model toward only successful regimes.
- Deferring metadata standards until you have enough data, then facing an unaffordable retrofit across years of heterogeneous instrument files.
- Treating simulation and experimental data as interchangeable without tracking which is which, so model error and reality drift silently apart.
- Applying data-hungry deep learning to a few hundred points and getting confident but wrong predictions, when physics-informed methods would fit the regime.
Readiness metrics that predict AI success
- Share of experiments captured to the central store with complete metadata within a day of running, targeting over 90 percent.
- Percentage of derived properties with full lineage back to raw instrument data.
- Negative-result capture rate: failed and out-of-spec runs recorded as usable data.
- Scientist time spent finding and reformatting data, tracked downward from the common one-third baseline.
Frequently asked questions
We only have a few hundred good experiments. Is that enough for AI?
Yes, if you use methods built for scarcity. Physics-informed models, Gaussian processes, and Bayesian active learning extract signal from hundreds of points by encoding known laws as priors. The lever is not more data, it is better use of the data you have plus smart selection of the next experiment.
Why bother storing failed experiments?
Failures define the boundaries of the feasible region. Active-learning and surrogate models need to know where things do not work, not just where they do. Discarding negatives biases every downstream model and wastes the money you already spent running them.
Can we mix simulation and experimental data for training?
You can, but tag each point by source and fidelity. Multi-fidelity methods combine cheap simulation with scarce experiments deliberately. Blending them without provenance lets simulation error contaminate your view of physical reality.
Related reading
Go deeper on this sector and topic.