How AI and IIoT Sensors Cut Unplanned Downtime in a Paper Mill

Papermaking is one of the oldest continuous-process industries in the world, and the economics haven't changed in a century: the line runs twenty-four hours a day, seven days a week, or it doesn't run profitably at all. A single unplanned stop on a wide paper machine can cost anywhere from two to five lakh rupees per hour when you add up lost production, off-spec broke that gets recycled at zero margin, rushed maintenance callouts, and the compounding knock-on through scheduling and logistics.

That is the problem we were brought in to solve. Not downtime in general — downtime specifically caused by surprise failures on critical rotating equipment that gave no warning until it was too late.

The Biggest Problem: Surprise Failures on Critical Rotating Equipment

A modern paper machine is a chain of rotating assemblies: dryer-section drives that spin steam-heated cylinders at precise speeds, high-consistency refiners that mechanically fibrillate pulp fibres, vacuum pumps that pull water through forming fabrics, and a fleet of fans and blowers managing press-section airflows. Every one of these sits in a hot, wet, vibration-rich environment, and every one of them is a single point of failure for the entire production line.

Bearing failure is the most common culprit. A dryer-section bearing that starts to spall generates a characteristic high-frequency vibration signature weeks before it seizes — but without instrumentation, no one hears it. Gearbox wear in a refiner drive produces side-band harmonics in the motor current long before any mechanical noise breaks through the ambient din of the mill. A vacuum pump with a failing mechanical seal shows elevated casing temperature well before the pump cavitates and trips.

When these failures arrive unannounced, the response is always the same: emergency shutdown, scramble for a spare (which may or may not be in stock), maintenance work carried out under pressure, and a production gap that cannot be recovered. In the engagement this case study describes, unplanned stops from rotating-equipment failures were occurring roughly twice a month and averaging six to eight hours of lost production each time.

Why Traditional Maintenance Falls Short

The mill had been running two conventional maintenance strategies side by side, as most paper mills do.

Run-to-failure was applied to lower-criticality equipment. The logic is economically reasonable for cheap, easily swapped components, but it migrates over time to critical assets simply because no one has drawn a clear boundary. The result is catastrophic stops on equipment that was never supposed to be managed reactively.

Fixed-interval preventive maintenance (PM) was applied to critical rotating equipment. Every 90 days — or 1000 operating hours — bearings were greased, alignments were checked, and belts were tensioned. This approach sounds rigorous, but it has a fundamental flaw: equipment does not degrade on a calendar. A bearing running in a clean, well-lubricated, thermally stable environment may be perfectly healthy at 1000 hours. The same bearing running hot, contaminated, and slightly misaligned may be days from failure at 400 hours. Fixed-interval PM spends money replacing healthy parts and still misses the failures it was designed to prevent.

The real cost of fixed-interval PM isn't just the parts you replace unnecessarily — it's the induced failures you cause when you disturb a running machine, and the false confidence that the next inspection is 90 days away when a bearing is already in distress.

Neither strategy was giving the maintenance team the one thing they needed: enough warning, early enough, to schedule an intervention on their own terms.

What We Deployed

The solution architecture has four layers, each feeding the next.

IIoT Sensing: Getting the Right Signals

We instrumented fifteen critical assets across the dryer section, refiner drives, and vacuum system. Each asset received:

Triaxial MEMS accelerometers mounted directly on bearing housings, sampling at 10–25 kHz to capture the high-frequency envelope signals associated with bearing spall and gear mesh. Cheap vibration switches that trigger on gross amplitude are not sufficient here — you need frequency-domain data.
RTD temperature sensors on bearing caps and motor frames, polled every 60 seconds.
Non-intrusive motor current signature analysis (MCSA) clamps on the incoming power leads of each drive motor. Current draw carries mechanical fault signatures through the electro-mechanical coupling of the motor — rotor eccentricity, impeller imbalance, developing gear tooth wear — without any physical contact with the rotating shaft.
Acoustic emission sensors on high-speed gearboxes, capturing ultrasonic stress-wave events that precede visible wear.

Total sensor count: 47 sensing channels across 15 assets.

Edge Gateway: Processing Close to the Machine

Raw vibration data at 25 kHz produces far too much volume to push continuously to a cloud system. An edge gateway sitting in the motor control centre performs on-device FFT, computes envelope spectra, extracts 38 time-domain and frequency-domain features per channel, and transmits a compact feature vector every 5–15 minutes. This keeps OT network bandwidth low, reduces latency, and means the system keeps functioning during WAN outages.

Secure OT/IT Data Path

The mill's operational technology (OT) network — PLC and DCS traffic — was isolated from the corporate IT network by a hardware firewall with a one-way data diode on the data feed into the analytics system. Sensor data flows outbound from OT to IT; no inbound connection crosses the boundary. This architecture satisfies both the mill's OT security requirements and the IT team's concerns about opening the DCS to external connectivity.

ML Anomaly Detection and Remaining Useful Life (RUL) Models

With twelve weeks of baseline sensor data collected under known-good conditions, we trained two complementary model families:

Unsupervised anomaly detection (isolation forest + autoencoder) flags when the multivariate feature vector for an asset deviates from its learned healthy baseline. This catches novel fault modes the model has never seen before — the unknown unknowns.
Supervised RUL regression models, trained on historical failure event data and vibration trajectories from the baseline period, estimate how many operating hours remain before intervention is required. These give the maintenance planner a confidence interval, not just a binary alarm.

Both model families retrain on a rolling 90-day window so they adapt to seasonal changes in humidity, ambient temperature, and production grade.

Real-Time OEE Dashboard and Early-Warning Alerts

The output surfaces in two places. A maintenance supervisor dashboard shows live asset health scores (0–100), trend charts for each KPI, active anomaly flags with severity, and estimated RUL values. A work-order integration pushes a draft corrective work order into the CMMS automatically when an asset's health score drops below a configurable threshold — with the sensor evidence attached so the technician arrives informed, not just alerted.

Alerts follow a three-tier escalation: an early-warning notification 2–4 weeks before predicted failure, a plan-now flag at 1–2 weeks, and an act-this-week critical alert. The goal is that by the time an asset reaches the critical tier, the spare part is already on the shelf and a maintenance window is already in the schedule.

The Results

The figures below are representative of the engagement; exact numbers are held in confidence.

Unplanned downtime from rotating-equipment failures decreased by approximately 35–40% over the twelve months following full deployment, compared with the prior twelve months.
Mean time between failures (MTBF) on instrumented assets increased by roughly 1.8×, primarily because early interventions were catching faults before they propagated to secondary damage on adjacent components.
Broke and off-spec scrap attributable to sudden grade-upset from equipment trips fell by around 20%, since most interventions now happen in a planned maintenance window at the end of a production run rather than mid-reel.
Maintenance spend shifted from approximately 70% reactive to approximately 55% planned within eight months of go-live. The full shift to condition-based maintenance is ongoing.
Three specific failure events — two bearing faults and one refiner gearbox — were caught and resolved at planned cost; post-incident analysis showed that undetected, each would have resulted in a 10–18 hour forced stop.

How This Generalises to Your Plant

Papermaking is a particularly demanding use case because the environment is wet, hot, and chemically aggressive, and the rotating equipment is both physically large and often inaccessible during operation. If the sensor-to-AI-to-dashboard architecture works here, it works in a wide range of process and discrete manufacturing settings.

The pattern is identical for sugar mills (diffuser drives, centrifuge motors, boiler feed pumps), automotive component plants (transfer line spindles, press eccentric drives, coolant pump trains), and chemical and pharma process plants (agitator drives, compressor trains, cooling tower fans). The specific fault signatures differ, but the physics — vibration, temperature, and current as proxies for internal mechanical state — are universal.

The critical enabler is not the AI model. It is the quality and resolution of the sensor data and the discipline to collect a meaningful baseline before attempting to train. Mills that skip the baseline period and go straight to deployment typically get either too many false alarms or too few real ones.

Where to Start

The engagement model that delivered the results above followed three stages, and we recommend the same sequence for any new plant:

Asset criticality audit (two to three weeks). Map every rotating asset against two axes: consequence of failure (production impact, safety risk) and current maintenance strategy. This identifies the ten to twenty assets where condition monitoring delivers the highest return, and avoids instrumenting assets where run-to-failure is actually the right answer.
Instrumented pilot on one production line (eight to twelve weeks). Deploy sensors on the five to eight highest-priority assets, collect baseline data, validate the anomaly-detection models against any events that occur during the pilot, and build the dashboard and work-order integration. At the end of this phase you have a working system and a quantified business case for the rest of the plant.
Plant-wide rollout and model maturation (ongoing). Expand asset coverage, connect additional data sources (process historian, quality data), and tune RUL models as the failure-event dataset grows. Condition-based maintenance improves with data; the system gets more accurate the longer it runs.

If you want to understand what an asset criticality audit would look like for your specific equipment and process, that is exactly the conversation we should have.