When AI Scores Sleep Like an Expert: Validating BIOSerenity A.I. SOMNO on 371 Polysomnographies

Sommeil

2026

Author : Guillaume JUBIEN

How close can AI get to the trained eye of a sleep expert? We ran an internal clinical study to find out: BIOSerenity A.I. SOMNO, our end-to-end AI for automated polysomnography scoring, was evaluated on 371 recordings from 8 French sleep centers. For most of the clinical metrics used to diagnose sleep disorders, the results are statistically indistinguishable from those produced by human experts.

Intro

Polysomnography (PSG) is the gold standard for diagnosing sleep disorders, but manual scoring is slow, expensive, and varies from one scorer to the next. We built BIOSerenity A.I. SOMNO to handle the whole report from start to finish: sleep stages, respiratory events, arousals, limb movements, desaturations, and all the clinical metrics that follow. In a recent internal clinical study, we benchmarked it against expert scoring on a large real-world dataset.

In short :

BIOSerenity A.I. SOMNO is a fully automated AI pipeline for analyzing overnight polysomnography. We trained it on 2,375 PSG recordings and tested it on an independent set of 371 recordings collected from 8 expert sleep centers in France. Overall accuracy reached 93.5% for sleep stage classification, and went above 96% for arousal, hypopnea, and desaturation detection. For the clinical metrics that doctors rely on (Apnea-Hypopnea Index, Total Sleep Time, Sleep Efficiency, Wake After Sleep Onset), we found no statistically significant difference between AI and expert scoring. Performance also held up across patients with metabolic, neurological, psychiatric, pulmonary, and cardiac comorbidities. The sleep staging module will soon be partially integrated into our Playback platform, bringing automated sleep stage analysis into day-to-day clinical work.

Why automated PSG scoring matters

Sleep affects almost every part of how we function, from cardiovascular regulation to memory consolidation and emotional balance. When sleep goes wrong, the clinical picture is rarely straightforward. Close to a third of adults report sleep complaints, and conditions like obstructive sleep apnea, insomnia, or restless leg syndrome each call for a careful workup. Polysomnography remains the reference exam for that workup, but it produces hours of multichannel signal that someone has to score by hand.

That manual scoring step is the bottleneck. It is slow, expensive, and prone to fatigue. Different scorers can also disagree on the same recording, especially on tricky events like arousals and hypopneas. Automating part of this work with reliable AI is no longer just a research goal. It has direct consequences on how many patients a sleep center can handle in a week, and on how consistent its reports look across nights and across centers.

What we built: BIOSerenity A.I. SOMNO

BIOSerenity A.I. SOMNO combines deep-learning models and rule-based signal-processing modules to score a full night of PSG from start to finish, in line with the AASM v3 guidelines. The system handles:

Sleep stage classification (Wake, N1, N2, N3, REM), based on a many-to-many architecture inspired by RobustSleepNet.
Arousal detection from EEG, EOG, and EMG signals.
Apnea and hypopnea detection from nasal pressure, RIP belts, SpO₂, and PPG, with several model variants selected automatically depending on signal quality.
Apnea subtype classification (obstructive, central, mixed).
Periodic limb movement detection from EMG envelopes.
Oxygen desaturation detection from SpO₂.
Computation of all standard derived sleep metrics: AHI, ODI, PLMI, Arousal Index, Total Sleep Time, Sleep Efficiency, sleep stage durations, Hypoxic Burden, and the Sleep Breathing Impairment Index.

BIOSerenity A.I. SOMNO is being developed under the European Medical Device Regulation (MDR 2017/745, Class IIa) and the IEC 62304 Class B standard. It is not CE-marked yet. The sleep staging module is about to be partially integrated into Playback, so BIOSerenity users will soon have automated sleep stage analysis right inside their usual workflow.

How we tested it

This is a retrospective, multicenter, non-interventional validation study run across 8 BIOSerenity sleep centers in France. We started from a database of about 20,000 PSG studies recorded between May 2020 and April 2024 in routine clinical practice. After quality control and consent verification, two datasets came out of it:

Training set: 2,375 PSG recordings.
Test set: 371 PSG recordings, all scored by one of three senior expert technicians who supervise scoring across the BIOSerenity network.

Sleep stage classification was evaluated epoch by epoch. Sleep-related events (arousals, apneas, hypopneas, desaturations, PLMs) were analyzed using a temporal-overlap criterion. We then compared the clinical metrics computed by the AI to those derived from expert annotations with paired t-tests, and ran two subgroup analyses to see whether performance held up across comorbidities and across clinical diagnoses (normal PSG, insomnia, sleep-disordered breathing).

The study is registered with the French Health Data Hub (project number 213760051). It was conducted in compliance with GDPR, the French Data Protection Act, the MDR, and the principles of the Declaration of Helsinki.

What we found

SLEEP STAGES

Overall accuracy for sleep stage classification reached 93.5% (95% CI 93.2 to 93.8), with 83.8% sensitivity and 96.0% specificity. By stage: 96.7% on Wake, 91.4% on N1, 87.8% on N2, 95.2% on N3, 96.5% on REM.

As shown in the table below, N1 is the hardest stage to classify, which is expected. It is also the stage on which human scorers agree the least.

SLEEP EVENTS

Event detection held up across the board:

Arousals: 97.7% accuracy, specificity above 99%.
Apneas (overall): 93.5% accuracy.
Hypopneas: 96.4% accuracy.
Desaturations: 98.1% accuracy, 92.0% sensitivity.
Periodic limb movement series: 94.1% accuracy.

CLINICAL METRICS : THE KEY RESULT

This is probably the most important finding for clinical practice. For most of the metrics that doctors look at when they read a sleep study, there was no statistically significant difference between values computed by BIOSerenity A.I. SOMNO and those derived from expert annotations (all p > 0.05). That includes the Apnea-Hypopnea Index, the Oxygen Desaturation Index, Hypoxic Burden, the Sleep Breathing Impairment Index, Total Sleep Time, Sleep Efficiency, Sleep Onset Latency, REM Onset Latency, Wake After Sleep Onset, and time spent in N1 and N2.

Three metrics did show a statistically significant difference: PLM Index, Arousal Index, and REM duration. The gaps stay within clinically interpretable ranges, and they help us pinpoint where the algorithms can still be improved.

DOES IT HOLD UP ACROSS PATIENT PROFILES ?

To ensure that performance does not collapse outside the easy cases, we ran subgroup analyses on patients with metabolic, neurological, psychiatric, pulmonary and cardiac comorbidities, and on patients grouped by clinical diagnosis (normal PSG, insomnia, sleep-disordered breathing). In almost every cell of that matrix, performance stayed within a 5% non-inferiority margin of the overall population. The few exceptions involve very rare events, like mixed sleep apneas in small subgroups, where a handful of misclassifications is mathematically enough to move the metric.

Why this matters for clinical sleep medicine

Most existing AI tools in sleep focus on a single task: sleep staging, or arousal detection, or apnea detection. BIOSerenity A.I. SOMNO has been validated as a single end-to-end pipeline that produces every component of a sleep study report. And not just on accuracy curves, but on the clinical metrics doctors actually use to make a diagnosis.

In practice, an AI-assisted workflow can shorten the time needed to produce a sleep report, reduce scoring variability across centers, and free expert technicians and physicians to focus on the complex cases where their judgment really makes the difference.

What’s next

There is still work to do. The next steps include extending validation to recordings from pediatric populations and those affected by medications known to alter EEG patterns. BIOSerenity A.I. SOMNO will also be deployed in at-home video EEG monitoring solutions, and soon in intensive care units through long-duration inEEG monitoring. This continuum, from lab to home to ICU, ensures that improvements in our algorithms directly benefit patients and clinicians in real-world settings.

In parallel, we are exploring at EEG foundation models to push sleep staging and arousal detection further, as well as expanding the range of detectable events, such as RERA and Cheyne-Stokes breathing.

On the product side, the sleep staging module is about to be partially integrated into Playback, our data viewer, enabling automated sleep stage analysis within the BIOSerenity ecosystem. Regulatory efforts are also underway to bring the full A.I. SOMNO system to CE marking.

References

Study registration (Health Data Hub, France): https://www.health-data-hub.fr/projets (project number 213760051).

AASM Manual for the Scoring of Sleep and Associated Events, version 3.0 (American Academy of Sleep Medicine, 2023).

Guillot A. and Thorey V., RobustSleepNet: Transfer Learning for Automated Sleep Staging at Scale, IEEE Trans. Neural Syst. Rehabil. Eng., 2021.

Full clinical study report (internal, BIOSerenity Medical Devices Group, 2026). A peer-reviewed publication based on these results is being prepared.