Feb 26, 2026

Peter Busk

AI and GxP: How to validate machine learning models

Introduction

"Can we even use AI in a GxP environment?" That is a question we often encounter at Hyperbolic when we work with pharma and medicinal companies. The answer is yes, but it requires a fundamentally different approach to validation than traditional software.

Machine learning models are not deterministic like classic software. They learn from data, and their output can change over time. This creates unique challenges in regulated environments where validation, traceability, and reproducibility are legal requirements.

Why is ML validation different?

Traditional software is validated by verifying that it does exactly what it is coded to do. An algorithm that calculates a dose will always give the same output for the same input. But an ML model:

  • Learns patterns from training data

  • Can produce different results upon retraining

  • Has built-in uncertainty

  • Changes when data changes

This means that classic Software Development Life Cycle (SDLC) validation is not sufficient. We need to think in terms of "Model Lifecycle Management."

Regulatory landscape

FDA, EMA, and other authorities are still developing specific guidelines for AI/ML. But the existing rules fully apply:

  • 21 CFR Part 11: Electronic records and signatures

  • EU GMP Annex 11: Computerized systems

  • GAMP 5: Good Automated Manufacturing Practice

At Hyperbolic, we operate on the principle: AI systems in GxP must meet the same requirements for quality, safety, and data integrity as all other software, plus additional requirements specifically for ML.

Validation framework for ML in GxP

Phase 1: Data Governance and Qualification

Everything starts with data. In GxP, it's not enough to have a lot of data; it needs to be qualified data.

Data lineage: Document exactly where data comes from. Which systems? Which processes? How was it collected?

Data quality: Validate that the data is:

  • Complete (no critical gaps)

  • Correct (validated against source systems)

  • Consistent (no conflicts or duplication)

  • Current (updated according to requirements)

Data splitting: Document how data is divided into training, validation, and test. This split must be reproducible and traceable.

In a project for a pharmaceutical company, we established complete data lineage from production systems through cleansing to training data. Each transformation was documented and validated.

Phase 2: Model Development and Documentation

Requirements Specification: What should the model be able to do? Define clearly:

  • Problem formulation

  • Acceptable accuracy levels

  • Performance requirements

  • Safety requirements

Model Selection: Document why this specific model type was chosen. We typically compare 3-5 different approaches and document the rationale for the choice.

Hyperparameter tuning: All tuning must be traceable. We log all experiments with MLflow or similar tools, so it is documented how we arrived at the final configuration.

Phase 3: Validation and Testing

Here, GxP validation really differs from standard ML practices.

Testing on independent data: Test data must NEVER have been seen during training or tuning. In GxP, we often require a "locked" test set that is only opened when the model is completed.

Performance qualification: Define acceptance criteria in advance. Example:

  • Minimum accuracy: 95%

  • Maximum false negative rate: 2%

  • Performance must be stable across different batches

Edge case testing: Test the model on:

  • Outliers and extreme values

  • Missing data

  • Data outside the training distribution

  • Known failure modes

Bias analysis: Document that the model does not have unacceptable bias. For a model screening clinical trial candidates, we tested performance across age, gender, and ethnicity to ensure there was no discrimination.

Phase 4: Deployment and Change Control

Versioning: Each model version must be uniquely identified. We version:

  • Model architecture

  • Training data (including exact split)

  • Hyperparameters

  • Dependencies (libraries and versions)

Change control: Any change must go through formal change control. Even minor adjustments require:

  • Impact assessment

  • Testing

  • Approval

  • Documentation

Rollback plan: What do we do if the model fails in production? There should always be a plan to roll back to the previous version or to a manual process.

Phase 5: Continuous Monitoring

ML models are not "set and forget." In GxP, we require continuous monitoring.

Performance monitoring: Track continuously:

  • Prediction accuracy

  • Distribution of input data (data drift)

  • Distribution of outputs

  • Response times

Periodic review: Quarterly or semi-annual reviews where we verify that the model still performs as expected.

Retraining and revalidation: When should the model be retrained? Define clear criteria:

  • Performance falls below threshold

  • Significant data drift detected

  • New regulatory requirements

  • Changes in underlying processes

Practical challenges and solutions

Challenge: Explainability

Regulators often want to know "why" the model makes a decision. Deep learning models are notoriously difficult to explain.

Our approach:

  • Prefer explainable models where possible (decision trees, linear models)

  • For complex models: Implement SHAP or LIME to explain individual predictions

  • Document model behavior thoroughly through sensitivity analysis

Challenge: Reproducibility

Being able to reproduce the exact same model is critical in GxP, but ML often involves randomness.

Our approach:

  • Set ALL random seeds and document them

  • Version control of everything (code, data, config)

  • Containerization (Docker) of the entire environment

  • Automated pipelines that ensure identical processes

Challenge: Audit trails

GxP requires a complete audit trail of all changes and decisions.

Our approach:

  • Automatic logging of all model interactions

  • Integration with electronic QMS systems

  • 21 CFR Part 11 compliant signatures on critical decisions

Case: Validation of quality control model

We developed an AI model for automatic inspection of pharmaceutical tablets. This was a category 5 system (GAMP) with direct GxP impact.

Our approach:

  1. 6 months of data collection and qualification from the production line

  2. Selection of CNN architecture after comparing it with 4 alternative approaches (documented)

  3. Locked test set with 10,000 tablets manually verified by 3 independent inspectors

  4. Performance requirements: Min 99% accuracy, max 0.1% false negatives (defective tablets marked as OK)

  5. Complete validation documentation: IQ/OQ/PQ of 300+ pages

  6. Continuous monitoring with weekly performance reviews

Result: Model approved by QA, implemented in production, and has operated stably for over 18 months with consistent >99.5% accuracy.

Tools and best practices

MLOps for GxP:

  • MLflow for experiment tracking (with audit logging)

  • DVC for data and model versioning

  • Great Expectations for data validation

  • Evidently AI for monitoring data drift

  • SHAP/LIME for model explainability

Documentation templates: We have developed GxP-ready templates for:

  • ML Model Requirements Specification

  • ML Model Design Document

  • Validation Plan and Report

  • Change Control procedures for ML

Conclusion

AI and ML can absolutely be used in GxP environments, but it requires discipline, thorough documentation, and a structured approach to validation. It is not enough to have a model that "works"; it must be validated, reproducible, and continuously monitored.

At Hyperbolic, we combine a deep understanding of both AI/ML technology and GxP requirements. We help pharma companies navigate this complex landscape and implement AI solutions that deliver value while meeting regulatory requirements.

Contact us for a consultation on validating AI in your GxP environment.

By

Peter Busk

CEO & Partner