Feb 26, 2026
Peter Busk
AI and GxP: How to validate machine learning models
Introduction
"Can we even use AI in a GxP environment?" That is a question we often encounter at Hyperbolic when we work with pharma and medicinal companies. The answer is yes, but it requires a fundamentally different approach to validation than traditional software.
Machine learning models are not deterministic like classic software. They learn from data, and their output can change over time. This creates unique challenges in regulated environments where validation, traceability, and reproducibility are legal requirements.
Why is ML validation different?
Traditional software is validated by verifying that it does exactly what it is coded to do. An algorithm that calculates a dose will always give the same output for the same input. But an ML model:
Learns patterns from training data
Can produce different results upon retraining
Has built-in uncertainty
Changes when data changes
This means that classic Software Development Life Cycle (SDLC) validation is not sufficient. We need to think in terms of "Model Lifecycle Management."
Regulatory landscape
FDA, EMA, and other authorities are still developing specific guidelines for AI/ML. But the existing rules fully apply:
21 CFR Part 11: Electronic records and signatures
EU GMP Annex 11: Computerized systems
GAMP 5: Good Automated Manufacturing Practice
At Hyperbolic, we operate on the principle: AI systems in GxP must meet the same requirements for quality, safety, and data integrity as all other software, plus additional requirements specifically for ML.
Validation framework for ML in GxP
Phase 1: Data Governance and Qualification
Everything starts with data. In GxP, it's not enough to have a lot of data; it needs to be qualified data.
Data lineage: Document exactly where data comes from. Which systems? Which processes? How was it collected?
Data quality: Validate that the data is:
Complete (no critical gaps)
Correct (validated against source systems)
Consistent (no conflicts or duplication)
Current (updated according to requirements)
Data splitting: Document how data is divided into training, validation, and test. This split must be reproducible and traceable.
In a project for a pharmaceutical company, we established complete data lineage from production systems through cleansing to training data. Each transformation was documented and validated.
Phase 2: Model Development and Documentation
Requirements Specification: What should the model be able to do? Define clearly:
Problem formulation
Acceptable accuracy levels
Performance requirements
Safety requirements
Model Selection: Document why this specific model type was chosen. We typically compare 3-5 different approaches and document the rationale for the choice.
Hyperparameter tuning: All tuning must be traceable. We log all experiments with MLflow or similar tools, so it is documented how we arrived at the final configuration.
Phase 3: Validation and Testing
Here, GxP validation really differs from standard ML practices.
Testing on independent data: Test data must NEVER have been seen during training or tuning. In GxP, we often require a "locked" test set that is only opened when the model is completed.
Performance qualification: Define acceptance criteria in advance. Example:
Minimum accuracy: 95%
Maximum false negative rate: 2%
Performance must be stable across different batches
Edge case testing: Test the model on:
Outliers and extreme values
Missing data
Data outside the training distribution
Known failure modes
Bias analysis: Document that the model does not have unacceptable bias. For a model screening clinical trial candidates, we tested performance across age, gender, and ethnicity to ensure there was no discrimination.
Phase 4: Deployment and Change Control
Versioning: Each model version must be uniquely identified. We version:
Model architecture
Training data (including exact split)
Hyperparameters
Dependencies (libraries and versions)
Change control: Any change must go through formal change control. Even minor adjustments require:
Impact assessment
Testing
Approval
Documentation
Rollback plan: What do we do if the model fails in production? There should always be a plan to roll back to the previous version or to a manual process.
Phase 5: Continuous Monitoring
ML models are not "set and forget." In GxP, we require continuous monitoring.
Performance monitoring: Track continuously:
Prediction accuracy
Distribution of input data (data drift)
Distribution of outputs
Response times
Periodic review: Quarterly or semi-annual reviews where we verify that the model still performs as expected.
Retraining and revalidation: When should the model be retrained? Define clear criteria:
Performance falls below threshold
Significant data drift detected
New regulatory requirements
Changes in underlying processes
Practical challenges and solutions
Challenge: Explainability
Regulators often want to know "why" the model makes a decision. Deep learning models are notoriously difficult to explain.
Our approach:
Prefer explainable models where possible (decision trees, linear models)
For complex models: Implement SHAP or LIME to explain individual predictions
Document model behavior thoroughly through sensitivity analysis
Challenge: Reproducibility
Being able to reproduce the exact same model is critical in GxP, but ML often involves randomness.
Our approach:
Set ALL random seeds and document them
Version control of everything (code, data, config)
Containerization (Docker) of the entire environment
Automated pipelines that ensure identical processes
Challenge: Audit trails
GxP requires a complete audit trail of all changes and decisions.
Our approach:
Automatic logging of all model interactions
Integration with electronic QMS systems
21 CFR Part 11 compliant signatures on critical decisions
Case: Validation of quality control model
We developed an AI model for automatic inspection of pharmaceutical tablets. This was a category 5 system (GAMP) with direct GxP impact.
Our approach:
6 months of data collection and qualification from the production line
Selection of CNN architecture after comparing it with 4 alternative approaches (documented)
Locked test set with 10,000 tablets manually verified by 3 independent inspectors
Performance requirements: Min 99% accuracy, max 0.1% false negatives (defective tablets marked as OK)
Complete validation documentation: IQ/OQ/PQ of 300+ pages
Continuous monitoring with weekly performance reviews
Result: Model approved by QA, implemented in production, and has operated stably for over 18 months with consistent >99.5% accuracy.
Tools and best practices
MLOps for GxP:
MLflow for experiment tracking (with audit logging)
DVC for data and model versioning
Great Expectations for data validation
Evidently AI for monitoring data drift
SHAP/LIME for model explainability
Documentation templates: We have developed GxP-ready templates for:
ML Model Requirements Specification
ML Model Design Document
Validation Plan and Report
Change Control procedures for ML
Conclusion
AI and ML can absolutely be used in GxP environments, but it requires discipline, thorough documentation, and a structured approach to validation. It is not enough to have a model that "works"; it must be validated, reproducible, and continuously monitored.
At Hyperbolic, we combine a deep understanding of both AI/ML technology and GxP requirements. We help pharma companies navigate this complex landscape and implement AI solutions that deliver value while meeting regulatory requirements.
Contact us for a consultation on validating AI in your GxP environment.

By
Peter Busk
CEO & Partner
[ HyperAcademy ]
Our insights from the industry



