The Model Card Template That Passes FDA Pre-Cert Review
The FDA Submission That Got Rejected
Startup: "We're submitting our AI diagnostic tool for FDA Pre-Cert."
FDA Reviewer: "Provide documentation: training data, model architecture, evaluation metrics, clinical validation."
Startup: "We have a white paper..."
FDA: "We need structured documentation. Model card, data card, and clinical evaluation report. Resubmit in 6 months."
The Delay: 6 months of scrambling to create documentation that should've existed from day one.
What FDA Pre-Cert Requires (The Checklist)
Three Documents:
- Model Card: What the AI does, how it was trained, limitations
- Data Card: Where training data came from, bias testing, quality control
- Clinical Evaluation Report: Real-world validation, safety monitoring
Timeline:
- Without documentation: 12-18 months to approval
- With documentation: 6-9 months
Cost Savings: 6 months of eng time + faster time to market
The FDA-Ready Model Card Template
Section 1: Intended Use
What FDA Wants:
- Medical condition/disease targeted
- Patient population (age, sex, comorbidities)
- Clinical setting (hospital, clinic, home use)
- User (physician, nurse, patient)
Example:
INTENDED USE
Medical Condition: Type 2 Diabetes screening
Patient Population: Adults 18-75, no prior diabetes diagnosis
Clinical Setting: Primary care clinic
Primary User: Primary care physician
Decision Support: AI flags high-risk patients for lab testing (HbA1c)
What NOT to Say: "General health screening" (too vague—FDA will reject)
Section 2: Model Architecture
What FDA Wants:
- Algorithm type (e.g., "Gradient boosting classifier")
- Input features (e.g., "Age, BMI, blood pressure, family history")
- Output (e.g., "Risk score 0-100, with threshold at 70 for high-risk")
Example:
MODEL ARCHITECTURE
Algorithm: XGBoost (gradient boosting decision trees)
Version: XGBoost 1.7.0
Inputs: 12 clinical features (age, BMI, systolic BP, fasting glucose, etc.)
Output: Diabetes risk score (0-100)
Threshold: Score ≥70 = High Risk (recommend HbA1c lab test)
Why This Matters: FDA needs to understand how the AI makes decisions (interpretability requirement).
Section 3: Training Data
What FDA Wants:
- Source (where data came from)
- Volume (how many patients)
- Demographics (age, sex, race, ethnicity)
- Date range (when data was collected)
- Quality control (how you ensured data accuracy)
Example:
TRAINING DATA
Source: Electronic Health Records from [Hospital System], IRB-approved (Protocol #12345)
Volume: 50,000 patients (2018-2023)
Demographics:
- Age: Mean 52 (range 18-75), SD 14
- Sex: 52% female, 48% male
- Race: 60% White, 20% Black, 12% Hispanic, 8% Asian
- Ethnicity: 85% Non-Hispanic, 15% Hispanic
Data Quality:
- Missing data: <5% per feature (imputed using median)
- Outliers: Values >99th percentile reviewed by clinician, corrected or removed
De-Identification: HIPAA-compliant (dates shifted, names removed, rare diagnoses aggregated)
Red Flag: If demographics don't match US population, FDA will ask about bias.
Section 4: Evaluation Metrics
What FDA Wants:
- Accuracy, sensitivity, specificity (clinical gold standards)
- Performance by demographic subgroup (fairness testing)
- Comparison to human clinicians (is AI better?)
- Clinical impact (does AI improve patient outcomes?)
Example:
EVALUATION METRICS
Test Set: 10,000 patients (held out, not used in training)
Overall Performance:
- Sensitivity (Recall): 87% (95% CI: 85-89%)
- Specificity: 82% (95% CI: 80-84%)
- AUC: 0.91
Subgroup Performance (Fairness Testing):
- Female: Sensitivity 88%, Specificity 83%
- Male: Sensitivity 86%, Specificity 81%
- White: Sensitivity 89%, Specificity 84%
- Black: Sensitivity 84%, Specificity 79% (within 5pp, acceptable)
Comparison to Physician:
- Physician sensitivity: 78% (AI +9pp improvement)
- Physician specificity: 85% (AI -3pp, acceptable trade-off)
Clinical Impact:
- Early detection: AI flags 12% more high-risk patients than physician alone
- Estimated prevented complications: 200 cases/year per 10,000 patients screened
Why This Matters: FDA cares about patient outcomes, not just model accuracy.
Section 5: Limitations and Warnings
What FDA Wants:
- Known failure modes (when AI is unreliable)
- Contraindications (when NOT to use AI)
- Required human oversight (physician must review)
Example:
LIMITATIONS
Known Failure Modes:
- Lower accuracy for patients with rare comorbidities (<1% of population)
- Not validated for patients under 18 or over 75
- Not validated for Type 1 Diabetes (only Type 2)
Contraindications:
- Do NOT use for patients with pre-existing diabetes diagnosis
- Do NOT use as sole diagnostic tool (lab confirmation required)
Required Human Oversight:
- Physician must review all high-risk flags before ordering lab tests
- AI is decision support, not autonomous diagnosis
- Physician retains final clinical decision authority
Why This Matters: FDA wants proof you're not overselling the AI's capabilities.
Section 6: Post-Market Surveillance
What FDA Wants:
- How you'll monitor AI performance in production
- What triggers a safety alert (accuracy drop, adverse events)
- How often you'll retrain/update the model
Example:
POST-MARKET SURVEILLANCE
Monitoring Plan:
- Monthly accuracy tracking on production data (random sample of 500 patients)
- Alert trigger: Sensitivity drops below 80% OR specificity drops below 75%
- Physician feedback: Track overrides, false positives, false negatives
Safety Reporting:
- Adverse events (patient harm) reported to FDA within 30 days
- Quarterly summary report to FDA (performance metrics, user feedback)
Model Updates:
- Annual retraining with new data (subject to FDA review)
- Version control: All model versions documented, old versions archived
Why This Matters: FDA Pre-Cert assumes continuous improvement (not "set it and forget it").
Real Example: Diabetic Retinopathy Detection AI
Product: AI analyzes retinal images, flags diabetic retinopathy.
FDA Submission:
Intended Use: Screen diabetic patients for retinopathy in primary care settings (not ophthalmology clinics).
Model: Convolutional neural network (ResNet-50 architecture)
Training Data: 120,000 retinal images from 5 hospital systems (2015-2020)
Evaluation:
- Sensitivity: 92% (FDA target: >85%)
- Specificity: 88%
- Comparison: Ophthalmologist sensitivity 95% (AI -3pp, acceptable for screening)
Limitations:
- Not for patients with cataracts (image quality too poor)
- Requires human ophthalmologist to confirm positive findings
Post-Market:
- Monthly monitoring: Random sample of 1,000 images re-reviewed by ophthalmologist
- Alert: If AI sensitivity drops below 88%, auto-disable pending investigation
FDA Decision: Approved (6 months from submission to clearance).
Why It Worked: Documentation was complete upfront. No back-and-forth with FDA.
The Data Card (Companion to Model Card)
What FDA Wants (separate document):
- Data provenance: IRB approval, patient consent, HIPAA compliance
- Bias testing: Performance by race, sex, age, socioeconomic status
- Data retention: How long you keep training data, why
- Data security: Encryption, access controls, audit logs
Example Snippet:
DATA CARD
Provenance:
- Source: [Hospital System] EHR database
- IRB: Approved under Protocol #12345, waiver of consent (de-identified data)
- HIPAA: Compliant (Business Associate Agreement signed)
Bias Testing:
- Racial parity: Sensitivity within 5pp across racial groups
- Gender parity: Sensitivity within 3pp (female 88%, male 86%)
- Age: Lower sensitivity for patients >70 (79% vs. 87% for 40-60 age group)
→ Mitigation: Added warning for physicians treating elderly patients
Data Retention:
- Training data: Retained for 10 years (FDA device record requirement)
- Production data: De-identified logs retained for 3 years (monitoring)
Data Security:
- Encryption: AES-256 at rest, TLS 1.3 in transit
- Access: Role-based (PM, ML engineer, clinical validator—7 people total)
- Audit logs: Reviewed quarterly by compliance team
Checklist: Is Your Model Card FDA-Ready?
- Intended use (specific medical condition, patient population, clinical setting)
- Model architecture (algorithm, inputs, outputs, threshold)
- Training data (source, volume, demographics, quality control)
- Evaluation metrics (sensitivity, specificity, AUC, subgroup performance)
- Comparison to human clinician (is AI better/worse?)
- Clinical impact (does AI improve patient outcomes?)
- Limitations (failure modes, contraindications, required oversight)
- Post-market surveillance (monitoring plan, safety reporting, update schedule)
If any box is unchecked, FDA will request more documentation.
Common PM Mistakes
Mistake 1: Claiming "General Purpose" AI
- Reality: FDA requires narrow, well-defined medical use cases
- Fix: Specify exact condition, population, setting (not "health screening")
Mistake 2: No Bias Testing
- Reality: FDA will reject if you haven't tested performance across demographics
- Fix: Report sensitivity/specificity by race, sex, age (minimum)
Mistake 3: No Post-Market Plan
- Reality: FDA Pre-Cert assumes you'll monitor and update the AI
- Fix: Document monitoring frequency, alert triggers, update process
Alex Welcing is a Senior AI Product Manager in New York who writes FDA-ready model cards before submitting medical device AI. His regulatory approvals take 6 months, not 18, because documentation is a product requirement from day one.
Related Articles
The September Retro: What Your AI Team Learned in Q3 (And What to Fix in Q4)
Q3 is over. Time to audit: Which AI features shipped on time? Which got delayed? What patterns emerge? Here's the retrospective template that turns lessons into Q4 action items.
The AI PM's September Checklist: Audit Season Prep for Q4 Compliance
Q4 brings SOC2 audits, HIPAA reviews, and year-end compliance checks. Here's the 30-day checklist to get your AI features audit-ready before November.
The AI Feature That Shipped Without a Kill Switch: A Post-Mortem
What happens when your AI model degrades in production and you can't roll back? A real incident report on why every AI feature needs a manual override.