The September Retro: What Your AI Team Learned in Q3 (And What to Fix in Q4)

September 8, 2025Alex Welcing8 min read

The Q3 Review That Exposed the Real Blockers

CEO: "We planned to ship 6 AI features in Q3. We shipped 3. What happened?"

PM: "Well, Feature A got delayed by compliance, Feature B by data quality issues, Feature C by..."

CEO: "Are these one-offs, or do we have systemic problems?"

PM: Realizes no one has been tracking delay patterns.

The Fix: Monthly retrospectives → identify patterns → fix root causes (not symptoms).

The Q3 Retrospective Framework

Three Questions:

What shipped? (wins)
What didn't ship? (delays)
What patterns emerge? (systemic issues)

Output: Action items for Q4 (not just "we'll try harder").

Real Q3 Retrospective: Legal Tech Startup

Team: 5 engineers, 2 PMs, 1 data scientist

Q3 Goals: Ship 6 AI features

Results: 3 shipped, 3 delayed

What Shipped (Wins)

Feature 1: Contract clause extraction AI (deployed Aug 15)

Why it shipped: Model was ready by July, UX was simple (one button), compliance already approved for previous feature (no new review needed)

Feature 2: AI-powered search (deployed Aug 22)

Why it shipped: Used existing model (no retraining), frontend-only changes, no PII concerns

Feature 3: Legal citation validator (deployed Sep 10)

Why it shipped: PM wrote compliance docs in advance (no delay waiting for legal team)

What Didn't Ship (Delays)

Feature 4: AI contract summarization (planned July, now Oct)

Delay cause: Legal review took 6 weeks (not 2 weeks as estimated)
Root cause: PM didn't know legal team was understaffed in July (vacation season)

Feature 5: Bias detection in hiring docs (planned Aug, now Nov)

Delay cause: Training data quality issues (50% of labels were wrong, had to re-label)
Root cause: PM didn't QA the labeled dataset before training started

Feature 6: Multi-language support (planned Sep, now Q1 2026)

Delay cause: Model accuracy in Spanish was 68% (vs. 89% in English)—not shippable
Root cause: PM assumed "fine-tune on Spanish data" would work; didn't test early enough

Patterns (Systemic Issues)

Pattern 1: Compliance Reviews Take 4-6 Weeks (Not 2)

Impact: 2 of 3 delays were compliance-related
Fix: Start compliance docs 8 weeks before planned launch (not 4 weeks)

Pattern 2: Data Quality Not Checked Until Training Starts

Impact: 1 feature delayed 3 months to re-label data
Fix: Add "Data QA Week" to project plan (before training, not during)

Pattern 3: Multi-Language Features Underestimated

Impact: Spanish model took 2x longer than expected (low accuracy, needed more data)
Fix: Add 4-week buffer for non-English features; test on small dataset first

The Action Items Template

For Each Pattern, Define:

Problem: What went wrong?
Root Cause: Why did it happen?
Fix: What will we change in Q4?
Owner: Who's responsible?
Deadline: When will this be implemented?

Example:

Problem	Root Cause	Fix	Owner	Deadline
Compliance reviews take 6 weeks (not 2)	PM didn't account for legal team workload	Start compliance docs 8 weeks before launch; check legal team calendar for conflicts	PM (Alex)	Oct 1
Data quality issues delay training	No QA process for labeled datasets	Add "Data QA Week" to project template; PM reviews 10% sample before training starts	PM + Data Scientist	Oct 1
Multi-language features underestimated	Assumed fine-tuning on Spanish data would work without testing	Spike: Test on 100 Spanish examples before committing to feature; add 4-week buffer to estimate	Data Scientist	Oct 15

The "Start, Stop, Continue" Exercise

For Each Team Member, Ask:

START doing (new practices):

Start writing compliance docs 8 weeks before launch
Start QA'ing labeled data before training
Start testing multi-language features on small datasets first

STOP doing (bad habits):

Stop assuming compliance reviews take 2 weeks (they take 6)
Stop training models on unvalidated data
Stop estimating multi-language features same as English

CONTINUE doing (what's working):

Continue writing model cards upfront (Feature 3 shipped on time because docs were ready)
Continue reusing existing models when possible (Feature 2 shipped fast)
Continue UX simplicity (one-button features ship faster than complex workflows)

The Velocity Tracking Dashboard

Track These Metrics Q3 → Q4:

Metric	Q3 Actual	Q4 Target
Features shipped on time	50% (3 of 6)	75% (6 of 8)
Compliance review time	6 weeks	4 weeks (with 8-week lead time)
Data quality issues (features delayed)	33% (1 of 3 delays)	10% (max 1 delay)
Model accuracy on first attempt	67% (2 of 3 hit target)	85% (7 of 8 hit target)

How to Use This: Monthly check-in (are we on track to hit Q4 targets?).

Real Examples of Q3 Learnings

Startup A (Healthcare AI):

Q3 Lesson: "Red-teaming takes 2 weeks, not 1 day"
Q4 Fix: Add 2-week red-team sprint to every AI feature (before security review)

Startup B (Legal Tech):

Q3 Lesson: "Attorneys need training on AI features (not just docs)"
Q4 Fix: Add 1-week training period for each feature (PM runs 3 workshops with attorneys)

Startup C (AdTech):

Q3 Lesson: "A/B tests need 4 weeks, not 2 weeks, for statistical significance"
Q4 Fix: Extend all A/B tests to 4 weeks; if sample size is too small, expand to 10% of users (not 5%)

The Blameless Post-Mortem Culture

Bad Retro:

"Feature 4 was delayed because [Engineer] didn't finish the model on time."

Good Retro:

"Feature 4 was delayed because we didn't account for compliance review time. Root cause: PM estimated 2 weeks, actual was 6 weeks. Fix: PM will check legal team calendar and add 8-week buffer in Q4."

Why This Matters: Blame kills learning. Focus on systems, not individuals.

The Q4 Commitments (Based on Q3 Learnings)

Commitment 1: Compliance Docs Start 8 Weeks Before Launch

What Changed: PM writes model card, risk register, and red-team report during development (not after)
Impact: No compliance delays in Q4

Commitment 2: Data QA Week (Before Training)

What Changed: PM + data scientist review 10% sample of labeled data; fix quality issues before training starts
Impact: No data quality delays in Q4

Commitment 3: Multi-Language Spike (Before Committing to Feature)

What Changed: Test on 100 examples in target language; if accuracy is under 80%, add 4-week buffer or kill feature
Impact: No multi-language surprises in Q4

Commitment 4: A/B Tests Run for 4 Weeks (Not 2)

What Changed: All A/B tests extended to 4 weeks for statistical significance
Impact: No "inconclusive test" delays in Q4

Checklist: Is Your Q3 Retro Actionable?

Documented: What shipped vs. what didn't
Identified: 3-5 patterns (not just one-off issues)
Defined: Action items with owners and deadlines
Committed: Specific process changes for Q4 (not "we'll try harder")
Measured: Velocity metrics to track improvement (Q3 vs. Q4)
Blameless: Focused on systems, not individuals

If any box is unchecked, your retro won't drive change.

The Monthly Retro Cadence

Don't wait for Q4 to reflect on Q3. Run monthly retros:

July Retro:

What shipped in June? What delayed?
1-2 action items for July

August Retro:

What shipped in July? What delayed?
Did July action items work? (If not, adjust)

September Retro (Q3 Summary):

Roll up patterns from July + August + September
Define Q4 process changes

Why This Works: Monthly retros catch patterns early (not 3 months later when memory fades).

The One-Page Q3 Summary (For Your CEO)

Q3 RETROSPECTIVE: AI FEATURES

SHIPPED:
✅ Contract clause extraction (Aug 15)
✅ AI-powered search (Aug 22)
✅ Legal citation validator (Sep 10)

DELAYED:
❌ Contract summarization (Oct, was July) - Compliance review took 6 weeks
❌ Bias detection (Nov, was Aug) - Data quality issues (re-labeling)
❌ Multi-language support (Q1 2026, was Sep) - Spanish accuracy too low

PATTERNS:
1. Compliance reviews take 6 weeks (not 2) → Fix: Start docs 8 weeks before launch
2. Data quality not checked until training starts → Fix: Add Data QA Week
3. Multi-language features underestimated → Fix: Test on small dataset first; add buffer

Q4 TARGETS:
- Ship 6 of 8 features on time (75% vs. 50% in Q3)
- Zero compliance delays (with 8-week lead time)
- Zero data quality delays (with QA Week)

CONFIDENCE: High (process changes address root causes, not symptoms)

Time to Prepare: 2 hours (but saves weeks of delays in Q4).

Alex Welcing is a Senior AI Product Manager in New York who runs monthly retrospectives and quarterly pattern analysis. His Q4 features ship on time because Q3 lessons turn into process changes, not just "lessons learned" docs that no one reads.

🐦 Share on Twitter 💼 Share on LinkedIn

The AI PM's September Checklist: Audit Season Prep for Q4 Compliance

Q4 brings SOC2 audits, HIPAA reviews, and year-end compliance checks. Here's the 30-day checklist to get your AI features audit-ready before November.

The Model Card Template That Passes FDA Pre-Cert Review

FDA's Software Pre-Certification program requires AI transparency. Here's the model card template that gets medical device AI approved faster.

The AI Feature That Shipped Without a Kill Switch: A Post-Mortem

What happens when your AI model degrades in production and you can't roll back? A real incident report on why every AI feature needs a manual override.