[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-311d0b34-8f61-4107-bc7c-bfedbd59dfa7":3,"$fxi-jJZndlGWopUa4APA_atkVyTFi8GPhNTeko4pQHGM":42},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":33},"311d0b34-8f61-4107-bc7c-bfedbd59dfa7","postmortem-writing","全面指南：编写有效、无责备的复盘报告，推动组织学习和防止事故再次发生。","cat_writing_article","mod_writing","sickn33,writing","---\nname: postmortem-writing\ndescription: \"Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.\"\nrisk: unknown\nsource: community\ndate_added: \"2026-02-27\"\n---\n\n# Postmortem Writing\n\nComprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.\n\n## Do not use this skill when\n\n- The task is unrelated to postmortem writing\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources\u002Fimplementation-playbook.md`.\n\n## Use this skill when\n\n- Conducting post-incident reviews\n- Writing postmortem documents\n- Facilitating blameless postmortem meetings\n- Identifying root causes and contributing factors\n- Creating actionable follow-up items\n- Building organizational learning culture\n\n## Core Concepts\n\n### 1. Blameless Culture\n\n| Blame-Focused | Blameless |\n|---------------|-----------|\n| \"Who caused this?\" | \"What conditions allowed this?\" |\n| \"Someone made a mistake\" | \"The system allowed this mistake\" |\n| Punish individuals | Improve systems |\n| Hide information | Share learnings |\n| Fear of speaking up | Psychological safety |\n\n### 2. Postmortem Triggers\n\n- SEV1 or SEV2 incidents\n- Customer-facing outages > 15 minutes\n- Data loss or security incidents\n- Near-misses that could have been severe\n- Novel failure modes\n- Incidents requiring unusual intervention\n\n## Quick Start\n\n### Postmortem Timeline\n```\nDay 0: Incident occurs\nDay 1-2: Draft postmortem document\nDay 3-5: Postmortem meeting\nDay 5-7: Finalize document, create tickets\nWeek 2+: Action item completion\nQuarterly: Review patterns across incidents\n```\n\n## Templates\n\n### Template 1: Standard Postmortem\n\n```markdown\n# Postmortem: [Incident Title]\n\n**Date**: 2024-01-15\n**Authors**: @alice, @bob\n**Status**: Draft | In Review | Final\n**Incident Severity**: SEV2\n**Incident Duration**: 47 minutes\n\n## Executive Summary\n\nOn January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.\n\n**Impact**:\n- 12,000 customers unable to complete purchases\n- Estimated revenue loss: $45,000\n- 847 support tickets created\n- No data loss or security implications\n\n## Timeline (All times UTC)\n\n| Time | Event |\n|------|-------|\n| 14:23 | Deployment v2.3.4 completed to production |\n| 14:31 | First alert: `payment_error_rate > 5%` |\n| 14:33 | On-call engineer @alice acknowledges alert |\n| 14:35 | Initial investigation begins, error rate at 23% |\n| 14:41 | Incident declared SEV2, @bob joins |\n| 14:45 | Database connection exhaustion identified |\n| 14:52 | Decision to rollback deployment |\n| 14:58 | Rollback to v2.3.3 initiated |\n| 15:10 | Rollback complete, error rate dropping |\n| 15:18 | Service fully recovered, incident resolved |\n\n## Root Cause Analysis\n\n### What Happened\n\nThe v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.\n\n### Why It Happened\n\n1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.\n\n2. **Contributing Factors**:\n   - Code review did not catch the connection handling change\n   - No integration tests specifically for connection pool behavior\n   - Staging environment has lower traffic, masking the issue\n   - Database connection metrics alert threshold was too high (90%)\n\n3. **5 Whys Analysis**:\n   - Why did the service fail? → Database connections exhausted\n   - Why were connections exhausted? → Each request opened new connection\n   - Why did each request open new connection? → Code bypassed connection pool\n   - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns\n   - Why was developer unfamiliar? → No documentation on connection management patterns\n\n### System Diagram\n\n```\n[Client] → [Load Balancer] → [Payment Service] → [Database]\n                                    ↓\n                            Connection Pool (broken)\n                                    ↓\n                            Direct connections (cause)\n```\n\n## Detection\n\n### What Worked\n- Error rate alert fired within 8 minutes of deployment\n- Grafana dashboard clearly showed connection spike\n- On-call response was swift (2 minute acknowledgment)\n\n### What Didn't Work\n- Database connection metric alert threshold too high\n- No deployment-correlated alerting\n- Canary deployment would have caught this earlier\n\n### Detection Gap\nThe deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.\n\n## Response\n\n### What Worked\n- On-call engineer quickly identified database as the issue\n- Rollback decision was made decisively\n- Clear communication in incident channel\n\n### What Could Be Improved\n- Took 10 minutes to correlate issue with recent deployment\n- Had to manually check deployment history\n- Rollback took 12 minutes (could be faster)\n\n## Impact\n\n### Customer Impact\n- 12,000 unique customers affected\n- Average impact duration: 35 minutes\n- 847 support tickets (23% of affected users)\n- Customer satisfaction score dropped 12 points\n\n### Business Impact\n- Estimated revenue loss: $45,000\n- Support cost: ~$2,500 (agent time)\n- Engineering time: ~8 person-hours\n\n### Technical Impact\n- Database primary experienced elevated load\n- Some replica lag during incident\n- No permanent damage to systems\n\n## Lessons Learned\n\n### What Went Well\n1. Alerting detected the issue before customer reports\n2. Team collaborated effectively under pressure\n3. Rollback procedure worked smoothly\n4. Communication was clear and timely\n\n### What Went Wrong\n1. Code review missed critical change\n2. Test coverage gap for connection pooling\n3. Staging environment doesn't reflect production traffic\n4. Alert thresholds were not tuned properly\n\n### Where We Got Lucky\n1. Incident occurred during business hours with full team available\n2. Database handled the load without failing completely\n3. No other incidents occurred simultaneously\n\n## Action Items\n\n| Priority | Action | Owner | Due Date | Ticket |\n|----------|--------|-------|----------|--------|\n| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |\n| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |\n| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |\n| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |\n| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |\n| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |\n\n## Appendix\n\n### Supporting Data\n\n#### Error Rate Graph\n[Link to Grafana dashboard snapshot]\n\n#### Database Connection Graph\n[Link to metrics]\n\n### Related Incidents\n- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)\n\n### References\n- Connection Pool Best Practices\n- Deployment Runbook\n```\n\n### Template 2: 5 Whys Analysis\n\n```markdown\n# 5 Whys Analysis: [Incident]\n\n## Problem Statement\nPayment service experienced 47-minute outage due to database connection exhaustion.\n\n## Analysis\n\n### Why #1: Why did the service fail?\n**Answer**: Database connections were exhausted, causing all new requests to fail.\n\n**Evidence**: Metrics showed connection count at 100\u002F100 (max), with 500+ pending requests.\n\n---\n\n### Why #2: Why were database connections exhausted?\n**Answer**: Each incoming request opened a new database connection instead of using the connection pool.\n\n**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.\n\n---\n\n### Why #3: Why did the code bypass the connection pool?\n**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.\n\n**Evidence**: PR #1234 shows the change, made while fixing a different bug.\n\n---\n\n### Why #4: Why wasn't this caught in code review?\n**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.\n\n**Evidence**: Review comments only discuss business logic.\n\n---\n\n### Why #5: Why isn't there a safety net for this type of change?\n**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.\n\n**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.\n\n## Root Causes Identified\n\n1. **Primary**: Missing automated tests for infrastructure behavior\n2. **Secondary**: Insufficient documentation of architectural patterns\n3. **Tertiary**: Code review checklist doesn't include infrastructure considerations\n\n## Systemic Improvements\n\n| Root Cause | Improvement | Type |\n|------------|-------------|------|\n| Missing tests | Add infrastructure behavior tests | Prevention |\n| Missing docs | Document connection patterns | Prevention |\n| Review gaps | Update review checklist | Detection |\n| No canary | Implement canary deployments | Mitigation |\n```\n\n### Template 3: Quick Postmortem (Minor Incidents)\n\n```markdown\n# Quick Postmortem: [Brief Title]\n\n**Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3\n\n## What Happened\nAPI latency spiked to 5s due to cache miss storm after cache flush.\n\n## Timeline\n- 10:00 - Cache flush initiated for config update\n- 10:02 - Latency alerts fire\n- 10:05 - Identified as cache miss storm\n- 10:08 - Enabled cache warming\n- 10:12 - Latency normalized\n\n## Root Cause\nFull cache flush for minor config update caused thundering herd.\n\n## Fix\n- Immediate: Enabled cache warming\n- Long-term: Implement partial cache invalidation (ENG-999)\n\n## Lessons\nDon't full-flush cache in production; use targeted invalidation.\n```\n\n## Facilitation Guide\n\n### Running a Postmortem Meeting\n\n```markdown\n## Meeting Structure (60 minutes)\n\n### 1. Opening (5 min)\n- Remind everyone of blameless culture\n- \"We're here to learn, not to blame\"\n- Review meeting norms\n\n### 2. Timeline Review (15 min)\n- Walk through events chronologically\n- Ask clarifying questions\n- Identify gaps in timeline\n\n### 3. Analysis Discussion (20 min)\n- What failed?\n- Why did it fail?\n- What conditions allowed this?\n- What would have prevented it?\n\n### 4. Action Items (15 min)\n- Brainstorm improvements\n- Prioritize by impact and effort\n- Assign owners and due dates\n\n### 5. Closing (5 min)\n- Summarize key learnings\n- Confirm action item owners\n- Schedule follow-up if needed\n\n## Facilitation Tips\n- Keep discussion on track\n- Redirect blame to systems\n- Encourage quiet participants\n- Document dissenting views\n- Time-box tangents\n```\n\n## Anti-Patterns to Avoid\n\n| Anti-Pattern | Problem | Better Approach |\n|--------------|---------|-----------------|\n| **Blame game** | Shuts down learning | Focus on systems |\n| **Shallow analysis** | Doesn't prevent recurrence | Ask \"why\" 5 times |\n| **No action items** | Waste of time | Always have concrete next steps |\n| **Unrealistic actions** | Never completed | Scope to achievable tasks |\n| **No follow-up** | Actions forgotten | Track in ticketing system |\n\n## Best Practices\n\n### Do's\n- **Start immediately** - Memory fades fast\n- **Be specific** - Exact times, exact errors\n- **Include graphs** - Visual evidence\n- **Assign owners** - No orphan action items\n- **Share widely** - Organizational learning\n\n### Don'ts\n- **Don't name and shame** - Ever\n- **Don't skip small incidents** - They reveal patterns\n- **Don't make it a blame doc** - That kills learning\n- **Don't create busywork** - Actions should be meaningful\n- **Don't skip follow-up** - Verify actions completed\n\n## Resources\n\n- [Google SRE - Postmortem Culture](https:\u002F\u002Fsre.google\u002Fsre-book\u002Fpostmortem-culture\u002F)\n- [Etsy's Blameless Postmortems](https:\u002F\u002Fcodeascraft.com\u002F2012\u002F05\u002F22\u002Fblameless-postmortems\u002F)\n- [PagerDuty Postmortem Guide](https:\u002F\u002Fpostmortems.pagerduty.com\u002F)\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,144,201,"2026-05-16 13:34:35",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"写作研究","writing","mdi-pencil-outline","从学术写作到创意文案，让 AI 成为你的专属写作助手",1,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":25,"skillCount":32,"createdAt":26},"文章写作","article","mdi-file-document-edit-outline","博客、新闻稿、自媒体文章等",61,[34],{"id":35,"skillId":4,"version":36,"fileName":37,"fileSize":38,"filePath":39,"fileHash":40,"manifest":41,"createdAt":19},"bc426809-320a-4b1a-a69b-c0b45b7aee03","1.0.0","postmortem-writing.zip",5265,"uploads\u002Fskills\u002F311d0b34-8f61-4107-bc7c-bfedbd59dfa7\u002Fpostmortem-writing.zip","fd8d93f928a5ccd3bcb3758f2661e5ec72f56f290c4515c5e04089e1e47bbb85","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":12790}]",{"code":43,"message":44,"data":45},200,"success",{"items":46,"stats":47,"page":50},[],{"averageRating":48,"totalRatings":48,"ratingCounts":49},0,[48,48,48,48,48],{"limit":51,"offset":48,"hasMore":52,"nextOffset":51,"ratedOnly":16},15,false]