[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-d5bfa4ae-e175-41e7-a09c-6a00106a948a":3,"$f0TEA-tRGNLIRX4RZo7r5VrmTZi3BwcpJ2hJ5COZsSRs":43},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":34},"d5bfa4ae-e175-41e7-a09c-6a00106a948a","incident-runbook-templates","适用于事件响应流程手册的生产就绪模板，涵盖检测、分类、缓解、解决和沟通。","cat_life_career","mod_other","sickn33,other","---\nname: incident-runbook-templates\ndescription: \"Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.\"\nrisk: critical\nsource: community\ndate_added: \"2026-02-27\"\n---\n\n# Incident Runbook Templates\n\nProduction-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.\n\n## Do not use this skill when\n\n- The task is unrelated to incident runbook templates\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources\u002Fimplementation-playbook.md`.\n\n## Use this skill when\n\n- Creating incident response procedures\n- Building service-specific runbooks\n- Establishing escalation paths\n- Documenting recovery procedures\n- Responding to active incidents\n- Onboarding on-call engineers\n\n## Core Concepts\n\n### 1. Incident Severity Levels\n\n| Severity | Impact | Response Time | Example |\n|----------|--------|---------------|---------|\n| **SEV1** | Complete outage, data loss | 15 min | Production down |\n| **SEV2** | Major degradation | 30 min | Critical feature broken |\n| **SEV3** | Minor impact | 2 hours | Non-critical bug |\n| **SEV4** | Minimal impact | Next business day | Cosmetic issue |\n\n### 2. Runbook Structure\n\n```\n1. Overview & Impact\n2. Detection & Alerts\n3. Initial Triage\n4. Mitigation Steps\n5. Root Cause Investigation\n6. Resolution Procedures\n7. Verification & Rollback\n8. Communication Templates\n9. Escalation Matrix\n```\n\n## Runbook Templates\n\n### Template 1: Service Outage Runbook\n\n```markdown\n# [Service Name] Outage Runbook\n\n## Overview\n**Service**: Payment Processing Service\n**Owner**: Platform Team\n**Slack**: #payments-incidents\n**PagerDuty**: payments-oncall\n\n## Impact Assessment\n- [ ] Which customers are affected?\n- [ ] What percentage of traffic is impacted?\n- [ ] Are there financial implications?\n- [ ] What's the blast radius?\n\n## Detection\n### Alerts\n- `payment_error_rate > 5%` (PagerDuty)\n- `payment_latency_p99 > 2s` (Slack)\n- `payment_success_rate \u003C 95%` (PagerDuty)\n\n### Dashboards\n- [Payment Service Dashboard](https:\u002F\u002Fgrafana\u002Fd\u002Fpayments)\n- [Error Tracking](https:\u002F\u002Fsentry.io\u002Fpayments)\n- [Dependency Status](https:\u002F\u002Fstatus.stripe.com)\n\n## Initial Triage (First 5 Minutes)\n\n### 1. Assess Scope\n```bash\n# Check service health\nkubectl get pods -n payments -l app=payment-service\n\n# Check recent deployments\nkubectl rollout history deployment\u002Fpayment-service -n payments\n\n# Check error rates\ncurl -s \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fquery?query=sum(rate(http_requests_total{status=~'5..'}[5m]))\"\n```\n\n### 2. Quick Health Checks\n- [ ] Can you reach the service? `curl -I https:\u002F\u002Fapi.company.com\u002Fpayments\u002Fhealth`\n- [ ] Database connectivity? Check connection pool metrics\n- [ ] External dependencies? Check Stripe, bank API status\n- [ ] Recent changes? Check deploy history\n\n### 3. Initial Classification\n| Symptom | Likely Cause | Go To Section |\n|---------|--------------|---------------|\n| All requests failing | Service down | Section 4.1 |\n| High latency | Database\u002Fdependency | Section 4.2 |\n| Partial failures | Code bug | Section 4.3 |\n| Spike in errors | Traffic surge | Section 4.4 |\n\n## Mitigation Procedures\n\n### 4.1 Service Completely Down\n```bash\n# Step 1: Check pod status\nkubectl get pods -n payments\n\n# Step 2: If pods are crash-looping, check logs\nkubectl logs -n payments -l app=payment-service --tail=100\n\n# Step 3: Check recent deployments\nkubectl rollout history deployment\u002Fpayment-service -n payments\n\n# Step 4: ROLLBACK if recent deploy is suspect\nkubectl rollout undo deployment\u002Fpayment-service -n payments\n\n# Step 5: Scale up if resource constrained\nkubectl scale deployment\u002Fpayment-service -n payments --replicas=10\n\n# Step 6: Verify recovery\nkubectl rollout status deployment\u002Fpayment-service -n payments\n```\n\n### 4.2 High Latency\n```bash\n# Step 1: Check database connections\nkubectl exec -n payments deploy\u002Fpayment-service -- \\\n  curl localhost:8080\u002Fmetrics | grep db_pool\n\n# Step 2: Check slow queries (if DB issue)\npsql -h $DB_HOST -U $DB_USER -c \"\n  SELECT pid, now() - query_start AS duration, query\n  FROM pg_stat_activity\n  WHERE state = 'active' AND duration > interval '5 seconds'\n  ORDER BY duration DESC;\"\n\n# Step 3: Kill long-running queries if needed\npsql -h $DB_HOST -U $DB_USER -c \"SELECT pg_terminate_backend(pid);\"\n\n# Step 4: Check external dependency latency\ncurl -w \"@curl-format.txt\" -o \u002Fdev\u002Fnull -s https:\u002F\u002Fapi.stripe.com\u002Fv1\u002Fhealth\n\n# Step 5: Enable circuit breaker if dependency is slow\nkubectl set env deployment\u002Fpayment-service \\\n  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments\n```\n\n### 4.3 Partial Failures (Specific Errors)\n```bash\n# Step 1: Identify error pattern\nkubectl logs -n payments -l app=payment-service --tail=500 | \\\n  grep -i error | sort | uniq -c | sort -rn | head -20\n\n# Step 2: Check error tracking\n# Go to Sentry: https:\u002F\u002Fsentry.io\u002Fpayments\n\n# Step 3: If specific endpoint, enable feature flag to disable\ncurl -X POST https:\u002F\u002Fapi.company.com\u002Finternal\u002Ffeature-flags \\\n  -d '{\"flag\": \"DISABLE_PROBLEMATIC_FEATURE\", \"enabled\": true}'\n\n# Step 4: If data issue, check recent data changes\npsql -h $DB_HOST -c \"\n  SELECT * FROM audit_log\n  WHERE table_name = 'payment_methods'\n  AND created_at > now() - interval '1 hour';\"\n```\n\n### 4.4 Traffic Surge\n```bash\n# Step 1: Check current request rate\nkubectl top pods -n payments\n\n# Step 2: Scale horizontally\nkubectl scale deployment\u002Fpayment-service -n payments --replicas=20\n\n# Step 3: Enable rate limiting\nkubectl set env deployment\u002Fpayment-service \\\n  RATE_LIMIT_ENABLED=true \\\n  RATE_LIMIT_RPS=1000 -n payments\n\n# Step 4: If attack, block suspicious IPs\nkubectl apply -f - \u003C\u003CEOF\napiVersion: networking.k8s.io\u002Fv1\nkind: NetworkPolicy\nmetadata:\n  name: block-suspicious\n  namespace: payments\nspec:\n  podSelector:\n    matchLabels:\n      app: payment-service\n  ingress:\n  - from:\n    - ipBlock:\n        cidr: 0.0.0.0\u002F0\n        except:\n        - 192.168.1.0\u002F24  # Suspicious range\nEOF\n```\n\n## Verification Steps\n```bash\n# Verify service is healthy\ncurl -s https:\u002F\u002Fapi.company.com\u002Fpayments\u002Fhealth | jq\n\n# Verify error rate is back to normal\ncurl -s \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fquery?query=sum(rate(http_requests_total{status=~'5..'}[5m]))\" | jq '.data.result[0].value[1]'\n\n# Verify latency is acceptable\ncurl -s \"http:\u002F\u002Fprometheus:9090\u002Fapi\u002Fv1\u002Fquery?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))\" | jq\n\n# Smoke test critical flows\n.\u002Fscripts\u002Fsmoke-test-payments.sh\n```\n\n## Rollback Procedures\n```bash\n# Rollback Kubernetes deployment\nkubectl rollout undo deployment\u002Fpayment-service -n payments\n\n# Rollback database migration (if applicable)\n.\u002Fscripts\u002Fdb-rollback.sh $MIGRATION_VERSION\n\n# Rollback feature flag\ncurl -X POST https:\u002F\u002Fapi.company.com\u002Finternal\u002Ffeature-flags \\\n  -d '{\"flag\": \"NEW_PAYMENT_FLOW\", \"enabled\": false}'\n```\n\n## Escalation Matrix\n\n| Condition | Escalate To | Contact |\n|-----------|-------------|---------|\n| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |\n| Data breach suspected | Security Team | #security-incidents |\n| Financial impact > $10k | Finance + Legal | @finance-oncall |\n| Customer communication needed | Support Lead | @support-lead |\n\n## Communication Templates\n\n### Initial Notification (Internal)\n```\n🚨 INCIDENT: Payment Service Degradation\n\nSeverity: SEV2\nStatus: Investigating\nImpact: ~20% of payment requests failing\nStart Time: [TIME]\nIncident Commander: [NAME]\n\nCurrent Actions:\n- Investigating root cause\n- Scaling up service\n- Monitoring dashboards\n\nUpdates in #payments-incidents\n```\n\n### Status Update\n```\n📊 UPDATE: Payment Service Incident\n\nStatus: Mitigating\nImpact: Reduced to ~5% failure rate\nDuration: 25 minutes\n\nActions Taken:\n- Rolled back deployment v2.3.4 → v2.3.3\n- Scaled service from 5 → 10 replicas\n\nNext Steps:\n- Continuing to monitor\n- Root cause analysis in progress\n\nETA to Resolution: ~15 minutes\n```\n\n### Resolution Notification\n```\n✅ RESOLVED: Payment Service Incident\n\nDuration: 45 minutes\nImpact: ~5,000 affected transactions\nRoot Cause: Memory leak in v2.3.4\n\nResolution:\n- Rolled back to v2.3.3\n- Transactions auto-retried successfully\n\nFollow-up:\n- Postmortem scheduled for [DATE]\n- Bug fix in progress\n```\n```\n\n### Template 2: Database Incident Runbook\n\n```markdown\n# Database Incident Runbook\n\n## Quick Reference\n| Issue | Command |\n|-------|---------|\n| Check connections | `SELECT count(*) FROM pg_stat_activity;` |\n| Kill query | `SELECT pg_terminate_backend(pid);` |\n| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |\n| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |\n\n## Connection Pool Exhaustion\n```sql\n-- Check current connections\nSELECT datname, usename, state, count(*)\nFROM pg_stat_activity\nGROUP BY datname, usename, state\nORDER BY count(*) DESC;\n\n-- Identify long-running connections\nSELECT pid, usename, datname, state, query_start, query\nFROM pg_stat_activity\nWHERE state != 'idle'\nORDER BY query_start;\n\n-- Terminate idle connections\nSELECT pg_terminate_backend(pid)\nFROM pg_stat_activity\nWHERE state = 'idle'\nAND query_start \u003C now() - interval '10 minutes';\n```\n\n## Replication Lag\n```sql\n-- Check lag on replica\nSELECT\n  CASE\n    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0\n    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())\n  END AS lag_seconds;\n\n-- If lag > 60s, consider:\n-- 1. Check network between primary\u002Freplica\n-- 2. Check replica disk I\u002FO\n-- 3. Consider failover if unrecoverable\n```\n\n## Disk Space Critical\n```bash\n# Check disk usage\ndf -h \u002Fvar\u002Flib\u002Fpostgresql\u002Fdata\n\n# Find large tables\npsql -c \"SELECT relname, pg_size_pretty(pg_total_relation_size(relid))\nFROM pg_catalog.pg_statio_user_tables\nORDER BY pg_total_relation_size(relid) DESC\nLIMIT 10;\"\n\n# VACUUM to reclaim space\npsql -c \"VACUUM FULL large_table;\"\n\n# If emergency, delete old data or expand disk\n```\n```\n\n## Best Practices\n\n### Do's\n- **Keep runbooks updated** - Review after every incident\n- **Test runbooks regularly** - Game days, chaos engineering\n- **Include rollback steps** - Always have an escape hatch\n- **Document assumptions** - What must be true for steps to work\n- **Link to dashboards** - Quick access during stress\n\n### Don'ts\n- **Don't assume knowledge** - Write for 3 AM brain\n- **Don't skip verification** - Confirm each step worked\n- **Don't forget communication** - Keep stakeholders informed\n- **Don't work alone** - Escalate early\n- **Don't skip postmortems** - Learn from every incident\n\n## Resources\n\n- [Google SRE Book - Incident Management](https:\u002F\u002Fsre.google\u002Fsre-book\u002Fmanaging-incidents\u002F)\n- [PagerDuty Incident Response](https:\u002F\u002Fresponse.pagerduty.com\u002F)\n- [Atlassian Incident Management](https:\u002F\u002Fwww.atlassian.com\u002Fincident-management)\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,160,1526,"2026-05-16 13:23:39",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"其他","other","mdi-page-next-outline","其他类型Skill",5,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":32,"skillCount":33,"createdAt":26},"职场发展","career","mdi-briefcase-outline","面试准备、简历优化、职业规划",4,575,[35],{"id":36,"skillId":4,"version":37,"fileName":38,"fileSize":39,"filePath":40,"fileHash":41,"manifest":42,"createdAt":19},"3c4810dd-08c4-47b9-807b-5289476b6745","1.0.0","incident-runbook-templates.zip",4762,"uploads\u002Fskills\u002Fd5bfa4ae-e175-41e7-a09c-6a00106a948a\u002Fincident-runbook-templates.zip","3be18fbf0a144a00e39bbd2657c7f81042f07ed293c56c97eb20bbf464b81bfa","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":11310}]",{"code":44,"message":45,"data":46},200,"success",{"items":47,"stats":48,"page":51},[],{"averageRating":49,"totalRatings":49,"ratingCounts":50},0,[49,49,49,49,49],{"limit":52,"offset":49,"hasMore":53,"nextOffset":52,"ratedOnly":16},15,false]