[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-c3169ced-1c7c-4660-9028-00ec9a647c58":3,"$fDoojuzpZMNr4hFqazdHjtM55zgwBa57nEmZATfWdbsM":43},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":34},"c3169ced-1c7c-4660-9028-00ec9a647c58","on-call-handoff-patterns","有效的值班交接模式，确保交接的连续性、上下文传递和可靠的突发事件响应。","cat_life_career","mod_other","sickn33,other","---\nname: on-call-handoff-patterns\ndescription: \"Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.\"\nrisk: unknown\nsource: community\ndate_added: \"2026-02-27\"\n---\n\n# On-Call Handoff Patterns\n\nEffective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.\n\n## Do not use this skill when\n\n- The task is unrelated to on-call handoff patterns\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources\u002Fimplementation-playbook.md`.\n\n## Use this skill when\n\n- Transitioning on-call responsibilities\n- Writing shift handoff summaries\n- Documenting ongoing investigations\n- Establishing on-call rotation procedures\n- Improving handoff quality\n- Onboarding new on-call engineers\n\n## Core Concepts\n\n### 1. Handoff Components\n\n| Component | Purpose |\n|-----------|---------|\n| **Active Incidents** | What's currently broken |\n| **Ongoing Investigations** | Issues being debugged |\n| **Recent Changes** | Deployments, configs |\n| **Known Issues** | Workarounds in place |\n| **Upcoming Events** | Maintenance, releases |\n\n### 2. Handoff Timing\n\n```\nRecommended: 30 min overlap between shifts\n\nOutgoing:\n├── 15 min: Write handoff document\n└── 15 min: Sync call with incoming\n\nIncoming:\n├── 15 min: Review handoff document\n├── 15 min: Sync call with outgoing\n└── 5 min: Verify alerting setup\n```\n\n## Templates\n\n### Template 1: Shift Handoff Document\n\n```markdown\n# On-Call Handoff: Platform Team\n\n**Outgoing**: @alice (2024-01-15 to 2024-01-22)\n**Incoming**: @bob (2024-01-22 to 2024-01-29)\n**Handoff Time**: 2024-01-22 09:00 UTC\n\n---\n\n## 🔴 Active Incidents\n\n### None currently active\nNo active incidents at handoff time.\n\n---\n\n## 🟡 Ongoing Investigations\n\n### 1. Intermittent API Timeouts (ENG-1234)\n**Status**: Investigating\n**Started**: 2024-01-20\n**Impact**: ~0.1% of requests timing out\n\n**Context**:\n- Timeouts correlate with database backup window (02:00-03:00 UTC)\n- Suspect backup process causing lock contention\n- Added extra logging in PR #567 (deployed 01\u002F21)\n\n**Next Steps**:\n- [ ] Review new logs after tonight's backup\n- [ ] Consider moving backup window if confirmed\n\n**Resources**:\n- Dashboard: [API Latency](https:\u002F\u002Fgrafana\u002Fd\u002Fapi-latency)\n- Thread: #platform-eng (01\u002F20, 14:32)\n\n---\n\n### 2. Memory Growth in Auth Service (ENG-1235)\n**Status**: Monitoring\n**Started**: 2024-01-18\n**Impact**: None yet (proactive)\n\n**Context**:\n- Memory usage growing ~5% per day\n- No memory leak found in profiling\n- Suspect connection pool not releasing properly\n\n**Next Steps**:\n- [ ] Review heap dump from 01\u002F21\n- [ ] Consider restart if usage > 80%\n\n**Resources**:\n- Dashboard: [Auth Service Memory](https:\u002F\u002Fgrafana\u002Fd\u002Fauth-memory)\n- Analysis doc: [Memory Investigation](https:\u002F\u002Fdocs\u002Feng-1235)\n\n---\n\n## 🟢 Resolved This Shift\n\n### Payment Service Outage (2024-01-19)\n- **Duration**: 23 minutes\n- **Root Cause**: Database connection exhaustion\n- **Resolution**: Rolled back v2.3.4, increased pool size\n- **Postmortem**: [POSTMORTEM-89](https:\u002F\u002Fdocs\u002Fpostmortem-89)\n- **Follow-up tickets**: ENG-1230, ENG-1231\n\n---\n\n## 📋 Recent Changes\n\n### Deployments\n| Service | Version | Time | Notes |\n|---------|---------|------|-------|\n| api-gateway | v3.2.1 | 01\u002F21 14:00 | Bug fix for header parsing |\n| user-service | v2.8.0 | 01\u002F20 10:00 | New profile features |\n| auth-service | v4.1.2 | 01\u002F19 16:00 | Security patch |\n\n### Configuration Changes\n- 01\u002F21: Increased API rate limit from 1000 to 1500 RPS\n- 01\u002F20: Updated database connection pool max from 50 to 75\n\n### Infrastructure\n- 01\u002F20: Added 2 nodes to Kubernetes cluster\n- 01\u002F19: Upgraded Redis from 6.2 to 7.0\n\n---\n\n## ⚠️ Known Issues & Workarounds\n\n### 1. Slow Dashboard Loading\n**Issue**: Grafana dashboards slow on Monday mornings\n**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up\n**Ticket**: OPS-456 (P3)\n\n### 2. Flaky Integration Test\n**Issue**: `test_payment_flow` fails intermittently in CI\n**Workaround**: Re-run failed job (usually passes on retry)\n**Ticket**: ENG-1200 (P2)\n\n---\n\n## 📅 Upcoming Events\n\n| Date | Event | Impact | Contact |\n|------|-------|--------|---------|\n| 01\u002F23 02:00 | Database maintenance | 5 min read-only | @dba-team |\n| 01\u002F24 14:00 | Major release v5.0 | Monitor closely | @release-team |\n| 01\u002F25 | Marketing campaign | 2x traffic expected | @platform |\n\n---\n\n## 📞 Escalation Reminders\n\n| Issue Type | First Escalation | Second Escalation |\n|------------|------------------|-------------------|\n| Payment issues | @payments-oncall | @payments-manager |\n| Auth issues | @auth-oncall | @security-team |\n| Database issues | @dba-team | @infra-manager |\n| Unknown\u002Fsevere | @engineering-manager | @vp-engineering |\n\n---\n\n## 🔧 Quick Reference\n\n### Common Commands\n```bash\n# Check service health\nkubectl get pods -A | grep -v Running\n\n# Recent deployments\nkubectl get events --sort-by='.lastTimestamp' | tail -20\n\n# Database connections\npsql -c \"SELECT count(*) FROM pg_stat_activity;\"\n\n# Clear cache (emergency only)\nredis-cli FLUSHDB\n```\n\n### Important Links\n- [Runbooks](https:\u002F\u002Fwiki\u002Frunbooks)\n- [Service Catalog](https:\u002F\u002Fwiki\u002Fservices)\n- [Incident Slack](https:\u002F\u002Fslack.com\u002Fincidents)\n- [PagerDuty](https:\u002F\u002Fpagerduty.com\u002Fschedules)\n\n---\n\n## Handoff Checklist\n\n### Outgoing Engineer\n- [x] Document active incidents\n- [x] Document ongoing investigations\n- [x] List recent changes\n- [x] Note known issues\n- [x] Add upcoming events\n- [x] Sync with incoming engineer\n\n### Incoming Engineer\n- [ ] Read this document\n- [ ] Join sync call\n- [ ] Verify PagerDuty is routing to you\n- [ ] Verify Slack notifications working\n- [ ] Check VPN\u002Faccess working\n- [ ] Review critical dashboards\n```\n\n### Template 2: Quick Handoff (Async)\n\n```markdown\n# Quick Handoff: @alice → @bob\n\n## TL;DR\n- No active incidents\n- 1 investigation ongoing (API timeouts, see ENG-1234)\n- Major release tomorrow (01\u002F24) - be ready for issues\n\n## Watch List\n1. API latency around 02:00-03:00 UTC (backup window)\n2. Auth service memory (restart if > 80%)\n\n## Recent\n- Deployed api-gateway v3.2.1 yesterday (stable)\n- Increased rate limits to 1500 RPS\n\n## Coming Up\n- 01\u002F23 02:00 - DB maintenance (5 min read-only)\n- 01\u002F24 14:00 - v5.0 release\n\n## Questions?\nI'll be available on Slack until 17:00 today.\n```\n\n### Template 3: Incident Handoff (Mid-Incident)\n\n```markdown\n# INCIDENT HANDOFF: Payment Service Degradation\n\n**Incident Start**: 2024-01-22 08:15 UTC\n**Current Status**: Mitigating\n**Severity**: SEV2\n\n---\n\n## Current State\n- Error rate: 15% (down from 40%)\n- Mitigation in progress: scaling up pods\n- ETA to resolution: ~30 min\n\n## What We Know\n1. Root cause: Memory pressure on payment-service pods\n2. Triggered by: Unusual traffic spike (3x normal)\n3. Contributing: Inefficient query in checkout flow\n\n## What We've Done\n- Scaled payment-service from 5 → 15 pods\n- Enabled rate limiting on checkout endpoint\n- Disabled non-critical features\n\n## What Needs to Happen\n1. Monitor error rate - should reach \u003C1% in ~15 min\n2. If not improving, escalate to @payments-manager\n3. Once stable, begin root cause investigation\n\n## Key People\n- Incident Commander: @alice (handing off)\n- Comms Lead: @charlie\n- Technical Lead: @bob (incoming)\n\n## Communication\n- Status page: Updated at 08:45\n- Customer support: Notified\n- Exec team: Aware\n\n## Resources\n- Incident channel: #inc-20240122-payment\n- Dashboard: [Payment Service](https:\u002F\u002Fgrafana\u002Fd\u002Fpayments)\n- Runbook: [Payment Degradation](https:\u002F\u002Fwiki\u002Frunbooks\u002Fpayments)\n\n---\n\n**Incoming on-call (@bob) - Please confirm you have:**\n- [ ] Joined #inc-20240122-payment\n- [ ] Access to dashboards\n- [ ] Understand current state\n- [ ] Know escalation path\n```\n\n## Handoff Sync Meeting\n\n### Agenda (15 minutes)\n\n```markdown\n## Handoff Sync: @alice → @bob\n\n1. **Active Issues** (5 min)\n   - Walk through any ongoing incidents\n   - Discuss investigation status\n   - Transfer context and theories\n\n2. **Recent Changes** (3 min)\n   - Deployments to watch\n   - Config changes\n   - Known regressions\n\n3. **Upcoming Events** (3 min)\n   - Maintenance windows\n   - Expected traffic changes\n   - Releases planned\n\n4. **Questions** (4 min)\n   - Clarify anything unclear\n   - Confirm access and alerting\n   - Exchange contact info\n```\n\n## On-Call Best Practices\n\n### Before Your Shift\n\n```markdown\n## Pre-Shift Checklist\n\n### Access Verification\n- [ ] VPN working\n- [ ] kubectl access to all clusters\n- [ ] Database read access\n- [ ] Log aggregator access (Splunk\u002FDatadog)\n- [ ] PagerDuty app installed and logged in\n\n### Alerting Setup\n- [ ] PagerDuty schedule shows you as primary\n- [ ] Phone notifications enabled\n- [ ] Slack notifications for incident channels\n- [ ] Test alert received and acknowledged\n\n### Knowledge Refresh\n- [ ] Review recent incidents (past 2 weeks)\n- [ ] Check service changelog\n- [ ] Skim critical runbooks\n- [ ] Know escalation contacts\n\n### Environment Ready\n- [ ] Laptop charged and accessible\n- [ ] Phone charged\n- [ ] Quiet space available for calls\n- [ ] Secondary contact identified (if traveling)\n```\n\n### During Your Shift\n\n```markdown\n## Daily On-Call Routine\n\n### Morning (start of day)\n- [ ] Check overnight alerts\n- [ ] Review dashboards for anomalies\n- [ ] Check for any P0\u002FP1 tickets created\n- [ ] Skim incident channels for context\n\n### Throughout Day\n- [ ] Respond to alerts within SLA\n- [ ] Document investigation progress\n- [ ] Update team on significant issues\n- [ ] Triage incoming pages\n\n### End of Day\n- [ ] Hand off any active issues\n- [ ] Update investigation docs\n- [ ] Note anything for next shift\n```\n\n### After Your Shift\n\n```markdown\n## Post-Shift Checklist\n\n- [ ] Complete handoff document\n- [ ] Sync with incoming on-call\n- [ ] Verify PagerDuty routing changed\n- [ ] Close\u002Fupdate investigation tickets\n- [ ] File postmortems for any incidents\n- [ ] Take time off if shift was stressful\n```\n\n## Escalation Guidelines\n\n### When to Escalate\n\n```markdown\n## Escalation Triggers\n\n### Immediate Escalation\n- SEV1 incident declared\n- Data breach suspected\n- Unable to diagnose within 30 min\n- Customer or legal escalation received\n\n### Consider Escalation\n- Issue spans multiple teams\n- Requires expertise you don't have\n- Business impact exceeds threshold\n- You're uncertain about next steps\n\n### How to Escalate\n1. Page the appropriate escalation path\n2. Provide brief context in Slack\n3. Stay engaged until escalation acknowledges\n4. Hand off cleanly, don't just disappear\n```\n\n## Best Practices\n\n### Do's\n- **Document everything** - Future you will thank you\n- **Escalate early** - Better safe than sorry\n- **Take breaks** - Alert fatigue is real\n- **Keep handoffs synchronous** - Async loses context\n- **Test your setup** - Before incidents, not during\n\n### Don'ts\n- **Don't skip handoffs** - Context loss causes incidents\n- **Don't hero** - Escalate when needed\n- **Don't ignore alerts** - Even if they seem minor\n- **Don't work sick** - Swap shifts instead\n- **Don't disappear** - Stay reachable during shift\n\n## Resources\n\n- [Google SRE - Being On-Call](https:\u002F\u002Fsre.google\u002Fsre-book\u002Fbeing-on-call\u002F)\n- [PagerDuty On-Call Guide](https:\u002F\u002Fwww.pagerduty.com\u002Fresources\u002Flearn\u002Fon-call-management\u002F)\n- [Increment On-Call Issue](https:\u002F\u002Fincrement.com\u002Fon-call\u002F)\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,235,1709,"2026-05-16 13:32:47",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"其他","other","mdi-page-next-outline","其他类型Skill",5,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":32,"skillCount":33,"createdAt":26},"职场发展","career","mdi-briefcase-outline","面试准备、简历优化、职业规划",4,575,[35],{"id":36,"skillId":4,"version":37,"fileName":38,"fileSize":39,"filePath":40,"fileHash":41,"manifest":42,"createdAt":19},"78ef551e-6d70-4cb3-8f59-b743dfaa7339","1.0.0","on-call-handoff-patterns.zip",5127,"uploads\u002Fskills\u002Fc3169ced-1c7c-4660-9028-00ec9a647c58\u002Fon-call-handoff-patterns.zip","103128169e35022ece12e9f397e60a4bc8c1f156ac8da6141adde89ae7c50eb3","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":11839}]",{"code":44,"message":45,"data":46},200,"success",{"items":47,"stats":48,"page":51},[],{"averageRating":49,"totalRatings":49,"ratingCounts":50},0,[49,49,49,49,49],{"limit":52,"offset":49,"hasMore":53,"nextOffset":52,"ratedOnly":16},15,false]