[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-12c959ae-cfdd-440b-bf63-1587929d92b1":3,"$fxfGjULZEp_3q_I-QAqyn-uKeVt5ObkeYs4y76QVxV30":43},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":34},"12c959ae-cfdd-440b-bf63-1587929d92b1","self-eval","诚实地使用双轴评分系统评估AI工作质量。在完成任务、代码审查或工作会议后使用，以获得无偏见的评估。检测分数膨胀，强制进行魔鬼辩护推理，并跨会话持续评分。","cat_life_career","mod_other","alirezarezvani,other","---\nname: \"self-eval\"\ndescription: \"Honestly evaluate AI work quality using a two-axis scoring system. Use after completing a task, code review, or work session to get an unbiased assessment. Detects score inflation, forces devil's advocate reasoning, and persists scores across sessions.\"\nlicense: \"MIT\"\n---\n\n# Self-Eval: Honest Work Evaluation\n\nultrathink\n\n**Tier:** STANDARD\n**Category:** Engineering \u002F Quality\n**Dependencies:** None (prompt-only, no external tools required)\n\n## Description\n\nSelf-eval is a Claude Code skill that produces honest, calibrated work evaluations. It replaces the default AI tendency to rate everything 4\u002F5 with a structured two-axis scoring system, mandatory devil's advocate reasoning, and cross-session anti-inflation detection.\n\nThe core insight: AI self-assessment converges to \"everything is a 4\" because a single-axis score conflates task difficulty with execution quality. Self-eval separates these axes, then combines them via a fixed matrix that the model cannot override.\n\n## Features\n\n- **Two-axis scoring** — Independently rates task ambition (Low\u002FMedium\u002FHigh) and execution quality (Poor\u002FAdequate\u002FStrong), then combines via a lookup matrix\n- **Mandatory devil's advocate** — Before finalizing, must argue for both higher AND lower scores, then resolve the tension\n- **Score persistence** — Appends scores to `.self-eval-scores.jsonl` in the working directory, building history across sessions\n- **Anti-inflation detection** — Reads past scores and flags clustering (4+ of last 5 identical)\n- **Matrix-locked scoring** — The composite score comes from the matrix, not from direct selection. Low ambition caps at 2\u002F5 regardless of execution quality\n\n## Usage\n\nAfter completing work in a Claude Code session:\n\n```\n\u002Fself-eval\n```\n\nWith context about what to evaluate:\n\n```\n\u002Fself-eval the authentication refactor we just completed\n```\n\nThe skill reviews the conversation history, identifies what was accomplished, scores it on both axes, runs the devil's advocate check, and appends the result to the score history file.\n\n## Examples\n\n### Example 1: Feature Implementation\n\n```\n\u002Fself-eval added pagination to the user list API\n```\n\nOutput:\n```\n## Self-Evaluation\n\n**Task:** Added cursor-based pagination to the user list API endpoint\n**Ambition:** Medium — New feature with established patterns, but required careful backwards compatibility\n**Execution:** Strong — All tests pass, handles edge cases, existing clients unaffected\n\n**Devil's Advocate:**\n- Lower: Pagination is well-documented. The pattern was copied from another endpoint with minor changes. Could argue Low ambition.\n- Higher: The backwards-compat requirement made this harder than a greenfield implementation. Had to support both offset and cursor modes.\n- Resolution: Medium ambition holds — the backwards-compat constraint added real complexity beyond copy-paste.\n\n**Score: 4** — Solid execution on meaningful work with real constraints.\n```\n\n### Example 2: Bug Fix\n\n```\n\u002Fself-eval\n```\n\nOutput:\n```\n## Self-Evaluation\n\n**Task:** Fixed race condition in the checkout flow causing duplicate orders\n**Ambition:** Medium — Required debugging across 3 services to find the root cause\n**Execution:** Adequate — Fix works but the test only covers the exact scenario reported, not the general class of race conditions\n\n**Devil's Advocate:**\n- Lower: The fix was a single line (adding a mutex). Once found, it was trivial.\n- Higher: Finding the root cause across 3 services took real investigation. The mutex placement required understanding the full flow.\n- Resolution: Medium ambition is correct for the investigation, but execution drops to Adequate — a more thorough fix would address the pattern, not just the instance.\n\n**Score: 3** — Good debugging work but the fix is narrow.\n```\n\n---\n\n## What to Evaluate\n\n$ARGUMENTS\n\nIf no arguments provided, review the full conversation history to identify what was accomplished this session. Summarize the work in one sentence before scoring.\n\n## How to Score — Two-Axis Model\n\nScore on two independent axes, then combine using the matrix. Do NOT pick a number first and rationalize it — rate each axis separately, then read the matrix.\n\n### Axis 1: Task Ambition (what was attempted)\n\nRate the difficulty and risk of what was worked on. NOT how well it was done.\n\n- **Low (1)** — Safe, familiar, routine. No real risk of failure. Examples: minor config changes, simple refactors, copy-paste with small modifications, tasks you were confident you'd complete before starting.\n- **Medium (2)** — Meaningful work with novelty or challenge. Partial failure was possible. Examples: new feature implementation, integrating an unfamiliar API, architectural changes, debugging a tricky issue.\n- **High (3)** — Ambitious, unfamiliar, or high-stakes. Real risk of complete failure. Examples: building something from scratch in an unfamiliar domain, complex system redesign, performance-critical optimization, shipping to production under pressure.\n\n**Self-check:** If you were confident of success before starting, ambition is Low or Medium, not High.\n\n### Axis 2: Execution Quality (how well it was done)\n\nRate the quality of the actual output, independent of how ambitious the task was.\n\n- **Poor (1)** — Major failures, incomplete, wrong output, or abandoned mid-task. The deliverable doesn't meet its own stated criteria.\n- **Adequate (2)** — Completed but with gaps, shortcuts, or missing rigor. Did the thing but left obvious improvements on the table.\n- **Strong (3)** — Well-executed, thorough, quality output. No obvious improvements left undone given the scope.\n\n### Composite Score Matrix\n\n|                        | Poor Exec (1) | Adequate Exec (2) | Strong Exec (3) |\n|------------------------|:---:|:---:|:---:|\n| **Low Ambition (1)**   |  1  |  2  |  2  |\n| **Medium Ambition (2)**|  2  |  3  |  4  |\n| **High Ambition (3)**  |  2  |  4  |  5  |\n\n**Read the matrix, don't override it.** The composite is your score. The devil's advocate below can cause you to re-rate an axis — but you cannot directly override the matrix result.\n\nKey properties:\n- Low ambition caps at 2. Safe work done perfectly is still safe work.\n- A 5 requires BOTH high ambition AND strong execution. It should be rare.\n- High ambition + poor execution = 2. Bold failure hurts.\n- The most common honest score for solid work is 3 (medium ambition, adequate execution).\n\n## Devil's Advocate (MANDATORY)\n\nBefore writing your final score, you MUST write all three of these:\n\n1. **Case for LOWER:** Why might this work deserve a lower score? What was easy, what was avoided, what was less ambitious than it appears? Would a skeptical reviewer agree with your axis ratings?\n2. **Case for HIGHER:** Why might this work deserve a higher score? What was genuinely challenging, surprising, or exceeded the original plan?\n3. **Resolution:** If either case reveals you mis-rated an axis, re-rate it and recompute the matrix result. Then state your final score with a 1-2 sentence justification that addresses at least one point from each case.\n\nIf your devil's advocate is less than 3 sentences total, you're not engaging with it — try harder.\n\n## Anti-Inflation Check\n\nCheck for a score history file at `.self-eval-scores.jsonl` in the current working directory.\n\nIf the file exists, read it and check the last 5 scores. If 4+ of the last 5 are the same number, flag it:\n> **Warning: Score clustering detected.** Last 5 scores: [list]. Consider whether you're anchoring to a default.\n\nIf the file doesn't exist, ask yourself: \"Would an outside observer rate this the same way I am?\"\n\n## Score Persistence\n\nAfter presenting your evaluation, append one line to `.self-eval-scores.jsonl` in the current working directory:\n\n```json\n{\"date\":\"YYYY-MM-DD\",\"score\":N,\"ambition\":\"Low|Medium|High\",\"execution\":\"Poor|Adequate|Strong\",\"task\":\"1-sentence summary\"}\n```\n\nThis enables the anti-inflation check to work across sessions. If the file doesn't exist, create it.\n\n## Output Format\n\nPresent your evaluation as:\n\n## Self-Evaluation\n\n**Task:** [1-sentence summary of what was attempted]\n**Ambition:** [Low\u002FMedium\u002FHigh] — [1-sentence justification]\n**Execution:** [Poor\u002FAdequate\u002FStrong] — [1-sentence justification]\n\n**Devil's Advocate:**\n- Lower: [why it might deserve less]\n- Higher: [why it might deserve more]\n- Resolution: [final reasoning]\n\n**Score: [1-5]** — [1-sentence final justification]\n","","imported","https:\u002F\u002Fgithub.com\u002Falirezarezvani\u002Fclaude-skills","user_system_seed","SkillOPIC",true,113,1484,"2026-05-16 13:55:11",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"其他","other","mdi-page-next-outline","其他类型Skill",5,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":32,"skillCount":33,"createdAt":26},"职场发展","career","mdi-briefcase-outline","面试准备、简历优化、职业规划",4,575,[35],{"id":36,"skillId":4,"version":37,"fileName":38,"fileSize":39,"filePath":40,"fileHash":41,"manifest":42,"createdAt":19},"3fc20d0b-cc70-479f-b43c-2186addb6dd1","1.0.0","self-eval.zip",3645,"uploads\u002Fskills\u002F12c959ae-cfdd-440b-bf63-1587929d92b1\u002Fself-eval.zip","a4ecff5cde00f4e7127b5128de5951e45c9f5ef457da6b4b40b7320920bced77","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":8457}]",{"code":44,"message":45,"data":46},200,"success",{"items":47,"stats":48,"page":51},[],{"averageRating":49,"totalRatings":49,"ratingCounts":50},0,[49,49,49,49,49],{"limit":52,"offset":49,"hasMore":53,"nextOffset":52,"ratedOnly":16},15,false]