[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-4c1b4133-b163-43c4-afe1-3a59f8d6ba5e":3,"$f1F0atAbbLB2radSbknwRKuWC2BgzXox6cbejUaqR8e0":42},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":33},"4c1b4133-b163-43c4-afe1-3a59f8d6ba5e","advanced-evaluation","当用户要求“实现LLM作为裁判”、“比较模型输出”、“创建评估标准”、“减轻评估偏见”或提及直接评分、成对比较、位置偏见、评估流程或自动质量评估时，应使用此技能。","cat_coding_backend","mod_coding","sickn33,coding","---\nname: advanced-evaluation\ndescription: This skill should be used when the user asks to \"implement LLM-as-judge\", \"compare model outputs\", \"create evaluation rubrics\", \"mitigate evaluation bias\", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.\nrisk: safe\nsource: community\ndate_added: 2026-03-18\n---\n\n# Advanced Evaluation\n\nThis skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.\n\n**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.\n\n## When to Use\nActivate this skill when:\n\n- Building automated evaluation pipelines for LLM outputs\n- Comparing multiple model responses to select the best one\n- Establishing consistent quality standards across evaluation teams\n- Debugging evaluation systems that show inconsistent results\n- Designing A\u002FB tests for prompt or model changes\n- Creating rubrics for human or automated evaluation\n- Analyzing correlation between automated and human judgments\n\n## Core Concepts\n\n### The Evaluation Taxonomy\n\nEvaluation approaches fall into two primary categories with distinct reliability profiles:\n\n**Direct Scoring**: A single LLM rates one response on a defined scale.\n- Best for: Objective criteria (factual accuracy, instruction following, toxicity)\n- Reliability: Moderate to high for well-defined criteria\n- Failure mode: Score calibration drift, inconsistent scale interpretation\n\n**Pairwise Comparison**: An LLM compares two responses and selects the better one.\n- Best for: Subjective preferences (tone, style, persuasiveness)\n- Reliability: Higher than direct scoring for preferences\n- Failure mode: Position bias, length bias\n\nResearch from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.\n\n### The Bias Landscape\n\nLLM judges exhibit systematic biases that must be actively mitigated:\n\n**Position Bias**: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.\n\n**Length Bias**: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.\n\n**Self-Enhancement Bias**: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.\n\n**Verbosity Bias**: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.\n\n**Authority Bias**: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.\n\n### Metric Selection Framework\n\nChoose metrics based on the evaluation task structure:\n\n| Task Type | Primary Metrics | Secondary Metrics |\n|-----------|-----------------|-------------------|\n| Binary classification (pass\u002Ffail) | Recall, Precision, F1 | Cohen's κ |\n| Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |\n| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |\n| Multi-label | Macro-F1, Micro-F1 | Per-label precision\u002Frecall |\n\nThe critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.\n\n## Evaluation Approaches\n\n### Direct Scoring Implementation\n\nDirect scoring requires three components: clear criteria, a calibrated scale, and structured output format.\n\n**Criteria Definition Pattern**:\n```\nCriterion: [Name]\nDescription: [What this criterion measures]\nWeight: [Relative importance, 0-1]\n```\n\n**Scale Calibration**:\n- 1-3 scales: Binary with neutral option, lowest cognitive load\n- 1-5 scales: Standard Likert, good balance of granularity and reliability\n- 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics\n\n**Prompt Structure for Direct Scoring**:\n```\nYou are an expert evaluator assessing response quality.\n\n## Task\nEvaluate the following response against each criterion.\n\n## Original Prompt\n{prompt}\n\n## Response to Evaluate\n{response}\n\n## Criteria\n{for each criterion: name, description, weight}\n\n## Instructions\nFor each criterion:\n1. Find specific evidence in the response\n2. Score according to the rubric (1-{max} scale)\n3. Justify your score with evidence\n4. Suggest one specific improvement\n\n## Output Format\nRespond with structured JSON containing scores, justifications, and summary.\n```\n\n**Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.\n\n### Pairwise Comparison Implementation\n\nPairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.\n\n**Position Bias Mitigation Protocol**:\n1. First pass: Response A in first position, Response B in second\n2. Second pass: Response B in first position, Response A in second\n3. Consistency check: If passes disagree, return TIE with reduced confidence\n4. Final verdict: Consistent winner with averaged confidence\n\n**Prompt Structure for Pairwise Comparison**:\n```\nYou are an expert evaluator comparing two AI responses.\n\n## Critical Instructions\n- Do NOT prefer responses because they are longer\n- Do NOT prefer responses based on position (first vs second)\n- Focus ONLY on quality according to the specified criteria\n- Ties are acceptable when responses are genuinely equivalent\n\n## Original Prompt\n{prompt}\n\n## Response A\n{response_a}\n\n## Response B\n{response_b}\n\n## Comparison Criteria\n{criteria list}\n\n## Instructions\n1. Analyze each response independently first\n2. Compare them on each criterion\n3. Determine overall winner with confidence level\n\n## Output Format\nJSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.\n```\n\n**Confidence Calibration**: Confidence scores should reflect position consistency:\n- Both passes agree: confidence = average of individual confidences\n- Passes disagree: confidence = 0.5, verdict = TIE\n\n### Rubric Generation\n\nWell-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.\n\n**Rubric Components**:\n1. **Level descriptions**: Clear boundaries for each score level\n2. **Characteristics**: Observable features that define each level\n3. **Examples**: Representative text for each level (optional but valuable)\n4. **Edge cases**: Guidance for ambiguous situations\n5. **Scoring guidelines**: General principles for consistent application\n\n**Strictness Calibration**:\n- **Lenient**: Lower bar for passing scores, appropriate for encouraging iteration\n- **Balanced**: Fair, typical expectations for production use\n- **Strict**: High standards, appropriate for safety-critical or high-stakes evaluation\n\n**Domain Adaptation**: Rubrics should use domain-specific terminology. A \"code readability\" rubric mentions variables, functions, and comments. A \"medical accuracy\" rubric references clinical terminology and evidence standards.\n\n## Practical Guidance\n\n### Evaluation Pipeline Design\n\nProduction evaluation systems require multiple layers:\n\n```\n┌─────────────────────────────────────────────────┐\n│                 Evaluation Pipeline              │\n├─────────────────────────────────────────────────┤\n│                                                   │\n│  Input: Response + Prompt + Context               │\n│           │                                       │\n│           ▼                                       │\n│  ┌─────────────────────┐                         │\n│  │   Criteria Loader   │ ◄── Rubrics, weights    │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  ┌─────────────────────┐                         │\n│  │   Primary Scorer    │ ◄── Direct or Pairwise  │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  ┌─────────────────────┐                         │\n│  │   Bias Mitigation   │ ◄── Position swap, etc. │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  ┌─────────────────────┐                         │\n│  │ Confidence Scoring  │ ◄── Calibration         │\n│  └──────────┬──────────┘                         │\n│             │                                     │\n│             ▼                                     │\n│  Output: Scores + Justifications + Confidence     │\n│                                                   │\n└─────────────────────────────────────────────────┘\n```\n\n### Common Anti-Patterns\n\n**Anti-pattern: Scoring without justification**\n- Problem: Scores lack grounding, difficult to debug or improve\n- Solution: Always require evidence-based justification before score\n\n**Anti-pattern: Single-pass pairwise comparison**\n- Problem: Position bias corrupts results\n- Solution: Always swap positions and check consistency\n\n**Anti-pattern: Overloaded criteria**\n- Problem: Criteria measuring multiple things are unreliable\n- Solution: One criterion = one measurable aspect\n\n**Anti-pattern: Missing edge case guidance**\n- Problem: Evaluators handle ambiguous cases inconsistently\n- Solution: Include edge cases in rubrics with explicit guidance\n\n**Anti-pattern: Ignoring confidence calibration**\n- Problem: High-confidence wrong judgments are worse than low-confidence\n- Solution: Calibrate confidence to position consistency and evidence strength\n\n### Decision Framework: Direct vs. Pairwise\n\nUse this decision tree:\n\n```\nIs there an objective ground truth?\n├── Yes → Direct Scoring\n│   └── Examples: factual accuracy, instruction following, format compliance\n│\n└── No → Is it a preference or quality judgment?\n    ├── Yes → Pairwise Comparison\n    │   └── Examples: tone, style, persuasiveness, creativity\n    │\n    └── No → Consider reference-based evaluation\n        └── Examples: summarization (compare to source), translation (compare to reference)\n```\n\n### Scaling Evaluation\n\nFor high-volume evaluation:\n\n1. **Panel of LLMs (PoLL)**: Use multiple models as judges, aggregate votes\n   - Reduces individual model bias\n   - More expensive but more reliable for high-stakes decisions\n\n2. **Hierarchical evaluation**: Fast cheap model for screening, expensive model for edge cases\n   - Cost-effective for large volumes\n   - Requires calibration of screening threshold\n\n3. **Human-in-the-loop**: Automated evaluation for clear cases, human review for low-confidence\n   - Best reliability for critical applications\n   - Design feedback loop to improve automated evaluation\n\n## Examples\n\n### Example 1: Direct Scoring for Accuracy\n\n**Input**:\n```\nPrompt: \"What causes seasons on Earth?\"\nResponse: \"Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, \ndifferent hemispheres receive more direct sunlight at different times of year.\"\nCriterion: Factual Accuracy (weight: 1.0)\nScale: 1-5\n```\n\n**Output**:\n```json\n{\n  \"criterion\": \"Factual Accuracy\",\n  \"score\": 5,\n  \"evidence\": [\n    \"Correctly identifies axial tilt as primary cause\",\n    \"Correctly explains differential sunlight by hemisphere\",\n    \"No factual errors present\"\n  ],\n  \"justification\": \"Response accurately explains the cause of seasons with correct \nscientific reasoning. Both the axial tilt and its effect on sunlight distribution \nare correctly described.\",\n  \"improvement\": \"Could add the specific tilt angle (23.5°) for completeness.\"\n}\n```\n\n### Example 2: Pairwise Comparison with Position Swap\n\n**Input**:\n```\nPrompt: \"Explain machine learning to a beginner\"\nResponse A: [Technical explanation with jargon]\nResponse B: [Simple analogy-based explanation]\nCriteria: [\"clarity\", \"accessibility\"]\n```\n\n**First Pass (A first)**:\n```json\n{ \"winner\": \"B\", \"confidence\": 0.8 }\n```\n\n**Second Pass (B first)**:\n```json\n{ \"winner\": \"A\", \"confidence\": 0.6 }\n```\n(Note: Winner is A because B was in first position)\n\n**Mapped Second Pass**:\n```json\n{ \"winner\": \"B\", \"confidence\": 0.6 }\n```\n\n**Final Result**:\n```json\n{\n  \"winner\": \"B\",\n  \"confidence\": 0.7,\n  \"positionConsistency\": {\n    \"consistent\": true,\n    \"firstPassWinner\": \"B\",\n    \"secondPassWinner\": \"B\"\n  }\n}\n```\n\n### Example 3: Rubric Generation\n\n**Input**:\n```\ncriterionName: \"Code Readability\"\ncriterionDescription: \"How easy the code is to understand and maintain\"\ndomain: \"software engineering\"\nscale: \"1-5\"\nstrictness: \"balanced\"\n```\n\n**Output** (abbreviated):\n```json\n{\n  \"levels\": [\n    {\n      \"score\": 1,\n      \"label\": \"Poor\",\n      \"description\": \"Code is difficult to understand without significant effort\",\n      \"characteristics\": [\n        \"No meaningful variable or function names\",\n        \"No comments or documentation\",\n        \"Deeply nested or convoluted logic\"\n      ]\n    },\n    {\n      \"score\": 3,\n      \"label\": \"Adequate\",\n      \"description\": \"Code is understandable with some effort\",\n      \"characteristics\": [\n        \"Most variables have meaningful names\",\n        \"Basic comments present for complex sections\",\n        \"Logic is followable but could be cleaner\"\n      ]\n    },\n    {\n      \"score\": 5,\n      \"label\": \"Excellent\",\n      \"description\": \"Code is immediately clear and maintainable\",\n      \"characteristics\": [\n        \"All names are descriptive and consistent\",\n        \"Comprehensive documentation\",\n        \"Clean, modular structure\"\n      ]\n    }\n  ],\n  \"edgeCases\": [\n    {\n      \"situation\": \"Code is well-structured but uses domain-specific abbreviations\",\n      \"guidance\": \"Score based on readability for domain experts, not general audience\"\n    }\n  ]\n}\n```\n\n## Guidelines\n\n1. **Always require justification before scores** - Chain-of-thought prompting improves reliability by 15-25%\n\n2. **Always swap positions in pairwise comparison** - Single-pass comparison is corrupted by position bias\n\n3. **Match scale granularity to rubric specificity** - Don't use 1-10 without detailed level descriptions\n\n4. **Separate objective and subjective criteria** - Use direct scoring for objective, pairwise for subjective\n\n5. **Include confidence scores** - Calibrate to position consistency and evidence strength\n\n6. **Define edge cases explicitly** - Ambiguous situations cause the most evaluation variance\n\n7. **Use domain-specific rubrics** - Generic rubrics produce generic (less useful) evaluations\n\n8. **Validate against human judgments** - Automated evaluation is only valuable if it correlates with human assessment\n\n9. **Monitor for systematic bias** - Track disagreement patterns by criterion, response type, model\n\n10. **Design for iteration** - Evaluation systems improve with feedback loops\n\n## Integration\n\nThis skill integrates with:\n\n- **context-fundamentals** - Evaluation prompts require effective context structure\n- **tool-design** - Evaluation tools need proper schemas and error handling\n- **context-optimization** - Evaluation prompts can be optimized for token efficiency\n- **evaluation** (foundational) - This skill extends the foundational evaluation concepts\n\n## References\n\nInternal reference:\n- LLM-as-Judge Implementation Patterns\n- Bias Mitigation Techniques\n- Metric Selection Guide\n\nExternal research:\n- [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https:\u002F\u002Feugeneyan.com\u002Fwriting\u002Fllm-evaluators\u002F)\n- [Judging LLM-as-a-Judge (Zheng et al., 2023)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05685)\n- [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.16634)\n- [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17926)\n\nRelated skills in this collection:\n- evaluation - Foundational evaluation concepts\n- context-fundamentals - Context structure for evaluation prompts\n- tool-design - Building evaluation tools\n\n---\n\n## Skill Metadata\n\n**Created**: 2024-12-24\n**Last Updated**: 2024-12-24\n**Author**: Muratcan Koylan\n**Version**: 1.0.0\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,121,1681,"2026-05-16 13:01:02",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"编程开发","coding","mdi-code-braces","代码生成、调试、审查，提升开发效率",2,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":25,"skillCount":32,"createdAt":26},"后端开发","backend","mdi-server","API、数据库、服务端架构",296,[34],{"id":35,"skillId":4,"version":36,"fileName":37,"fileSize":38,"filePath":39,"fileHash":40,"manifest":41,"createdAt":19},"bb969258-161e-4028-8206-9e850aadeec0","1.0.0","advanced-evaluation.zip",6324,"uploads\u002Fskills\u002F4c1b4133-b163-43c4-afe1-3a59f8d6ba5e\u002Fadvanced-evaluation.zip","3b7f7c74385c75c6a56c54fbaa446393c58b420f30b50982e30ca4ac8f400f9f","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":17839}]",{"code":43,"message":44,"data":45},200,"success",{"items":46,"stats":47,"page":50},[],{"averageRating":48,"totalRatings":48,"ratingCounts":49},0,[48,48,48,48,48],{"limit":51,"offset":48,"hasMore":52,"nextOffset":51,"ratedOnly":16},15,false]