[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-633c4298-66cc-4ecd-9a75-457c62c05d40":3,"$fC71Rfw2yU0gC6h0A116TZFJNu5RLssnRnnP_tp9OSvI":42},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":33},"633c4298-66cc-4ecd-9a75-457c62c05d40","llm-evaluation","掌握LLM应用的全面评估策略，从自动化指标到人工评估和A\u002FB测试。","cat_coding_backend","mod_coding","sickn33,coding","---\nname: llm-evaluation\ndescription: \"Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A\u002FB testing.\"\nrisk: unknown\nsource: community\ndate_added: \"2026-02-27\"\n---\n\n# LLM Evaluation\n\nMaster comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A\u002FB testing.\n\n## Do not use this skill when\n\n- The task is unrelated to llm evaluation\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources\u002Fimplementation-playbook.md`.\n\n## Use this skill when\n\n- Measuring LLM application performance systematically\n- Comparing different models or prompts\n- Detecting performance regressions before deployment\n- Validating improvements from prompt changes\n- Building confidence in production systems\n- Establishing baselines and tracking progress over time\n- Debugging unexpected model behavior\n\n## Core Evaluation Types\n\n### 1. Automated Metrics\nFast, repeatable, scalable evaluation using computed scores.\n\n**Text Generation:**\n- **BLEU**: N-gram overlap (translation)\n- **ROUGE**: Recall-oriented (summarization)\n- **METEOR**: Semantic similarity\n- **BERTScore**: Embedding-based similarity\n- **Perplexity**: Language model confidence\n\n**Classification:**\n- **Accuracy**: Percentage correct\n- **Precision\u002FRecall\u002FF1**: Class-specific performance\n- **Confusion Matrix**: Error patterns\n- **AUC-ROC**: Ranking quality\n\n**Retrieval (RAG):**\n- **MRR**: Mean Reciprocal Rank\n- **NDCG**: Normalized Discounted Cumulative Gain\n- **Precision@K**: Relevant in top K\n- **Recall@K**: Coverage in top K\n\n### 2. Human Evaluation\nManual assessment for quality aspects difficult to automate.\n\n**Dimensions:**\n- **Accuracy**: Factual correctness\n- **Coherence**: Logical flow\n- **Relevance**: Answers the question\n- **Fluency**: Natural language quality\n- **Safety**: No harmful content\n- **Helpfulness**: Useful to the user\n\n### 3. LLM-as-Judge\nUse stronger LLMs to evaluate weaker model outputs.\n\n**Approaches:**\n- **Pointwise**: Score individual responses\n- **Pairwise**: Compare two responses\n- **Reference-based**: Compare to gold standard\n- **Reference-free**: Judge without ground truth\n\n## Quick Start\n\n```python\nfrom llm_eval import EvaluationSuite, Metric\n\n# Define evaluation suite\nsuite = EvaluationSuite([\n    Metric.accuracy(),\n    Metric.bleu(),\n    Metric.bertscore(),\n    Metric.custom(name=\"groundedness\", fn=check_groundedness)\n])\n\n# Prepare test cases\ntest_cases = [\n    {\n        \"input\": \"What is the capital of France?\",\n        \"expected\": \"Paris\",\n        \"context\": \"France is a country in Europe. Paris is its capital.\"\n    },\n    # ... more test cases\n]\n\n# Run evaluation\nresults = suite.evaluate(\n    model=your_model,\n    test_cases=test_cases\n)\n\nprint(f\"Overall Accuracy: {results.metrics['accuracy']}\")\nprint(f\"BLEU Score: {results.metrics['bleu']}\")\n```\n\n## Automated Metrics Implementation\n\n### BLEU Score\n```python\nfrom nltk.translate.bleu_score import sentence_bleu, SmoothingFunction\n\ndef calculate_bleu(reference, hypothesis):\n    \"\"\"Calculate BLEU score between reference and hypothesis.\"\"\"\n    smoothie = SmoothingFunction().method4\n\n    return sentence_bleu(\n        [reference.split()],\n        hypothesis.split(),\n        smoothing_function=smoothie\n    )\n\n# Usage\nbleu = calculate_bleu(\n    reference=\"The cat sat on the mat\",\n    hypothesis=\"A cat is sitting on the mat\"\n)\n```\n\n### ROUGE Score\n```python\nfrom rouge_score import rouge_scorer\n\ndef calculate_rouge(reference, hypothesis):\n    \"\"\"Calculate ROUGE scores.\"\"\"\n    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)\n    scores = scorer.score(reference, hypothesis)\n\n    return {\n        'rouge1': scores['rouge1'].fmeasure,\n        'rouge2': scores['rouge2'].fmeasure,\n        'rougeL': scores['rougeL'].fmeasure\n    }\n```\n\n### BERTScore\n```python\nfrom bert_score import score\n\ndef calculate_bertscore(references, hypotheses):\n    \"\"\"Calculate BERTScore using pre-trained BERT.\"\"\"\n    P, R, F1 = score(\n        hypotheses,\n        references,\n        lang='en',\n        model_type='microsoft\u002Fdeberta-xlarge-mnli'\n    )\n\n    return {\n        'precision': P.mean().item(),\n        'recall': R.mean().item(),\n        'f1': F1.mean().item()\n    }\n```\n\n### Custom Metrics\n```python\ndef calculate_groundedness(response, context):\n    \"\"\"Check if response is grounded in provided context.\"\"\"\n    # Use NLI model to check entailment\n    from transformers import pipeline\n\n    nli = pipeline(\"text-classification\", model=\"microsoft\u002Fdeberta-large-mnli\")\n\n    result = nli(f\"{context} [SEP] {response}\")[0]\n\n    # Return confidence that response is entailed by context\n    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0\n\ndef calculate_toxicity(text):\n    \"\"\"Measure toxicity in generated text.\"\"\"\n    from detoxify import Detoxify\n\n    results = Detoxify('original').predict(text)\n    return max(results.values())  # Return highest toxicity score\n\ndef calculate_factuality(claim, knowledge_base):\n    \"\"\"Verify factual claims against knowledge base.\"\"\"\n    # Implementation depends on your knowledge base\n    # Could use retrieval + NLI, or fact-checking API\n    pass\n```\n\n## LLM-as-Judge Patterns\n\n### Single Output Evaluation\n```python\ndef llm_judge_quality(response, question):\n    \"\"\"Use GPT-5 to judge response quality.\"\"\"\n    prompt = f\"\"\"Rate the following response on a scale of 1-10 for:\n1. Accuracy (factually correct)\n2. Helpfulness (answers the question)\n3. Clarity (well-written and understandable)\n\nQuestion: {question}\nResponse: {response}\n\nProvide ratings in JSON format:\n{{\n  \"accuracy\": \u003C1-10>,\n  \"helpfulness\": \u003C1-10>,\n  \"clarity\": \u003C1-10>,\n  \"reasoning\": \"\u003Cbrief explanation>\"\n}}\n\"\"\"\n\n    result = openai.ChatCompletion.create(\n        model=\"gpt-5\",\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n        temperature=0\n    )\n\n    return json.loads(result.choices[0].message.content)\n```\n\n### Pairwise Comparison\n```python\ndef compare_responses(question, response_a, response_b):\n    \"\"\"Compare two responses using LLM judge.\"\"\"\n    prompt = f\"\"\"Compare these two responses to the question and determine which is better.\n\nQuestion: {question}\n\nResponse A: {response_a}\n\nResponse B: {response_b}\n\nWhich response is better and why? Consider accuracy, helpfulness, and clarity.\n\nAnswer with JSON:\n{{\n  \"winner\": \"A\" or \"B\" or \"tie\",\n  \"reasoning\": \"\u003Cexplanation>\",\n  \"confidence\": \u003C1-10>\n}}\n\"\"\"\n\n    result = openai.ChatCompletion.create(\n        model=\"gpt-5\",\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n        temperature=0\n    )\n\n    return json.loads(result.choices[0].message.content)\n```\n\n## Human Evaluation Frameworks\n\n### Annotation Guidelines\n```python\nclass AnnotationTask:\n    \"\"\"Structure for human annotation task.\"\"\"\n\n    def __init__(self, response, question, context=None):\n        self.response = response\n        self.question = question\n        self.context = context\n\n    def get_annotation_form(self):\n        return {\n            \"question\": self.question,\n            \"context\": self.context,\n            \"response\": self.response,\n            \"ratings\": {\n                \"accuracy\": {\n                    \"scale\": \"1-5\",\n                    \"description\": \"Is the response factually correct?\"\n                },\n                \"relevance\": {\n                    \"scale\": \"1-5\",\n                    \"description\": \"Does it answer the question?\"\n                },\n                \"coherence\": {\n                    \"scale\": \"1-5\",\n                    \"description\": \"Is it logically consistent?\"\n                }\n            },\n            \"issues\": {\n                \"factual_error\": False,\n                \"hallucination\": False,\n                \"off_topic\": False,\n                \"unsafe_content\": False\n            },\n            \"feedback\": \"\"\n        }\n```\n\n### Inter-Rater Agreement\n```python\nfrom sklearn.metrics import cohen_kappa_score\n\ndef calculate_agreement(rater1_scores, rater2_scores):\n    \"\"\"Calculate inter-rater agreement.\"\"\"\n    kappa = cohen_kappa_score(rater1_scores, rater2_scores)\n\n    interpretation = {\n        kappa \u003C 0: \"Poor\",\n        kappa \u003C 0.2: \"Slight\",\n        kappa \u003C 0.4: \"Fair\",\n        kappa \u003C 0.6: \"Moderate\",\n        kappa \u003C 0.8: \"Substantial\",\n        kappa \u003C= 1.0: \"Almost Perfect\"\n    }\n\n    return {\n        \"kappa\": kappa,\n        \"interpretation\": interpretation[True]\n    }\n```\n\n## A\u002FB Testing\n\n### Statistical Testing Framework\n```python\nfrom scipy import stats\nimport numpy as np\n\nclass ABTest:\n    def __init__(self, variant_a_name=\"A\", variant_b_name=\"B\"):\n        self.variant_a = {\"name\": variant_a_name, \"scores\": []}\n        self.variant_b = {\"name\": variant_b_name, \"scores\": []}\n\n    def add_result(self, variant, score):\n        \"\"\"Add evaluation result for a variant.\"\"\"\n        if variant == \"A\":\n            self.variant_a[\"scores\"].append(score)\n        else:\n            self.variant_b[\"scores\"].append(score)\n\n    def analyze(self, alpha=0.05):\n        \"\"\"Perform statistical analysis.\"\"\"\n        a_scores = self.variant_a[\"scores\"]\n        b_scores = self.variant_b[\"scores\"]\n\n        # T-test\n        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)\n\n        # Effect size (Cohen's d)\n        pooled_std = np.sqrt((np.std(a_scores)**2 + np.std(b_scores)**2) \u002F 2)\n        cohens_d = (np.mean(b_scores) - np.mean(a_scores)) \u002F pooled_std\n\n        return {\n            \"variant_a_mean\": np.mean(a_scores),\n            \"variant_b_mean\": np.mean(b_scores),\n            \"difference\": np.mean(b_scores) - np.mean(a_scores),\n            \"relative_improvement\": (np.mean(b_scores) - np.mean(a_scores)) \u002F np.mean(a_scores),\n            \"p_value\": p_value,\n            \"statistically_significant\": p_value \u003C alpha,\n            \"cohens_d\": cohens_d,\n            \"effect_size\": self.interpret_cohens_d(cohens_d),\n            \"winner\": \"B\" if np.mean(b_scores) > np.mean(a_scores) else \"A\"\n        }\n\n    @staticmethod\n    def interpret_cohens_d(d):\n        \"\"\"Interpret Cohen's d effect size.\"\"\"\n        abs_d = abs(d)\n        if abs_d \u003C 0.2:\n            return \"negligible\"\n        elif abs_d \u003C 0.5:\n            return \"small\"\n        elif abs_d \u003C 0.8:\n            return \"medium\"\n        else:\n            return \"large\"\n```\n\n## Regression Testing\n\n### Regression Detection\n```python\nclass RegressionDetector:\n    def __init__(self, baseline_results, threshold=0.05):\n        self.baseline = baseline_results\n        self.threshold = threshold\n\n    def check_for_regression(self, new_results):\n        \"\"\"Detect if new results show regression.\"\"\"\n        regressions = []\n\n        for metric in self.baseline.keys():\n            baseline_score = self.baseline[metric]\n            new_score = new_results.get(metric)\n\n            if new_score is None:\n                continue\n\n            # Calculate relative change\n            relative_change = (new_score - baseline_score) \u002F baseline_score\n\n            # Flag if significant decrease\n            if relative_change \u003C -self.threshold:\n                regressions.append({\n                    \"metric\": metric,\n                    \"baseline\": baseline_score,\n                    \"current\": new_score,\n                    \"change\": relative_change\n                })\n\n        return {\n            \"has_regression\": len(regressions) > 0,\n            \"regressions\": regressions\n        }\n```\n\n## Benchmarking\n\n### Running Benchmarks\n```python\nclass BenchmarkRunner:\n    def __init__(self, benchmark_dataset):\n        self.dataset = benchmark_dataset\n\n    def run_benchmark(self, model, metrics):\n        \"\"\"Run model on benchmark and calculate metrics.\"\"\"\n        results = {metric.name: [] for metric in metrics}\n\n        for example in self.dataset:\n            # Generate prediction\n            prediction = model.predict(example[\"input\"])\n\n            # Calculate each metric\n            for metric in metrics:\n                score = metric.calculate(\n                    prediction=prediction,\n                    reference=example[\"reference\"],\n                    context=example.get(\"context\")\n                )\n                results[metric.name].append(score)\n\n        # Aggregate results\n        return {\n            metric: {\n                \"mean\": np.mean(scores),\n                \"std\": np.std(scores),\n                \"min\": min(scores),\n                \"max\": max(scores)\n            }\n            for metric, scores in results.items()\n        }\n```\n\n## Resources\n\n- **references\u002Fmetrics.md**: Comprehensive metric guide\n- **references\u002Fhuman-evaluation.md**: Annotation best practices\n- **references\u002Fbenchmarking.md**: Standard benchmarks\n- **references\u002Fa-b-testing.md**: Statistical testing guide\n- **references\u002Fregression-testing.md**: CI\u002FCD integration\n- **assets\u002Fevaluation-framework.py**: Complete evaluation harness\n- **assets\u002Fbenchmark-dataset.jsonl**: Example datasets\n- **scripts\u002Fevaluate-model.py**: Automated evaluation runner\n\n## Best Practices\n\n1. **Multiple Metrics**: Use diverse metrics for comprehensive view\n2. **Representative Data**: Test on real-world, diverse examples\n3. **Baselines**: Always compare against baseline performance\n4. **Statistical Rigor**: Use proper statistical tests for comparisons\n5. **Continuous Evaluation**: Integrate into CI\u002FCD pipeline\n6. **Human Validation**: Combine automated metrics with human judgment\n7. **Error Analysis**: Investigate failures to understand weaknesses\n8. **Version Control**: Track evaluation results over time\n\n## Common Pitfalls\n\n- **Single Metric Obsession**: Optimizing for one metric at the expense of others\n- **Small Sample Size**: Drawing conclusions from too few examples\n- **Data Contamination**: Testing on training data\n- **Ignoring Variance**: Not accounting for statistical uncertainty\n- **Metric Mismatch**: Using metrics not aligned with business goals\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,186,1807,"2026-05-16 13:26:54",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"编程开发","coding","mdi-code-braces","代码生成、调试、审查，提升开发效率",2,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":25,"skillCount":32,"createdAt":26},"后端开发","backend","mdi-server","API、数据库、服务端架构",296,[34],{"id":35,"skillId":4,"version":36,"fileName":37,"fileSize":38,"filePath":39,"fileHash":40,"manifest":41,"createdAt":19},"7683d85a-c309-46f2-bc93-fa11315ccc57","1.0.0","llm-evaluation.zip",5232,"uploads\u002Fskills\u002F633c4298-66cc-4ecd-9a75-457c62c05d40\u002Fllm-evaluation.zip","f529623dba3c568c1b5170f8c40cdc02220215d5fd416ce3312024bcd27d6113","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":14401}]",{"code":43,"message":44,"data":45},200,"success",{"items":46,"stats":47,"page":50},[],{"averageRating":48,"totalRatings":48,"ratingCounts":49},0,[48,48,48,48,48],{"limit":51,"offset":48,"hasMore":52,"nextOffset":51,"ratedOnly":16},15,false]