[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-54de6f8e-eabc-428f-87ef-3e082c096c17":3,"$fOOKLFoQMw33d2KZXbXF2fzLaza-OKmNv0vNXgbkqDK4":43},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":34},"54de6f8e-eabc-428f-87ef-3e082c096c17","hugging-face-community-evals","使用inspect-ai或lighteval对Hugging Face Hub模型进行本地评估。","cat_coding_frontend","mod_coding","sickn33,coding","---\nsource: \"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fskills\u002Ftree\u002Fmain\u002Fskills\u002Fhuggingface-community-evals\"\nname: hugging-face-community-evals\ndescription: Run local evaluations for Hugging Face Hub models with inspect-ai or lighteval.\nrisk: unknown\n---\n\n# Overview\n\n## When to Use\nUse this skill for local model evaluation, backend selection, and GPU smoke tests outside the Hugging Face Jobs workflow.\n\nThis skill is for **running evaluations against models on the Hugging Face Hub on local hardware**.\n\nIt covers:\n- `inspect-ai` with local inference\n- `lighteval` with local inference\n- choosing between `vllm`, Hugging Face Transformers, and `accelerate`\n- smoke tests, task selection, and backend fallback strategy\n\nIt does **not** cover:\n- Hugging Face Jobs orchestration\n- model-card or `model-index` edits\n- README table extraction\n- Artificial Analysis imports\n- `.eval_results` generation or publishing\n- PR creation or community-evals automation\n\nIf the user wants to **run the same eval remotely on Hugging Face Jobs**, hand off to the `hugging-face-jobs` skill and pass it one of the local scripts in this skill.\n\nIf the user wants to **publish results into the community evals workflow**, stop after generating the evaluation run and hand off that publishing step to `~\u002Fcode\u002Fcommunity-evals`.\n\n> All paths below are relative to the directory containing this `SKILL.md`.\n\n# When To Use Which Script\n\n| Use case | Script |\n|---|---|\n| Local `inspect-ai` eval on a Hub model via inference providers | `scripts\u002Finspect_eval_uv.py` |\n| Local GPU eval with `inspect-ai` using `vllm` or Transformers | `scripts\u002Finspect_vllm_uv.py` |\n| Local GPU eval with `lighteval` using `vllm` or `accelerate` | `scripts\u002Flighteval_vllm_uv.py` |\n| Extra command patterns | `examples\u002FUSAGE_EXAMPLES.md` |\n\n# Prerequisites\n\n- Prefer `uv run` for local execution.\n- Set `HF_TOKEN` for gated\u002Fprivate models.\n- For local GPU runs, verify GPU access before starting:\n\n```bash\nuv --version\nprintenv HF_TOKEN >\u002Fdev\u002Fnull\nnvidia-smi\n```\n\nIf `nvidia-smi` is unavailable, either:\n- use `scripts\u002Finspect_eval_uv.py` for lighter provider-backed evaluation, or\n- hand off to the `hugging-face-jobs` skill if the user wants remote compute.\n\n# Core Workflow\n\n1. Choose the evaluation framework.\n   - Use `inspect-ai` when you want explicit task control and inspect-native flows.\n   - Use `lighteval` when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.\n2. Choose the inference backend.\n   - Prefer `vllm` for throughput on supported architectures.\n   - Use Hugging Face Transformers (`--backend hf`) or `accelerate` as compatibility fallbacks.\n3. Start with a smoke test.\n   - `inspect-ai`: add `--limit 10` or similar.\n   - `lighteval`: add `--max-samples 10`.\n4. Scale up only after the smoke test passes.\n5. If the user wants remote execution, hand off to `hugging-face-jobs` with the same script + args.\n\n# Quick Start\n\n## Option A: inspect-ai with local inference providers path\n\nBest when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.\n\n```bash\nuv run scripts\u002Finspect_eval_uv.py \\\n  --model meta-llama\u002FLlama-3.2-1B \\\n  --task mmlu \\\n  --limit 20\n```\n\nUse this path when:\n- you want a quick local smoke test\n- you do not need direct GPU control\n- the task already exists in `inspect-evals`\n\n## Option B: inspect-ai on Local GPU\n\nBest when you need to load the Hub model directly, use `vllm`, or fall back to Transformers for unsupported architectures.\n\nLocal GPU:\n\n```bash\nuv run scripts\u002Finspect_vllm_uv.py \\\n  --model meta-llama\u002FLlama-3.2-1B \\\n  --task gsm8k \\\n  --limit 20\n```\n\nTransformers fallback:\n\n```bash\nuv run scripts\u002Finspect_vllm_uv.py \\\n  --model microsoft\u002Fphi-2 \\\n  --task mmlu \\\n  --backend hf \\\n  --trust-remote-code \\\n  --limit 20\n```\n\n## Option C: lighteval on Local GPU\n\nBest when the task is naturally expressed as a `lighteval` task string, especially Open LLM Leaderboard style benchmarks.\n\nLocal GPU:\n\n```bash\nuv run scripts\u002Flighteval_vllm_uv.py \\\n  --model meta-llama\u002FLlama-3.2-3B-Instruct \\\n  --tasks \"leaderboard|mmlu|5,leaderboard|gsm8k|5\" \\\n  --max-samples 20 \\\n  --use-chat-template\n```\n\n`accelerate` fallback:\n\n```bash\nuv run scripts\u002Flighteval_vllm_uv.py \\\n  --model microsoft\u002Fphi-2 \\\n  --tasks \"leaderboard|mmlu|5\" \\\n  --backend accelerate \\\n  --trust-remote-code \\\n  --max-samples 20\n```\n\n# Remote Execution Boundary\n\nThis skill intentionally stops at **local execution and backend selection**.\n\nIf the user wants to:\n- run these scripts on Hugging Face Jobs\n- pick remote hardware\n- pass secrets to remote jobs\n- schedule recurring runs\n- inspect \u002F cancel \u002F monitor jobs\n\nthen switch to the **`hugging-face-jobs`** skill and pass it one of these scripts plus the chosen arguments.\n\n# Task Selection\n\n`inspect-ai` examples:\n- `mmlu`\n- `gsm8k`\n- `hellaswag`\n- `arc_challenge`\n- `truthfulqa`\n- `winogrande`\n- `humaneval`\n\n`lighteval` task strings use `suite|task|num_fewshot`:\n- `leaderboard|mmlu|5`\n- `leaderboard|gsm8k|5`\n- `leaderboard|arc_challenge|25`\n- `lighteval|hellaswag|0`\n\nMultiple `lighteval` tasks can be comma-separated in `--tasks`.\n\n# Backend Selection\n\n- Prefer `inspect_vllm_uv.py --backend vllm` for fast GPU inference on supported architectures.\n- Use `inspect_vllm_uv.py --backend hf` when `vllm` does not support the model.\n- Prefer `lighteval_vllm_uv.py --backend vllm` for throughput on supported models.\n- Use `lighteval_vllm_uv.py --backend accelerate` as the compatibility fallback.\n- Use `inspect_eval_uv.py` when Inference Providers already cover the model and you do not need direct GPU control.\n\n# Hardware Guidance\n\n| Model size | Suggested local hardware |\n|---|---|\n| `\u003C 3B` | consumer GPU \u002F Apple Silicon \u002F small dev GPU |\n| `3B - 13B` | stronger local GPU |\n| `13B+` | high-memory local GPU or hand off to `hugging-face-jobs` |\n\nFor smoke tests, prefer cheaper local runs plus `--limit` or `--max-samples`.\n\n# Troubleshooting\n\n- CUDA or vLLM OOM:\n  - reduce `--batch-size`\n  - reduce `--gpu-memory-utilization`\n  - switch to a smaller model for the smoke test\n  - if necessary, hand off to `hugging-face-jobs`\n- Model unsupported by `vllm`:\n  - switch to `--backend hf` for `inspect-ai`\n  - switch to `--backend accelerate` for `lighteval`\n- Gated\u002Fprivate repo access fails:\n  - verify `HF_TOKEN`\n- Custom model code required:\n  - add `--trust-remote-code`\n\n# Examples\n\nSee:\n- `examples\u002FUSAGE_EXAMPLES.md` for local command patterns\n- `scripts\u002Finspect_eval_uv.py`\n- `scripts\u002Finspect_vllm_uv.py`\n- `scripts\u002Flighteval_vllm_uv.py`\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,102,2054,"2026-05-16 13:22:31",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"编程开发","coding","mdi-code-braces","代码生成、调试、审查，提升开发效率",2,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":32,"skillCount":33,"createdAt":26},"前端开发","frontend","mdi-language-html5","HTML\u002FCSS\u002FJavaScript\u002F框架相关",1,96,[35],{"id":36,"skillId":4,"version":37,"fileName":38,"fileSize":39,"filePath":40,"fileHash":41,"manifest":42,"createdAt":19},"43c24d7a-209f-4a4f-99f0-c2b178704c56","1.0.0","hugging-face-community-evals.zip",10118,"uploads\u002Fskills\u002F54de6f8e-eabc-428f-87ef-3e082c096c17\u002Fhugging-face-community-evals.zip","86d76b306acfafb7e727f5ee52278b27ad5401f8c87285e8ce161ede0d1aec0d","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":6899},{\"path\":\"examples\u002F.env.example\",\"isDirectory\":false,\"size\":159},{\"path\":\"examples\u002FUSAGE_EXAMPLES.md\",\"isDirectory\":false,\"size\":2062},{\"path\":\"scripts\u002Finspect_eval_uv.py\",\"isDirectory\":false,\"size\":3004},{\"path\":\"scripts\u002Finspect_vllm_uv.py\",\"isDirectory\":false,\"size\":9119},{\"path\":\"scripts\u002Flighteval_vllm_uv.py\",\"isDirectory\":false,\"size\":9204}]",{"code":44,"message":45,"data":46},200,"success",{"items":47,"stats":48,"page":51},[],{"averageRating":49,"totalRatings":49,"ratingCounts":50},0,[49,49,49,49,49],{"limit":52,"offset":49,"hasMore":53,"nextOffset":52,"ratedOnly":16},15,false]