Name: hugging-face-community-evals
Rating: 5.0 (102 reviews)
Author: SkillOPIC

应用简介

使用inspect-ai或lighteval对Hugging Face Hub模型进行本地评估。

---
source: "https://github.com/huggingface/skills/tree/main/skills/huggingface-community-evals"
name: hugging-face-community-evals
description: Run local evaluations for Hugging Face Hub models with inspect-ai or lighteval.
risk: unknown
---

# Overview

## When to Use
Use this skill for local model evaluation, backend selection, and GPU smoke tests outside the Hugging Face Jobs workflow.

This skill is for **running evaluations against models on the Hugging Face Hub on local hardware**.

It covers:
- `inspect-ai` with local inference
- `lighteval` with local inference
- choosing between `vllm`, Hugging Face Transformers, and `accelerate`
- smoke tests, task selection, and backend fallback strategy

It does **not** cover:
- Hugging Face Jobs orchestration
- model-card or `model-index` edits
- README table extraction
- Artificial Analysis imports
- `.eval_results` generation or publishing
- PR creation or community-evals automation

If the user wants to **run the same eval remotely on Hugging Face Jobs**, hand off to the `hugging-face-jobs` skill and pass it one of the local scripts in this skill.

If the user wants to **publish results into the community evals workflow**, stop after generating the evaluation run and hand off that publishing step to `~/code/community-evals`.

> All paths below are relative to the directory containing this `SKILL.md`.

# When To Use Which Script

| Use case | Script |
|---|---|
| Local `inspect-ai` eval on a Hub model via inference providers | `scripts/inspect_eval_uv.py` |
| Local GPU eval with `inspect-ai` using `vllm` or Transformers | `scripts/inspect_vllm_uv.py` |
| Local GPU eval with `lighteval` using `vllm` or `accelerate` | `scripts/lighteval_vllm_uv.py` |
| Extra command patterns | `examples/USAGE_EXAMPLES.md` |

# Prerequisites

- Prefer `uv run` for local execution.
- Set `HF_TOKEN` for gated/private models.
- For local GPU runs, verify GPU access before starting:

```bash
uv --version
printenv HF_TOKEN >/dev/null
nvidia-smi
```

If `nvidia-smi` is unavailable, either:
- use `scripts/inspect_eval_uv.py` for lighter provider-backed evaluation, or
- hand off to the `hugging-face-jobs` skill if the user wants remote compute.

# Core Workflow

1. Choose the evaluation framework.
- Use `inspect-ai` when you want explicit task control and inspect-native flows.
- Use `lighteval` when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.
2. Choose the inference backend.
- Prefer `vllm` for throughput on supported architectures.
- Use Hugging Face Transformers (`--backend hf`) or `accelerate` as compatibility fallbacks.
3. Start with a smoke test.
- `inspect-ai`: add `--limit 10` or similar.
- `lighteval`: add `--max-samples 10`.
4. Scale up only after the smoke test passes.
5. If the user wants remote execution, hand off to `hugging-face-jobs` with the same script + args.

# Quick Start

## Option A: inspect-ai with local inference providers path

Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.

```bash
uv run scripts/inspect_eval_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--limit 20
```

Use this path when:
- you want a quick local smoke test
- you do not need direct GPU control
- the task already exists in `inspect-evals`

## Option B: inspect-ai on Local GPU

Best when you need to load the Hub model directly, use `vllm`, or fall back to Transformers for unsupported architectures.

Local GPU:

```bash
uv run scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task gsm8k \
--limit 20
```

Transformers fallback:

```bash
uv run scripts/inspect_vllm_uv.py \
--model microsoft/phi-2 \
--task mmlu \
--backend hf \
--trust-remote-code \
--limit 20
```

## Option C: lighteval on Local GPU

Best when the task is naturally expressed as a `lighteval` task string, especially Open LLM Leaderboard style benchmarks.

Local GPU:

```bash
uv run scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-3B-Instruct \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
--max-samples 20 \
--use-chat-template
```

`accelerate` fallback:

```bash
uv run scripts/lighteval_vllm_uv.py \
--model microsoft/phi-2 \
--tasks "leaderboard|mmlu|5" \
--backend accelerate \
--trust-remote-code \
--max-samples 20
```

# Remote Execution Boundary

This skill intentionally stops at **local execution and backend selection**.

If the user wants to:
- run these scripts on Hugging Face Jobs
- pick remote hardware
- pass secrets to remote jobs
- schedule recurring runs
- inspect / cancel / monitor jobs

then switch to the **`hugging-face-jobs`** skill and pass it one of these scripts plus the chosen arguments.

# Task Selection

`inspect-ai` examples:
- `mmlu`
- `gsm8k`
- `hellaswag`
- `arc_challenge`
- `truthfulqa`
- `winogrande`
- `humaneval`

Multiple `lighteval` tasks can be comma-separated in `--tasks`.

# Backend Selection

- Prefer `inspect_vllm_uv.py --backend vllm` for fast GPU inference on supported architectures.
- Use `inspect_vllm_uv.py --backend hf` when `vllm` does not support the model.
- Prefer `lighteval_vllm_uv.py --backend vllm` for throughput on supported models.
- Use `lighteval_vllm_uv.py --backend accelerate` as the compatibility fallback.
- Use `inspect_eval_uv.py` when Inference Providers already cover the model and you do not need direct GPU control.

# Hardware Guidance

| Model size | Suggested local hardware |
|---|---|
| `< 3B` | consumer GPU / Apple Silicon / small dev GPU |
| `3B - 13B` | stronger local GPU |
| `13B+` | high-memory local GPU or hand off to `hugging-face-jobs` |

For smoke tests, prefer cheaper local runs plus `--limit` or `--max-samples`.

# Troubleshooting

- CUDA or vLLM OOM:
- reduce `--batch-size`
- reduce `--gpu-memory-utilization`
- switch to a smaller model for the smoke test
- if necessary, hand off to `hugging-face-jobs`
- Model unsupported by `vllm`:
- switch to `--backend hf` for `inspect-ai`
- switch to `--backend accelerate` for `lighteval`
- Gated/private repo access fails:
- verify `HF_TOKEN`
- Custom model code required:
- add `--trust-remote-code`

# Examples

See:
- `examples/USAGE_EXAMPLES.md` for local command patterns
- `scripts/inspect_eval_uv.py`
- `scripts/inspect_vllm_uv.py`
- `scripts/lighteval_vllm_uv.py`

## Limitations
- Use this skill only when the task clearly matches the scope described above.
- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.

发布日期

5/16/2026

提供方

SkillOPIC

来源类型

导入

sickn33

coding

数据安全

使用 Skill 时，您的对话内容将被发送至 AI 模型进行处理。我们会严格保护您的隐私数据，不会将您的对话内容用于模型训练或分享给第三方。以下为此 Skill 的数据处理说明。

此 Skill 将处理您的对话输入

您的消息将作为 Prompt 上下文发送至 AI 模型

所有通信均通过加密通道传输

对话记录仅保存在本地

您可以随时清除本地对话历史，清除后数据不可恢复

来源：https://github.com/sickn33/antigravity-awesome-skills

Skill 信息

了解此 Skill 的详细信息和功能特性

编程开发

前端开发

文件结构

6 个文件· 29.7 KB

examples

scripts

SKILL.md6.7 KB

版本历史

公开
来源于用户导入

如需详细了解相关要求，请访问帮助中心，或给我们提交反馈信息

hugging-face-community-evals

应用简介

数据安全

评分和评价