[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-eb4fb73d-9819-407e-8c43-625131d9454f":3,"$fBPYraB2lopNX8ySay9bCIkZsY6RBqRcEUsBga4yGR54":42},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":33},"eb4fb73d-9819-407e-8c43-625131d9454f","senior-data-scientist","世界级高级数据科学家，擅长统计建模、实验设计、因果推断和预测分析。涵盖A\u002FB测试（样本量、双比例z检验、Bonferroni校正）、差异-差异分析、特征工程管道（Scikit-learn、XGBoost）、交叉验证模型评估（AUC-ROC、AUC-PR、SHAP）和MLflow实验跟踪——使用Python（NumPy、Pandas、Scikit-learn）、R和SQL。在设计或分析受控实验时使用。","cat_coding_backend","mod_coding","alirezarezvani,coding","---\nname: \"senior-data-scientist\"\ndescription: World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A\u002FB testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.\n---\n\n# Senior Data Scientist\n\nWorld-class senior data scientist skill for production-grade AI\u002FML\u002FData systems.\n\n## Core Workflows\n\n### 1. Design an A\u002FB Test\n\n```python\nimport numpy as np\nfrom scipy import stats\n\ndef calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):\n    \"\"\"\n    Calculate required sample size per variant.\n    baseline_rate: current conversion rate (e.g. 0.10)\n    mde: minimum detectable effect (relative, e.g. 0.05 = 5% lift)\n    \"\"\"\n    p1 = baseline_rate\n    p2 = baseline_rate * (1 + mde)\n    effect_size = abs(p2 - p1) \u002F np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) \u002F 2)\n    z_alpha = stats.norm.ppf(1 - alpha \u002F 2)\n    z_beta = stats.norm.ppf(power)\n    n = ((z_alpha + z_beta) \u002F effect_size) ** 2\n    return int(np.ceil(n))\n\ndef analyze_experiment(control, treatment, alpha=0.05):\n    \"\"\"\n    Run two-proportion z-test and return structured results.\n    control\u002Ftreatment: dicts with 'conversions' and 'visitors'.\n    \"\"\"\n    p_c = control[\"conversions\"] \u002F control[\"visitors\"]\n    p_t = treatment[\"conversions\"] \u002F treatment[\"visitors\"]\n    pooled = (control[\"conversions\"] + treatment[\"conversions\"]) \u002F (control[\"visitors\"] + treatment[\"visitors\"])\n    se = np.sqrt(pooled * (1 - pooled) * (1 \u002F control[\"visitors\"] + 1 \u002F treatment[\"visitors\"]))\n    z = (p_t - p_c) \u002F se\n    p_value = 2 * (1 - stats.norm.cdf(abs(z)))\n    ci_low = (p_t - p_c) - stats.norm.ppf(1 - alpha \u002F 2) * se\n    ci_high = (p_t - p_c) + stats.norm.ppf(1 - alpha \u002F 2) * se\n    return {\n        \"lift\": (p_t - p_c) \u002F p_c,\n        \"p_value\": p_value,\n        \"significant\": p_value \u003C alpha,\n        \"ci_95\": (ci_low, ci_high),\n    }\n\n# --- Experiment checklist ---\n# 1. Define ONE primary metric and pre-register secondary metrics.\n# 2. Calculate sample size BEFORE starting: calculate_sample_size(0.10, 0.05)\n# 3. Randomise at the user (not session) level to avoid leakage.\n# 4. Run for at least 1 full business cycle (typically 2 weeks).\n# 5. Check for sample ratio mismatch: abs(n_control - n_treatment) \u002F expected \u003C 0.01\n# 6. Analyze with analyze_experiment() and report lift + CI, not just p-value.\n# 7. Apply Bonferroni correction if testing multiple metrics: alpha \u002F n_metrics\n```\n\n### 2. Build a Feature Engineering Pipeline\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.compose import ColumnTransformer\n\ndef build_feature_pipeline(numeric_cols, categorical_cols, date_cols=None):\n    \"\"\"\n    Returns a fitted-ready ColumnTransformer for structured tabular data.\n    \"\"\"\n    numeric_pipeline = Pipeline([\n        (\"impute\", SimpleImputer(strategy=\"median\")),\n        (\"scale\",  StandardScaler()),\n    ])\n    categorical_pipeline = Pipeline([\n        (\"impute\", SimpleImputer(strategy=\"most_frequent\")),\n        (\"encode\", OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)),\n    ])\n    transformers = [\n        (\"num\", numeric_pipeline, numeric_cols),\n        (\"cat\", categorical_pipeline, categorical_cols),\n    ]\n    return ColumnTransformer(transformers, remainder=\"drop\")\n\ndef add_time_features(df, date_col):\n    \"\"\"Extract cyclical and lag features from a datetime column.\"\"\"\n    df = df.copy()\n    df[date_col] = pd.to_datetime(df[date_col])\n    df[\"dow_sin\"] = np.sin(2 * np.pi * df[date_col].dt.dayofweek \u002F 7)\n    df[\"dow_cos\"] = np.cos(2 * np.pi * df[date_col].dt.dayofweek \u002F 7)\n    df[\"month_sin\"] = np.sin(2 * np.pi * df[date_col].dt.month \u002F 12)\n    df[\"month_cos\"] = np.cos(2 * np.pi * df[date_col].dt.month \u002F 12)\n    df[\"is_weekend\"] = (df[date_col].dt.dayofweek >= 5).astype(int)\n    return df\n\n# --- Feature engineering checklist ---\n# 1. Never fit transformers on the full dataset — fit on train, transform test.\n# 2. Log-transform right-skewed numeric features before scaling.\n# 3. For high-cardinality categoricals (>50 levels), use target encoding or embeddings.\n# 4. Generate lag\u002Frolling features BEFORE the train\u002Ftest split to avoid leakage.\n# 5. Document each feature's business meaning alongside its code.\n```\n\n### 3. Train, Evaluate, and Select a Prediction Model\n\n```python\nfrom sklearn.model_selection import StratifiedKFold, cross_validate\nfrom sklearn.metrics import make_scorer, roc_auc_score, average_precision_score\nimport xgboost as xgb\nimport mlflow\n\nSCORERS = {\n    \"roc_auc\":  make_scorer(roc_auc_score, needs_proba=True),\n    \"avg_prec\": make_scorer(average_precision_score, needs_proba=True),\n}\n\ndef evaluate_model(model, X, y, cv=5):\n    \"\"\"\n    Cross-validate and return mean ± std for each scorer.\n    Use StratifiedKFold for classification to preserve class balance.\n    \"\"\"\n    cv_results = cross_validate(\n        model, X, y,\n        cv=StratifiedKFold(n_splits=cv, shuffle=True, random_state=42),\n        scoring=SCORERS,\n        return_train_score=True,\n    )\n    summary = {}\n    for metric in SCORERS:\n        test_scores = cv_results[f\"test_{metric}\"]\n        summary[metric] = {\"mean\": test_scores.mean(), \"std\": test_scores.std()}\n        # Flag overfitting: large gap between train and test score\n        train_mean = cv_results[f\"train_{metric}\"].mean()\n        summary[metric][\"overfit_gap\"] = train_mean - test_scores.mean()\n    return summary\n\ndef train_and_log(model, X_train, y_train, X_test, y_test, run_name):\n    \"\"\"Train model and log all artefacts to MLflow.\"\"\"\n    with mlflow.start_run(run_name=run_name):\n        model.fit(X_train, y_train)\n        proba = model.predict_proba(X_test)[:, 1]\n        metrics = {\n            \"roc_auc\":  roc_auc_score(y_test, proba),\n            \"avg_prec\": average_precision_score(y_test, proba),\n        }\n        mlflow.log_params(model.get_params())\n        mlflow.log_metrics(metrics)\n        mlflow.sklearn.log_model(model, \"model\")\n        return metrics\n\n# --- Model evaluation checklist ---\n# 1. Always report AUC-PR alongside AUC-ROC for imbalanced datasets.\n# 2. Check overfit_gap > 0.05 as a warning sign of overfitting.\n# 3. Calibrate probabilities (Platt scaling \u002F isotonic) before production use.\n# 4. Compute SHAP values to validate feature importance makes business sense.\n# 5. Run a baseline (e.g. DummyClassifier) and verify the model beats it.\n# 6. Log every run to MLflow — never rely on notebook output for comparison.\n```\n\n### 4. Causal Inference: Difference-in-Differences\n\n```python\nimport statsmodels.formula.api as smf\n\ndef diff_in_diff(df, outcome, treatment_col, post_col, controls=None):\n    \"\"\"\n    Estimate ATT via OLS DiD with optional covariates.\n    df must have: outcome, treatment_col (0\u002F1), post_col (0\u002F1).\n    Returns the interaction coefficient (treatment × post) and its p-value.\n    \"\"\"\n    covariates = \" + \".join(controls) if controls else \"\"\n    formula = (\n        f\"{outcome} ~ {treatment_col} * {post_col}\"\n        + (f\" + {covariates}\" if covariates else \"\")\n    )\n    result = smf.ols(formula, data=df).fit(cov_type=\"HC3\")\n    interaction = f\"{treatment_col}:{post_col}\"\n    return {\n        \"att\":     result.params[interaction],\n        \"p_value\": result.pvalues[interaction],\n        \"ci_95\":   result.conf_int().loc[interaction].tolist(),\n        \"summary\": result.summary(),\n    }\n\n# --- Causal inference checklist ---\n# 1. Validate parallel trends in pre-period before trusting DiD estimates.\n# 2. Use HC3 robust standard errors to handle heteroskedasticity.\n# 3. For panel data, cluster SEs at the unit level (add groups= param to fit).\n# 4. Consider propensity score matching if groups differ at baseline.\n# 5. Report the ATT with confidence interval, not just statistical significance.\n```\n\n## Reference Documentation\n\n- **Statistical Methods:** `references\u002Fstatistical_methods_advanced.md`\n- **Experiment Design Frameworks:** `references\u002Fexperiment_design_frameworks.md`\n- **Feature Engineering Patterns:** `references\u002Ffeature_engineering_patterns.md`\n\n## Common Commands\n\n```bash\n# Testing & linting\npython -m pytest tests\u002F -v --cov=src\u002F\npython -m black src\u002F && python -m pylint src\u002F\n\n# Training & evaluation\npython scripts\u002Ftrain.py --config prod.yaml\npython scripts\u002Fevaluate.py --model best.pth\n\n# Deployment\ndocker build -t service:v1 .\nkubectl apply -f k8s\u002F\nhelm upgrade service .\u002Fcharts\u002F\n\n# Monitoring & health\nkubectl logs -f deployment\u002Fservice\npython scripts\u002Fhealth_check.py\n```\n","","imported","https:\u002F\u002Fgithub.com\u002Falirezarezvani\u002Fclaude-skills","user_system_seed","SkillOPIC",true,76,1968,"2026-05-16 13:57:24",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"编程开发","coding","mdi-code-braces","代码生成、调试、审查，提升开发效率",2,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":25,"skillCount":32,"createdAt":26},"后端开发","backend","mdi-server","API、数据库、服务端架构",296,[34],{"id":35,"skillId":4,"version":36,"fileName":37,"fileSize":38,"filePath":39,"fileHash":40,"manifest":41,"createdAt":19},"9bf2a3b3-9249-4193-8063-d19a2e946248","1.0.0","senior-data-scientist.zip",9990,"uploads\u002Fskills\u002Feb4fb73d-9819-407e-8c43-625131d9454f\u002Fsenior-data-scientist.zip","b2ca15c2e08b3cffbd515a0c6370fd1e5021677ffa48f2dec622e2f82e872e7a","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":9124},{\"path\":\"references\u002Fexperiment_design_frameworks.md\",\"isDirectory\":false,\"size\":1435},{\"path\":\"references\u002Ffeature_engineering_patterns.md\",\"isDirectory\":false,\"size\":1435},{\"path\":\"references\u002Fstatistical_methods_advanced.md\",\"isDirectory\":false,\"size\":1435},{\"path\":\"scripts\u002Fexperiment_designer.py\",\"isDirectory\":false,\"size\":2784},{\"path\":\"scripts\u002Ffeature_engineering_pipeline.py\",\"isDirectory\":false,\"size\":2827},{\"path\":\"scripts\u002Fmodel_evaluation_suite.py\",\"isDirectory\":false,\"size\":2797}]",{"code":43,"message":44,"data":45},200,"success",{"items":46,"stats":47,"page":50},[],{"averageRating":48,"totalRatings":48,"ratingCounts":49},0,[48,48,48,48,48],{"limit":51,"offset":48,"hasMore":52,"nextOffset":51,"ratedOnly":16},15,false]