[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"skill-43276a03-8e2f-4e56-b3ae-0e8883908e27":3,"$fwhfPghAieDF5Mrovn0i0Knbb_5w58EzKugIOVQyC-tg":43},{"id":4,"title":5,"description":6,"categoryId":7,"moduleId":8,"tags":9,"prompt":10,"icon":11,"source":12,"sourceUrl":13,"authorId":14,"authorName":15,"isPublic":16,"stars":17,"runs":18,"createdAt":19,"updatedAt":19,"module":20,"category":27,"packages":34},"43276a03-8e2f-4e56-b3ae-0e8883908e27","web-scraper","智能多策略网页抓取。从网页（表格、列表、价格）中提取结构化数据。分页、监控和导出CSV\u002FJSON。","cat_prod_data","mod_productivity","sickn33,productivity","---\nname: web-scraper\ndescription: Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV\u002FJSON.\nrisk: safe\nsource: community\ndate_added: '2026-03-06'\nauthor: renat\ntags:\n- scraping\n- data-extraction\n- automation\n- csv\ntools:\n- claude-code\n- antigravity\n- cursor\n- gemini-cli\n- codex-cli\n---\n\n# Web Scraper\n\n## Overview\n\nWeb scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV\u002FJSON.\n\n## When to Use This Skill\n\n- When the user mentions \"scraper\" or related topics\n- When the user mentions \"scraping\" or related topics\n- When the user mentions \"extrair dados web\" or related topics\n- When the user mentions \"web scraping\" or related topics\n- When the user mentions \"raspar dados\" or related topics\n- When the user mentions \"coletar dados site\" or related topics\n\n## Do Not Use This Skill When\n\n- The task is unrelated to web scraper\n- A simpler, more specific tool can handle the request\n- The user needs general-purpose assistance without domain expertise\n\n## How It Works\n\nExecute phases in strict order. Each phase feeds the next.\n\n```\n1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT\n```\n\nNever skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.\n\n**Fast path**: If user provides URL + clear data target + the request is simple\n(single page, one data type), compress Phases 1-3 into a single action:\nfetch, classify, and extract in one WebFetch call. Still validate and format.\n\n---\n\n## Capabilities\n\n- **Multi-strategy**: WebFetch (static), Browser automation (JS-rendered), Bash\u002Fcurl (APIs), WebSearch (discovery)\n- **Extraction modes**: table, list, article, product, contact, FAQ, pricing, events, jobs, custom\n- **Output formats**: Markdown tables (default), JSON, CSV\n- **Pagination**: auto-detect and follow (page numbers, infinite scroll, load-more)\n- **Multi-URL**: extract same structure across sources with comparison and diff\n- **Validation**: confidence ratings (HIGH\u002FMEDIUM\u002FLOW) on every extraction\n- **Auto-escalation**: WebFetch fails silently -> automatic Browser fallback\n- **Data transforms**: cleaning, normalization, deduplication, enrichment\n- **Differential mode**: detect changes between scraping runs\n\n## Web Scraper\n\nMulti-strategy web data extraction with intelligent approach selection,\nautomatic fallback escalation, data transformation, and structured output.\n\n## Phase 1: Clarify\n\nEstablish extraction parameters before touching any URL.\n\n## Required Parameters\n\n| Parameter     | Resolve                              | Default        |\n|:--------------|:-------------------------------------|:---------------|\n| Target URL(s) | Which page(s) to scrape?             | *(required)*   |\n| Data Target   | What specific data to extract?       | *(required)*   |\n| Output Format | Markdown table, JSON, CSV, or text?  | Markdown table |\n| Scope         | Single page, paginated, or multi-URL?| Single page    |\n\n## Optional Parameters\n\n| Parameter     | Resolve                                | Default      |\n|:--------------|:---------------------------------------|:-------------|\n| Pagination    | Follow pagination? Max pages?          | No, 1 page   |\n| Max Items     | Maximum number of items to collect?    | Unlimited    |\n| Filters       | Data to exclude or include?            | None         |\n| Sort Order    | How to sort results?                   | Source order  |\n| Save Path     | Save to file? Which path?              | Display only |\n| Language      | Respond in which language?             | User's lang  |\n| Diff Mode     | Compare with previous run?             | No           |\n\n## Clarification Rules\n\n- If user provides a URL and clear data target, proceed directly to Phase 2.\n  Do NOT ask unnecessary questions.\n- If request is ambiguous (e.g. \"scrape this site\"), ask ONLY:\n  \"What specific data do you want me to extract from this page?\"\n- Default to Markdown table output. Mention alternatives only if relevant.\n- Accept requests in any language. Always respond in the user's language.\n- If user says \"everything\" or \"all data\", perform recon first, then present\n  what's available and let user choose.\n\n## Discovery Mode\n\nWhen user has a topic but no specific URL:\n1. Use WebSearch to find the most relevant pages\n2. Present top 3-5 URLs with descriptions\n3. Let user choose which to scrape, or scrape all\n4. Proceed to Phase 2 with selected URL(s)\n\nExample: \"find and extract pricing data for CRM tools\"\n-> WebSearch(\"CRM tools pricing comparison 2026\")\n-> Present top results -> User selects -> Extract\n\n---\n\n## Phase 2: Reconnaissance\n\nAnalyze the target page before extraction.\n\n## Step 2.1: Initial Fetch\n\nUse WebFetch to retrieve and analyze the page structure:\n\n```\nWebFetch(\n  url = TARGET_URL,\n  prompt = \"Analyze this page structure and report:\n    1. Page type: article, product listing, search results, data table,\n       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other\n    2. Main content structure: tables, ordered\u002Funordered lists, card grid, free-form text,\n       accordion\u002Fcollapsible sections, tabs\n    3. Approximate number of distinct data items visible\n    4. JavaScript rendering indicators: empty containers, loading spinners,\n       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS\n    5. Pagination: next\u002Fprev links, page numbers, load-more buttons,\n       infinite scroll indicators, total results count\n    6. Data density: how much structured, extractable data exists\n    7. List the main data fields\u002Fcolumns available for extraction\n    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags\n    9. Available download links: CSV, Excel, PDF, API endpoints\"\n)\n```\n\n## Step 2.2: Evaluate Fetch Quality\n\n| Signal                                      | Interpretation                    | Action                    |\n|:--------------------------------------------|:----------------------------------|:--------------------------|\n| Rich content with data clearly visible      | Static page                       | Strategy A (WebFetch)     |\n| Empty containers, \"loading...\", minimal text | JS-rendered                       | Strategy B (Browser)      |\n| Login wall, CAPTCHA, 403\u002F401 response       | Blocked                           | Report to user            |\n| Content present but poorly structured       | Needs precision                   | Strategy B (Browser)      |\n| JSON or XML response body                   | API endpoint                      | Strategy C (Bash\u002Fcurl)    |\n| Download links for CSV\u002FExcel available      | Direct data file                  | Strategy C (download)     |\n\n## Step 2.3: Content Classification\n\nClassify into an extraction mode:\n\n| Mode       | Indicators                                 | Examples                          |\n|:-----------|:-------------------------------------------|:----------------------------------|\n| `table`    | HTML `\u003Ctable>`, grid layout with headers   | Price comparison, statistics, specs|\n| `list`     | Repeated similar elements, card grids      | Search results, product listings  |\n| `article`  | Long-form text with headings\u002Fparagraphs    | Blog post, news article, docs     |\n| `product`  | Product name, price, specs, images, rating | E-commerce product page           |\n| `contact`  | Names, emails, phones, addresses, roles    | Team page, staff directory        |\n| `faq`      | Question-answer pairs, accordions          | FAQ page, help center             |\n| `pricing`  | Plan names, prices, features, tiers        | SaaS pricing page                 |\n| `events`   | Dates, locations, titles, descriptions     | Event listings, conferences       |\n| `jobs`     | Titles, companies, locations, salaries     | Job boards, career pages          |\n| `custom`   | User specified CSS selectors or fields     | Anything not matching above       |\n\nRecord: **page type**, **extraction mode**, **JS rendering needed (yes\u002Fno)**,\n**available fields**, **structured data present (JSON-LD etc.)**.\n\nIf user asked for \"everything\", present the available fields and let them choose.\n\n---\n\n## Phase 3: Strategy Selection\n\nChoose the extraction approach based on recon results.\n\n## Decision Tree\n\n```\nStructured data (JSON-LD, microdata) has what we need?\n |\n +-- YES --> STRATEGY E: Extract structured data directly\n |\n +-- NO: Content fully visible in WebFetch?\n      |\n      +-- YES: Need precise element targeting?\n      |    |\n      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction\n      |    +-- YES --> STRATEGY B: Browser automation\n      |\n      +-- NO: JavaScript rendering detected?\n           |\n           +-- YES --> STRATEGY B: Browser automation\n           +-- NO:  API\u002FJSON\u002FXML endpoint or download link?\n                |\n                +-- YES --> STRATEGY C: Bash (curl + jq)\n                +-- NO  --> Report access issue to user\n```\n\n## Strategy A: Webfetch With Ai Extraction\n\n**Best for**: Static pages, articles, simple tables, well-structured HTML.\n\nUse WebFetch with a targeted extraction prompt tailored to the mode:\n\n```\nWebFetch(\n  url = URL,\n  prompt = \"Extract [DATA_TARGET] from this page.\n    Return ONLY the extracted data as [FORMAT] with these columns\u002Ffields: [FIELDS].\n    Rules:\n    - If a value is missing or unclear, use 'N\u002FA'\n    - Do not include navigation, ads, footers, or unrelated content\n    - Preserve original values exactly (numbers, currencies, dates)\n    - Include ALL matching items, not just the first few\n    - For each item, also extract the URL\u002Flink if available\"\n)\n```\n\n**Auto-escalation**: If WebFetch returns suspiciously few items (less than\n50% of expected from recon), or mostly empty fields, automatically escalate\nto Strategy B without asking user. Log the escalation in notes.\n\n## Strategy B: Browser Automation\n\n**Best for**: JS-rendered pages, SPAs, interactive content, lazy-loaded data.\n\nSequence:\n1. Get tab context: `tabs_context_mcp(createIfEmpty=true)` -> get tabId\n2. Navigate to URL: `navigate(url=TARGET_URL, tabId=TAB)`\n3. Wait for content to load: `computer(action=\"wait\", duration=3, tabId=TAB)`\n4. Check for cookie\u002Fconsent banners: `find(query=\"cookie consent or accept button\", tabId=TAB)`\n   - If found, dismiss it (prefer privacy-preserving option)\n5. Read page structure: `read_page(tabId=TAB)` or `get_page_text(tabId=TAB)`\n6. Locate target elements: `find(query=\"[DESCRIPTION]\", tabId=TAB)`\n7. Extract with JavaScript for precise data via `javascript_tool`\n\n```javascript\n\u002F\u002F Table extraction\nconst rows = document.querySelectorAll('TABLE_SELECTOR tr');\nconst data = Array.from(rows).map(row => {\n  const cells = row.querySelectorAll('td, th');\n  return Array.from(cells).map(c => c.textContent.trim());\n});\nJSON.stringify(data);\n```\n\n```javascript\n\u002F\u002F List\u002Fcard extraction\nconst items = document.querySelectorAll('ITEM_SELECTOR');\nconst data = Array.from(items).map(item => ({\n  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,\n  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,\n  link: item.querySelector('a')?.href || null,\n}));\nJSON.stringify(data);\n```\n\n8. For lazy-loaded content, scroll and re-extract:\n   `computer(action=\"scroll\", scroll_direction=\"down\", tabId=TAB)`\n   then `computer(action=\"wait\", duration=2, tabId=TAB)`\n\n## Strategy C: Bash (Curl + Jq)\n\n**Best for**: REST APIs, JSON endpoints, XML feeds, CSV\u002FExcel downloads.\n\n```bash\n\n## Json Api\n\ncurl -s \"API_URL\" | jq '[.items[] | {field1: .key1, field2: .key2}]'\n\n## Csv Download\n\ncurl -s \"CSV_URL\" -o \u002Ftmp\u002Fscraped_data.csv\n\n## Xml Parsing\n\ncurl -s \"XML_URL\" | python3 -c \"\nimport xml.etree.ElementTree as ET, json, sys\ntree = ET.parse(sys.stdin)\n\n## ... Parse And Output Json\n\n\"\n```\n\n## Strategy D: Hybrid\n\nWhen a single strategy is insufficient, combine:\n1. WebSearch to discover relevant URLs\n2. WebFetch for initial content assessment\n3. Browser automation for JS-heavy sections\n4. Bash for post-processing (jq, python for data cleaning)\n\n## Strategy E: Structured Data Extraction\n\nWhen JSON-LD, microdata, or OpenGraph is present:\n1. Use Browser `javascript_tool` to extract structured data:\n```javascript\nconst scripts = document.querySelectorAll('script[type=\"application\u002Fld+json\"]');\nconst data = Array.from(scripts).map(s => {\n  try { return JSON.parse(s.textContent); } catch { return null; }\n}).filter(Boolean);\nJSON.stringify(data);\n```\n2. This often provides cleaner, more reliable data than DOM scraping\n3. Fall back to DOM extraction only for fields not in structured data\n\n## Pagination Handling\n\nWhen pagination is detected and user wants multiple pages:\n\n**Page-number pagination (any strategy):**\n1. Extract data from current page\n2. Identify URL pattern (e.g. `?page=N`, `\u002Fpage\u002FN`, `&offset=N`)\n3. Iterate through pages up to user's max (default: 5 pages)\n4. Show progress: \"Extracting page 2\u002F5...\"\n5. Concatenate all results, deduplicate if needed\n\n**Infinite scroll (Browser only):**\n1. Extract currently visible data\n2. Record item count\n3. Scroll down: `computer(action=\"scroll\", scroll_direction=\"down\", tabId=TAB)`\n4. Wait: `computer(action=\"wait\", duration=2, tabId=TAB)`\n5. Extract newly loaded data\n6. Compare count - if no new items after 2 scrolls, stop\n7. Repeat until no new content or max iterations (default: 5)\n\n**\"Load More\" button (Browser only):**\n1. Extract currently visible data\n2. Find button: `find(query=\"load more button\", tabId=TAB)`\n3. Click it: `computer(action=\"left_click\", ref=REF, tabId=TAB)`\n4. Wait and extract new content\n5. Repeat until button disappears or max iterations reached\n\n---\n\n## Phase 4: Extract\n\nExecute the selected strategy using mode-specific patterns.\nSee [references\u002Fextraction-patterns.md](references\u002Fextraction-patterns.md)\nfor CSS selectors and JavaScript snippets.\n\n## Table Mode\n\nWebFetch prompt:\n```\n\"Extract ALL rows from the table(s) on this page.\nReturn as a markdown table with exact column headers.\nInclude every row - do not truncate or summarize.\nPreserve numeric precision, currencies, and units.\"\n```\n\n## List Mode\n\nWebFetch prompt:\n```\n\"Extract each [ITEM_TYPE] from this page.\nFor each item, extract: [FIELD_LIST].\nReturn as a JSON array of objects with these keys: [KEY_LIST].\nInclude ALL items, not just the first few. Include link\u002FURL for each item if available.\"\n```\n\n## Article Mode\n\nWebFetch prompt:\n```\n\"Extract article metadata:\n- title, author, date, tags\u002Fcategories, word count estimate\n- Key factual data points, statistics, and named entities\nReturn as structured markdown. Summarize the content; do not reproduce full text.\"\n```\n\n## Product Mode\n\nWebFetch prompt:\n```\n\"Extract product data with these exact fields:\n- name, brand, price, currency, originalPrice (if discounted),\n  availability, description (first 200 chars), rating, reviewCount,\n  specifications (as key-value pairs), productUrl, imageUrl\nReturn as JSON. Use null for missing fields.\"\n```\n\nAlso check for JSON-LD `Product` schema (Strategy E) first.\n\n## Contact Mode\n\nWebFetch prompt:\n```\n\"Extract contact information for each person\u002Fentity:\n- name, title, role, email, phone, address, organization, website, linkedinUrl\nReturn as a markdown table. Only extract real contacts visible on the page.\"\n```\n\n## Faq Mode\n\nWebFetch prompt:\n```\n\"Extract all question-answer pairs from this page.\nFor each FAQ item extract:\n- question: the exact question text\n- answer: the answer text (first 300 chars if long)\n- category: the section\u002Fcategory if grouped\nReturn as a JSON array of objects.\"\n```\n\n## Pricing Mode\n\nWebFetch prompt:\n```\n\"Extract all pricing plans\u002Ftiers from this page.\nFor each plan extract:\n- planName, monthlyPrice, annualPrice, currency\n- features (array of included features)\n- limitations (array of limits or excluded features)\n- ctaText (call-to-action button text)\n- highlighted (true if marked as recommended\u002Fpopular)\nReturn as JSON. Use null for missing fields.\"\n```\n\n## Events Mode\n\nWebFetch prompt:\n```\n\"Extract all events\u002Fsessions from this page.\nFor each event extract:\n- title, date, time, endTime, location, description (first 200 chars)\n- speakers (array of names), category, registrationUrl\nReturn as JSON. Use null for missing fields.\"\n```\n\n## Jobs Mode\n\nWebFetch prompt:\n```\n\"Extract all job listings from this page.\nFor each job extract:\n- title, company, location, salary, salaryRange, type (full-time\u002Fpart-time\u002Fcontract)\n- postedDate, description (first 200 chars), applyUrl, tags\nReturn as JSON. Use null for missing fields.\"\n```\n\n## Custom Mode\n\nWhen user provides specific selectors or field descriptions:\n- Use Browser automation with `javascript_tool` and user's CSS selectors\n- Or use WebFetch with a prompt built from user's field descriptions\n- Always confirm extracted schema with user before proceeding to multi-URL\n\n## Multi-Url Extraction\n\nWhen extracting from multiple URLs:\n1. Extract from the **first URL** to establish the data schema\n2. Show user the first results and confirm the schema is correct\n3. Extract from remaining URLs using the same schema\n4. Add a `source` column\u002Ffield to every record with the origin URL\n5. Combine all results into a single output\n6. Show progress: \"Extracting 3\u002F7 URLs...\"\n\n---\n\n## Phase 5: Transform\n\nClean, normalize, and enrich extracted data before validation.\nSee [references\u002Fdata-transforms.md](references\u002Fdata-transforms.md) for patterns.\n\n## Automatic Transforms (Always Apply)\n\n| Transform              | Action                                               |\n|:-----------------------|:-----------------------------------------------------|\n| Whitespace cleanup     | Trim, collapse multiple spaces, remove `\\n` in cells |\n| HTML entity decode     | `&amp;` -> `&`, `&lt;` -> `\u003C`, `&#39;` -> `'`       |\n| Unicode normalization  | NFKC normalization for consistent characters          |\n| Empty string to null   | `\"\"` -> `null` (for JSON), `\"\"` -> `N\u002FA` (for tables)|\n\n## Conditional Transforms (Apply When Relevant)\n\n| Transform             | When                         | Action                                  |\n|:----------------------|:-----------------------------|:----------------------------------------|\n| Price normalization   | Product\u002Fpricing modes        | Extract numeric value + currency symbol |\n| Date normalization    | Any dates found              | Normalize to ISO-8601 (YYYY-MM-DD)      |\n| URL resolution        | Relative URLs extracted      | Convert to absolute URLs                |\n| Phone normalization   | Contact mode                 | Standardize to E.164 format if possible |\n| Deduplication         | Multi-page or multi-URL      | Remove exact duplicate rows             |\n| Sorting               | User requested or natural    | Sort by user-specified field            |\n\n## Data Enrichment (Only When Useful)\n\n| Enrichment             | When                         | Action                                |\n|:-----------------------|:-----------------------------|:--------------------------------------|\n| Currency conversion    | User asks for single currency| Note original + convert (approximate) |\n| Domain extraction      | URLs in data                 | Add domain column from full URLs      |\n| Word count             | Article mode                 | Count words in extracted text         |\n| Relative dates         | Dates present                | Add \"X days ago\" column if useful     |\n\n## Deduplication Strategy\n\nWhen combining data from multiple pages or URLs:\n1. Exact match: rows with identical values in all fields -> keep first\n2. Near match: rows with same key fields (name+source) but different details\n   -> keep most complete (fewer nulls), flag in notes\n3. Report: \"Removed N duplicate rows\" in delivery notes\n\n---\n\n## Phase 6: Validate\n\nVerify extraction quality before delivering results.\n\n## Validation Checks\n\n| Check                | Action                                              |\n|:---------------------|:----------------------------------------------------|\n| Item count           | Compare extracted count to expected count from recon |\n| Empty fields         | Count N\u002FA or null values per field                   |\n| Data type consistency| Numbers should be numeric, dates parseable           |\n| Duplicates           | Flag exact duplicate rows (post-dedup)               |\n| Encoding             | Check for HTML entities, garbled characters           |\n| Completeness         | All user-requested fields present in output          |\n| Truncation           | Verify data wasn't cut off (check last items)        |\n| Outliers             | Flag values that seem anomalous (e.g. $0.00 price)  |\n\n## Confidence Rating\n\nAssign to every extraction:\n\n| Rating     | Criteria                                                        |\n|:-----------|:----------------------------------------------------------------|\n| **HIGH**   | All fields populated, count matches expected, no anomalies      |\n| **MEDIUM** | Minor gaps (\u003C10% empty fields) or count slightly differs        |\n| **LOW**    | Significant gaps (>10% empty), structural issues, partial data  |\n\nAlways report confidence with specifics:\n> Confidence: **HIGH** - 47 items extracted, all 6 fields populated,\n> matches expected count from page analysis.\n\n## Auto-Recovery (Try Before Reporting Issues)\n\n| Issue              | Auto-Recovery Action                                  |\n|:-------------------|:------------------------------------------------------|\n| Missing data       | Re-attempt with Browser if WebFetch was used          |\n| Encoding problems  | Apply HTML entity decode + unicode normalization      |\n| Incomplete results | Check for pagination or lazy-loading, fetch more      |\n| Count mismatch     | Scroll\u002Fpaginate to find remaining items               |\n| All fields empty   | Page likely JS-rendered, switch to Browser strategy   |\n| Partial fields     | Try JSON-LD extraction as supplement                  |\n\nLog all recovery attempts in delivery notes.\nInform user of any irrecoverable gaps with specific details.\n\n---\n\n## Phase 7: Format And Deliver\n\nStructure results according to user preference.\nSee [references\u002Foutput-templates.md](references\u002Foutput-templates.md)\nfor complete formatting templates.\n\n## Delivery Envelope\n\nALWAYS wrap results with this metadata header:\n\n```markdown\n\n## Extraction Results\n\n**Source:** [Page Title](http:\u002F\u002Fexample.com)\n**Date:** YYYY-MM-DD HH:MM UTC\n**Items:** N records (M fields each)\n**Confidence:** HIGH | MEDIUM | LOW\n**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)\n**Format:** Markdown Table | JSON | CSV\n\n---\n\n[DATA HERE]\n\n---\n\n**Notes:**\n- [Any gaps, issues, or observations]\n- [Transforms applied: deduplication, normalization, etc.]\n- [Pages scraped if paginated: \"Pages 1-5 of 12\"]\n- [Auto-escalation if it occurred: \"Escalated from WebFetch to Browser\"]\n```\n\n## Markdown Table Rules\n\n- Left-align text columns (`:---`), right-align numbers (`---:`)\n- Consistent column widths for readability\n- Include summary row for numeric data when useful (totals, averages)\n- Maximum 10 columns per table; split wider data into multiple tables\n  or suggest JSON format\n- Truncate long cell values to 60 chars with `...` indicator\n- Use `N\u002FA` for missing values, never leave cells empty\n- For multi-page results, show combined table (not per-page)\n\n## Json Rules\n\n- Use camelCase for keys (e.g. `productName`, `unitPrice`)\n- Wrap in metadata envelope:\n  ```json\n  {\n    \"metadata\": {\n      \"source\": \"URL\",\n      \"title\": \"Page Title\",\n      \"extractedAt\": \"ISO-8601\",\n      \"itemCount\": 47,\n      \"fieldCount\": 6,\n      \"confidence\": \"HIGH\",\n      \"strategy\": \"A\",\n      \"transforms\": [\"deduplication\", \"priceNormalization\"],\n      \"notes\": []\n    },\n    \"data\": [ ... ]\n  }\n  ```\n- Pretty-print with 2-space indentation\n- Numbers as numbers (not strings), booleans as booleans\n- null for missing values (not empty strings)\n\n## Csv Rules\n\n- First row is always headers\n- Quote any field containing commas, quotes, or newlines\n- UTF-8 encoding with BOM for Excel compatibility\n- Use `,` as delimiter (standard)\n- Include metadata as comments: `# Source: URL`\n\n## File Output\n\nWhen user requests file save:\n- Markdown: `.md` extension\n- JSON: `.json` extension\n- CSV: `.csv` extension\n- Confirm path before writing\n- Report full file path and item count after saving\n\n## Multi-Url Comparison Format\n\nWhen comparing data across multiple sources:\n- Add `Source` as the first column\u002Ffield\n- Use short identifiers for sources (domain name or user label)\n- Group by source or interleave based on user preference\n- Highlight differences if user asks for comparison\n- Include summary: \"Best price: $X at store-b.com\"\n\n## Differential Output\n\nWhen user requests change detection (diff mode):\n- Compare current extraction with previous run\n- Mark new items with `[NEW]`\n- Mark removed items with `[REMOVED]`\n- Mark changed values with `[WAS: old_value]`\n- Include summary: \"Changes since last run: +5 new, -2 removed, 3 modified\"\n\n---\n\n## Rate Limiting\n\n- Maximum 1 request per 2 seconds for sequential page fetches\n- For multi-URL jobs, process sequentially with pauses\n- If a site returns 429 (Too Many Requests), stop and report to user\n\n## Access Respect\n\n- If a page blocks access (403, CAPTCHA, login wall), report to user\n- Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls\n- Do NOT scrape behind authentication unless user explicitly provides access\n- Respect robots.txt directives when known\n\n## Copyright\n\n- Do NOT reproduce large blocks of copyrighted article text\n- For articles: extract factual data, statistics, and structured info;\n  summarize narrative content\n- Always include source attribution (http:\u002F\u002Fexample.com) in output\n\n## Data Scope\n\n- Extract ONLY what the user explicitly requested\n- Warn user before collecting potentially sensitive data at scale\n  (emails, phone numbers, personal information)\n- Do not store or transmit extracted data beyond what the user sees\n\n## Failure Protocol\n\nWhen extraction fails or is blocked:\n1. Explain the specific reason (JS rendering, bot detection, login, etc.)\n2. Suggest alternatives (different URL, API if available, manual approach)\n3. Never retry aggressively or escalate access attempts\n\n---\n\n## Quick Reference: Mode Cheat Sheet\n\n| User Says...                         | Mode      | Strategy  | Output Default   |\n|:-------------------------------------|:----------|:----------|:-----------------|\n| \"extract the table\"                  | table     | A or B    | Markdown table   |\n| \"get all products\u002Fprices\"            | product   | E then A  | Markdown table   |\n| \"scrape the listings\"                | list      | A or B    | Markdown table   |\n| \"extract contact info \u002F team page\"   | contact   | A         | Markdown table   |\n| \"get the article data\"               | article   | A         | Markdown text    |\n| \"extract the FAQ\"                    | faq       | A or B    | JSON             |\n| \"get pricing plans\"                  | pricing   | A or B    | Markdown table   |\n| \"scrape job listings\"                | jobs      | A or B    | Markdown table   |\n| \"get event schedule\"                 | events    | A or B    | Markdown table   |\n| \"find and extract [topic]\"           | discovery | WebSearch | Markdown table   |\n| \"compare prices across sites\"        | multi-URL | A or B    | Comparison table |\n| \"what changed since last time\"       | diff      | any       | Diff format      |\n\n---\n\n## References\n\n- **Extraction patterns**: [references\u002Fextraction-patterns.md](references\u002Fextraction-patterns.md)\n  CSS selectors, JavaScript snippets, JSON-LD parsing, domain tips.\n\n- **Output templates**: [references\u002Foutput-templates.md](references\u002Foutput-templates.md)\n  Markdown, JSON, CSV templates with complete examples.\n\n- **Data transforms**: [references\u002Fdata-transforms.md](references\u002Fdata-transforms.md)\n  Cleaning, normalization, deduplication, enrichment patterns.\n\n## Best Practices\n\n- Provide clear, specific context about your project and requirements\n- Review all suggestions before applying them to production code\n- Combine with other complementary skills for comprehensive analysis\n\n## Common Pitfalls\n\n- Using this skill for tasks outside its domain expertise\n- Applying recommendations without understanding your specific context\n- Not providing enough project context for accurate analysis\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.\n","","imported","https:\u002F\u002Fgithub.com\u002Fsickn33\u002Fantigravity-awesome-skills","user_system_seed","SkillOPIC",true,130,135,"2026-05-16 13:46:44",{"id":8,"name":21,"slug":22,"icon":23,"description":24,"sort":25,"createdAt":26},"效率工具","productivity","mdi-lightning-bolt-outline","文档处理、数据分析、自动化工作流",4,"2026-05-16 12:53:40",{"id":7,"name":28,"slug":29,"icon":30,"description":31,"moduleId":8,"sort":32,"skillCount":33,"createdAt":26},"数据分析","data-analysis","mdi-chart-bar","数据可视化、统计分析",2,30,[35],{"id":36,"skillId":4,"version":37,"fileName":38,"fileSize":39,"filePath":40,"fileHash":41,"manifest":42,"createdAt":19},"dfe6c6ab-23f3-4a98-b8f0-f8a0ada0cc83","1.0.0","web-scraper.zip",25177,"uploads\u002Fskills\u002F43276a03-8e2f-4e56-b3ae-0e8883908e27\u002Fweb-scraper.zip","0d6ae049d106bdeecc62563b4c4928ecd409179180131a24aace7af6b04c62e7","[{\"path\":\"SKILL.md\",\"isDirectory\":false,\"size\":28821},{\"path\":\"references\u002Fdata-transforms.md\",\"isDirectory\":false,\"size\":11666},{\"path\":\"references\u002Fextraction-patterns.md\",\"isDirectory\":false,\"size\":15720},{\"path\":\"references\u002Foutput-templates.md\",\"isDirectory\":false,\"size\":12263}]",{"code":44,"message":45,"data":46},200,"success",{"items":47,"stats":48,"page":51},[],{"averageRating":49,"totalRatings":49,"ratingCounts":50},0,[49,49,49,49,49],{"limit":52,"offset":49,"hasMore":53,"nextOffset":52,"ratedOnly":16},15,false]