網路爬蟲開發 Skill

Name: web-scraping-guide
Rating: 78
Author: cy5407

何時使用此 Skill

當需要：

•新增爬蟲來源（新網站支援）
•修改級聯搜尋邏輯（調整優先順序）
•處理編碼問題（日文網站）
•設定速率限制（避免被封鎖）
•修復爬蟲失敗（網站改版）

級聯搜尋順序

code

AV-WIKI (主要)
    ↓ 失敗
chiba-f.net (備援)
    ↓ 失敗
JAVDB (最終)

爬蟲架構

code

src/scrapers/
├── sources/
│   ├── avwiki_scraper.py     # AV-WIKI 爬蟲
│   ├── chibaf_scraper.py     # chiba-f 爬蟲
│   └── javdb_scraper.py      # JAVDB 爬蟲
├── cache_manager.py          # 快取管理
└── unified_scraper.py        # 統一介面

新增爬蟲來源

Step 1: 建立爬蟲類別

檔案：src/scrapers/sources/example_scraper.py

python

from typing import Optional, Dict, Any
from bs4 import BeautifulSoup
import logging

logger = logging.getLogger(__name__)

class ExampleScraper:
    """Example 網站爬蟲"""
    
    BASE_URL = "https://example.com"
    
    def __init__(self, safe_searcher):
        self.safe_searcher = safe_searcher
    
    async def search(self, code: str) -> Optional[Dict[str, Any]]:
        """搜尋番號"""
        try:
            url = f"{self.BASE_URL}/search?q={code}"
            html = await self.safe_searcher.fetch(url)
            
            if not html:
                return None
            
            soup = BeautifulSoup(html, 'html.parser')
            # 解析邏輯...
            
            return {
                'actresses': [...],
                'title': '...',
                'studio': '...'
            }
        except Exception as e:
            logger.error(f"❌ Example 搜尋失敗 {code}: {e}")
            return None

Step 2: 整合到級聯搜尋

檔案：src/services/web_searcher.py

python

from scrapers.sources.example_scraper import ExampleScraper

class WebSearcher:
    def __init__(self):
        self.example_scraper = ExampleScraper(self.safe_searcher)
    
    async def cascade_search(self, code: str):
        # 1. AV-WIKI
        result = await self.avwiki.search(code)
        if result:
            return result
        
        # 2. Example (新增)
        result = await self.example_scraper.search(code)
        if result:
            return result
        
        # 3. chiba-f
        result = await self.chibaf.search(code)
        if result:
            return result
        
        # 4. JAVDB (最終)
        return await self.javdb.search(code)

日文編碼處理

python

import chardet
from bs4 import BeautifulSoup

# 自動檢測編碼
def decode_japanese_html(content: bytes) -> str:
    detected = chardet.detect(content)
    encoding = detected['encoding']
    
    # 常見日文編碼
    if encoding in ['SHIFT_JIS', 'EUC-JP']:
        return content.decode(encoding, errors='ignore')
    else:
        return content.decode('utf-8', errors='ignore')

速率限制

python

# SafeSearcher 自動處理速率限制
from services.safe_searcher import SafeSearcher

searcher = SafeSearcher(
    min_delay=0.5,      # 最小延遲 (秒)
    max_delay=1.5,      # 最大延遲 (秒)
    timeout=20,         # 請求逾時
    max_retries=3       # 重試次數
)

網路爬蟲開發 Skill

何時使用此 Skill

級聯搜尋順序

爬蟲架構

新增爬蟲來源

Step 1: 建立爬蟲類別

Step 2: 整合到級聯搜尋

日文編碼處理

速率限制

相關檔案