Scraping Rule Helper - Feature Documentation¶
Overview¶
The Scraping Rule Helper is an interactive TUI tool that helps users discover optimal CSS selectors for extracting clean content from feed entries.
Key Features¶
✅ Auto-analyze entry URLs ✅ Smart suggestions ranked by quality ✅ Live preview of extracted content ✅ Custom testing for complex sites ✅ One-click save to feed settings ✅ 100% secure with comprehensive validation
Architecture¶
See PR #405 for full technical details.
Components¶
- SecureFetcher: Safe HTTP fetching with security constraints
- ContentAnalyzer: Intelligent content detection and scoring
- ScrapingHelperScreen: Interactive TUI interface
Test Coverage¶
- 61 tests total
- 100% coverage
- All security scenarios validated
Usage¶
Quick Start¶
- Navigate to entry in entry list
- Press
Shift+X - Review suggestions (auto-ranked)
- Select best match or test custom
- Press
Ctrl+Sto save - Done! ✨
Keybindings¶
| Key | Action |
|---|---|
Shift+X |
Open from entry list |
↑/↓ |
Navigate suggestions |
Enter |
Select suggestion |
t |
Test custom selector |
Ctrl+S |
Save rule |
Esc |
Close |
How It Works¶
Content Detection Heuristics¶
The analyzer uses multiple signals to rank selectors:
- Semantic tags:
<article>,<main>→ High score - Semantic IDs:
#content,#article→ Medium-high - Semantic classes:
.post,.entry→ Medium - Text length: More content → Higher score
- Paragraph count: More paragraphs → Higher score
- Link density: High links/text ratio → Lower score (likely nav)
Scoring Algorithm¶
score = (paragraphs × 5) + min(text_length / 10, 100) + tag_bonus - link_penalty
Tag bonuses:
- <article>: +50
- <main>: +40
- #content, #main-content: +30
- .content, .post: +20
Security¶
Protections¶
✅ Blocks private IPs (10.x, 192.168.x, 172.16-31.x) ✅ Blocks localhost in all forms ✅ Only http/https allowed ✅ 5MB size limit ✅ 10s timeout ✅ HTML sanitization (XSS protection) ✅ No JavaScript execution
Threat Mitigation¶
- SSRF: URL validation blocks private/local addresses
- XXS: HTML sanitized with bleach before display
- Resource exhaustion: Size and timeout limits
- Injection: Safe CSS selector validation only
Examples¶
Example 1: Blog Post¶
URL: https://blog.example.com/post/123
Top suggestions:
1. ⭐150 - article.post-content
2. ⭐120 - #main-content
3. ⭐100 - .entry-content
→ Select #1, save → Done!
Example 2: News Site¶
URL: https://news.example.com/article/456
Challenge: Multiple <article> tags (teasers + main)
Solution: Analyzer scores main article highest (most paragraphs)
Example 3: Custom Selector¶
URL: https://difficult.site.com/page
Problem: Generic classes, no semantic tags
Solution:
1. Review suggestions (not great)
2. Enter custom: div.story-text
3. Test → Preview looks good
4. Save → Success!
Performance¶
| Operation | Time |
|---|---|
| Fetch URL | 0.1-2s |
| Analyze | <0.1s |
| Preview | <0.01s |
| Total | ~1-2s |
Future Enhancements¶
Potential improvements: - Rule templates for popular sites - Community rule database - Batch testing on multiple entries - XPath selector support - Visual element highlighting - A/B comparison mode
Dependencies¶
httpx(0.28.1) - Async HTTPbeautifulsoup4(4.14.2) - HTML parsinghtml5lib(1.1) - Secure parserbleach(6.3.0) - Sanitization
All actively maintained, security-focused libraries.
References¶
- PR #405 - Core modules
- PR #406 - Integration
- Issue #391 - Original request
- Miniflux Scraping Rules
This is a unique feature not found in other RSS readers! 🚀