Automated cybersecurity news aggregation with mosaic intelligence clustering.
Continuously monitors 40+ cybersecurity news sources, clusters related stories to reveal the bigger picture, and deploys a live intelligence brief via GitHub Pages.
Live Brief: arandomguyhere.github.io/Google-News-Scraper
- Multi-source scraping: 126 targeted search queries across mainstream media, security publications, threat intel vendors, and international sources
- Mosaic intelligence: Clusters related stories using multi-dimensional entity matching (countries, threat actors, sectors, techniques)
- Cluster confidence scoring: Academic-backed scoring (Silhouette-inspired) rates cluster quality as strong/reasonable/weak
- Source reliability weighting: 50+ sources rated using MBFC/NewsGuard methodology
- Syndication detection: Identifies echo/duplicate content to prevent false confirmation
- Threat actor tracking: 60+ named APT groups, ransomware gangs, and nation-state actors
- Early signal detection: Surfaces stories gaining traction internationally before US mainstream coverage
- Historical archives: Timestamped snapshots for trend analysis
scraper.py 126 queries (with when:24h freshness), 350+ stories per run
|
v
generate_mosaic.py Story clustering + entity extraction
|
+-- StoryCorrelator (v3.0)
| - Entity extraction (regex patterns)
| - Multi-dimensional matching (2+ dimensions required)
| - Cluster confidence scoring (5-factor weighted)
| - Source reliability weighting (MBFC/NewsGuard)
| - Syndication/echo detection (85% title similarity)
|
v
docs/index.html Clustered intelligence brief
docs/feed.json Structured data (clusters, connections, timeline)
- Every 6 hours:
0 */6 * * * - Weekday mornings:
0 9 * * 1-5 - Manual: GitHub Actions workflow dispatch
.
├── scraper.py # Main scraping engine
├── generate_mosaic.py # Mosaic intelligence generator
├── requirements.txt # Python dependencies
├── src/
│ └── processors/
│ ├── story_correlator.py # Clustering engine
│ └── nlp_processor.py # NLP utilities (spaCy optional)
├── data/ # Scraper output (gitignored)
├── docs/ # GitHub Pages content
│ ├── index.html # Live newsletter
│ └── feed.json # Clustered data
└── archives/ # Historical snapshots
| Origin | Groups |
|---|---|
| Chinese | Salt/Volt/Flax Typhoon, Mustang Panda, Winnti, Hafnium, APT1/10/27/40/41 |
| Russian | Fancy/Cozy Bear, Sandworm, Turla, Star/Midnight Blizzard, APT28/29 |
| North Korean | Lazarus, Kimsuky, Andariel, BlueNoroff, APT37/38 |
| Iranian | Charming Kitten, MuddyWater, OilRig, Mint/Peach Sandstorm, APT33/34/35 |
| Ransomware | LockBit, BlackCat/ALPHV, Clop, Akira, Rhysida, Black Basta, Play |
| Financial | FIN7/11/12, Scattered Spider, LAPSUS$ |
- Countries: China, Russia, Iran, North Korea, Ukraine, Taiwan, Israel, + 10 more
- Sectors: Healthcare, financial, telecom, energy, defense, government, aerospace
- Techniques: Phishing, lateral movement, C2, credential stuffing, living off the land
- Vulnerabilities: CVE patterns, zero-day, RCE, privilege escalation
Stories are grouped when they share 2+ entity dimensions:
Example: "China APT telecom" + "China APT infrastructure"
- Shared: countries (China), threat_actors (APT)
- Result: Clustered together
Example: "China scams" + "China rare earths"
- Shared: countries only (China)
- Result: NOT clustered (only 1 dimension)
This prevents overly broad groupings while connecting genuinely related stories.
Each cluster receives a confidence score (0-1) based on five factors:
| Factor | Weight | Description |
|---|---|---|
| Entity overlap | 30% | Shared dimensions (countries, actors, sectors) |
| Source quality | 20% | Average reliability of sources in cluster |
| Text similarity | 20% | TF-IDF word overlap between stories |
| Source diversity | 15% | Number of unique sources confirming |
| Temporal coherence | 15% | Stories within tight time window |
Strength thresholds (per Silhouette methodology):
>0.7= strong (high-confidence cluster)>0.5= reasonable>0.25= weak<0.25= noise
Sources rated 0.0-1.0 based on Media Bias/Fact Check and NewsGuard methodology:
| Tier | Score | Examples |
|---|---|---|
| 1 | 0.85-1.0 | Reuters, BBC, FT, WSJ, Bloomberg |
| 2 | 0.75-0.85 | Krebs, The Record, Bleeping Computer, CyberScoop |
| 3 | 0.70-0.85 | Mandiant, CrowdStrike, Unit 42 (vendor research) |
| 4 | 0.65-0.80 | SCMP, Nikkei, Al Jazeera (international) |
| 5 | 0.50-0.70 | Mixed reliability sources |
requests>=2.28.0
beautifulsoup4>=4.12.0
pandas>=2.0.0
lxml>=4.9.0
scikit-learn>=1.3.0
numpy>=1.20.0,<2.0.0
Optional: spacy with en_core_web_sm for enhanced NER (falls back to regex)
- Fork/clone the repository
- Enable GitHub Actions
- Configure GitHub Pages (source: GitHub Actions)
- Run workflow manually or wait for scheduled run
Enable via workflow dispatch for:
- Detailed scraper logs
- Extended artifact retention
- Additional diagnostics
| Type | Retention |
|---|---|
| Repository data | Permanent (git) |
| GitHub Pages | Until next deployment |
| Metrics artifacts | 365 days |
| Archives | 90 days |
| Debug logs | 7 days |
Academic research supporting the v3.0 implementation:
Cluster Confidence Scoring
- Rousseeuw, P.J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." Journal of Computational and Applied Mathematics, 20, 53-65.
- scikit-learn Silhouette Score
Source Reliability Methodology
- Reliability Estimation of News Media Sources: Birds of a Feather Flock Together - arXiv 2024
- Political audience diversity and news reliability in algorithmic ranking - Nature Human Behaviour 2021
- Media Bias/Fact Check (MBFC) factuality ratings
- NewsGuard journalist credibility scores
Syndication/Echo Detection
- Echo Chamber Detection and Analysis - Social Network Analysis and Mining 2021
- A Survey on Echo Chambers on Social Media - arXiv 2021
News Event Clustering
- LLM Enhanced Clustering for News Event Detection - arXiv 2024
- Event Detection in Finance Using Hierarchical Clustering - PMC 2021
- Temporal-Semantic Clustering of Newspaper Articles - ResearchGate
Powered by GitHub Actions and mosaic intelligence clustering