Skip to content

arandomguyhere/Google-News-Scraper

Repository files navigation

Bob's Daily Brief

Automated cybersecurity news aggregation with mosaic intelligence clustering.

Scraping Status GitHub Pages

What It Does

Continuously monitors 40+ cybersecurity news sources, clusters related stories to reveal the bigger picture, and deploys a live intelligence brief via GitHub Pages.

Live Brief: arandomguyhere.github.io/Google-News-Scraper

Features

  • Multi-source scraping: 126 targeted search queries across mainstream media, security publications, threat intel vendors, and international sources
  • Mosaic intelligence: Clusters related stories using multi-dimensional entity matching (countries, threat actors, sectors, techniques)
  • Cluster confidence scoring: Academic-backed scoring (Silhouette-inspired) rates cluster quality as strong/reasonable/weak
  • Source reliability weighting: 50+ sources rated using MBFC/NewsGuard methodology
  • Syndication detection: Identifies echo/duplicate content to prevent false confirmation
  • Threat actor tracking: 60+ named APT groups, ransomware gangs, and nation-state actors
  • Early signal detection: Surfaces stories gaining traction internationally before US mainstream coverage
  • Historical archives: Timestamped snapshots for trend analysis

How It Works

scraper.py              126 queries (with when:24h freshness), 350+ stories per run
    |
    v
generate_mosaic.py      Story clustering + entity extraction
    |
    +-- StoryCorrelator (v3.0)
    |     - Entity extraction (regex patterns)
    |     - Multi-dimensional matching (2+ dimensions required)
    |     - Cluster confidence scoring (5-factor weighted)
    |     - Source reliability weighting (MBFC/NewsGuard)
    |     - Syndication/echo detection (85% title similarity)
    |
    v
docs/index.html         Clustered intelligence brief
docs/feed.json          Structured data (clusters, connections, timeline)

Schedule

  • Every 6 hours: 0 */6 * * *
  • Weekday mornings: 0 9 * * 1-5
  • Manual: GitHub Actions workflow dispatch

Repository Structure

.
├── scraper.py              # Main scraping engine
├── generate_mosaic.py      # Mosaic intelligence generator
├── requirements.txt        # Python dependencies
├── src/
│   └── processors/
│       ├── story_correlator.py   # Clustering engine
│       └── nlp_processor.py      # NLP utilities (spaCy optional)
├── data/                   # Scraper output (gitignored)
├── docs/                   # GitHub Pages content
│   ├── index.html          # Live newsletter
│   └── feed.json           # Clustered data
└── archives/               # Historical snapshots

Tracked Entities

Threat Actors (60+)

Origin Groups
Chinese Salt/Volt/Flax Typhoon, Mustang Panda, Winnti, Hafnium, APT1/10/27/40/41
Russian Fancy/Cozy Bear, Sandworm, Turla, Star/Midnight Blizzard, APT28/29
North Korean Lazarus, Kimsuky, Andariel, BlueNoroff, APT37/38
Iranian Charming Kitten, MuddyWater, OilRig, Mint/Peach Sandstorm, APT33/34/35
Ransomware LockBit, BlackCat/ALPHV, Clop, Akira, Rhysida, Black Basta, Play
Financial FIN7/11/12, Scattered Spider, LAPSUS$

Other Patterns

  • Countries: China, Russia, Iran, North Korea, Ukraine, Taiwan, Israel, + 10 more
  • Sectors: Healthcare, financial, telecom, energy, defense, government, aerospace
  • Techniques: Phishing, lateral movement, C2, credential stuffing, living off the land
  • Vulnerabilities: CVE patterns, zero-day, RCE, privilege escalation

Clustering Algorithm

Stories are grouped when they share 2+ entity dimensions:

Example: "China APT telecom" + "China APT infrastructure"
  - Shared: countries (China), threat_actors (APT)
  - Result: Clustered together

Example: "China scams" + "China rare earths"
  - Shared: countries only (China)
  - Result: NOT clustered (only 1 dimension)

This prevents overly broad groupings while connecting genuinely related stories.

Cluster Confidence Scoring (v3.0)

Each cluster receives a confidence score (0-1) based on five factors:

Factor Weight Description
Entity overlap 30% Shared dimensions (countries, actors, sectors)
Source quality 20% Average reliability of sources in cluster
Text similarity 20% TF-IDF word overlap between stories
Source diversity 15% Number of unique sources confirming
Temporal coherence 15% Stories within tight time window

Strength thresholds (per Silhouette methodology):

  • >0.7 = strong (high-confidence cluster)
  • >0.5 = reasonable
  • >0.25 = weak
  • <0.25 = noise

Source Reliability

Sources rated 0.0-1.0 based on Media Bias/Fact Check and NewsGuard methodology:

Tier Score Examples
1 0.85-1.0 Reuters, BBC, FT, WSJ, Bloomberg
2 0.75-0.85 Krebs, The Record, Bleeping Computer, CyberScoop
3 0.70-0.85 Mandiant, CrowdStrike, Unit 42 (vendor research)
4 0.65-0.80 SCMP, Nikkei, Al Jazeera (international)
5 0.50-0.70 Mixed reliability sources

Dependencies

requests>=2.28.0
beautifulsoup4>=4.12.0
pandas>=2.0.0
lxml>=4.9.0
scikit-learn>=1.3.0
numpy>=1.20.0,<2.0.0

Optional: spacy with en_core_web_sm for enhanced NER (falls back to regex)

Setup

  1. Fork/clone the repository
  2. Enable GitHub Actions
  3. Configure GitHub Pages (source: GitHub Actions)
  4. Run workflow manually or wait for scheduled run

Debug Mode

Enable via workflow dispatch for:

  • Detailed scraper logs
  • Extended artifact retention
  • Additional diagnostics

Data Retention

Type Retention
Repository data Permanent (git)
GitHub Pages Until next deployment
Metrics artifacts 365 days
Archives 90 days
Debug logs 7 days

References

Academic research supporting the v3.0 implementation:

Cluster Confidence Scoring

  • Rousseeuw, P.J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." Journal of Computational and Applied Mathematics, 20, 53-65.
  • scikit-learn Silhouette Score

Source Reliability Methodology

Syndication/Echo Detection

News Event Clustering


Powered by GitHub Actions and mosaic intelligence clustering

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages