Bob's Daily Brief

Automated cybersecurity news aggregation with mosaic intelligence clustering.

What It Does

Continuously monitors 40+ cybersecurity news sources, clusters related stories to reveal the bigger picture, and deploys a live intelligence brief via GitHub Pages.

Live Brief: arandomguyhere.github.io/Google-News-Scraper

Features

Multi-source scraping: 126 targeted search queries across mainstream media, security publications, threat intel vendors, and international sources
Mosaic intelligence: Clusters related stories using multi-dimensional entity matching (countries, threat actors, sectors, techniques)
Cluster confidence scoring: Academic-backed scoring (Silhouette-inspired) rates cluster quality as strong/reasonable/weak
Source reliability weighting: 50+ sources rated using MBFC/NewsGuard methodology
Syndication detection: Identifies echo/duplicate content to prevent false confirmation
Threat actor tracking: 60+ named APT groups, ransomware gangs, and nation-state actors
Early signal detection: Surfaces stories gaining traction internationally before US mainstream coverage
Historical archives: Timestamped snapshots for trend analysis

How It Works

scraper.py              126 queries (with when:24h freshness), 350+ stories per run
    |
    v
generate_mosaic.py      Story clustering + entity extraction
    |
    +-- StoryCorrelator (v3.0)
    |     - Entity extraction (regex patterns)
    |     - Multi-dimensional matching (2+ dimensions required)
    |     - Cluster confidence scoring (5-factor weighted)
    |     - Source reliability weighting (MBFC/NewsGuard)
    |     - Syndication/echo detection (85% title similarity)
    |
    v
docs/index.html         Clustered intelligence brief
docs/feed.json          Structured data (clusters, connections, timeline)

Schedule

Every 6 hours: 0 */6 * * *
Weekday mornings: 0 9 * * 1-5
Manual: GitHub Actions workflow dispatch

Repository Structure

.
├── scraper.py              # Main scraping engine
├── generate_mosaic.py      # Mosaic intelligence generator
├── requirements.txt        # Python dependencies
├── src/
│   └── processors/
│       ├── story_correlator.py   # Clustering engine
│       └── nlp_processor.py      # NLP utilities (spaCy optional)
├── data/                   # Scraper output (gitignored)
├── docs/                   # GitHub Pages content
│   ├── index.html          # Live newsletter
│   └── feed.json           # Clustered data
└── archives/               # Historical snapshots

Tracked Entities

Threat Actors (60+)

Origin	Groups
Chinese	Salt/Volt/Flax Typhoon, Mustang Panda, Winnti, Hafnium, APT1/10/27/40/41
Russian	Fancy/Cozy Bear, Sandworm, Turla, Star/Midnight Blizzard, APT28/29
North Korean	Lazarus, Kimsuky, Andariel, BlueNoroff, APT37/38
Iranian	Charming Kitten, MuddyWater, OilRig, Mint/Peach Sandstorm, APT33/34/35
Ransomware	LockBit, BlackCat/ALPHV, Clop, Akira, Rhysida, Black Basta, Play
Financial	FIN7/11/12, Scattered Spider, LAPSUS$

Other Patterns

Countries: China, Russia, Iran, North Korea, Ukraine, Taiwan, Israel, + 10 more
Sectors: Healthcare, financial, telecom, energy, defense, government, aerospace
Techniques: Phishing, lateral movement, C2, credential stuffing, living off the land
Vulnerabilities: CVE patterns, zero-day, RCE, privilege escalation

Clustering Algorithm

Stories are grouped when they share 2+ entity dimensions:

Example: "China APT telecom" + "China APT infrastructure"
  - Shared: countries (China), threat_actors (APT)
  - Result: Clustered together

Example: "China scams" + "China rare earths"
  - Shared: countries only (China)
  - Result: NOT clustered (only 1 dimension)

This prevents overly broad groupings while connecting genuinely related stories.

Cluster Confidence Scoring (v3.0)

Each cluster receives a confidence score (0-1) based on five factors:

Factor	Weight	Description
Entity overlap	30%	Shared dimensions (countries, actors, sectors)
Source quality	20%	Average reliability of sources in cluster
Text similarity	20%	TF-IDF word overlap between stories
Source diversity	15%	Number of unique sources confirming
Temporal coherence	15%	Stories within tight time window

Strength thresholds (per Silhouette methodology):

>0.7 = strong (high-confidence cluster)
>0.5 = reasonable
>0.25 = weak
<0.25 = noise

Source Reliability

Sources rated 0.0-1.0 based on Media Bias/Fact Check and NewsGuard methodology:

Tier	Score	Examples
1	0.85-1.0	Reuters, BBC, FT, WSJ, Bloomberg
2	0.75-0.85	Krebs, The Record, Bleeping Computer, CyberScoop
3	0.70-0.85	Mandiant, CrowdStrike, Unit 42 (vendor research)
4	0.65-0.80	SCMP, Nikkei, Al Jazeera (international)
5	0.50-0.70	Mixed reliability sources

Dependencies

requests>=2.28.0
beautifulsoup4>=4.12.0
pandas>=2.0.0
lxml>=4.9.0
scikit-learn>=1.3.0
numpy>=1.20.0,<2.0.0

Optional: spacy with en_core_web_sm for enhanced NER (falls back to regex)

Setup

Fork/clone the repository
Enable GitHub Actions
Configure GitHub Pages (source: GitHub Actions)
Run workflow manually or wait for scheduled run

Debug Mode

Enable via workflow dispatch for:

Detailed scraper logs
Extended artifact retention
Additional diagnostics

Data Retention

Type	Retention
Repository data	Permanent (git)
GitHub Pages	Until next deployment
Metrics artifacts	365 days
Archives	90 days
Debug logs	7 days

References

Academic research supporting the v3.0 implementation:

Cluster Confidence Scoring

Rousseeuw, P.J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis." Journal of Computational and Applied Mathematics, 20, 53-65.
scikit-learn Silhouette Score

Source Reliability Methodology

Reliability Estimation of News Media Sources: Birds of a Feather Flock Together - arXiv 2024
Political audience diversity and news reliability in algorithmic ranking - Nature Human Behaviour 2021
Media Bias/Fact Check (MBFC) factuality ratings
NewsGuard journalist credibility scores

Syndication/Echo Detection

Echo Chamber Detection and Analysis - Social Network Analysis and Mining 2021
A Survey on Echo Chambers on Social Media - arXiv 2021

News Event Clustering

LLM Enhanced Clustering for News Event Detection - arXiv 2024
Event Detection in Finance Using Hierarchical Clustering - PMC 2021
Temporal-Semantic Clustering of Newspaper Articles - ResearchGate

Powered by GitHub Actions and mosaic intelligence clustering

Name		Name	Last commit message	Last commit date
Latest commit History 1,011 Commits
.github/workflows		.github/workflows
archives		archives
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
archive_manager.py		archive_manager.py
generate_html.py		generate_html.py
generate_mosaic.py		generate_mosaic.py
metrics_tracker.py		metrics_tracker.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bob's Daily Brief

What It Does

Features

How It Works

Schedule

Repository Structure

Tracked Entities

Threat Actors (60+)

Other Patterns

Clustering Algorithm

Cluster Confidence Scoring (v3.0)

Source Reliability

Dependencies

Setup

Debug Mode

Data Retention

References

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

arandomguyhere/Google-News-Scraper

Folders and files

Latest commit

History

Repository files navigation

Bob's Daily Brief

What It Does

Features

How It Works

Schedule

Repository Structure

Tracked Entities

Threat Actors (60+)

Other Patterns

Clustering Algorithm

Cluster Confidence Scoring (v3.0)

Source Reliability

Dependencies

Setup

Debug Mode

Data Retention

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages