This project provides modular tools for fetching and analyzing WordPress content through the WordPress REST API.
Note: Replace
yourwebsite.comwith your actual website domain in all examples and code.
The project has been organized into the following modules:
-
api.py: Functions for interacting with the WordPress REST API- Handles all API requests to WordPress
- Fetches posts by date range
- Retrieves post analytics data
- Lists available post types
-
content.py: Tools for cleaning HTML content and analyzing reading level- Removes WordPress blocks and HTML tags from content
- Preserves meaningful text structure (lists, paragraphs)
- Calculates Flesch-Kincaid reading grade level
-
util.py: Utility functions like URL pattern generation for analytics- Creates regex patterns for URL matching in analytics platforms
- Processes URLs to extract meaningful patterns
-
dataframe.py: Data processing and manipulation functions- Builds structured DataFrames from WordPress content
- Handles error cases and exceptions
- Exports data to CSV with proper formatting
-
wordpress_fetcher.py: Main script for fetching WordPress data by date range- Provides command-line interface
- Manages credentials and authentication
- Orchestrates the data collection workflow
-
wordpress_topic_fetcher.py: Script for fetching WordPress posts by topic tags- Filters posts by specified topic tags
- Supports multiple topics with comma-separated list
- Lists all available topics in your WordPress site
- Works within a specified date range
- Python 3.6+
- WordPress site with REST API enabled
- User credentials with appropriate permissions
- Clone or download this repository
- Install the required dependencies:
pip install -r requirements.txtCreate a .env file in the project directory with your WordPress credentials:
WORDPRESS_USER=your_username
WORDPRESS_PASSWORD=your_password
No input files are required. The script fetches data directly from the WordPress API based on date parameters.
The main script allows you to fetch WordPress content published within a specific date range:
python wordpress_fetcher.py --start_date 2024-01-01 --end_date 2024-03-31To see all available post types from your WordPress site:
python wordpress_fetcher.py --list-typesThe topic fetcher script allows you to fetch WordPress content that matches specific topic tags:
python wordpress_topic_fetcher.py --start-date 2024-01-01 --end-date 2024-03-31 --topics science,technologyTo see all available topics/tags in your WordPress site:
python wordpress_topic_fetcher.py --list-topicsEach script has built-in help functionality:
# Main script help options for specific modules
python wordpress_fetcher.py --help-api # Show API functions help
python wordpress_fetcher.py --help-content # Show content processing functions help
python wordpress_fetcher.py --help-util # Show utility functions help
python wordpress_fetcher.py --help-dataframe # Show dataframe functions help
# Direct module help
python api.py --help # API module help
python content.py --help # Content processing module help
python util.py --help # Utility module help
python dataframe.py --help # DataFrame module helpThe script generates two files:
wordpress_URLs_<start_date>_to_<end_date>.csv: Contains just the URLs fetchedwordpress_analytics_<start_date>_to_<end_date>.csv: Contains detailed analytics for each URL
The analytics CSV contains the following columns:
url: The full URL of the postregex: A regex pattern for matching the URL in Google Analyticsdate_published: The publication date (YYYY-MM-DD)title: The post titlelabel: The content label (if applicable)topics: List of topics (semicolon-separated)word_count: Number of words in the contentreading_level: Flesch-Kincaid grade level score
Note: The content column is excluded from CSV output to avoid formatting issues with commas.
If you want to incorporate these functions in your own scripts, you can import the modules directly:
from wordpress_fetcher.api import fetch_posts_by_date_range
import base64
import os
# Create authentication header
wordpress_user = os.environ.get('WORDPRESS_USER')
wordpress_password = os.environ.get('WORDPRESS_PASSWORD')
wordpress_credentials = wordpress_user + ":" + wordpress_password
wordpress_token = base64.b64encode(wordpress_credentials.encode())
header = {'Authorization': 'Basic ' + wordpress_token.decode('utf-8')}
# Fetch URLs
urls = fetch_posts_by_date_range('2024-01-01', '2024-01-31', header)
print(f"Found {len(urls)} URLs")from wordpress_fetcher.content import clean_html_content
html = "<p>This is <strong>formatted</strong> content with <a href='#'>links</a>.</p>"
clean_text = clean_html_content(html)
print(clean_text)from wordpress_fetcher.content import analyze_reading_level
text = "This is a sample text to analyze for readability."
grade_level = analyze_reading_level(text)
print(f"Reading grade level: {grade_level}")from wordpress_fetcher.util import create_url_regex_pattern
url = "https://www.yourwebsite.com/2024/02/15/sample-article/"
pattern = create_url_regex_pattern(url)
print(f"Regex pattern: {pattern}")from wordpress_fetcher.dataframe import build_wordpress_analytics_dataframe, save_dataframe_to_csv
# Create authentication header as shown above
# Build DataFrame
df = build_wordpress_analytics_dataframe('2024-01-01', '2024-01-31', header)
# Save to CSV
output_file = save_dataframe_to_csv(df, '2024-01-01', '2024-01-31')
print(f"Data saved to {output_file}")This script relies on the following WordPress REST API endpoints:
-
WordPress Core REST API - for fetching posts by date:
- Path:
/wp-json/wp/v2/{post-type} - Methods: GET
- Parameters:
after,before,per_page,page - Example:
https://www.yourwebsite.com/wp-json/wp/v2/posts?after=2023-01-01T00:00:00Z
- Path:
-
Custom Analytics Endpoint - for fetching detailed content analytics:
- Path:
/wp-json/site-api/v3/analytics/query-by-url/ - Methods: POST
- Body: JSON object with URL parameter
- Example:
{"url": "https://www.yourwebsite.com/example-post/"}
- Path:
To use this script with your WordPress site, you need to either:
- Ensure these endpoints exist on your site, or
- Modify the code to use the endpoints available on your site
Edit the pub_types and endpoint_map variables in api.py to match the post types available on your WordPress site:
pub_types = ['post-type-1', 'post-type-2', 'post-type-3', 'post-type-4', 'post-type-5']
endpoint_map = {
'post-type-1': 'post-type-1',
'post-type-2': 'post-type-2',
'post-type-3': 'post-type-3',
'post-type-4': 'post-type-4',
'post-type-5': 'post-type-5',
}To see all available post types on your site, run:
python wordpress_fetcher.py --list-typesIf your WordPress site uses different API endpoints, modify the URLs in api.py:
queryurl = 'https://www.yourwebsite.com/wp-json/site-api/v3/analytics/query-by-url/'This script uses Basic Authentication for simplicity, but for production environments, consider using more secure authentication methods like OAuth.
- Store your credentials in a
.envfile (never commit this file to version control) - The script loads these credentials and creates an authentication header
- This header is passed to all API requests
For improved security:
- Create a dedicated API user with limited permissions
- Use HTTPS for all API requests
- Consider implementing a token-based authentication system
If you encounter connection errors:
- Verify your WordPress credentials
- Check that your WordPress site has REST API enabled
- Ensure your user has appropriate permissions
- Check network connectivity to your WordPress site
If the output CSV is missing expected data:
- Check that your WordPress site has the expected post types
- Verify that content was published within the specified date range
- Ensure your user has access to view all required content
- Check for API rate limiting issues
If the script runs slowly:
- Reduce the date range to process fewer posts
- Adjust the sleep time in API requests (currently 0.2-0.5 seconds)
- Consider using pagination to process posts in smaller batches