This Python script generates a synthetic dataset of unstructured logs in CSV format, simulating logs from various services over a one-year period. The logs include different severity levels, sources, and detailed messages with dynamic content, suitable for testing, analysis, or demonstration purposes.
Because the script relies on randomization without a fixed seed, every time you execute it:
- New Log Entries: You’ll get a fresh set of log entries, with different timestamps, sources, levels, and messages.
- Varied Data: Usernames, IP addresses, transaction details, and other data points will be unique.
- Different Patterns: While the overall structure and patterns (like error spikes) remain consistent, the specific details and timing will change.
This behavior is beneficial for:
- Testing: Allows you to test your log analysis tools or applications with varied data.
- Simulation: Helps simulate real-world scenarios where log data is never the same.
- Learning: Provides diverse datasets for practice in data analysis, machine learning, or cybersecurity.
- Simulates realistic log data from multiple services:
AuthService
PaymentService
DatabaseService
NotificationService
CacheService
- Generates over 10,000 log entries with timestamps spanning the past year.
- Includes various log levels:
INFO
WARN
ERROR
DEBUG
- Introduces spikes and patterns to mimic real-world scenarios:
- Error spikes occurring monthly.
- Security incidents and performance issues.
- Dynamic message content using the Faker library for realistic data.
- Python 3.6 or higher
- Faker library
If you're using a GitHub repository:
git clone https://github.com/your-username/log-data-generator.git
cd log-data-generator
It's best practice to use a virtual environment to manage dependencies.
python3 -m venv venv
source venv/bin/activate
Install the required Python packages using pip
.
pip install Faker
Run the script to generate the log dataset.
python3 generate_logs.py
The script will create a file named logs_dataset.csv in the current directory.
- The dataset will contain log entries with the following fields:
- @timestamp
- log.level
- source
- message
Change the num_entries
variable in the script to generate more or fewer log entries.
num_entries = 10000 # Set to desired number of entries
Alter the weights in the random.choices
function to change the frequency of each log level.
level = random.choices(
log_levels,
weights=[0.7, 0.1, 0.1, 0.1] # Adjust weights for INFO, WARN, ERROR, DEBUG
)[0]
Edit the messages
dictionary in the script to add or modify message templates for different services and log levels.
messages = {
'AuthService': {
'INFO': [
"User '{user}' logged in successfully from IP address {ip_address}",
# Add more templates as needed
],
# ... other levels and services
},
# ... other services
}
An excerpt from the generated logs_dataset.csv
file:
Timestamp,Level,Source,Message
2022-10-02T03:12:45Z,INFO,AuthService,User 'jane_doe' logged in successfully from IP address 192.168.1.100
2022-11-05T10:45:37Z,ERROR,DatabaseService,Database connection failed: unable to reach primary database cluster 'db-cluster-1'
2023-08-05T04:00:25Z,ERROR,AuthService,Critical security vulnerability detected: unauthorized access attempt to admin panel from IP '10.0.0.5'
...
There is a file in the repo: "example_output_logs_dataset.csv" you can download and see an example of the output csv.
Each time you run the script, it will generate a new CSV file with an incremented sequence number and place it in the output_csv
directory.
-
Output Files:
- Files are named in the format
logs_dataset_0001.csv
,logs_dataset_0002.csv
, etc. - Located in the
output_csv
directory.
- Files are named in the format
- Ensure that you have activated your virtual environment (if using one).
- Install the Faker library:
pip install Faker
- If Python 3 is not installed, install it via Homebrew or from the official website: https://www.python.org/downloads/
Contributions are welcome! Please feel free to submit a pull request or open an issue if you have suggestions or find any bugs.
This project is licensed under the MIT License. See the LICENSE file for details.