Skip to content

ninoslavmiskovic/log-data-generator

Repository files navigation

Unstructured Log Data Generator

This Python script generates a synthetic dataset of unstructured logs in CSV format, simulating logs from various services over a one-year period. The logs include different severity levels, sources, and detailed messages with dynamic content, suitable for testing, analysis, or demonstration purposes.

Ensuring Unique Datasets on Each Execution

Because the script relies on randomization without a fixed seed, every time you execute it:

  • New Log Entries: You’ll get a fresh set of log entries, with different timestamps, sources, levels, and messages.
  • Varied Data: Usernames, IP addresses, transaction details, and other data points will be unique.
  • Different Patterns: While the overall structure and patterns (like error spikes) remain consistent, the specific details and timing will change.

This behavior is beneficial for:

  • Testing: Allows you to test your log analysis tools or applications with varied data.
  • Simulation: Helps simulate real-world scenarios where log data is never the same.
  • Learning: Provides diverse datasets for practice in data analysis, machine learning, or cybersecurity.

Features

  • Simulates realistic log data from multiple services:
    • AuthService
    • PaymentService
    • DatabaseService
    • NotificationService
    • CacheService
  • Generates over 10,000 log entries with timestamps spanning the past year.
  • Includes various log levels:
    • INFO
    • WARN
    • ERROR
    • DEBUG
  • Introduces spikes and patterns to mimic real-world scenarios:
    • Error spikes occurring monthly.
    • Security incidents and performance issues.
  • Dynamic message content using the Faker library for realistic data.

Requirements

  • Python 3.6 or higher
  • Faker library

Installation

1. Clone the Repository (If Applicable)

If you're using a GitHub repository:

git clone https://github.com/your-username/log-data-generator.git
cd log-data-generator

2. Set Up a Virtual Environment (Recommended)

It's best practice to use a virtual environment to manage dependencies.

**2.1 Create a virtual environment

python3 -m venv venv

***2.2 Activate the virtual environment

source venv/bin/activate

3. Install Dependencies

Install the required Python packages using pip.

pip install Faker

4. Usage

Run the script to generate the log dataset.

python3 generate_logs.py

The script will create a file named logs_dataset.csv in the current directory.

  • The dataset will contain log entries with the following fields:
    • @timestamp
    • log.level
    • source
    • message

5. Customization

5.1. Adjust the Number of Log Entries

Customization

Adjust the Number of Log Entries

Change the num_entries variable in the script to generate more or fewer log entries.

num_entries = 10000  # Set to desired number of entries

5.2. Modify Log Levels Distribution

Alter the weights in the random.choices function to change the frequency of each log level.

level = random.choices(
    log_levels,
    weights=[0.7, 0.1, 0.1, 0.1]  # Adjust weights for INFO, WARN, ERROR, DEBUG
)[0]

5.3. Update Message Templates

Edit the messages dictionary in the script to add or modify message templates for different services and log levels.

messages = {
    'AuthService': {
        'INFO': [
            "User '{user}' logged in successfully from IP address {ip_address}",
            # Add more templates as needed
        ],
        # ... other levels and services
    },
    # ... other services
}

6. Example Output

An excerpt from the generated logs_dataset.csv file:

Timestamp,Level,Source,Message
2022-10-02T03:12:45Z,INFO,AuthService,User 'jane_doe' logged in successfully from IP address 192.168.1.100
2022-11-05T10:45:37Z,ERROR,DatabaseService,Database connection failed: unable to reach primary database cluster 'db-cluster-1'
2023-08-05T04:00:25Z,ERROR,AuthService,Critical security vulnerability detected: unauthorized access attempt to admin panel from IP '10.0.0.5'
...

There is a file in the repo: "example_output_logs_dataset.csv" you can download and see an example of the output csv.

7. Generating Multiple Datasets

Each time you run the script, it will generate a new CSV file with an incremented sequence number and place it in the output_csv directory.

  • Output Files:

    • Files are named in the format logs_dataset_0001.csv, logs_dataset_0002.csv, etc.
    • Located in the output_csv directory.

8. Troubleshooting

8.1. ModuleNotFoundError: No module named 'faker'

  • Ensure that you have activated your virtual environment (if using one).
  • Install the Faker library:
pip install Faker

9. Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue if you have suggestions or find any bugs.

10.License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Python script that generates unstructured log data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages