json2splunk

json2splunk is a Python script designed to process and ingest JSON formatted log files into Splunk. This script leverages multiprocessing to efficiently handle multiple files and integrates with Splunk's HTTP Event Collector (HEC) to push data.

Features

CSV files: Supports also csv files.
Multiprocessing Support: Utilizes multiple CPUs to process files concurrently with mpire lib (https://github.com/sybrenjansen/mpire).
Flexible File Matching: Configurable file matching rules based on file name/path patterns and path suffixes, allowing selective processing of files.
Splunk Integration: Automates the creation of Splunk indices and HEC tokens, ensuring that data is ingested smoothly and efficiently into Splunk.
Test Mode: Allows running the script in a test configuration where no data is actually sent to Splunk, useful for debugging and validation.

Requirements

Run only on Linux host
Python 3.7 or newer
External libraries: argparse, json, logging, os, re, time, datetime, functools, multiprocessing, pathlib, polars, yaml, mpire, splunk_http_event_collector
splunk_http_event_collector modified to remove multithreading as json2splunk is already using multiprocessing

Setup

Clone the repository:

git clone https://github.com/maxspl/json2splunk.git
cd json2splunk

Install required Python libraries:
```
pip install -r requirements.txt
```

Configure Splunk Settings: Update splunk_configuration.yml with your Splunk instance details:

splunk:
  host: {splunk_FQDN_or_IP}
  user: {splunk_user}
  password: {splunk_password}
  port: {splunk_port} # Default is 8000
  mport: {splunk_mport} # Default is 8089
  ssl: {splunk_enable_ssl} # Default is False

Set File Matching Rules: Edit indexer_patterns.yml to define the patterns for the files you want to ingest:

<source_name>:
  name_rex: # regex matching the file name (optional if path_suffix or path_rex is set). Regex applied on FILE PATH
  path_suffix: # suffix path to files to index (optional if name_rex or path_rex is set). Match ending path. Ex: If "path_suffix: evtx" will match of files ending wih .jsonl under <whatever is the path>/evtx/
  path_rex: #  regex matching the file parent directory (optional if name_rex or path_suffix is set). Regex applied on FILE DIRECTORY (ie. without filename)
  sourcetype: # Splunk sourcetype (optional). If not specified, default is source_name
  timestamp_path: # path to the json key (multiple keys can be specified, the firt one found in the event is use) containing the event timestamp. Populates Splunk _time field. Ex: "Event.System.TimeCreated.#attributes.SystemTime"  (optional)
  timestamp_format: # format of the timestamp extracted. Ex: "%Y-%m-%dT%H:%M:%S.%fZ" (optional)
  host_path: # path to the json key containing the event host. Populates Splunk host field. Ex: Event.System.Computer (optional)
  host_rex: # regex to extract the hostname for the filename or the file path. Populates Splunk host field. (optional)
  artifact: # source_name alernative (optional) - can be useful to define a global name like EVTX where source_name is very specific like windows:evtx:powershell. If not specified, default is source_name.

Usage

Run the script with the required parameters. Example usage:

python json2splunk.py --input /path/to/logs --index my_index
python json2splunk.py --input /path/to/logs --index my_index --config_spl /opt/json2splunk/splunk_configuration.yml --indexer_patterns /opt/json2splunk/indexer_patterns.yml
python json2splunk.py --input /path/to/logs --index my_index --nb_cpu 4
python json2splunk.py --input /path/to/logs --index my_index --ext ".csv,.jsonl"

Parameters

--input: Mandatory. Directory containing the log files to process.
--index: Mandatory. The name of the Splunk index to use.
--nb_cpu: Optional. Specifies the number of CPUs to use for processing. Defaults to the number of available CPUs.
--test: Optional. Enables test mode where no data is sent to Splunk. Useful for debugging.
--config_spl: Optional. Specifies the path to the Splunk configuration file. Defaults to splunk_configuration.yml.
--indexer_patterns: Optional. Specifies the path to the file patterns configuration. Defaults to indexer_patterns.yml.
--ext: Optional. Specifies a list of extensions to prefilter the input directory. Defaults is None.

Test Mode

Test mode is designed to validate the setup without pushing data to Splunk. It simulates the entire process, from file scanning to data preparation, without making any actual data transmissions to Splunk.

This mode also generates a dataframe (named test_files_to_index.json) containing matched files and patterns, which can be reviewed to ensure correct file handling before live deployment.

For example, the dataframe can be used to review the patterns matched by each file:

[
  {
    "file_path": "input_sample/prefetch/SRV-DA09DKL--prefetch-AA4646DB4646A841_2000000016FC0_D000000018CE8_4_TABBY.EXE-D326E1BD.pf_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "file_name": "SRV-DA09DKL--prefetch-AA4646DB4646A841_2000000016FC0_D000000018CE8_4_TABBY.EXE-D326E1BD.pf_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "source": [
      "prefetch",
      "all"
    ],
    "sourcetype": "_json",
    "timestamp_path": "",
    "timestamp_format": "",
    "host": "SRV-DA09DKL",
    "host_path": null
  },
  {
    "file_path": "input_sample/evtx/SRV-DA09DKL--evtx-AA4646DB4646A841_10000000014B3_E0000000249F8_4_Microsoft-Windows-StorageSettings%4Diagnostic.evtx_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "file_name": "SRV-DA09DKL--evtx-AA4646DB4646A841_10000000014B3_E0000000249F8_4_Microsoft-Windows-StorageSettings%4Diagnostic.evtx_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "source": [
      "evtx",
      "all"
    ],
    "sourcetype": "_json",
    "timestamp_path": [
      "Event.System.TimeCreated.#attributes.SystemTime"
    ],
    "timestamp_format": "%Y-%m-%dT%H:%M:%S.%fZ",
    "host": "Unknown", // Normal as host_path is extracted after the dataframe creation
    "host_path": "Event.System.Computer"
  }
]

Example

Directory Structure Example

Let's ingest these files:

/input_sample
├── output
│   ├── app
│   │   ├── error
│   │   │   └── app_error.jsonl
│   │   ├── info
│   │   │   └── app_info.jsonl
│   │   └── debug
│   │       └── app_debug.jsonl
├── prefech
│   ├── HOST-A--prefetch1.jsonl
│   ├── HOST-A--prefetch2.jsonl
│   └── HOST-A--prefetch3.jsonl
└── evtx
    ├── event1.jsonl
    ├── event2.jsonl
    └── event3.jsonl

Patterns Configuration (`indexer_patterns.yml`)

This YAML file is crucial for specifying which files json2splunk.py should process. You can define multiple criteria based on file name (or file path) regex patterns and path suffixes: Each entry specifies a unique pattern to match certain files with specific processing rules for Splunk ingestion.

Warning: Fields required: sourcetype, one of: name_rex, path_suffix Warning: If a file matches several artifacts, the first one is selected.

windows:evtx:powershell:
    name_rex: Windows_PowerShell.*\.jsonl$
    path_suffix: evtx
    host_path: "Event.System.Computer" # Extract the host from the event
    timestamp_path:  # Extract the timestamp from the event
      - "Event.System.TimeCreated.#attributes.SystemTime"
      - "Event.Timestamp"
    timestamp_format: "%Y-%m-%dT%H:%M:%S.%fZ" # Specify the timestamp format
    artifact: EVTX
evtx:
    name_rex: \.jsonl$
    path_suffix: evtx
    sourcetype: _json
    host_path: "Event.System.Computer" # Extract the host from the event
    timestamp_path:  # Extract the timestamp from the event
      - "Event.System.TimeCreated.#attributes.SystemTime"
      - "Event.Timestamp"
    timestamp_format: "%Y-%m-%dT%H:%M:%S.%fZ" # Specify the timestamp format
prefetch:
    name_rex: \.jsonl$
    path_rex: ".*prefetch"
    sourcetype: _json
    host_rex: (^[\w-]+)-- # Extract the host from the filename
    timestamp_path: # Extract the timestamp from the event
      - LastRun
    timestamp_format: "%Y-%m-%d %H:%M:%S" # Specify the timestamp format
application:
    path_suffix: output/app
    sourcetype: _json
    host_rex: (^[\w-]+)--
hives:
    name_rex: \.csv$
    path_suffix: hives
    sourcetype: _json
    host_rex: (^[\w-]+)--

Run the script

python json2splunk.py --input /input_sample --index my_index

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
LICENSE-MIT		LICENSE-MIT
README.md		README.md
indexer_patterns_sample.yml		indexer_patterns_sample.yml
json2splunk.py		json2splunk.py
requirements.txt		requirements.txt
splunk_configuration_sample.yml		splunk_configuration_sample.yml
splunk_helper.py		splunk_helper.py
splunk_http_event_collector.py		splunk_http_event_collector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

json2splunk

Features

Requirements

Setup

Usage

Parameters

Test Mode

Example

Directory Structure Example

Patterns Configuration (`indexer_patterns.yml`)

Run the script

About

Uh oh!

Releases

Packages

Languages

License

maxspl/json2splunk

Folders and files

Latest commit

History

Repository files navigation

json2splunk

Features

Requirements

Setup

Usage

Parameters

Test Mode

Example

Directory Structure Example

Patterns Configuration (indexer_patterns.yml)

Run the script

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Patterns Configuration (`indexer_patterns.yml`)

Packages