Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADDON-77801 Added support for ingesting json event using conte… #376

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spanchal-crest
Copy link

Jira: https://splunk.atlassian.net/browse/ADDON-77801

Added support for ingesting the json event using contentctl.

For the windows TA, the SPL2 transformed logs have additional fields along with _raw. So, in order to ingest those samples we need to use services/collector/event endpoint as we want to ingest json data.

For ingesting the events in json format, we should have the below format of the event.
{"event": "raw_event", "fields" : {"fields1" : "value1", "fields2" : "value2"}}

@spanchal-crest spanchal-crest force-pushed the ADDON-77801-support-for-ingesting-json-events branch from 1f21143 to 0f3be11 Compare March 5, 2025 05:14
@pyth0n1c
Copy link
Contributor

pyth0n1c commented Mar 9, 2025

Had some internal discussion directly with the PR author about this, so this PR may change. As such I have converted it to draft at this time.

@pyth0n1c pyth0n1c added WIP Draft and removed WIP labels Mar 9, 2025
@@ -11,3 +11,4 @@ class TestAttackData(BaseModel):
sourcetype: str = Field(...)
custom_index: str | None = None
host: str | None = None
endpoint: str | None = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it makes sense to add param, so that it would need to be set externally. It is doable for sure, on the other hand maybe it is generic enough for contentctl to spot that event format is hec format and decide as @spanchal-crest was proposing initally?

Mean there is very specific json format, requiring "event" field and some others: https://docs.splunk.com/Documentation/Splunk/latest/Data/FormateventsforHTTPEventCollector

On the other hand if assuming that having event and not having other fields than listed on above page seems to risky we can stick to the entry here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure that I am following your suggestion. Could you provide some pseudocode, by any chance?
Are you saying that, for example, the JSON data that we want to replay, today, is actually JSONL and is a single file that looks like the following:

{"field1": "somevalue", "field2": 4}
{"field1": "someothervalue", "field2": 2}
{"field1": "somethirdvalue", "field2": 1}

But must actually be converted to the following format before it is replayed:

{"sourcetype": "some_sourcetype", "event": {"field1": "somevalue", "field2": 4}}
{"sourcetype": "some_sourcetype", "event": {"field1": "someothervalue", "field2": 2}}
{"sourcetype": "some_sourcetype", "event": {"field1": "somethirdvalue", "field2": 1}}

In line with the requirements for the HEC JSON endpoint?

Secondly, instead of the typing
endpoint: str | None = None
I would suggest

DEFAULT_HEC_ENDPOINT = "services/collector/raw"
`endpoint: str  = DEFAULT_HEC_ENDPOINT`

or, even better
from enum import StrEnum
class ReplayEndpoint(StrEnum):
HEC = "services/collector/raw"
JSON = "services/collector/event"

endpoint: ReplayEndpoint = ReplayEndpoint.HEC


given that we already have thousands of tests and all of them use the raw event collector endpoint.  The typing this way:
1. Ensures that endpoint is never `None` - since it is required it MUST be defined and Test Object creation time (giving it a reasonable default value means we don't need to go update 2,000 existing ymls).
2. Makes it explicit after the object is constructed so that we don't need a check in the replay logic for whether or not the value is None.
3. Means that if someone typos the endpoint, or tries to replay to an endpoint that contentctl does not support, they get a warning at validation time.


A third option could even be to make ReplayEndpoint its own Abstract Class and make HEC and JSON child classes that MUST implement the function `transform` so that if "raw" data (which comes directly from a file) must be "changed" in some way, the logic can be implemented in each of the ReplayEndpoint classes rather than in the replay logic itself.
OR, the JSON event file could be in the `{"sourcetype": "xyz", "event":{...}}` format on-disk.

@@ -1402,6 +1402,10 @@ def hec_raw_replay(
url_with_hec_path = urllib.parse.urljoin(
url_with_port, "services/collector/raw"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do if else above with this, not override later?

Is there any test where we can test this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants