[Algorithm] Drain3 raw log parsing potential enhancement #9
Labels
Algorithm
The work is on the algorithm side
analysis: log
enhancement
New feature or request
upstream
A issue that could be submitted to upstream repos first
Milestone
Background: Drain log parsing works best on ingesting only log content - meaning we trim the rest with some simple Regex or rule. Slicing the contents accurately from
Dec 10 07:28:08 LabSZ sshd[24247]: Received disconnect from 112.95.230.3: 11: Bye Bye [preauth]
to below requires prior knowledge on the delimiter, which I am 99% sure users don't care to give. So we need to adapt Drain to be more robust.
Received disconnect from 112.95.230.3: 11: Bye Bye [preauth]
I found a potentially(?) major enhancement to the algorithm on RAW log parsing.
The current test is shown below yields much better clustering than the original unreadable results (over-convergence), but it also requires a tiny adjustment to global similarity threshold - So the idea is all clusters should have their own standard of accepting new templates, not by a global constraint. (This is mentioned in the updated version of research paper, not my invention)
I will attempt to submit a patch to the upstream IBM/Drain3 repo and see if it's accepted.
BUT! To yield the most accurate result, we still need to implement a dynamic threshold calculation and clustering merger for the similarity function;
Threshold 0.4 (Default, not best)
Threshold 0.3 compared to below baseline result, looks almost perfect
Original version without my patch, but sliced with prior knowledge, threshold 0.4 default
The text was updated successfully, but these errors were encountered: