Fixed a potential bugs that may lead to positional misalignment between pred and gt on different device. #111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the existing version, when DivLog performs an evaluation on the current results, it reorders "gt" according to the DPP algorithm. However, the DPP algorithm is implemented using the
math
library. We discovered that the computation results for DPP vary across different versions of Python, which subsequently leads to changes in the order of "gt".To be more specific, if the parsed results are generated on device A, and the comparison with "gt" for evaluation is also conducted on device A, then the evaluation results are accurate. That's because both processes are operated on device A, thus ensuring the same DPP rearrangement order. However, if the results are generated on device A, but compared with "gt" on device B, the orders of results and "gt" may be misaligned, leading to incorrect evaluation results. This is due to the variations in DPP outcomes between the two devices, resulting in inconsistencies between the orders of samples in results and "gt".
To resolve this issue, the current update abandons ordered one-to-one evaluation according to the post-DPP mapping. Instead, it assesses based on whether the corresponding result and "gt" for each original log message are consistent. The current evaluation method has been corrected.
Additionally, in this update, we've added a notice of GPT-3's deprecation to the README.md.