Add metric toggles to leaderboard + Remove failed FPR scores#50
Add metric toggles to leaderboard + Remove failed FPR scores#50
Conversation
|
Eval run succeeded! Link to run: link Here are the results of the submission(s): e5-small-loraRelease date: 2024-11-07 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 85.69% at FPR=5%. LLMDetRelease date: 2023-05-24 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 26.70% at FPR=5%. LuminarRelease date: 2025-05-17 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 100.00% at FPR=5%. SpeedAIRelease date: 2025-05-08 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 99.62% at FPR=5%. It's AIRelease date: 2025-04-01 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 94.15% at FPR=5%. RoBERTa-base (GPT2)Release date: 2019-08-24 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 51.77% at FPR=5%. RoBERTa (ChatGPT)Release date: 2023-01-18 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 26.64% at FPR=5%. DesklibRelease date: 2024-10-03 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 83.76% at FPR=5%. RoBERTa-large (GPT2)Release date: 2019-08-24 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 50.70% at FPR=5%. SuperAnnotate AI DetectorRelease date: 2024-10-27 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 64.87% at FPR=5%. GLTRRelease date: 2019-06-10 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 51.48% at FPR=5%. Desklib AI Text Detector v1.01Release date: 2025-02-16 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 91.17% at FPR=5%. BinocularsRelease date: 2024-01-22 I've committed detailed results of this detector's performance on the test set to this PR. Warning No aggregate score across all settings is reported here as some domains/generator models/decoding strategies/repetition penalties/adversarial attacks were not included in the submission. This submission will not appear in the main leaderboard; it will only be visible within the splits in which all samples were evaluated. RADARRelease date: 2023-07-07 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 63.91% at FPR=5%. Gaussian ExtremeRelease date: 2025-05-17 I've committed detailed results of this detector's performance on the test set to this PR. On the RAID dataset as a whole (aggregated across all generation models, domains, decoding strategies, repetition penalties, and adversarial attacks), it achieved a TPR of 97.10% at FPR=5%. FastDetectGPTRelease date: 2023-10-08 I've committed detailed results of this detector's performance on the test set to this PR. Warning No aggregate score across all settings is reported here as some domains/generator models/decoding strategies/repetition penalties/adversarial attacks were not included in the submission. This submission will not appear in the main leaderboard; it will only be visible within the splits in which all samples were evaluated. Warning No aggregate score across all non-adversarial settings is reported here as some domains/generator models/decoding strategies/repetition penalties were not included in the submission. |
This pull request adds:
run_evaluationNon-breaking interface changes
evaluate_cli.pynow takes in multiple arguments fortarget_fpr.evaluate_cli.pyis both 0.05 FPR and 0.01 FPR.Potentially breaking changes
The output format of
results.jsonnow has a slightly altered structure. Instead of theaccuracyfield pointing to the TPR@FPR=5% it now points to a dictionary indexed by the FPR containing the true positives, false negatives, and TPR for the particular fpr value.Old
New
This is a BREAKING CHANGE for any code built off
evaluate_cliorrun_evaluationthat directly accesses the results.json. Please take care to catch these null values and index accuracy correctly.