This repository will contain the additional material (benchmarks, code and experimental results) for the paper "What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns" where we propose the Spotlight method to evaluate LLM outputs.
If you want to use the Premise pattern extraction method, we use in the paper, you can find it here. We will provide an updated version in the future that fits the Spotlight structure. At the moment, just replace "correct/incorrect classification" with "group 1/2".
If you run into any issues or if you already need the additional material, do not hesitate to contact us.