SportsMetrics

Benchmark data to evaluate numerical reasoning and information fusion of LLMs.

SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs
Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Dong Yu, Fei Liu
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24), Bangkok, Thailand.
Arxiv Paper

Usage of Benchmark

Select the task from data/
Import GeneralTaskLoader from sportsmetrics.py

from sportsmetrics import GeneralTaskLoader

batch_size = False # by default
if not batch_size:
    # load the task instance one by one
    for i in task.iter_instance():
        yiled i['system_message'], i['user_message']
else:
    # load the task instance by batch
    for i in task.iter_batch(batch_size):
        yiled i['system_message'], i['user_message']

Instance from TaskLoader

{
    "id": str,
    "system_message": str,
    "user_message": str,
    "ground_truth": dict()
}

Benchmark Tasks

The LLM is mandatorily required to generate responses in JSON format.

Reasoning Task

reasoning-team_points_tracking: Tracking team points in one match.
reasoning-key_stats_tracking: Tracking the key statistics for sports analytics.

Conflicts Task

conflict-one_point_rule: All scoring actions in the competition are set to be worth only one point.
conflict-swap_{num}_players: Swap {num} of spalyer between two teams.

Robustness Task

robustness-duplicate_{prob}: Replicate the non-scoring move with a probability of {prob}.
robustness-remove_{prob}: Remove the non-scoring move with a probability of {prob}.
robustness-shuffled_pbp: Shuffle the order of all moves in play-by-play descriptions while maintain the original order of timestamps.
robustness-{num}_fiction_names: Randomly select {num} of players from both teams and replace them with names from fiction movies.

Run Benchmark On OpenAI models

Set <API-Key> in ./openai.yaml

api-key: <Your API>
parameters:
  temperature: 0
  max_tokens: 4096
  top_p: 1
  frequency_penalty: 0
  presence_penalty: 0

Customize the script evaluation_sample.py accordingly to generate responses.

Bibtex

@misc{hu2024sportsmetricsblendingtextnumerical,
      title={SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs}, 
      author={Yebowen Hu and Kaiqiang Song and Sangwoo Cho and Xiaoyang Wang and Hassan Foroosh and Dong Yu and Fei Liu},
      year={2024},
      eprint={2402.10979},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.10979}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluation_sample.py		evaluation_sample.py
openai.yaml		openai.yaml
sportsmetrics.py		sportsmetrics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SportsMetrics

Usage of Benchmark

Benchmark Tasks

Reasoning Task

Conflicts Task

Robustness Task

Run Benchmark On OpenAI models

About

Releases

Packages

Languages

License

YebowenHu/SportsMetrics

Folders and files

Latest commit

History

Repository files navigation

SportsMetrics

Usage of Benchmark

Benchmark Tasks

Reasoning Task

Conflicts Task

Robustness Task

Run Benchmark On OpenAI models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages