Skip to content

Commit

Permalink
added laion-debate note post
Browse files Browse the repository at this point in the history
  • Loading branch information
robvanvolt committed Jun 28, 2024
1 parent 9c41471 commit 7afd2cb
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 0 deletions.
66 changes: 66 additions & 0 deletions notes/laion-debate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: "Call to Build Open Multi-Modal Models for Personal Assistants"
author: "LAION"
date: "June 28, 2024"
previewImg: "/images/blog/laion-debate.png"
---

We’re pleased to announce the World's first Large Competitive Debate Dataset: LAION-Debate. LAION-Debate is a large Competitive debate dataset providing links to Competitive Debate Championships, discussions and prominent speakers intake and conversations posted on YouTube by University of Cambridge and University of Oxford through their Cambridge and Oxford Union Debate clubs on their affiliated channels.

Competitive Debate datasets are scarce and hard to find in the public domain. Because these datasets are either gated by individuals and institutions who generate them or not archived properly enough to form them into a dataset. Hindering the ability to use them for Artificial Intelligence research.

In an era, where datasets are being scarce and the large AI models are exhausting entire human knowledge and depleting known data sources, Debate 2B encourages to use alternative credible sources and other forms of knowledge corpus that provides a unique outlook and understanding than the mainstream.

Today, a community member of LAION (tawsif) released this novel dataset on Competitive Debate in the field of Natural Language Processing.

## What’s Competitive Debate?

Competitive Debate is a sport where speakers of widely different backgrounds engage in discussions on relevant motions (subject matter). Subject matters include but are not limited to Philosophy, Politics, Historical Debate, Logical fallacy, morality and ethics, Science and Technology.

Speakers engage into these discussions from two sides, one in support of the subject matter and another against the subject matter and use speculative language, tone, logical traps, well-constructed sentences to reflect their intent and other strategies to convince the judge and audience for their school of thought.

Both sides of the spectrum include knowledgeable speakers well-versed in the subject matter and eloquent in their words and then engage into these discussions to convince the judges and audience their school of thought to be justified. In this sport, most knowledgeable and convincing speakers end up winning rather than those stating facts.

It’s a sport where logic and art of speech meet together in perfect harmony.

## Characteristic of Debate 2B

Debate 2B is largely a collection of YouTube links pointing towards the championship and discussion videos posted by University of Oxford and University of Cambridge on their official affiliated channels. Most of these speeches are either British Parliamentary speeches or interviews taken by aforementioned universities’ students of prominent and significant characters.

Although these interviews conducted at both the Oxford Union and Cambridge Union are widely different from what we public view on Sky News and CNN. Because these interviews are conducted by individuals well-versed in the art of speech while having a neutral opinion whilst conducting the interviews. Making sure relevant questions are being addressed and most truest opinions are extracted from the interviewee without any intent of sensationalising the opinions expressed by the interviewee.

## Intent fields and research routes

Debate 2B is intended to represent Natural language processing as the primary field. Although, we understand it can be used in the context of Computer Vision and Reinforcement learning too.

Debate 2B provides two datasets captured into one. Audio and textual form datasets. Audio datasets can be used to fine-tune large pretrained audio generation models to generate audio that sounds logical and emotional. Because these speakers used emotions and logical tone to convey their message and convince their audience of their school of thought.

Similarly, textual datasets provide an in-depth outlook into a new form of text generation datasets. That is backed by facts and how these facts and sentences should be structured to provide logical reasoning. We believe Debate 2B is the first dataset able to provide logical reasoning built-in within the dataset.

**Note**: We don’t provide the textual form of this dataset yet.

## Metadata and info of Debate 2B

We provide links to 2,700 hours of audio recordings; which accounts for 130GB for highest bitrate and 40GB for lowest possible bitrate for these recordings.

Cambridge Union links dates between 19th May, 2011 - 2nd of June, 2024
Oxford Union links between 6th of September - 12th of July, 2024

## Licence

It is hosted under Apache 2.0.

## Downloading the dataset

Debate 2B links can be found on Hugging Face. Its access is gated and only academic and work emails are being allowed at the moment to ensure safety. Audio recordings of Debate 2B can be found on Kaggle.

<https://huggingface.co/datasets/sleeping-ai/LAION-Debate>
<https://www.kaggle.com/datasets/sleepingcat4/cambridge-2b>
<https://www.kaggle.com/datasets/sleepingcat4/oxford-2b>

## Acknowledgement

We acknowledge our LAION community member tawsif who created the dataset and made its audio recordings and links to the audio recordings public.

<https://github.com/sleepingcat4>
Email: <[email protected]>
Binary file added public/images/blog/laion-debate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7afd2cb

Please sign in to comment.