VANTA Research

license

apache-2.0

VANTA Research

Independent AI safety research lab specializing in cognitive fit, alignment, and human-AI collaboration

Human-AI Collaboration-1

A curated, high-quality synthetically generated dataset of human-AI conversational exchanges designed for training and evaluating AI models on collaborative reasoning, problem-solving, and knowledge synthesis.

Dataset Size: 3,050 entries
License: Apache 2.0
Author: VANTA Research

Overview

This dataset captures authentic human-AI collaboration patterns across diverse domains including:

Conversational Problem-Solving: Multi-turn dialogues tackling complex questions and creative challenges
Knowledge Synthesis: Combining human expertise with AI capabilities to explore topics in depth
Reasoning Transparency: Clear explanation of thought processes, uncertainty acknowledgment, and collaborative exploration
Domain Diversity: Technical topics, creative writing, scientific reasoning, strategic thinking, and more

Each entry is a carefully curated example of effective human-AI collaboration, optimized for training models that can:

Engage in substantive dialogue
Acknowledge and explore uncertainty
Synthesize information from multiple perspectives
Maintain context and coherence across exchanges
Collaborate productively toward shared understanding

Dataset Structure

Each entry in the dataset follows a standard conversational format with thinking tags that reveal the AI's reasoning process:

{
  "messages": [
    {
      "role": "user",
      "content": "Can you help me optimize this Python function for processing large datasets?"
    },
    {
      "role": "assistant",
      "content": "<think>\nThe user needs performance optimization help. I should:\n1. Analyze their current approach for bottlenecks\n2. Suggest concrete improvements with explanations\n3. Ask clarifying questions about their use case\n4. Provide multiple solutions at different complexity levels\n</think>\n\nI'd be happy to help optimize that! To give you the best recommendations, could you share:\n- The current function code\n- Typical dataset size you're working with\n- Any specific performance bottlenecks you've noticed?\n\nIn the meantime, here are common optimization strategies..."
    }
  ]
}

Think Tags

A distinctive feature of this dataset is the inclusion of <think> tags that contain the AI's internal reasoning process before generating the visible response. These tags:

Show strategic planning and approach selection
Reveal uncertainty acknowledgment and decision-making
Demonstrate consideration of user context and needs
Provide insight into collaborative reasoning patterns

Training Tip: You can choose to keep or remove think tags depending on your use case:

Keep them to train models on transparent reasoning and planning
Remove them if you only want the final conversational outputs

Data Format

Format: JSONL (JSON Lines - one JSON object per line)
Character Encoding: UTF-8
Line Endings: Standard newlines (\n)

Field Descriptions

Field	Type	Description
`messages`	Array	Conversation turn containing user and assistant exchanges
`role`	String	Either "user" or "assistant"
`content`	String	The actual message content (may include `<think>` tags for assistant responses)

Key Characteristics

Quality Attributes

✓ Transparent Reasoning: Internal <think> tags reveal planning, strategy selection, and uncertainty
✓ Coherent: Responses maintain logical consistency and relevance
✓ Substantive: Meaningful depth in exploration and explanation
✓ Collaborative: Genuine dialogue advancing shared understanding
✓ Diverse: Wide range of topics, styles, and interaction patterns

Representativity

The dataset spans:

Multiple domains (STEM, humanities, creative, strategic)
Various conversation lengths (2-turn exchanges to extended dialogues)
Different reasoning styles (analytical, creative, exploratory, systematic)
Mixed levels of topic complexity (accessible to specialized)

Usage

Loading the Dataset

Python with Hugging Face Datasets:

from datasets import load_dataset

dataset = load_dataset("huggingface", "vanta-research/human-ai-1962")

Raw JSONL Loading:

import json

entries = []
with open("human-ai-1962.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        entries.append(json.loads(line))

Processing Think Tags:

import re

def extract_thinking(content):
    """Extract think tags and visible response separately."""
    think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL)
    thinking = think_match.group(1).strip() if think_match else None
    visible = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
    return thinking, visible

def remove_think_tags(content):
    """Remove think tags entirely for training without reasoning."""
    return re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()

# Example usage
with open("human-ai-1962.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        entry = json.loads(line)
        for msg in entry["messages"]:
            if msg["role"] == "assistant":
                thinking, response = extract_thinking(msg["content"])
                print(f"Internal reasoning: {thinking}")
                print(f"Visible response: {response}")

With PyTorch:

from torch.utils.data import Dataset

class HumanAIDataset(Dataset):
    def __init__(self, jsonl_path):
        self.data = []
        with open(jsonl_path, "r", encoding="utf-8") as f:
            for line in f:
                self.data.append(json.loads(line))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

Recommended Applications

Reasoning Transparency Training: Leverage <think> tags to train models that show their work
Language Model Fine-Tuning: Train models on substantive dialogue patterns (with or without reasoning tags)
Chain-of-Thought Learning: Use internal reasoning to improve model planning capabilities
Dialogue Evaluation: Benchmark conversational AI systems on collaborative interactions
RLHF Datasets: Use as reference examples for preference learning
Instruction Following: Learn nuanced response generation patterns

Statistics

Total Entries: conversations
Average Entry Length: ~200-400 tokens per exchange
Language: English
Minimum Quality Threshold: All entries reviewed for coherence and relevance

License

This dataset is released under the Apache License 2.0. See LICENSE file for full details.

Apache 2.0 Summary

The Apache License 2.0 permits:

✓ Commercial use
✓ Modification
✓ Distribution
✓ Private use

With the conditions:

License and copyright notice must be included
State significant changes made to the code/data

Ethical Considerations

Dataset Limitations

Reflects conversation patterns as of training data cutoff
May contain perspectives that are not universally shared
Represents primarily English-language interactions
Domain coverage is non-uniform (some topics more represented than others)

Responsible Use

We encourage users to:

Consider downstream application impacts
Test for biases relevant to your use case
Maintain transparency about dataset usage
Report concerns or improvements to maintainers

Citation

If you use this dataset in research or production, please cite:

@dataset{vanta-research-human-ai-collaboration-1,
  title={Human-AI Collaboration-1},
  author={VANTA Research},
  year={2025},
  url={https://huggingface.co/datasets/vanta-research/human-ai-collaboration-1}
}

Data Quality & Curation

Each entry in this dataset has been:

✓ Validated for JSON structural integrity
✓ Verified for conversation coherence
✓ Checked for substantive content
✓ Reviewed for diversity of interaction patterns

Entries are designed to exemplify high-quality human-AI collaboration, making them suitable for training models to engage in similarly productive exchanges.

Getting Started

Download: Access the dataset from Hugging Face Hub
Explore: Start with a small subset to understand structure
Integrate: Use provided code examples to load into your pipeline
Fine-tune: Apply to your specific task with domain-appropriate training procedures
Evaluate: Benchmark your results against baseline models

Contributing & Feedback

Reporting Issues

If you find data quality issues, format problems, or licensing concerns, please open an issue on the project repository.

Improvements & Suggestions

We welcome feedback on:

Data completeness
Format usability
Documentation clarity
Suggested use cases

Changelog

Version 1.0 (2025)

Initial release with 3,050 curated entries
JSONL format standardization
Comprehensive documentation

Support

For questions, suggestions, or support:

Open an issue on GitHub
Check existing documentation
Review example code snippets

Disclaimer: This dataset is provided as-is for research and development purposes. Users are responsible for evaluating the dataset's suitability for their specific applications.

License: Apache License 2.0 | Repository Owner: VANTA Research | Initial Commit: November 2025

VANTA Research: AI for humans.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
vanta-research-human-ai-collaboration-1.jsonl		vanta-research-human-ai-collaboration-1.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VANTA Research

Human-AI Collaboration-1

Overview

Dataset Structure

Think Tags

Data Format

Field Descriptions

Key Characteristics

Quality Attributes

Representativity

Usage

Loading the Dataset

Recommended Applications

Statistics

License

Apache 2.0 Summary

Ethical Considerations

Dataset Limitations

Responsible Use

Citation

Data Quality & Curation

Getting Started

Contributing & Feedback

Reporting Issues

Improvements & Suggestions

Changelog

Version 1.0 (2025)

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

VANTA Research

Human-AI Collaboration-1

Overview

Dataset Structure

Think Tags

Data Format

Field Descriptions

Key Characteristics

Quality Attributes

Representativity

Usage

Loading the Dataset

Recommended Applications

Statistics

License

Apache 2.0 Summary

Ethical Considerations

Dataset Limitations

Responsible Use

Citation

Data Quality & Curation

Getting Started

Contributing & Feedback

Reporting Issues

Improvements & Suggestions

Changelog

Version 1.0 (2025)

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages