| license | apache-2.0 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| tags |
|
||||||||
| task_categories |
|
||||||||
| language |
|
Independent AI safety research lab specializing in cognitive fit, alignment, and human-AI collaboration
A curated, high-quality synthetically generated dataset of human-AI conversational exchanges designed for training and evaluating AI models on collaborative reasoning, problem-solving, and knowledge synthesis.
Dataset Size: 3,050 entries
License: Apache 2.0
Author: VANTA Research
This dataset captures authentic human-AI collaboration patterns across diverse domains including:
- Conversational Problem-Solving: Multi-turn dialogues tackling complex questions and creative challenges
- Knowledge Synthesis: Combining human expertise with AI capabilities to explore topics in depth
- Reasoning Transparency: Clear explanation of thought processes, uncertainty acknowledgment, and collaborative exploration
- Domain Diversity: Technical topics, creative writing, scientific reasoning, strategic thinking, and more
Each entry is a carefully curated example of effective human-AI collaboration, optimized for training models that can:
- Engage in substantive dialogue
- Acknowledge and explore uncertainty
- Synthesize information from multiple perspectives
- Maintain context and coherence across exchanges
- Collaborate productively toward shared understanding
Each entry in the dataset follows a standard conversational format with thinking tags that reveal the AI's reasoning process:
{
"messages": [
{
"role": "user",
"content": "Can you help me optimize this Python function for processing large datasets?"
},
{
"role": "assistant",
"content": "<think>\nThe user needs performance optimization help. I should:\n1. Analyze their current approach for bottlenecks\n2. Suggest concrete improvements with explanations\n3. Ask clarifying questions about their use case\n4. Provide multiple solutions at different complexity levels\n</think>\n\nI'd be happy to help optimize that! To give you the best recommendations, could you share:\n- The current function code\n- Typical dataset size you're working with\n- Any specific performance bottlenecks you've noticed?\n\nIn the meantime, here are common optimization strategies..."
}
]
}A distinctive feature of this dataset is the inclusion of <think> tags that contain the AI's internal reasoning process before generating the visible response. These tags:
- Show strategic planning and approach selection
- Reveal uncertainty acknowledgment and decision-making
- Demonstrate consideration of user context and needs
- Provide insight into collaborative reasoning patterns
Training Tip: You can choose to keep or remove think tags depending on your use case:
- Keep them to train models on transparent reasoning and planning
- Remove them if you only want the final conversational outputs
- Format: JSONL (JSON Lines - one JSON object per line)
- Character Encoding: UTF-8
- Line Endings: Standard newlines (
\n)
| Field | Type | Description |
|---|---|---|
messages |
Array | Conversation turn containing user and assistant exchanges |
role |
String | Either "user" or "assistant" |
content |
String | The actual message content (may include <think> tags for assistant responses) |
- ✓ Transparent Reasoning: Internal
<think>tags reveal planning, strategy selection, and uncertainty - ✓ Coherent: Responses maintain logical consistency and relevance
- ✓ Substantive: Meaningful depth in exploration and explanation
- ✓ Collaborative: Genuine dialogue advancing shared understanding
- ✓ Diverse: Wide range of topics, styles, and interaction patterns
The dataset spans:
- Multiple domains (STEM, humanities, creative, strategic)
- Various conversation lengths (2-turn exchanges to extended dialogues)
- Different reasoning styles (analytical, creative, exploratory, systematic)
- Mixed levels of topic complexity (accessible to specialized)
Python with Hugging Face Datasets:
from datasets import load_dataset
dataset = load_dataset("huggingface", "vanta-research/human-ai-1962")Raw JSONL Loading:
import json
entries = []
with open("human-ai-1962.jsonl", "r", encoding="utf-8") as f:
for line in f:
entries.append(json.loads(line))Processing Think Tags:
import re
def extract_thinking(content):
"""Extract think tags and visible response separately."""
think_match = re.search(r'<think>(.*?)</think>', content, re.DOTALL)
thinking = think_match.group(1).strip() if think_match else None
visible = re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
return thinking, visible
def remove_think_tags(content):
"""Remove think tags entirely for training without reasoning."""
return re.sub(r'<think>.*?</think>', '', content, flags=re.DOTALL).strip()
# Example usage
with open("human-ai-1962.jsonl", "r", encoding="utf-8") as f:
for line in f:
entry = json.loads(line)
for msg in entry["messages"]:
if msg["role"] == "assistant":
thinking, response = extract_thinking(msg["content"])
print(f"Internal reasoning: {thinking}")
print(f"Visible response: {response}")With PyTorch:
from torch.utils.data import Dataset
class HumanAIDataset(Dataset):
def __init__(self, jsonl_path):
self.data = []
with open(jsonl_path, "r", encoding="utf-8") as f:
for line in f:
self.data.append(json.loads(line))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]- Reasoning Transparency Training: Leverage
<think>tags to train models that show their work - Language Model Fine-Tuning: Train models on substantive dialogue patterns (with or without reasoning tags)
- Chain-of-Thought Learning: Use internal reasoning to improve model planning capabilities
- Dialogue Evaluation: Benchmark conversational AI systems on collaborative interactions
- RLHF Datasets: Use as reference examples for preference learning
- Instruction Following: Learn nuanced response generation patterns
- Total Entries: conversations
- Average Entry Length: ~200-400 tokens per exchange
- Language: English
- Minimum Quality Threshold: All entries reviewed for coherence and relevance
This dataset is released under the Apache License 2.0. See LICENSE file for full details.
The Apache License 2.0 permits:
- ✓ Commercial use
- ✓ Modification
- ✓ Distribution
- ✓ Private use
With the conditions:
- License and copyright notice must be included
- State significant changes made to the code/data
- Reflects conversation patterns as of training data cutoff
- May contain perspectives that are not universally shared
- Represents primarily English-language interactions
- Domain coverage is non-uniform (some topics more represented than others)
We encourage users to:
- Consider downstream application impacts
- Test for biases relevant to your use case
- Maintain transparency about dataset usage
- Report concerns or improvements to maintainers
If you use this dataset in research or production, please cite:
@dataset{vanta-research-human-ai-collaboration-1,
title={Human-AI Collaboration-1},
author={VANTA Research},
year={2025},
url={https://huggingface.co/datasets/vanta-research/human-ai-collaboration-1}
}Each entry in this dataset has been:
- ✓ Validated for JSON structural integrity
- ✓ Verified for conversation coherence
- ✓ Checked for substantive content
- ✓ Reviewed for diversity of interaction patterns
Entries are designed to exemplify high-quality human-AI collaboration, making them suitable for training models to engage in similarly productive exchanges.
- Download: Access the dataset from Hugging Face Hub
- Explore: Start with a small subset to understand structure
- Integrate: Use provided code examples to load into your pipeline
- Fine-tune: Apply to your specific task with domain-appropriate training procedures
- Evaluate: Benchmark your results against baseline models
If you find data quality issues, format problems, or licensing concerns, please open an issue on the project repository.
We welcome feedback on:
- Data completeness
- Format usability
- Documentation clarity
- Suggested use cases
- Initial release with 3,050 curated entries
- JSONL format standardization
- Comprehensive documentation
For questions, suggestions, or support:
- Open an issue on GitHub
- Check existing documentation
- Review example code snippets
Disclaimer: This dataset is provided as-is for research and development purposes. Users are responsible for evaluating the dataset's suitability for their specific applications.
License: Apache License 2.0 | Repository Owner: VANTA Research | Initial Commit: November 2025
VANTA Research: AI for humans.
