Dive into building Deep learning pipelines using Lance datasets! This repository contains examples to help you use Lance datasets for your Deep learning projects.
-
These are built using Lance, a free, open-source, columnar data format that requires no setup.
-
High-performance random access: More than 1000x faster than Parquet.
-
Zero-copy, automatic versioning: manage versions of your data automatically, and reduce redundancy with zero-copy logic built-in.
Join our community for support - Discord • Twitter
Convinience
Lance columnar file format is designed for large scale DL workloads. Columnar format allows you to easily and efficiently manage complex and unstructred multi-modal datasets Updation, filtering and zero-copy versioning allow you to iterate faster on large datasets. It’s designed to be used with images, videos, 3D point clouds, audio and of course tabular data. It supports any POSIX file systems, and cloud storage like AWS S3 and Google Cloud Storage
Performance
Lance format supports fast read/writes making your training time data loading significantly faster.
Examples on how to convert existing datasets to Lance format.
Example | Scripts | Read The Blog! |
---|---|---|
Creating text dataset for LLM pre-training | ||
Creating Instruction dataset for LLM fine-tuning | ||
Creating Image Captioning Dataset for Multi-Modal Model Training |
Practical examples showcasing how to adapt your Lance dataset to popular deep learning projects.
If you're working on some cool deep learning examples using Lance that you'd like to add to this repo, please open a PR! More detailed instructions on contributing can be found on the CONTRIBUTING.md page.