- Change in perspective is necessary because some abilities only emerge at a certain scale. Some conclusions from the past are invalidated and we need to constantly unlearn intuitions built on top of such ideas.
- From first-principles, scaling up the Transformer amounts to efficiently doing matrix multiplications with many, many machines.
- Further scaling (think 10000x GPT-4 scale). It entails finding the inductive bias that is the bottleneck in further scaling.
- Twitter / Video / Slides [6 Oct 2023]
- LLMprices.dev: Compare prices for models like GPT-4, Claude Sonnet 3.5, Llama 3.1 405b and many more.
- AI Model Review: Compare 75 AI Models on 200+ Prompts Side By Side.
- Artificial Analysis:💡Independent analysis of AI models and API providers.
- Inside language models (from GPT to Olympus)
- LLM Pre-training and Post-training Paradigms [17 Aug 2024]
-
Evolutionary Graph of LLaMA Family
-
LLM evolutionary tree
-
Timeline of SLMs
-
A Survey of Large Language Models: [cnt] /git [31 Mar 2023] contd.
-
LLM evolutionary tree: [cnt]: A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers) git [26 Apr 2023]
-
A Comprehensive Survey of Small Language Models in the Era of Large Language Models / git [4 Nov 2024]
-
An overview of different fields of study and recent developments in NLP. doc / ref [24 Sep 2023]
Exploring the Landscape of Natural Language Processing Research ref [20 Jul 2023]
NLP taxonomy
Distribution of the number of papers by most popular fields of study from 2002 to 2022
- The Open Source AI Definition [28 Oct 2024]
- The LLM Index: A list of large language models (LLMs)
- Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings
- LLM Collection: promptingguide.ai
- Huggingface Open LLM Learboard
- ollam: ollama-supported models
- The mother of all spreadsheets for anyone into LLMs [17 Dec 2024]
- KoAlpaca: Alpaca for korean [Mar 2023]
- Pythia: How do large language models (LLMs) develop and evolve over the course of training and change as models scale? A suite of decoder-only autoregressive language models ranging from 70M to 12B parameters git [Apr 2023]
- OLMo:💡Truly open language model and framework to build, study, and advance LMs, along with the training data, training and evaluation code, intermediate model checkpoints, and training logs. git [Feb 2024]
- OLMoE: fully-open LLM leverages sparse Mixture-of-Experts [Sep 2024]
- OLMo 2 [26 Nov 2024]
- Open-Sora: Democratizing Efficient Video Production for All [Mar 2024]
- Jamba: AI21's SSM-Transformer Model. Mamba + Transformer + MoE [28 Mar 2024]
- TÜLU 3:💡Pushing Frontiers in Open Language Model Post-Training git / demo:ref [22 Nov 2024]
- Meta (aka. Facebook)
- Most OSS LLM models have been built on the Llama / ref / git
- Llama 2: 1) 40% more data than Llama. 2)7B, 13B, and 70B. 3) Trained on over 1 million human annotations. 4) double the context length of Llama 1: 4K 5) Grouped Query Attention, KV Cache, and Rotary Positional Embedding were introduced in Llama 2 [18 Jul 2023] demo
- Llama 3: 1) 7X more data than Llama 2. 2) 8B, 70B, and 400B. 3) 8K context length [18 Apr 2024]
- MEGALODON: Long Sequence Model. Unlimited context length. Outperforms Llama 2 model. [Apr 2024]
- Llama 3.1: 405B, context length to 128K, add support across eight languages. first OSS model outperforms GTP-4o. [23 Jul 2024]
- Llama 3.2: Multimodal. Include text-only models (1B, 3B) and text-image models (11B, 90B), with quantized versions of 1B and 3B [Sep 2024]
- NotebookLlama: An Open Source version of NotebookLM [28 Oct 2024]
- Llama 3.3: a text-only 70B instruction-tuned model. Llama 3.3 70B approaches the performance of Llama 3.1 405B. [6 Dec 2024]
- Google
- Foundation Models: Gemini, Veo, Gemma etc.
- Gemma: Open weights LLM from Google DeepMind. git / Pytorch git [Feb 2024]
- Gemma 2 2B, 9B, 27B ref: releases [Jun 2024]
- PaliGemma: a 3B VLM [10 Jul 2024]
- DataGemma [12 Sep 2024] / NotebookLM: LLM-powered notebook. free to use, not open-source. [12 Jul 2023]
- PaliGemma 2: VLMs at 3 different sizes (3B, 10B, 28B) [4 Dec 2024]
- Gemini: Rebranding: Bard -> Gemini [8 Feb 2024]
- Gemini 1.5: 1 million token context window, 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. [Feb 2024]
- Gemini 2 Flash: Multimodal LLM with multilingual inputs/outputs, real-time capabilities (Project Astra), complex task handling (Project Mariner), and developer tools (Jules) [11 Dec 2024]
- Gemini 2.0 Flash Thinking Experimental [19 Dec 2024]
- gemini/cookbook
- Anthrophic
- Claude 3, the largest version of the new LLM, outperforms rivals GPT-4 and Google’s Gemini 1.0 Ultra. Three variants: Opus, Sonnet, and Haiku. [Mar 2024]
- Microsoft
- phi-series: cost-effective small language models (SLMs) ref
- phi-4: Specializing in Complex Reasoning ref [12 Dec 2024]
- phi-3.5-MoE-instruct: ref [Aug 2024]
- phi-3: Phi-3-mini, with 3.8 billion parameters, supports 4K and 128K context, instruction tuning, and hardware optimization. [22 Apr 2024] ref
- phi-3-vision (multimodal), phi-3-small, phi-3 (7b), phi-sillica (Copilot+PC designed for NPUs)
- phi-2: open source, and 50% better at mathematical reasoning. git [Dec 2023]
- phi-1.5: [cnt]: Textbooks Are All You Need II. Phi 1.5 is trained solely on synthetic data. Despite having a mere 1 billion parameters compared to Llama 7B's much larger model size, Phi 1.5 often performs better in benchmark tests. [11 Sep 2023]
- phi-1: [cnt]: Despite being small in size, phi-1 attained 50.6% on HumanEval and 55.5% on MBPP. Textbooks Are All You Need. ref [20 Jun 2023]
- NVIDIA
- Nemotron-4 340B: Synthetic Data Generation for Training Large Language Models [14 Jun 2024]
- Amazon
- Amazon Nova Foundation Models: Text only - Micro, Multimodal - Light, Pro [3 Dec 2024]
- Mistral
- Founded in April 2023. French tech.
- Groq
- Founded in 2016. low-latency AI inference H/W. American tech.
- Llama-3-Groq-Tool-Use: a model optimized for function calling [Jul 2024]
- Alibaba
- Qwen series > Qwen2: 29 languages. 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. [Feb 2024]
- Cohere
- Founded in 2019. Canadian multinational tech.
- Command R+: The performant model for RAG capabilities, multilingual support, and tool use. [Aug 2024]
- Deepseek
- Founded in 2023, is a Chinese company dedicated to AGI.
- A list of models: git
- Tencent
- Founded in 1998, Tencent is a Chinese company dedicated to various technology sectors, including social media, gaming, and AI development.
- Hunyuan-Large: An open-source MoE model with open weights. [4 Nov 2024] git
- Qualcomm
- Qualcomm’s on-device AI models: Bring generative AI to mobile devices [Feb 2024]
- xAI
- xAI is an American AI company founded by Elon Musk in March 2023
- Grok: 314B parameter Mixture-of-Experts (MoE) model. Released under the Apache 2.0 license. Not includeded training code. Developed by JAX git [17 Mar 2024]
- Grok-2 and Grok-2 mini [13 Aug 2024]
- Databricks
- Apple
- OpenELM: Apple released a Transformer-based language model. Four sizes of the model: 270M, 450M, 1.1B, and 3B parameters. [April 2024]
- Apple Intelligence Foundation Language Models: 1. A 3B on-device model used for language tasks like summarization and Writing Tools. 2. A large Server model used for language tasks too complex to do on-device. [10 Jun 2024]
- IBM
- Granite Guardian: a collection of models designed to detect risks in prompts and responses [10 Dec 2024]
- GPT for Domain Specific x-ref
- MLLM (multimodal large language model) x-ref
- Large Language Models (in 2023) x-ref
- Llama variants emerged in 2023
- Falcon LLM Apache 2.0 license [Mar 2023]
- Alpaca: Fine-tuned from the LLaMA 7B model [Mar 2023]
- vicuna: 90% ChatGPT Quality [Mar 2023]
- dolly: Databricks [Mar 2023]
- Cerebras-GPT: 7 GPT models ranging from 111m to 13b parameters. [Mar 2023]
- Koala: Focus on dialogue data gathered from the web. [Apr 2023]
- StableVicuna First Open Source RLHF LLM Chatbot [Apr 2023]
- Upstage's 70B Language Model Outperforms GPT-3.5: ref [1 Aug 2023]
- AlphaFold3: Open source implementation of AlphaFold3 [Nov 2023] / OpenFold: PyTorch reproduction of AlphaFold 2 [Sep 2021]
- BioGPT: [cnt]: Generative Pre-trained Transformer for Biomedical Text Generation and Mining git [19 Oct 2022]
- Galactica: A Large Language Model for Science [16 Nov 2022]
- TimeGPT: The First Foundation Model for Time Series Forecasting git [Mar 2023]
- BloombergGPT: A Large Language Model for Finance [30 Mar 2023]
- Huggingface StarCoder: A State-of-the-Art LLM for Code: git [May 2023]
- FrugalGPT: LLM with budget constraints, requests are cascaded from low-cost to high-cost LLMs. git [9 May 2023]
- Code Llama: Built on top of Llama 2, free for research and commercial use. ref / git [24 Aug 2023]
- MechGPT: Language Modeling Strategies for Mechanics and Materials git [16 Oct 2023]
- MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers [27 Nov 2023]
- EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain [30 Jan 2024]
- SaulLM-7B: A pioneering Large Language Model for Law [6 Mar 2024]
- Devin AI: Devin is an AI software engineer developed by Cognition AI [12 Mar 2024]
- DeepSeek-Coder-V2: Open-source Mixture-of-Experts (MoE) code language model [17 Jun 2024]
- Qwen2-Math: math-specific LLM / Qwen2-Audio: large-scale audio-language model [Aug 2024] / Qwen 2.5-Coder [18 Sep 2024]
- Chai-1: a multi-modal foundation model for molecular structure prediction [Sep 2024]
- Prithvi WxC: In collaboration with NASA, IBM is releasing an open-source foundation model for Weather and Climate ref [20 Sep 2024]
- AlphaChip: Reinforcement learning-based model for designing physical chip layouts. [26 Sep 2024]
- OpenCoder: 1.5B and 8B base and open-source Code LLM, supporting both English and Chinese. [Oct 2024]
- Video LLMs for Temporal Reasoning in Long Videos: TemporalVLM, a video LLM excelling in temporal reasoning and fine-grained understanding of long videos, using time-aware features and validated on datasets like TimeIT and IndustryASM for superior performance. [4 Dec 2024]
-
Understanding Multimodal LLMs:💡Two main approaches to building multimodal LLMs: 1. Unified Embedding Decoder Architecture approach; 2. Cross-modality Attention Architecture approach. [3 Nov 2024]
-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants: [cnt]: A comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities. Specific-Purpose 1. Visual understanding tasks 2. Visual generation tasks General-Purpose 3. General-purpose interface. [18 Sep 2023]
-
Awesome Multimodal Large Language Models: Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation. [Jun 2023]
-
CLIP: [cnt]: CLIP (Contrastive Language-Image Pretraining), Trained on a large number of internet text-image pairs and can be applied to a wide range of tasks with zero-shot learning. git [26 Feb 2021]
-
BLIP-2 [30 Jan 2023]: [cnt]: Salesforce Research, Querying Transformer (Q-Former) / git / ref / 📺 / BLIP: [cnt]: git [28 Jan 2022]
Q-Former (Querying Transformer)
: A transformer model that consists of two submodules that share the same self-attention layers: an image transformer that interacts with a frozen image encoder for visual feature extraction, and a text transformer that can function as both a text encoder and a text decoder.- Q-Former is a lightweight transformer which employs a set of learnable query vectors to extract visual features from the frozen image encoder. It acts as an information bottleneck between the frozen image encoder and the frozen LLM.
-
TaskMatrix, aka VisualChatGPT: [cnt]: Microsoft TaskMatrix git; GroundingDINO + SAM / git [8 Mar 2023]
-
GroundingDINO: [cnt]: DINO with Grounded Pre-Training for Open-Set Object Detection git [9 Mar 2023]
-
LLaVa: [cnt]: Large Language-and-Vision Assistant git [17 Apr 2023]
- Simple linear layer to connect image features into the word embedding space. A trainable projection matrix W is applied to the visual features Zv, transforming them into visual embedding tokens Hv. These tokens are then concatenated with the language embedding sequence Hq to form a single sequence. Note that Hv and Hq are not multiplied or added, but concatenated, both are same dimensionality.
- LLaVA-1.5: [cnt]: is out! git: Changing from a linear projection to an MLP cross-modal. [5 Oct 2023]
-
MiniGPT-4 & MiniGPT-v2: [cnt]: Enhancing Vision-language Understanding with Advanced Large Language Models git [20 Apr 2023]
-
openai/shap-e Generate 3D objects conditioned on text or images [3 May 2023] git
-
Drag Your GAN: [cnt]: Interactive Point-based Manipulation on the Generative Image Manifold git [18 May 2023]
-
Video-ChatGPT: [cnt]: a video conversation model capable of generating meaningful conversation about videos. / git [8 Jun 2023]
-
moondream: an OSS tiny vision language model. Built using SigLIP, Phi-1.5, LLaVA dataset. [Dec 2023]
-
MiniCPM-V: MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone [Jan 2024]
-
mini-omni2: ref: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. [15 Oct 2024]
-
LLaVA-CoT: (FKA. LLaVA-o1) Let Vision Language Models Reason Step-by-Step. git [15 Nov 2024]
-
Vision capability to a LLM ref [22 Aug 2023]
-
Meta (aka. Facebook)
- facebookresearch/ImageBind: [cnt]: ImageBind One Embedding Space to Bind Them All git [9 May 2023]
- facebookresearch/segment-anything(SAM): [cnt]: The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model. git [5 Apr 2023]
- facebookresearch/SeamlessM4T: [cnt]: SeamlessM4T is the first all-in-one multilingual multimodal AI translation and transcription model. This single model can perform speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages depending on the task. ref [22 Aug 2023]
- Chameleon: Early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. The unified approach uses fully token-based representations for both image and textual modalities. [16 May 2024]
- Models and libraries
-
Microsoft
- Language Is Not All You Need: Aligning Perception with Language Models Kosmos-1: [cnt] [27 Feb 2023]
- Kosmos-2: [cnt]: Grounding Multimodal Large Language Models to the World [26 Jun 2023]
- Kosmos-2.5: [cnt]: A Multimodal Literate Model [20 Sep 2023]
- BEiT-3: [cnt]: Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks [22 Aug 2022]
- TaskMatrix.AI: [cnt]: TaskMatrix connects ChatGPT and a series of Visual Foundation Models to enable sending and receiving images during chatting. [29 Mar 2023]
- Florence-2: Advancing a unified representation for various vision tasks, demonstrating specialized models like
CLIP
for classification,GroundingDINO
for object detection, andSAM
for segmentation. ref [10 Nov 2023] - LLM2CLIP: Directly integrating LLMs into CLIP causes catastrophic performance drops. We propose LLM2CLIP, a caption contrastive fine-tuning method that leverages LLMs to enhance CLIP. [7 Nov 2024]
- Florence-VL: A multimodal large language model (MLLM) that integrates Florence-2. [5 Dec 2024]
-
Apple
- 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities. [13 Jun 2024]
-
Hugging Face
-
Benchmarking Multimodal LLMs.
- LLaVA-1.5 achieves SoTA on a broad range of 11 tasks incl. SEED-Bench.
- SEED-Bench: [cnt]: Benchmarking Multimodal LLMs git [30 Jul 2023]
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models ref [25 Sep 2024]
-
Optimizing Memory Usage for Training LLMs and Vision Transformers: When applying 10 techniques to a vision transformer, we reduced the memory consumption 20x on a single GPU. ref / git [2 Jul 2023]
- The Generative AI Revolution: Exploring the Current Landscape : doc [28 Jun 2023]
- Diffusion Models vs. GANs vs. VAEs: Comparison of Deep Generative Models [12 May 2023]
Model | Description | Strengths | Weaknesses |
---|---|---|---|
GANs | Two neural networks, a generator and a discriminator, work together. The generator creates synthetic samples, and the discriminator distinguishes between real and generated samples. | Unsupervised learning, able to mimic data distributions without labeled data, and are versatile in applications like image synthesis, super-resolution, and style transfer | Known for potentially unstable training and less diversity in generation. |
VAEs | Consists of an encoder and a decoder. The encoder maps input data into a low-dimensional representation, and the decoder reconstructs the original input data from this representation. e.g, DALLE |
Efficient at learning latent representations and can be used for tasks like data denoising and anomaly detection, in addition to data generation. | Dependent on an approximate loss function. |
Diffusion Models | Consists of forward and reverse diffusion processes. Forward diffusion adds noise to input data until white noise is obtained. The reverse diffusion process removes the noise to recover the original data. e.g, Stable Diffusion |
Capable of producing high-quality, step-by-step samples. | Multi-step (often 1000) generation process. |