Prominent AI Foundation Models
Below is a curated list of notable foundation models, categorized by primary modality. This includes large language models (LLMs), vision models, multimodal models, and others. The list draws from various sources and focuses on well-known examples, including their developers where available.
Language Models
- BERT (Google): A bidirectional transformer for natural language understanding.[1][9][12]
- BLOOM (BigScience/Hugging Face): A 176B-parameter open-access multilingual LLM.[1][12]
- Claude (Anthropic): A family of LLMs focused on safety and helpfulness.[1][2][11]
- Cohere Command (Cohere): Enterprise-focused language models for generation and retrieval.[1][11]
- DBRX (Databricks): An open-source mixture-of-experts LLM.[11]
- DeepSeek series (DeepSeek-AI): Models emphasizing reasoning via reinforcement learning.[12]
- Gemini (Google): Advanced LLM with multimodal capabilities.[2][11]
- GPT series (OpenAI): Including GPT-2, GPT-3, GPT-4; generative pre-trained transformers for text tasks.[1][2][4][9][11][12]
- Granite (IBM): Open-source models for enterprise AI.[2][11]
- Jurassic (AI21 Labs): High-performance LLMs for natural language generation.[1]
- LLaMA series (Meta): Open and efficient foundation LLMs, including LLaMA 2 and Llama 3.[12]
- Mistral (Mistral AI): Efficient open-weight LLMs.[7][11]
- Nemotron (NVIDIA): Customizable models for enterprise use.[11]
- PaLM series (Google): Scaling language modeling with pathways.[12]
- Phi (Microsoft): Small, efficient language models.[11]
- T5 (Google): Text-to-text transformer for transfer learning.[12]
Vision Models
- BLIP-2 (Salesforce): Bootstrapping language-image pre-training.[0][12]
- CLIP (OpenAI): Contrastive language-image pre-training for visual representations.[4][12]
- DINO series (Meta/IDEA/NYU): Self-supervised vision transformers for object detection and world modeling.[12]
- EfficientNet (Google): Efficient convolutional networks for image classification.[0]
- Florence (Microsoft): Foundation model for computer vision tasks.[12]
- InternImage (Shanghai AI Lab): Large-scale vision models with deformable convolutions.[12]
- MobileNetV2 (Google): Lightweight models for mobile vision applications.[0]
- SAM series (Meta): Segment Anything models for image and video segmentation.[12]
- Sapiens (Meta): Foundation for human vision models.[12]
- Swin Transformer (Microsoft): Hierarchical vision transformer.[12]
- ViT (Vision Transformer) (Google): Transformer-based image recognition.[0][12]
- YOLOv8 (Ultralytics): Real-time object detection model.[0]
Multimodal Models
- DALL-E series (OpenAI): Text-to-image generation models, including DALL-E 3.[3][4][9][12]
- Flamingo (Google DeepMind): Visual language model for few-shot learning.[3][9][12]
- Imagen series (Google): Photorealistic text-to-image diffusion models.[12]
- LLaVA (University of Wisconsin-Madison/Microsoft): Visual instruction tuning for vision-language tasks.[12]
- MiniGPT-4 (KAUST): Enhancing vision-language understanding.[12]
- Stable Diffusion (Stability AI): Latent diffusion for image synthesis.[1][3][9][12]
Other Modalities (e.g., Audio, Robotics, Specialized)
- AstroLLaMA (Not specified): For astronomy data.[9]
- Grok (xAI): Multimodal model from xAI.[7]
- LLark (Spotify): Music generation model.[3][9]
- MusicGen (Meta): Audio generation for music.[3][9]
- Nova (Amazon): Multimodal foundation models.[1]
- Orbital (Orbital Materials): For chemistry applications.[9]
- RT-2 (Google): For robotic control.[9]
- Sora (OpenAI): Video generation model.[4]
- StarCoder (Not specified): For coding tasks.[9]
For a more exhaustive list, repositories like Awesome-Foundation-Models on GitHub provide hundreds of models with paper references.[5][12] Rankings, such as Forrester's top 10, highlight models like Google Gemini as leaders based on offering, strategy, and market presence.[2][11]