Vision Language Models (VLM)

Multimodal models that can reason against image and video inputs and perform descriptive language generation​

new york
PREVIEW
liuhaotian
llava-v1.6-34b
language generation
vision assistant
Multi-modal vision-language model that understands text/images and generates informative responses
new york
PREVIEW
liuhaotian
llava-v1.6-mistral-7b
language generation
vision assistant
Multi-modal vision-language model that understands text/images and generates informative responses
new york
PREVIEW
nvidia
neva-22b
language generation
vision assistant
Multi-modal vision-language model that understands text/images and generates informative responses
new york
PREVIEW
microsoft
phi-3-vision-128k-instruct
language generation
vision assistant
Cutting-edge open multimodal model exceling in high-quality reasoning from images.
new york
PREVIEW
google
paligemma
language generation
vision assistant
Vision language model adept at comprehending text and visual inputs to produce informative responses
new york
PREVIEW
microsoft
kosmos-2
image understanding
multimodal
Groundbreaking multimodal model designed to understand and reason about visual elements in images.
new york
PREVIEW
adept
fuyu-8b
image understanding
language generation
Multi-modal model for a wide range of tasks, including image understanding and language generation.

Specialized Foundation Models

Computer vision models that excel at particular visual perception tasks

nvidia / 
ocdrnetPREVIEW

OCDNet and OCRNet are pre-trained models designed for optical character detection and recognition respectively.

Optical Character Detection
Optical Character Recognition
nvidia / 
visual-changenetPREVIEW

Visual Changenet detects pixel-level change maps between two images and outputs a semantic change segmentation mask

Image Segmentation
NVIDIA NIM
nvidia / 
retail-object-detectionPREVIEW

EfficientDet-based object detection network to detect 100 specific retail objects from an input video.

NVIDIA NIM
Object Detection