Find ASIC Vendors

VILA: Advancing Visual Reasoning Across Images and Videos

May 06, 2024

Get a Price Quote

Recently, NVIDIA unveiled VILA, a groundbreaking visual language model that promises to revolutionize the field of multi-modal products. Developed with a comprehensive pretraining, instruction tuning, and deployment pipeline, VILA is tailored to meet the diverse needs of NVIDIA's clients, ensuring top-tier performance across various benchmarks.

One of the key highlights of VILA is its state-of-the-art (SOTA) performance on image and video question-answering benchmarks. With exceptional multi-image reasoning and in-context learning capabilities, VILA sets a new standard for efficiency and accuracy in visual language models. Notably, VILA stands out for its speed optimization, using only a fraction of the tokens compared to other models.

Moreover, VILA offers versatility with multiple size options, ranging from the high-performance 40B model to the edge device-friendly 3.5B version, which can be seamlessly deployed on devices like the NVIDIA Jetson Orin. This flexibility ensures that VILA can cater to a wide range of applications and computing environments.

The development of VILA was marked by an efficient training pipeline that saw the VILA-13B model trained on 128 NVIDIA A100 GPUs in a remarkably short period of just two days. This rapid training process not only underscores the model's capabilities but also hints at its scalability with increased data and GPU resources.

When it comes to inference efficiency, VILA shines with TRT-LLM compatibility, further enhancing its performance capabilities. By leveraging 4-bit AWQ quantization, VILA achieves impressive speeds, running at an impressive 10ms per token for the VILA-14B model on a single NVIDIA RTX 4090 GPU. This remarkable efficiency makes VILA a standout choice for demanding computational tasks.