CarLLaVA: Vision language models for camera-only closed-loop driving

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, Oleg Sinavski

2024-06-14CARLA Leaderboard 2.0 Autonomous Driving Language Modelling

Paper PDF Code

Abstract

In this technical report, we present CarLLaVA, a Vision Language Model (VLM) for autonomous driving, developed for the CARLA Autonomous Driving Challenge 2.0. CarLLaVA uses the vision encoder of the LLaVA VLM and the LLaMA architecture as backbone, achieving state-of-the-art closed-loop driving performance with only camera input and without the need for complex or expensive labels. Additionally, we show preliminary results on predicting language commentary alongside the driving output. CarLLaVA uses a semi-disentangled output representation of both path predictions and waypoints, getting the advantages of the path for better lateral control and the waypoints for better longitudinal control. We propose an efficient training recipe to train on large driving datasets without wasting compute on easy, trivial data. CarLLaVA ranks 1st place in the sensor track of the CARLA Autonomous Driving Challenge 2.0 outperforming the previous state of the art by 458% and the best concurrent submission by 32.6%.

Results

Task	Dataset	Metric	Value	Model
Autonomous Vehicles	Bench2Drive	Driving Score	85.94	SimLingo-Base (CarLLaVa)
Autonomous Vehicles	CARLA	Driving Score	6.87	CarLLaVA
Autonomous Vehicles	CARLA	Infraction Score	0.42	CarLLaVA
Autonomous Vehicles	CARLA	Route Completion	18.08	CarLLaVA
Autonomous Vehicles	CARLA	Driving Score	6.25	CarLLaVA (Map Track)
Autonomous Vehicles	CARLA	Infraction Score	0.39	CarLLaVA (Map Track)
Autonomous Vehicles	CARLA	Route Completion	18.89	CarLLaVA (Map Track)
Autonomous Driving	Bench2Drive	Driving Score	85.94	SimLingo-Base (CarLLaVa)

CarLLaVA: Vision language models for camera-only closed-loop driving

Abstract

Results

Related Papers

CarLLaVA: Vision language models for camera-only closed-loop driving

Abstract

Results

Related Papers