ActiveAnno3D - An Active Learning Framework for Multi-Modal 3D Object Detection

1Technical University of Munich (TUM), 2University of California San Diego (UCSD), 3Aalborg Universitet
IEEE Intelligent Vehicles Symposium (IV'24)

*Indicates Equal Contribution
title_figure
We propose a framework for efficient active learning within various 3D object detection techniques and modalities, demonstrating the effectiveness of active learning at reaching comparable detection performance on benchmark datasets at a fraction of the annotation cost. Datasets include roadside infrastructure sensors (top row) and onboard vehicle sensors (bottom row), with LiDAR-only and LiDAR+camera fusion methods, the two dominant strategies in state-of-the-art performance at the safety-critical detection task. We use the entropy active learning sampling strategy to select the most informative data.

Overview

ActiveAnno3D is the first active learning framework for multi-modal 3D object detection. With this framework you can select data samples for labeling that are of maximum informativeness for training.

In summary:
  • We explore various continuous training methods and integrate the most efficient method regarding computational demand and detection performance.
  • We perform extensive experiments and ablation studies with BEVFusion and PV-RCNN on the nuScenes and TUM Traffic Intersection datasets and show that we can achieve almost the same performance when using only half of the data (77.25 mAP compared to 83.50 mAP).
  • We integrate our active learning framework into the 3D BAT v24.3.2 labeling tool to enable AI-assisted data selection and labeling and minimize the labeling costs.

Abstract

The curation of large-scale datasets is still costly and requires much time and resources. Data is often manually labeled, and the challenge of creating high-quality datasets remains. In this work, we fill the research gap using active learning for multi-modal 3D object detection. We propose ActiveAnno3D, an active learning framework to select data samples for labeling that are of maximum informativeness for training. We explore various continuous training methods and integrate the most efficient method regarding computational demand and detection performance. Furthermore, we perform extensive experiments and ablation studies with BEVFusion and PV-RCNN on the nuScenes and TUM Traffic Intersection (TUMTraf-I) dataset. We show that we can achieve almost the same performance with PV-RCNN and the entropy-based query strategy when using only half of the training data (77.25 mAP compared to 83.50 mAP) of the TUMTraf-I dataset. BEVFusion achieved an mAP of 64.31 when using half of the training data and 52.88 mAP when using the complete nuScenes dataset. We integrate our active learning framework into the proAnno labeling tool to enable AI-assisted data selection and labeling and minimize the labeling costs. Finally, we provide code, weights, and visualization results on our website: https://active3d-framework.github.io/active3d-framework.

Architecture

title_figure
The generalized active learning flow involves the selection of data from an unlabeled pool according to an acquisition function, which, in the case of uncertainty-driven AL, utilizes the trained model or, in the case of diversity-driven AL, may be independent of the training. This selected data is then annotated by an oracle and aggregated with previously labeled data. Whether or not all data or just the new data is used in the next training step is determined by the choice of training strategy. The variety of possible acquisition and training techniques and unique domain challenges posed by autonomous driving make active learning an opportune environment for innovation toward safe and accurate learning.

Evaluation

The graph illustrates the mAP score achieved by the BEVFusion model on the nuScenes dataset relative to the expanding size of the training set in the active learning setting with random and entropy queries separately.
The graph illustrates the mAP scores achieved by the PV-RCNN model on the TUM Traffic Intersection dataset relative to the expanding size of the training set in the active learning setting with random and entropy queries separately.
title_figure
The graph illustrates the mAP scores achieved by the PV-RCNN model on the TUM Traffic Intersection dataset relative to the expanding size of the training set in the active learning setting with different query strategies.

Qualitative Results

title_figure
Qualitative results are illustrated by two pairs of images. The left pair is from the TUM Traffic Intersection dataset, and the right pair is from nuScenes. For each pair, the left image shows the predicted labels for each class, with each class represented by a different color. The right image of each pair shows the predictions made by learning on the complete dataset. Both results are quite similar, showing the efficiency of the active learning technique.

Benchmark Results

Labeled Pool LiDAR-Only (PV-RCNN) LiDAR+Camera (BEVFusion)
Round % Random Entropy Random Entropy
1 10 51.03 54.32

(+3.29)

30.95 31.06

(+0.11)

2 15 61.98 62.24

(+0.26)

34.19 36.39

(+2.20)

3 20 69.84 68.23

(-1.61)

38.00 40.41

(+2.41)

4 25 74.82 72.40

(-2.42)

42.36 42.17

(-0.19)

5 30 77.25 76.56

(-0.69)

44.94 45.57

(+0.63)

6 35 75.40 75.00

(-0.40)

44.74 46.76

(+2.02)

7 40 77.03 75.48

(-1.55)

46.93 49.24

(+2.31)

8 50 79.09 77.25

(-1.84)

49.90 64.31

(+14.41)

SOA (No AL) 100 83.50 52.88
Evaluation of the PV-RCNN (LiDAR-only) and BEVFusion (camera+LiDAR) model using the random sampling baseline and entropy querying method on the TUM Traffic Intersection dataset and the nuScenes dataset. These Results are compaired to the respective 100% accuracies of the original work.