Latest News / November ‘25 / PopcornSAR Joins The Autoware Foundation!

From BEVDet to BEVFormer: Building a Vision-First Future for Autoware

Author: ADMIN
From BEVDet to BEVFormer: Building a Vision-First Future for Autoware

Autonomous perception is entering a new era, one where camera-first architectures are rapidly gaining ground over traditional LiDAR-centric designs. This transformation is driven by Bird’s-Eye-View (BEV) neural networks, which reconstruct a top-down 3D understanding of the environment from multiple camera feeds.

By leveraging BEV perception, vehicles can achieve robust object detection and motion tracking with lower hardware cost and greater deployment flexibility. For Autoware, this aligns perfectly with its commitment to AI-first, software-defined, and hardware-agnostic autonomy.

In collaboration with the Autoware Foundation, MulticoreWare Inc. — a global technology company specializing in AI optimization, computer vision, and embedded acceleration — has integrated two BEV-based detection models into Autoware’s Perception stack.


Fig. 1. Framework of the BEVDet paradigm

BEVDet – Real-Time 3D Detection from Multi-Camera Images

BEVDet (Bird’s-Eye-View Detection) serves as the foundation of Autoware’s camera-based perception BEVDet applies the Lift–Splat–Shoot principle: lifting image features into 3D, splatting them into a unified BEV grid, and detecting objects in top-down space.

Key Features:

  • ROS 2 Integration: Runs as a native Autoware perception node.
  • Camera-Only Operation: Derives 3D geometry purely from multi-view camera inputs.
  • Unified BEV Representation: Provides consistent spatial context for downstream planning.
  • TensorRT Optimization: Upgraded to TensorRT 10.x with FP16 mixed-precision inference.

Fig. 5. Overall Architecture of BEVFormer: (a) The encoder layer of BEVFormer contains grid-shaped  BEV queries, temporal self-attention, and spatial cross-attention. (b) In spatial crossattention,  each BEV query only interacts with image features in the regions of interest. (c) In temporal self attention, each BEV query interacts with two features: the BEV queries at the current timestamp  and the BEV features at the previous timestamp.

BEVFormer — Temporal Transformers for Advanced BEV Perception

BEVFormer builds upon BEVDet by introducing temporal reasoning, fusing features from multiple frames to handle motion, occlusion, and continuity over time.

Technical Highlights:

  • Spatial Cross-Attention: Selectively gathers visual features from all camera views.
  • Temporal Self-Attention: Maintains consistency of BEV features across frames.
  • C++ Inference Pipeline: Entirely reimplemented for ROS 2 with ONNX → TensorRT workflow.
  • RViz Visualization: Enables real-time 3D bounding box rendering and trajectory tracking.

Why It Matters

  • Demonstrating open collaboration between the Autoware Foundation and MulticoreWare to advance open-source, deployable AI.
  • Transitioning from LiDAR-heavy to scalable, vision-based perception.
  • Achieving real-time performance on embedded and edge devices.
  • Strengthening spatial and temporal understanding in complex driving scenes.

Access the Full Technical Report