Results – LEASARD

Link to the slides and videos used to present the project during the Cominlabs Days 2024.

Learning with appropriate datasets

The question of the datasets

What datasets are available for event cameras on UAVs? We do not know any dataset to train event-based DNNs that could improve the navigation autonomy of UAVs for SAR missions. There is the drone racing dataset from the University of Zürich, which is limited to indoor and specific to drone racing. There are many native event camera datasets, mainly for automotive applications, such as DDD20, Prophesee 1MEGAPIXEL, and DSEC. RGB-based UAV datasets exist, such as VisDrone, UAVDT, and SARD, and can be converted into event datasets using tools such as v2e. However, these datasets are not suitable for training deep neural networks (DNNs) for autonomous navigation of UAVs in different environments, such as forests, caves, and destroyed buildings, which are expected during SAR missions.

Simulating Aerial Event-based Environment: Application to Car Detection, presented at ERF’24.

We presented at the European Robotics Forum 2024 our framework to simulate an aerial event-based environment, which can be used to train and evaluate artificial vision tasks, such as detection, in diverse conditions.
To build versatile datasets on demand to train deep neural networks (DNNs) for event cameras and UAVs, several tools can be used. Airsim provides a complete simulation environment allowing different scenarios in realistic conditions with diverse light conditions, and refined graphics thanks to Unreal Engine. Coupled with v2e, it produces realistic and rich event streams that may be provided by an event camera mounted on a UAV. This allows DNN models to be tested, but not yet to be trained. To do so, we used UnrealGT to get a ground truth. We validated this framework to train an event-based YOLOv7 model.
More details are provided in our paper:
Ismail Amessegher, Hajer Fradi, Clémence Liard, Jean-Philippe Diguet, Panagiotis Papadakis, and Matthieu Arzel. Simulating Aerial Event-based Environment: Application to Car Detection. European Robotics Forum 2024, Mar 2024, Rimini, Italy. ⟨hal-04497648⟩

Our framework to generate datasets on demand.

Example of one of our test scenarios: events are used to build a high-frame-rate low-latency video stream that can feed an efficient artificial vision model.

Event-based control

UAV Object Tracking based on Deep Reinforcement Learning

We investigated how to design an efficient low-latency controller that would allow a UAV to follow a leader in full autonomy. The proposed tracking controller is designed to respond to visual feedback from the mounted event sensor, adjusting the drone movements to follow the target. To leverage the full motion capabilities of a quadrotor and the unique properties of event sensors, we propose an end-to-end deep-reinforcement learning (DRL) framework that maps raw sensor data from event streams directly to control actions for the UAV. To learn an optimal policy under highly variable and challenging conditions, we opt for a simulation environment (Airsim) with domain randomization for effective transfer to real-world environments, as forests, caves, or rooms with complex patterns.
We demonstrate the effectiveness of our approach through experiments in challenging scenarios, including fast-moving targets and changing lighting conditions, which result in improved generalization capabilities.

Various training environments thanks to our simulation framework.

End-to-end DNN model fed by events to decide actions.

Result of the approach proposed in the project: a UAV is able to track a leader based only on an event stream feeding a DNN model deciding on the actions to apply by the flight controler.

This work was submitted for publication and is available here: ⟨hal-04714734⟩

Event-RGB sensor fusion

How and why to combine event and RGB cameras?

Real-time vision applications such as object detection for autonomous navigation have recently witnessed the emergence of neuromorphic or event cameras, thanks to their high dynamic range, high temporal resolution and low latency. In this work, our objective is to leverage the distinctive properties of asynchronous events and static texture information of conventional frames. To handle that, asynchronous events are first transformed into a 2D spatial grid representation, which is carefully selected to harness the high temporal resolution of event streams and align with conventional image-based vision. Via a joint detection framework, detections from both RGB and event modalities are fused by probabilistically combining scores and bounding boxes. The superiority of the proposed method is demonstrated over concurrent Event-RGB fusion methods on DSEC-MOD and PKU-DDD17 datasets by a significant margin.

The proposed flowchart for very late fusion of event and RGB detections, with YOLOv7 as the baseline detector. The input multi-range representation of event streams precedes the fusion process.

This work was accepted for publication at the IEEE International Conference on Robotic Computing 2024, and is available here: ⟨hal-04746439⟩.

DNNs on FPGA

Rich RGB semantic segmentation on FPGA

Semantic segmentation is a complex task that benefited from major improvements thanks to recent advances in machine learning and more specifically in DNNs. It allows an accurate understanding of complex environments, making technology like autonomous vehicles possible. Our first intuition was that RGB cameras would be required to have enough information to run this task with acceptable accuracy and that the models would be too complex to run with a low latency. For instance, the time available to avoid an obstacle while moving at 20m/s is around 50ms.
We thought that it would be fairly difficult to achieve such a low latency with semantic segmentation. However, we thought that it would nicely complement detection, which could benefit from simpler DNNs and the low latency of event-based vision. So, we decided to evaluate RGB semantic segmentation and event-based detection (detailed in the next section).
Datasets like Cityscapes are available to train semantic segmentation models for autonomous car driving and are useful to first benchmark what can be achieved with efficient implementations of DNNs on FPGAs. These hardware targets have proven to be excellent for deploying highly parallel, low-latency and low-power DNN architectures for embedded and cloud applications. Many FPGA implementations use recursive architectures based on Deep Processing Units (DPUs) for fast and resource-efficient solutions which usually come at the cost of a higher latency. On the other hand, pipelined dataflow architectures have the potential to offer scalable, low-latency implementations. In this work, we have explored implementing a semantic segmentation network as a pipelined architecture and evaluated the achievable performances. Our model, a convolutional encoder-decoder based on U-Net, achieves 62.9 % mIoU on the Cityscapes dataset with a 4-bit integer quantization. Once deployed on the Xilinx Alveo U250 FPGA board, the implemented neural network architecture is able to output close to 23 images per second with 44 ms latency per input. The code of this work is open-source and was released publicly. This work was also published at the IEEE 30th International Conference on Electronics, Circuits and Systems, Dec 2023, Istanbul, Turkey ⟨10.1109/ICECS58634.2023.10382715⟩ ⟨hal-04262138v2⟩.
With such results on a datacenter-class board, the aim of analyzing the environment with a latency compliant with a speed of 20m/s was achieved. However, this model is still too complex (14M parameters) to be embedded on a smaller FPGA board fitting a UAV.

Therefore, we proposed a second version based on ENet with only 350k parameters achieving 70.3 % mIoU on the Cityscapes dataset with 4-bit integers (inputs : 512X256), 226 FPS with 4.2 ms latency per input on an AMD ZU19EG (embedded-class board), and 6.8W peak power consumption (measured).
So, our first intuition that semantic segmentation was too complex to be compliant with low-latency event streams was wrong. But would it be beneficial from an information point of view? Could event cameras bring enough information for an accurate semantic segmentation? These questions are still open.

Cityscapes dataset used to train and evaluate our models of semantic segmentation on FPGA.

Our first U-Net model to run semantic segmentation on FPGA.

ENet architecture. The output sizes are for 512×256 in-
puts and 19 classes as in the Cityscapes dataset. Table adapted
from original paper on Enet.

Our second Enet-based model to run semantic sementation on FPGA, with less resources and a reduced latency.

Low-latency event-based detection on FPGA

To detect obstacles in front of the UAV, we considered YOLO as a candidate to evaluate. Specifically, YOLOv5m, which has 21 million parameters, was tested on the Jetson Orin Nano. This model was trained on the DSEC-MOD dataset and quantized to 8 bits. However, it has an inference time of at least 70 milliseconds, resulting in a maximum frame rate of 14 frames per second, making it not the best choice for real-world applications.

Alternatively, TinyYOLOv3, which has 8 million parameters and 13 convolutional layers, was tested on an FPGA. Using FINN and BREVITAS to quantize on 8 bits and map the model to hardware layers, the first results showed a frame rate of 18 frames per second with a latency of 50 milliseconds at 100MHz on a modest Pynq Z1.
We aim to further reduce the complexity of this TinyYOLOv3 model by downsizing the number of bits for the weights and activations, to fully benefit from the sparse ternary information provided by event cameras.

Test in the sports hall at IMT Atlantique.

YOLOv5 quantized on 8bits and trained on the DSEC-MOD dataset.

UAV platform

What about the power consumption of the whole system?

We designed a UAV prototype based on the combination of a Pixhawk board, a NVidia Jetson Orin Nano and an event-camera (DVXplorer Mini by Inivation).
During a static flight, the power consumption of the whole system, without any DNN model running, was measured. 242W are required by the motors, the flight controler (Pixhawk) and the on-board computer (Jetson Orin Nano, power mode set to 15W max.). After measuring the power consumption of the event camera and of the Jetson board running the YOLOV5m model, we can conclude that the few watts consumed by the artificial vision remain not significant when compared to the 242W required by a static flight.
The few Watts used by on-bard artificial intelligence are worth the cost if they allow to reduce the power budget associated with the motors!
For instance, if we can learn to avoid obstacles with the best trajectory as shown in the video below, then less energy will be spent to fly and more will be used to inspect the environment in Search-An-Rescue missions.

Power consumption of our drone prototype, during a static flight, without any DNN model running.

Power measurements in different conditions.

Experiment with our drone prototype avoiding a pole in the middle of the arena.