This article introduces a multi-modal indoor dataset for event-based monocular depth estimation by mobile robots. The dataset was recorded on a humanoid platform and includes synchronized RGB, depth, event streams, and IMU data from Intel RealSense D435i, DAVIS346, and Prophesee EVK4 sensors. To provide a baseline, we implement a CycleGAN model that learns bidirectional mappings between the event-representation and the depth domain. We evaluate multiple state-of-the-art representations showing that event-based inputs could outperform frame-only inputs across accuracy, perceptual quality, and geometric reliability. The dataset and baseline together provide a reproducible testbed for event-based perception in indoor mobile robotics.
We present a multi-modal indoor dataset for mobile robots, including synchronized RGB, depth, event streams, and IMU data. It addresses the challenges of event-based monocular depth estimation and other tasks in realistic indoor environments.
The Cycle Generative Adversarial Network (CycleGAN) is adapted to learn a bidirectional mapping between event-based representations H and monocular depth images Z. This approach uses two generators along with discriminators. The adversarial objective enforces realism in each target domain. Meanwhile, cycle-consistency and identity losses preserve geometric and structural information for robust depth estimation in indoor navigation scenarios.
The CycleGAN baseline was trained for 100 epochs with a batch size of 8 and the Adam optimizer. Table 1 summarizes the most relevant hyperparameters.
| Parameter | Value |
|---|---|
| Epochs | 100 |
| Batch size | 8 |
| Learning rate (G) | 2 × 10−4 |
| Learning rate (D) | 2 × 10−4 |
| β1 | 0.5 |
| β2 | 0.999 |
@misc{Bugueno2025MMIDEventDepth,
author="Bugueno-Cordova, Ignacio
and Luna, Gava
and Verschae, Rodrigo
and Ruiz-del-Solar, Javier
and Navarro-Guerrero, Nicolas",
title="Human-Robot Navigation using Event-based Cameras and Reinforcement Learning",
year="2025",
}