Deep Learning | Tianxiong Zhang

Recent Gains from reading research papers[May 3rd]

Fri, 03 May 2024 00:00:00 +0000

[1]Li M, Liu S, Zhou H. SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM[J]. arXiv preprint arXiv:2402.03246, 2024.

This paper introduces the SGS-SLAM semantic visual SLAM system, which is based on Gaussian Splatting technology and is capable of integrating appearance, geometry, and semantic features through multi-channel optimization. The system compensates for the deficiencies of traditional depth and color losses in object optimization with a unique semantic feature loss and avoids reconstruction errors caused by cumulative errors through a semantic-guided keyframe selection strategy.

Note 1: One of the core innovations of the article is that SGS-SLAM is the first semantic dense visual SLAM system based on a 3D Gaussian representation. It utilizes 2D semantic maps to learn a 3D semantic representation characterized by Gaussians, providing high-fidelity reconstruction and optimal segmentation accuracy compared to previous methods based on NeRF.

Note 2: Another innovation of the article is the adoption of a multi-channel parameter optimization strategy, where appearance, geometry, and semantic signals jointly contribute to camera tracking and scene reconstruction. During the tracking phase, these different channels are used for keyframe selection, focusing on actively identifying objects seen early in the trajectory, thus achieving efficient and high-quality map reconstruction.

Note 3: Due to the nature of the 3D Gaussian representation, geometric information not observed from multiple views in the trajectory will inevitably be missing, which may result in holes when performing scene operations such as object removal or transformation. This challenge, caused by the characteristics of the 3D Gaussian representation itself, requires future research to address the issue of unobserved geometric information using 3D geometric priors or scene repair techniques to improve the completeness of the scene representation.

Recent Gains from reading research papers[Apr 26th]

Fri, 26 Apr 2024 00:00:00 +0000

[1]Yan C, Qu D, Wang D, et al. Gs-slam: Dense visual slam with 3d gaussian splatting[J]. arXiv preprint arXiv:2311.11700, 2023.

The paper presents a novel dense visual SLAM system, GS-SLAM, which for the first time utilizes a 3D Gaussian representation method in a SLAM system. GS-SLAM achieves a better balance between efficiency and accuracy, and compared to recent SLAM methods using neural implicit representations, it adopts a real-time differentiable splatting rendering pipeline, significantly accelerating map optimization and RGB-D rendering.

Note 1: One of the core innovations of the paper is the adaptive 3D Gaussian expansion strategy. This strategy effectively reconstructs newly observed scene geometry by adding new or removing noisy 3D Gaussians and improves the mapping of previously observed areas, which is crucial for expanding the 3D Gaussian representation to reconstruct the entire scene rather than synthesizing static objects in existing methods.

Note 2: Another innovation of the paper is the introduction of a coarse-to-fine tracking technique. During the pose tracking process, an efficient coarse-to-fine technique is designed for selecting reliable 3D Gaussian representations to optimize the camera pose, thereby reducing runtime and enhancing the robustness of the estimation.

Note 3: The method in the paper may be limited under certain conditions if the quality of the depth data is not high. Moreover, although the 3D Gaussian representation performs well in rendering and mapping, in some cases, such as in more challenging scenes in the TUM-RGBD dataset, further optimization may be required to improve tracking accuracy.

Recent Gains from reading research papers[Apr 19th]

Fri, 19 Apr 2024 00:00:00 +0000

[1] Yan G, Pi J, Guo J, et al. OASim: an Open and Adaptive Simulator based on Neural Rendering for Autonomous Driving[J]. arXiv preprint arXiv:2402.03830, 2024.

This paper proposes OASim, an open and adaptive autonomous driving simulator, which generates high-fidelity autonomous driving data based on implicit neural rendering technology. It addresses the high cost, time consumption, and safety risks associated with real-world data collection through high-quality scene reconstruction, trajectory editing, a rich library of vehicle and sensor models, and a highly customizable data generation system. OASim utilizes advanced implicit surface reconstruction and 3D Gaussian splatting techniques, combined with an interactive visualization interface, allowing users to edit and preview vehicle trajectories and sensor configurations in real-time, thereby generating customized data suitable for multiple downstream applications of autonomous driving, such as perception and planning. Additionally, OASim demonstrates its experimental results in photorealistic rendering, novel viewpoint synthesis, diverse sensor configurations, and traffic flow simulation, verifying its effectiveness and advancement in autonomous driving data generation and simulation.

Note 1: One of the core innovations of the article is that OASim uses implicit neural surface reconstruction technology to achieve high-fidelity scene reconstruction. This method captures complex scene details by training a multilayer perceptron (MLP) to simulate the radiance and depth of each point in the scene, resulting in high-quality rendering outcomes.

Note 2: Another innovation of the article is the provision of an interactive interface that allows users to edit the trajectories of their own vehicle and other vehicles, flexibly configure sensor suites, and preview the data generated based on the edited trajectories in real-time. This interactivity provides users with a high degree of customization ability to generate data and scenarios that meet specific needs.

Note 3: Although the platform currently offers a rich library of sensor models, there may still be some types of sensors not covered. The focus is primarily on the generation of visual data, and in the future, it could explore the integration of more modal data (such as sound, temperature, etc.) to provide a more comprehensive simulation environment.

Recent Gains from reading research papers[Apr 12th]

Fri, 12 Apr 2024 00:00:00 +0000

[1] Zhu Z, Chen Y, Wu Z, et al. Latitude: Robotic global localization with truncated dynamic low-pass filter in city-scale nerf[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 8326-8332.

Neural Radiance Fields (NeRF) have achieved significant success in representing complex 3D scenes; however, existing NeRF-based pose estimators are prone to local optima during the optimization process and lack initial pose prediction. To address these issues, the authors propose the LATITUDE method, which includes a two-stage localization mechanism: the first is a position recognition phase that provides initial global localization values by training a regressor; the second is a pose optimization phase that minimizes the residual between observed and rendered images by directly optimizing the pose on the tangent plane. To avoid local optima, the authors introduce a coarse-to-fine pose registration using TDLF.

Note 1: One of the core innovations of the article is the introduction of a two-stage global localization mechanism. In the position recognition phase, an initial global position estimate is provided by training a regressor based on NeRF. This approach leverages the large-scale image data generated by NeRF to provide reliable initial values for global localization. In the pose optimization phase, the pose is optimized on the tangent plane, achieving a coarse-to-fine adjustment of the pose, which helps to improve the accuracy of localization.

Note 2: Another innovation of the article is the introduction of TDLF to avoid local optima during the optimization process. TDLF applies a smooth mask to the positional encoding of NeRF during optimization, allowing for dynamic adjustment from non-zero to full across different frequency bands. This coarse-to-fine optimization strategy helps to avoid local optima caused by high-frequency information, ensuring the stability and accuracy of the optimization process.

Note 3: The method presented in the paper demonstrated high precision in experiments but did not discuss in detail its performance in real-time or near-real-time applications. Real-time capability is a critical factor for practical robotic navigation systems. Future work could focus on improving the computational efficiency of the algorithm. Moreover, the paper primarily focused on vision-based localization methods. In practical applications, combining data from multiple sensors could improve the accuracy and robustness of localization. Future work could explore how to effectively integrate data from different sensors to further enhance the performance of the system.

Recent Gains from reading research papers[Apr 5th]

Fri, 05 Apr 2024 00:00:00 +0000

[1] C. Jiang et al., “H 2 -Mapping: Real-Time Dense Mapping Using Hierarchical Hybrid Representation,” in IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6787-6794, Oct. 2023, doi: 10.1109/LRA.2023.3313051.

This paper introduces a novel real-time dense mapping method named H2-Mapping, which is based on Neural Radiance Field (NeRF) technology. The innovation lies in its ability to achieve high-quality 3D map reconstruction and real-time capabilities on edge computing devices. The method proposes a novel hierarchical hybrid representation that utilizes implicit multi-resolution hash encoding and explicit octree-based Signed Distance Function (SDF) priors to describe scenes at varying levels of detail. This representation allows for rapid initialization of scene geometry and facilitates easier learning of scene geometry. Additionally, the paper presents a coverage-maximizing keyframe selection strategy to address the forgetting issue and enhance mapping quality, especially in edge regions.

Note 1: The hierarchical hybrid representation employed in this paper combines explicit octree SDF priors and implicit multi-resolution hash encoding to describe different levels of scene detail. By efficiently capturing coarse geometric shapes using octree SDF priors, the multi-resolution hash encoding can focus on residual geometric shapes, which are simpler than the complete geometry, thus improving geometric precision and convergence rates. This representation also achieves real-time dense mapping and dynamic scalability through rapid scene geometry initialization and geometry shapes that are easier to learn.

Note 2: To achieve higher mapping accuracy, the paper proposes a coverage-maximizing keyframe selection strategy to address the critical forgetting issue in online mapping tasks. This strategy avoids redundant computation of samples across all keyframes, ensuring the quality of edge regions without increasing the number of training samples. Through this strategy, all allocated voxels are covered, improving the mapping quality with the least number of iterations, especially in edge areas.

Note 3: Current methods may face challenges when dealing with dynamic scenes, as dynamic objects can move or change during the mapping process. Future research could explore how to effectively integrate real-time data of dynamic objects to achieve accurate mapping of dynamic environments.

Recent Gains from reading research papers[Mar 29th]

Fri, 29 Mar 2024 00:00:00 +0000

[1] Charatan D, Li S, Tagliasacchi A, et al. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction[J]. arXiv preprint arXiv:2312.12337, 2023.

The paper presents the pixelSplat3D reconstruction method, capable of reconstructing a 3D radiance field parameterized by 3D Gaussian primitives from a pair of images. The main features of this method include the ability to achieve real-time rendering and memory-efficient training, as well as rapid 3D reconstruction during inference. pixelSplat overcomes local minima issues by predicting a dense probability distribution of 3D positions and sampling Gaussian means from this distribution. Additionally, the method employs a reparameterization trick to make the sampling operation differentiable, thus allowing gradients to backpropagate through the Gaussian splatting representation.

Note 1: In real-world datasets, due to the limitations of Structure from Motion (SfM) software, the reconstructed camera poses only have an arbitrary scale factor, leading to scale ambiguity issues. The paper addresses this by designing a multi-view epipolar transformer that can reliably infer the scale factor for each scene. This method finds correspondences between two reference views and combines them with depth values encoded with positional information, enabling the model to correctly predict the position of each Gaussian primitive without knowing the global scale.

Note 2: In generalized 3D reconstruction, directly optimizing Gaussian primitive parameters through gradient descent can easily fall into local minima. To address this problem, pixelSplat proposes a novel parameterization approach that predicts the probability distribution of Gaussian positions instead of directly predicting depth values, and makes the sampling operation differentiable through reparameterization. This way, as gradient descent increases the opacity of a Gaussian at a certain 3D location, the model also increases the probability that the location will be sampled again in the future, thus avoiding local minima while maintaining gradient flow.

Note 3: The paper primarily addresses the problem of 3D reconstruction from a pair of images. Future research could explore how to extend this method to a wider range of applications, such as reconstruction from multiple viewpoints or using different types of image data. Moreover, there may still be certain limitations when dealing with complex scenes and objects rich in detail. Future work could further improve the accuracy and detail of 3D reconstruction by enhancing the network structure, introducing more advanced prior knowledge, or combining other types of sensor data.

3D Reconstruction of Chengde Airport, China

Wed, 27 Mar 2024 00:00:00 +0000

Recent Gains from reading research papers[Mar 22th]

Fri, 22 Mar 2024 00:00:00 +0000

[1] Zhou X, Lin Z, Shan X, et al. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes[J]. arXiv preprint arXiv:2312.07920, 2023.

The literature introduces DrivingGaussian, specifically designed for efficiently representing and modeling dynamic autonomous driving scenarios. By employing Composite Gaussian Splatting technology, it decomposes the static background and multiple dynamic objects separately and reconstructs them for global rendering in complex driving scenarios. The framework utilizes incremental static 3D Gaussians to gradually build the static background of the entire scene and employs composite dynamic Gaussian maps to handle multiple moving objects, thereby restoring their accurate positions and occlusion relationships in the scene. Furthermore, DrivingGaussian leverages LiDAR data as prior knowledge to achieve more detailed scene reconstruction and maintain panoramic consistency. This method surpasses existing technologies in dynamic driving scene reconstruction and can achieve high fidelity and multi-camera consistent panoramic view synthesis.

Note 1: Composite Gaussian Splatting is a core concept of DrivingGaussian. It decomposes the entire scene into a static background and dynamic objects, then reconstructs each part separately, and finally integrates them for global rendering. Specifically, the paper first uses incremental static 3D Gaussians to build a comprehensive scene sequentially obtained from multiple surrounding camera views. Then, it uses composite dynamic Gaussian maps to individually reconstruct each moving object and dynamically integrates them into the static background based on the Gaussian maps. This method allows for efficient representation and rendering of dynamic objects and static backgrounds in complex driving scenarios.

Note 2: DrivingGaussian uses LiDAR data as prior knowledge for Gaussian Splatting to recover more accurate geometric structures and maintain multi-view consistency. Compared with traditional Structure from Motion (SfM) based initialization methods, LiDAR prior provides more accurate geometric shape priors and comprehensive scene descriptions, not just as depth supervision for images. This method enables DrivingGaussian to demonstrate the potential to reconstruct large-scale dynamic scenes even without LiDAR data, and provide more accurate geometric structures and better multi-view consistency when LiDAR data is available.

Note 3: The paper mentions that the model faces certain challenges when dealing with extremely small objects (such as roadside stones) and materials with full reflection characteristics (such as glass mirrors and water surfaces). Future work can focus on improving Gaussian Splatting technology to better represent these areas with complex lighting and reflection characteristics.

3D Reconstruction of TB-20 aircraft using 3D Gaussian Splitting and Unreal Engine 5

Fri, 15 Mar 2024 00:00:00 +0000

Recent Gains from reading research papers[Mar 15th]

Fri, 15 Mar 2024 00:00:00 +0000

[1] Yan Y, Lin H, Zhou C, et al. Street gaussians for modeling dynamic urban scenes[J]. arXiv preprint arXiv:2401.01339, 2024.

This paper introduces a novel scene representation method termed “Street Gaussians,” designed for modeling dynamic urban street scenes from monocular video data. This approach addresses the limitations of existing technologies characterized by slow training and rendering speeds, as well as a heavy reliance on precise tracking of vehicular poses. Street Gaussians overcome these constraints by explicitly representing dynamic urban streets with point clouds, semantic logits, and 3D Gaussians. The method is capable of completing training within half an hour and rendering high-quality images at a rate of 133 frames per second.

Note1: The three-dimensional reconstruction of autonomous driving scenes for simulation environment construction leverages the creation of realistic urban street scenes to provide a virtual testing ground for autonomous vehicles, thereby reducing the costs and risks associated with real-vehicle testing. By generating high-quality virtual images and point cloud data, this technology aids in the development and training of sensor fusion algorithms within autonomous driving systems, such as the integration of data from radar, LiDAR, and cameras. It also enhances the scene understanding of autonomous vehicles, including vehicle detection, pedestrian recognition, and traffic sign interpretation, thus supporting more accurate and reliable decision-making.

Note2: The innovation of the article lies in its adoption of an explicit scene representation, wherein dynamic urban streets are depicted as a collection of point clouds with semantic logits and 3D Gaussians, simulating foreground vehicles or the background. This explicit representation allows for the easy combination of object vehicles and the background, facilitating scene editing operations, and is capable of completing training in half an hour, achieving real-time rendering. Moreover, to simulate the dynamics of foreground object vehicles, each object point cloud is optimized through an adjustable tracking pose, combined with a dynamic spherical harmonics model to handle dynamic appearances. This representation method allows for the use of poses provided by off-the-shelf trackers, achieving performance comparable to that using precise ground truth poses, even in the presence of noisy data.

Note3: Applying this technology to the apron, we can not only create finely detailed and lifelike apron scenes but also dynamically generate traffic flow within the apron, realizing a realistic digital twin platform. This platform can be used for the verification of various algorithms during the mixed operation of manned/unmanned vehicles.

Recent Gains from reading research papers[Mar 8th]

Fri, 08 Mar 2024 00:00:00 +0000

[1] Chen Y, Gu C, Jiang J, et al. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering[J]. arXiv preprint arXiv:2311.18561, 2023.

This paper introduces a novel dynamic scene representation method called Periodic Vibration Gaussian (PVG), specifically designed for the reconstruction and real-time rendering of large-scale dynamic urban scenes. The PVG method effectively captures both static and dynamic elements in a scene by incorporating periodic vibration-based temporal dynamics.

Note1: PVG is a unified representation model that introduces periodic vibration-based temporal dynamics, enabling each Gaussian point to have a specific lifespan and dynamically adjust its position and opacity based on temporal changes. This representation not only elegantly unifies static and dynamic scene elements but also captures the motion characteristics of dynamic objects, such as velocity and staticness, through periodic vibrations. By integrating periodic vibrations into the conventional 3D Gaussian splatting formulation, the PVG model provides a new approach to dynamic scene modeling.

Note2: To enhance temporal coherence in representation learning with typically sparse training data, the article proposes a scene flow-based temporal smoothing mechanism that estimates the state changes between adjacent timestamps to smooth the scene flow. Additionally, to more effectively represent unbounded urban scenes, the article introduces a position-aware adaptive control strategy that adaptively adjusts the size of the point cloud based on its location, thereby reducing the number of required points while maintaining scene accuracy.

Recent Gains from reading research papers[Mar 1st]

Fri, 01 Mar 2024 00:00:00 +0000

[1] Liu Y, Zhang K, Li Y, et al. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models[J]. arXiv preprint arXiv:2402.17177, 2024.

This paper provides a comprehensive review of Sora, a text-to-video generative AI model released by OpenAI. Sora has the capability to create videos of scenes that are either realistic or imagined based on textual instructions, demonstrating potential in simulating the physical world. The paper discusses in detail the background of Sora, its underlying technologies, applications, challenges faced, and future directions for development, drawing on publicly available technical reports and reverse engineering.

Note1: To effectively process videos of varying resolutions and frame rates, Sora compresses videos into a low-dimensional latent space and then decomposes them into spacetime patches. These patches function analogously to tokens in language models, providing Sora with detailed visual phrases for video construction. Moreover, Sora employs a video compression network, or visual encoder, to reduce the dimensionality of input data, particularly raw videos. This network, based on Variational Autoencoders (VAE) or its variants like Vector Quantized Variational Autoencoder (VQ-VAE), transforms video frames into fixed-size patches, which are then encoded into the latent space. This allows the model to handle videos of different resolutions and frame rates. By compressing videos into latent space representations and then extracting a series of latent spacetime patches, these patches encapsulate the visual appearance and motion dynamics over short time spans. These patches are subsequently used to generate videos through a diffusion transformer model. During training, Sora may employ a cascading diffusion model architecture, which includes a base model and multiple spacetime refinement models. This architecture enables the model to enhance the quality and frame rate of videos through cascading refinement while maintaining high resolution.

Note2: SORA currently also faces some limitations that need improvement. Sora may struggle with understanding textual instructions regarding the placement or arrangement of objects and characters, which can lead to generated videos that do not match the expected spatial layout and temporal sequence. Additionally, the model may insert irrelevant elements when dealing with complex scenes containing multiple characters or elements, thereby altering the original composition and atmosphere of the scene. When simulating complex scenarios, Sora might not accurately handle physical principles such as interactions between objects, kinematics, and dynamics. This could result in generated videos with flaws in physical authenticity, such as unnatural object deformations or incorrect physical interactions. These are challenges that we need to overcome in subsequent developments.

Recent Gains from reading research papers[Feb 23th]

Fri, 23 Feb 2024 00:00:00 +0000

[1] Tian R, Zhang Y, Feng Y, et al. Accurate and robust object SLAM with 3D quadric landmark reconstruction in outdoors[J]. IEEE Robotics and Automation Letters, 2021, 7(2): 1534-1541.

Object-oriented SLAM is a cutting-edge technology in the fields of autonomous driving and robotics. This paper introduces a stereo vision SLAM system with a robust quadratic landmark representation method. The system comprises four components: deep learning-based detection, quadratic landmark initialization, object data association, and object pose optimization. SLAM algorithms based on quadratic surfaces always face observation-related challenges and are sensitive to observation noise, which limits their application in outdoor scenes. To address this issue, the paper proposes a quadratic initialization method based on a quadratic parameter separation approach, enhancing robustness to observation noise. An adequate object data association algorithm and object-oriented optimization with multiple cues enable highly accurate object pose estimation and robustness to local observations.

Note1: The innovation of this paper lies in: 1. The introduction of an algorithm based on Separating Quadric Parameters (SQP) for the initialization of 3D ellipsoidal landmarks. This method independently estimates the translation and yaw rotation of the ellipsoid center, improving robustness to observation noise. 2. The adoption of a novel Object Data Association (ODA) algorithm that combines semantic inlier distribution, motion prediction based on Kalman filtering, and ellipsoidal projection to achieve precise data association. 3. The implementation of a real-time stereo vision SLAM system that utilizes precise and robust ellipsoids to represent objects, aiming to build an object-oriented and semantically enhanced map for outdoor navigation.

Note2: The accuracy of data association in scenes with dynamic objects, such as moving vehicles and pedestrians, may require further improvement to prevent incorrect object associations leading to inaccurate ellipsoid initialization. Moreover, there may be room for enhancement in the system’s real-time performance and computational efficiency, especially when dealing with large-scale or complex scenes.

Recent Gains from reading research papers[Jan 26th]

Fri, 26 Jan 2024 00:00:00 +0000

[1] Nair G B, Daga S, Sajnani R, et al. Multi-object monocular SLAM for dynamic environments[C]//2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2020: 651-657.

The paper addresses the issue of multi-object SLAM (Simultaneous Localization and Mapping) using a monocular camera. “Multi-object” signifies that the algorithm is capable of tracking both the camera’s motion and the movement of other dynamic entities within the scene. A typical challenge in dynamic environments is the unobservability problem, which refers to the impossibility of triangulating moving objects explicitly from a moving monocular camera. Existing solutions suffer from relative scale ambiguity, meaning that there is an infinite number of solutions for each pair of movements within the scene. The innovation of this paper lies in its approach to resolving this issue by leveraging monocular metric, advances in deep learning, and category-level shape estimation. The paper proposes a multi-pose graph optimization formulation to address the ambiguities associated with the involved relative and absolute scale factors.

Note1: The novelty of this paper is the introduction of a multi-object monocular SLAM system for dynamic environments, which is the first practical system of its kind. It can perform dynamic multi-object tracking and self-localization within a unified framework. By utilizing advances in monocular metric and category-level shape estimation through deep learning, the paper resolves the issue of relative scale ambiguity present in dynamic scenes. Category-level shape estimation is a method worth emulating in future research.

Note2: The algorithm of the article may not accurately locate keypoints for objects at a greater distance, which could affect the overall performance of the system. Improving the detection and tracking of distant objects could be a direction for future work. Moreover, the current implementation may not be suitable for applications requiring real-time responses, such as autonomous driving. Further optimization of the algorithm to increase processing speed could be an important direction for improvement. Although the current method has made some progress in resolving scale ambiguity, there is still room for further in-depth research in this field, especially under various types of scenes and conditions.

Recent Gains from reading research papers[Jan 12th]

Fri, 12 Jan 2024 00:00:00 +0000

[1] Doherty K, Fourie D, Leonard J. Multimodal semantic slam with probabilistic data association[C]//2019 international conference on robotics and automation (ICRA). IEEE, 2019: 2419-2425.

Semantic SLAM problems can be decomposed into discrete inference problems: determining object class labels and measurement-landmark correspondences (data association problems), as well as continuous inference problems: obtaining the robot pose and object position set in the environment. However, under fuzzy data association, this is often a non-Gaussian inference problem, while most previous work has focused on Gaussian inference. The paper proposes a solution by representing hypotheses as multiple modes of an equivalent non-Gaussian sensor model. Then, non-parametric belief propagation is used to solve the resulting non-Gaussian inference problem.

Note1: The innovation of the paper is mainly reflected in the introduction of multi-modal sensor models. The paper proposes a new method that represents hypotheses of data association as multiple modes of an equivalent non-Gaussian sensor model, which can better handle SLAM problems in non-Gaussian environments. In addition, to address non-Gaussian inference problems, the paper adopts non-parametric belief propagation. This method can effectively approximate non-Gaussian posterior, dealing with the uncertainty of complex data association and landmark categories.

Note2: Although the proposed method can handle uncertain associations, it relies on hard decisions when adding new landmarks. Representing this uncertainty is an important step more closely integrating data association and SLAM problems. Furthermore, while the paper presents a method for handling complex data association, the computational cost of calculating association probabilities may be large. Future work could explore solutions to this computational complexity issue, such as using Dirichlet process priors or approximate matrix permanents. The paper assumes a simple geometric model, focusing on comparing data association methods. Another future direction is to apply new geometric representations, such as quadrics, to further improve the performance and accuracy of SLAM.

Recent Gains from reading research papers[Jan 5th]

Fri, 05 Jan 2024 00:00:00 +0000

[1] Zhou J, Elksnis A, Fu Z, et al. MultiMap3D: A Multi-Level Semantic Perceptual Map Construction Based on SLAM and Point Cloud Detection[C]//2023 28th International Conference on Automation and Computing (ICAC). IEEE, 2023: 1-6.

This paper presents a real-time multi-level semantic map generation approach based on the recent advancements in anchor-free methods for point cloud object detection in terms of computation time and accuracy. Combining visual SLAM, ground segmentation, point cloud object detection, and data association strategies for major semantic instances, a hierarchical graph is constructed in real-time. The multi-level maps consist of four layers: three-dimensional semantic metric layer, two-dimensional semantic grid layer, object layer, and room layer. A data association strategy is designed to track object instances across multiple frames and maintain a database of tracked objects. The paper also proposes a simple and effective method for room recognition using the SVC algorithm. Options for generating and blending training data are explored to address the limited real-world data for training SVC.

Note1: Few researchers in current studies have combined geometric reconstruction (such as SLAM/SFM and multi-view stereo) with deep learning-based semantic segmentation or object detection methods to build multi-level semantic maps. Navigation and path planning using metric maps alone lead to significant storage and redundancy issues. Therefore, it is necessary to integrate low-level obstacle avoidance and motion planning with high-level task planning by constructing maps with multi-level semantic information. This allows robots to quickly and efficiently capture different abstraction levels of reality and perform more complex tasks in indoor scenarios.

Note2: The highlight of this work lies in the fusion of target detection with SLAM, generating semantic SLAM maps with 3D object detection boxes. Additionally, it delves into understanding how the preprocessing of raw 3D data affects the performance of laser odometry. However, the method has not been tested in outdoor scenarios yet, and future work involves optimizing pose estimation and extending the algorithm to apron usage.

Recent Gains from reading research papers[Dec 22th]

Fri, 22 Dec 2023 00:00:00 +0000

[1] Hosseinzadeh M, Li K, Latif Y, et al. Real-time monocular object-model aware sparse SLAM[C]//2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019: 7123-7129.

The article constructs a Semantic SLAM based on sparse point clouds, integrating real-time deep learning object detection with a monocular SLAM framework. This achieves a secondary surface representation for general objects and ensures real-time performance. The work is primarily based on the traditional ORB-SLAM algorithm, incorporating CNN networks to finely reconstruct identified targets and providing shape priors for secondary surfaces. To capture the main structure of the scene, additional plane landmarks are detected through a CNN-based plane detector and modeled as independent landmarks in the map.

Note1: The innovation of the paper lies in the use of sparse point clouds, unlike conventional Object-SLAM that often relies on dense point clouds. By merging the object detector, the paper achieves dense reconstruction of identified objects, realizing Semantic SLAM while significantly reducing computational complexity. Additionally, the article utilizes a CNN network to extract plane information. Data association between planes is mainly obtained through shared keypoints, normals, and distance relationships. An added object point cloud generation network places specific object detection boxes into a CNN network, generating a point cloud to represent the 3D shape of the object, then estimating alignment parameters with the ellipsoid in SLAM after normalizing the bounding ellipsoid. The generated object point cloud can be used to provide object shape priors, introducing additional constraints in optimization.

Note2: Previous representations of dual quadric surfaces for general objects had certain constraints: (1) From the frontend perspective, such as a) dependence on the depth channel for plane segmentation and parameter regression, b) precomputation of object detection based on Faster R-CNN to allow real-time performance, and c) self-organized object and plane matching/tracking. (2) From the backend perspective: a) assuming conic observations are axis-aligned, limiting the robustness of quadric surface reconstruction, b) keeping all detected landmarks in a single global reference frame. This work not only addresses the aforementioned constraints but also introduces new factors applicable to real-time inclusion of planes and object detection while integrating detailed point cloud reconstruction from deep learning CNN into the map. The algorithm’s final output remains quadric surface bounding boxes, and future work may consider further constraint elimination using more precise expressions.

Recent Gains from reading research papers[Dec 14th]

Thu, 14 Dec 2023 00:00:00 +0000

[1] Moreno F M, Guindel C, Armingol J M, et al. Study of the Effect of Exploiting 3D Semantic Segmentation in LiDAR Odometry[J]. Applied Sciences, 2020, 10(16): 5657.

The paper investigates how preprocessing point clouds using 3D semantic segmentation affects the performance of laser odometry in terms of distance measurement. It analyzes the estimated trajectories when filtering the raw data with semantic information. Different filtering configurations are tested: raw (original point cloud), dynamic (removing dynamic obstacles from the point cloud), dynamic vehicles (removing vehicles), long-range (removing distant points), ground (removing ground points), and structural (keeping only structural and object points in the point cloud). The conclusions drawn from this work are significant for improving the efficiency of laser odometry algorithms in various scenarios.

Note1: The main contribution of the paper is to gain insights into how preprocessing of raw 3D data affects the performance of laser odometry. This analysis specifically focuses on a state-of-the-art method called LOAM, which is a high-performance technique that utilizes only laser scanner information and serves as the basis for many literature works.

Note2: The highlight of this work lies in exploring the behavior of laser odometry by proposing six different input configurations. For each configuration, a set of newly filtered data is generated and provided to the odometry algorithm for evaluation. The experimental results confirm the effectiveness of preprocessing the data before inputting it into the localization algorithm. Furthermore, the type of driving environment has a decisive impact on the applicability of each filtering method.

Recent Gains from reading research papers[Dec 7th]

Thu, 07 Dec 2023 00:00:00 +0000

[1] Rosinol A, Leonard J J, Carlone L. Nerf-slam: Real-time dense monocular slam with neural radiance fields[J]. arXiv preprint arXiv:2210.13641, 2022.

The paper proposes a novel pipeline for accurate and real-time reconstruction of scenes from monocular images using geometric and photometric 3D mapping. The work leverages recent advancements in dense monocular SLAM and real-time hierarchical volume-based neural radiance fields (NeRF). Dense monocular SLAM refers to specific methods that solely utilize monocular images to provide accurate scene pose estimation and depth maps. Neural radiance fields are neural networks that model the 3D structure and appearance of a scene by modeling the radiance (light intensity) at each point in space. By combining information from dense monocular SLAM with neural radiance fields, it becomes possible to create real-time, photometrically accurate scene maps. The use of uncertainty-based depth loss helps improve the photometric and geometric accuracy of the maps by addressing the uncertainty in depth estimates provided by the dense monocular SLAM system. By considering this uncertainty, the proposed method achieves significantly better results in terms of photometric and geometric accuracy compared to competing methods, with a 179% improvement in peak signal-to-noise ratio (PSNR) and an 86% improvement in L1 depth, which is remarkable and indicates a new direction for SLAM research.

Note1: The breakthrough of this work lies in proposing the first 3D scene reconstruction pipeline that combines the advantages of dense monocular SLAM and hierarchical volume-based neural radiance fields. The core idea is to use a monocular dense SLAM method to estimate camera poses and dense depth maps along with their uncertainties and utilize this information as supervision to train the NeRF scene representation. This approach constructs accurate radiance fields from the image stream without requiring pose or depth as input and operates in real-time. It achieves state-of-the-art performance on a replica dataset of monocular methods. Future work is suggested to build upon this latest research.

Note2: NeRF was initially developed for image rendering, i.e., generating an image from a given camera view. NeRF is built upon the assumption of known camera poses, but in most robotic applications, the camera poses are unknown. Consequently, more recent work applies NeRF techniques to simultaneously estimate camera poses and model the environment, known as NeRF-based SLAM.

Recent research has shown that given sufficiently good initial estimates, having a given camera view is not strictly necessary. Therefore, real-time pose-agnostic NeRF reconstruction can produce accurate 3D maps. Overall, the article leverages recent research advancements in dense monocular SLAM (Droid-SLAM), probabilistic volume fusion (Rosinol et al.), and hash-based hierarchical volume neural radiance fields (Instant-NGP) to estimate the geometry and photometry of scenes in real-time without the need for depth images or poses.

Note3: The fusion of deep learning and traditional geometry is a trend in SLAM development. In the past, some modules in SLAM that relied on single-point methods have been replaced by neural networks, such as feature extraction (SuperPoint), feature matching (SuperGlue), loop closure (NetVlad), and depth estimation (MonoDepth), among others. Compared to single-point replacements, NeRF-based methods present a completely new framework that can replace traditional SLAM end-to-end, both in terms of design methodology and implementation architecture.

Compared to traditional SLAM, NeRF-based methods have the following advantages:

-Directly operate on raw pixel values without feature extraction. The error is regressed to the pixels themselves, resulting in more direct information transfer and an optimization process that yields immediate visual results.

-Both implicit and explicit map representations can be differentiable, allowing for full-dense optimization of the map (traditional SLAM typically struggles to optimize dense maps and usually only optimizes a limited number of feature points or updates the map coverage).

Recent Gains from reading research papers[Dec 1st]

Fri, 01 Dec 2023 00:00:00 +0000

[1] Li R, Li S J, Chen X, et al. TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation[J]. arXiv preprint arXiv:2309.07849, 2023.

The paper presents a distance-image-based LiDAR semantic segmentation method that utilizes temporal information to address the “many-to-one” problem caused by the limited horizontal and vertical angular resolution of distance images. This problem can result in approximately 20% of 3D points being occluded, rendering the LiDAR semantic segmentation unable to accurately and robustly understand the surrounding environment. Specifically, the article combines a temporal fusion layer to extract useful information from previous scans and integrate it with the current scan. Then, a post-processing technique based on maximum voting is designed to correct erroneous predictions, particularly those caused by the “many-to-one” problem.

Note1: The so-called “many-to-one” problem refers to the influence of boundary blurring effects on the range view representation. The main cause of this problem is the limited horizontal and vertical angular resolution: when these points share the same vertical and horizontal angles, multiple points will be projected onto the same range image pixel. Handling such issues should be considered in subsequent work.

Note2: The highlight of this work lies in the incorporation of temporal information in segmentation or detection, inspired by human visual perception. Temporal information is crucial for understanding object motion and identifying occlusions. By combining temporal information, severely occluded points can be captured from adjacent range image scans. The post-processing during the inference stage can also benefit from this approach. The article proposes a post-processing scheme based on maximum voting, which effectively leverages predictions from past frames.

Recent Gains from reading research papers[Nov 24th]

Fri, 24 Nov 2023 00:00:00 +0000

[1] Li P, Ding S, Chen X, et al. PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird’s-Eye View[J]. arXiv preprint arXiv:2306.10761, 2023.

The paper proposes an end-to-end framework, POWERBEV, for vehicle motion prediction based on bird’s-eye view (BEV). The framework differs in several design choices, reducing inherent redundancies in previous methods. Firstly, POWERBEV predicts the future not through auto-regressive modeling but with parallel multi-scale modules built using lightweight 2D convolutional networks. Secondly, the article employs segmentation and centripetal inverse flows for prediction, simplifying the previous multitask objectives by eliminating redundant output patterns. Based on this output representation, a simple flow-based post-processing method is proposed, which generates more stable instance associations.

Note1: Currently, multi-sensor data fusion and feature extraction on bird’s-eye view (BEV) are commonly used and have achieved good results. However, there is limited research on motion prediction in this regard. Traditional prediction methods based on multi-view surround cameras rely on multitask auto-regressive settings and complex post-processing to predict future instances in a spatiotemporally consistent manner. This article deviates from this paradigm.

Note2: The current primary approach to predicting vehicle motion trajectories decouples the task into separate modules. In this mode, interested objects are detected and localized through complex perception models, and associations are made across multiple frames. Then, using parameterized trajectory models, the potential future motion of these detected targets is predicted based on their past motion. However, due to the separate prediction of perception and motion models, the entire system is prone to errors in the first stage. Therefore, I personally believe that an end-to-end model design can be applied to future perception, localization, prediction, and other aspects. This approach can effectively alleviate the aforementioned issues and reduce data redundancy, presenting a feasible path forward.

Recent Gains from reading research papers[Nov 17th]

Fri, 17 Nov 2023 00:00:00 +0000

[1] Zhang H, Xie C, Toriya H, et al. Vehicle Localization in a Completed City-Scale 3D Scene Using Aerial Images and an On-Board Stereo Camera[J]. Remote Sensing, 2023, 15(15): 3871.

Navigation in autonomous driving cars currently relies heavily on high-precision maps. However, the production of high-precision maps is both complex and expensive, posing challenges for commercialization. Therefore, the paper proposes a global positioning system using low-precision urban-scale 3D scene maps reconstructed by UAVs to optimize the visual positioning of vehicles. To address the differences in image information due to different aerial and ground viewpoints, the paper introduces a wall complementary algorithm based on building geometries to refine the city-scale 3D scene. The paper also develops a 3D-to-3D feature alignment algorithm to determine the vehicle position by combining the optimized city-scale 3D scene with the local scene generated by the on-board stereo camera.

Note1: The innovation of the paper is that it proposes a SLAM system that fuses bird’s-eye view and ground view while reconstructing a 3D large-scale scene model. The UAV vision sensor provides an aerial view of the city. This aerial view is used to reconstruct a previous 3D large scene model at the commercial district scale. The ground vision sensor consists of two pairs of stereo cameras mounted on the vehicle to capture a ground view. This ground view is used to localize the vehicle within the larger scene.

Note2: The construction of high precision maps within the apron has been conceptualized before, and the limiting point is that the reconstruction requires taking a large number of photographs of different angles within the actual aircraft stand, which is difficult to achieve. The reconstruction and experimental validation in this thesis was also not realized using actual scenes, but a computer graphics (CG) simulator. The scene acquisition and validation in the simulator provides ideas for the construction of high precision maps within the apron.

Recent Gains from reading research papers[Nov 10th]

Fri, 10 Nov 2023 00:00:00 +0000

[1] Yang S, Scherer S. CubeSLAM: Monocular 3D object detection and SLAM without prior models[J]. arXiv preprint arXiv:1806.00557, 2018.

This paper proposes a single-image 3D rectangular object detection and a multi-view object SLAM approach without a priori object modeling, and demonstrates that the two aspects can benefit from each other. For 3D detection, the paper generates high-quality rectangular proposals from 2D bounding boxes and vanishing point sampling. The proposals are further scored and selected to align with image edges. Experiments on SUN RGBD and KITTI demonstrate the efficiency and accuracy of existing methods. Then in the second part, multi-view beam adjustment with novel measurements is proposed to jointly optimize camera poses, objects and points using single view detection results. Objects can provide more geometric constraints and scale consistency than points.

Note1: Most existing monocular approaches address object detection and SLAM separately and also rely on existing CAD models of objects that may not be applicable to general environments. The innovation of this work lies in focusing on a priori-free 3D object mapping and jointly addressing 3D object detection and multi-view object SLAM, proposing a system that addresses both 3D object detection and SLAM.

Note2: This approach demonstrates for the first time that semantic object detection and geometric SLAM can be mutually beneficial in a unified framework, and the inclusion of 3D LIDAR can be considered in the future and integrated into a single framework.

Recent Gains from reading research papers[Nov 3rd]

Fri, 03 Nov 2023 00:00:00 +0000

[1] Hariya, Keigo, Hiroki Inoshita, Ryo Yanase, Keisuke Yoneda, and Naoki Suganuma. 2023. “ExistenceMap-PointPillars: A Multifusion Network for Robust 3D Object Detection with Object Existence Probability Map” Sensors 23, no. 20: 8367. https://doi.org/10.3390/s23208367

This paper proposes a framework “ExistMap-PointPillars”, a LiDAR-camera multilevel fusion approach for 3D object recognition, which is mainly used to cope with challenging and unfavorable conditions, e.g., at night or in rainy weather.The core concept of ExistMap-PointPillars revolves around the integration of pseudo-2D maps, which depict the estimated object presence area from the fused sensor data in a probabilistic manner. The core concept revolves around the integration of pseudo-2D maps, which probabilistically depict the estimated object presence areas obtained from the fused sensor data. These maps are then merged into pseudo-images generated from 3D point clouds.

Note1:

The generation of pseudo-2D maps serves the following main purposes:

Reduce computational complexity: 3D point cloud data usually contains a large amount of information, and processing this data requires significant computational resources and time. By projecting 3D data onto a 2D plane, the complexity of the data can be greatly reduced, thus reducing computational resources and time.

Simplifying the problem: It is usually simpler to deal with problems on a 2D plane than in 3D space. 2D object detection is a problem that has been extensively studied in computer vision, with many well-established methods and algorithms available for reference and use. 3D object detection, on the other hand, is relatively new and methods and algorithms are still under development.

Effective utilization of information: Although pseudo-2D maps lose some 3D information (e.g., height), it still retains most of the important information, such as the position, shape and size of objects. In addition, by designing appropriate projection methods and feature extraction algorithms, we can encode useful 3D information into 2D maps.

Enhanced Visualization: Pseudo-2D maps can be easily visualized, which helps in understanding and parsing the data, as well as checking and debugging the algorithms.

Therefore, generating pseudo-2D maps is a commonly used technique when working with 3D data, and its can be used for in-situ vehicle detection, map construction, etc.

Note2: The method has a reduced number of true detections in the outer part of the error ellipse, which may result in failure to detect actually present objects in some cases and limited recall improvement for distant objects.

Ideas for improvement are considered as follows:

Enhancement of object detection in the outer part of the error ellipse: including improved feature extraction and classification algorithms, or the use of more sophisticated models to deal with complex backgrounds and object shapes.

Improve the pseudo-2D map generation process: consider using different projection methods or introducing more features to improve the ability of pseudo-2D maps in representing complex 3D world information.

Introducing more modalities: e.g. millimeter-wave radar, etc. This allows objects to be detected from more angles and depths, thus improving the comprehensiveness and accuracy of the detection.

Recent Gains from reading research papers[Oct 27th]

Fri, 27 Oct 2023 00:00:00 +0000

[1]K. Yoneda, N. Ichihara, H. Kawanishi, T. Okuno, L. Cao and N. Suganuma, “Sun-Glare region recognition using Visual explanations for Traffic light detection,” 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 2021, pp. 1464-1469, doi: 10.1109/IV48863.2021.9575631.

Recognition of traffic signals is an important task in autonomous vehicles, and extracting traffic light color information is crucial. This article develops a method to recognize sun glare regions in images using Convolutional Neural Network visual interpretation in order to address the situation where traffic lights and the sun overlap ahead and the images captured by the camera are overexposed, risking false detection. The CNN outputs an attention map using the Grad-CAM method, which can then be processed through time series to estimate the global orientation of the sun glare region. By estimating the direction of sunlight that reduces visibility such as direct sunlight and reflected light from buildings, it helps to achieve robust image recognition.

Note1: The estimated posterior probability distribution can be used not only to compute the sun direction, but also to understand the areas of reduced visibility in the current image. If the image quality is affected by the exposure, the color of the light will also change. In this case, the image quality of the captured image is affected and even the human eye may have difficulty in distinguishing the illumination color. In order to prevent false alarms in this case, we need to recognize this situation in advance and specialize it.

Note2: This author has implemented only an upstream task, we also need to integrate the algorithm with specific realities, e.g., applying it to active route planning and intersection driving position planning to avoid exposure situations, or calling algorithms that remove exposure. The algorithm can also be used to guide the sensing devices in the unmanned driving of safeguarded vehicles in the apron to encounter exposure situations.

Recent Gains from reading research papers[Oct 20th]

Fri, 20 Oct 2023 00:00:00 +0000

[1] 吴飞,金圣洁,林晓琛.基于Mask R-CNN和关键点提取的抓取位姿估计方法[J].合肥工业大学学报(自然科学版),2023,46(09):1178-1184.

In this paper, a maskregion-based convolutional neural network (MaskR-CNN) and key point extraction method is proposed for the problem of arbitrary grasping position of target workpieces in the industrial field, which is scattered and stacked in the space and obscured. We use instance segmentation for the workpieces to be grasped and the non-grasped workpieces in the grasping environment, and construct the grasping target surface point cloud; we extract the key points of 3D scale-invariant features from the target surface point cloud and extract the key points of 3D corner points from the template point cloud, and use the extracted key points as the initial values of the sampling consistency initial alignment algorithm, so that the computational amount of the point cloud alignment is reduced, and we can achieve the coarse alignment between the target surface point cloud and the reference template point cloud. Then, the iterative nearest point algorithm is used for accurate alignment.

Note1:
The way of extracting the surface point cloud by Mask-RCNN is relatively novel, and the ICP algorithm of iterative nearest point can greatly reduce the computation amount of point cloud alignment, and the way of realizing it is relatively simple, which can be learned from the article, especially can be applied to the scene flow estimation, and the inter-frame matching.

Note2: Mask-RCNN this instance segmentation algorithm is poor in real-time, the article did not mention the real-time problem, this aspect will be the focus of the future consideration of improvement.

Recent Gains from reading research papers[Oct 13th]

Fri, 13 Oct 2023 00:00:00 +0000

[1] Miyama M. Robust inference of multi-task convolutional neural network for advanced driving assistance by embedding coordinates[C]//Proceedings of the 8th World Congress on Electrical Engineering and Computer Systems and Science, EECSS. 2022: 105-1.

This paper develops a multi-task CNN (Convolutional Neural Network) for advanced driver assistance. The network performs three tasks simultaneously: object detection, semantic segmentation and parallax estimation. The innovation is that the three tasks share not only an encoder but also a decoder. The decoder uses a combination of deep point-by-point convolution and bilinear interpolation instead of the common transposed convolution. The number of multiply-accumulate operations can be reduced to 44.0% and the number of convolutional weight parameters to 38.2%. In multi-task CNN training, the loss weights for each task are automatically adjusted by backpropagation, and the three tasks are learned in a balanced manner. Reducing the complexity of the decoder not only did not reduce the recognition accuracy, but also improved it.

Note1: The combination of DP convolution and bilinear interpolation instead of the normal transposed convolution of the decoder reduces the number of parameters while maintaining the accuracy, which can be borrowed in the subsequent lightweighting algorithms done.

Note2: In the experimental session, without input coordinates, semantic segmentation and parallax estimation would be incorrectly estimated at the same location. The possible reason for this is that both tasks are the result of the inference of a CNN. the CNN uses a monocular image to estimate parallax. In this case, the parallax estimation depends on the type, size and location of the object. Therefore, there is a strong correlation between semantic segmentation and parallax estimation.

Note3: FlowNet for optical flow estimation can also be used for stereo vision parallax estimation. The encoder extracts feature maps from a pair of two images while gradually decreasing the resolution, and the decoder generates a per-pixel motion flow from the encoded feature maps while gradually increasing the resolution.

Recent Gains from reading research papers[Sep 29th]

Fri, 29 Sep 2023 00:00:00 +0000

[1]Rozsa Z, Sziranyi T. Object detection from a few LIDAR scanning planes[J]. IEEE Transactions on Intelligent Vehicles, 2019, 4(4): 548-560.

This paper presents a LiDAR identification method based on a small number of detection planes. It is also applicable to the current single line LiDAR on ROS carts. Current research has focused on target detection for close range LiDAR, typically targeting objects within a range of 10-20 m and no more than 30 m. Surface features of 3D LiDAR are not present in strongly discontinuous point cloud segments at greater distances and greater distances. The significance of this work is that the processing range can be increased to better support processing in higher speed scenes. Currently LiDAR sensors can still be used to detect points at greater distances, but there is no available surface information. So this work proposes a solution that utilizes the available information as much as possible.

Note: In high-speed autonomous driving car applications, a robust solution is needed to recognize far-field outdoor objects. Even in the best case (Velodyne VLS-1285 with very high vertical resolution), at a range of 150 meters (a 130 km/h vehicle on a highway approaching an object in about 4 seconds), LIDAR cannot see an object 1.5 meters high in more than 5 planes, and it can only see interruptions of it. The paper also mentions the difference between AGVs and self-driving vehicles in that AGVs are slower, whereas autonomous driving vehicles need to consider safe operation at faster speeds. When looking at autonomous driving in road traffic, it is possible that focusing more on AGV operation would come up with better ideas.

Recent Gains from reading research papers[Sep 22th]

Fri, 22 Sep 2023 00:00:00 +0000

[1]Duffhauss F, Baur S A. PillarFlowNet: A real-time deep multitask network for LiDAR-based 3D object detection and scene flow estimation[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 10734-10741.

This paper focuses on realizing the combination of scene stream detection and target detection for 3D point clouds. Previously, scene flow estimation and target detection need to be processed by separate networks, which requires a large amount of computational resources. This work proposes PillarFlowNet, a new approach for low-latency and high-precision LiDAR scene flow estimation and object detection based on a single network that achieves real-time performance.

Note1: The goal of the article is the multi-task learning problem of 3D object detection and scene flow estimation using LiDAR data in a single deep network.3D scene flow is a point of interest for deeper understanding, and its definition associates displacement vectors to each point in a point cloud and propagates them forward to the corresponding location in a continuous point cloud. Estimating 3D scene flow using LiDAR data is a complex task because of the inherent sparsity of measured points in 3D space, where there is almost no one-to-one correspondence of point orientations in two consecutive point clouds. The scene flow must be inferred from the potential motion of objects in the scene. There is still little research in this area, and research points in this area can be dug deeper.

Note2: The highlight of the paper is the use of a single network for simultaneous point cloud object detection and scene flow estimation. Networks in previous work relied on computing expensive 3D convolutions to find correspondences in the spatial and temporal domains, which makes it too slow for real-time applications. The paper proposes that the main feature of the network is a pillar-dependent feature representation combined with efficient 2D convolution. The increase in speed makes it the first LiDAR object detection and scene flow multitask prediction network that can run in real time. However, there is still room for improvement in its accuracy of detection and scene flow estimation.

Note3: I hope the authors will release the source code for us to study.

Recent Gains from reading research papers[Sep 15th]

Fri, 15 Sep 2023 00:00:00 +0000

[1]Rozsa Z, Sziranyi T. Optical Flow and Expansion Based Deep Temporal Up-Sampling of LIDAR Point Clouds[J]. Remote Sensing, 2023, 15(10): 2487.

This paper proposes a framework to generate a virtual point cloud of the radar for the current frame by using the image of the current frame of the camera and the point cloud of the previous frame of the lidar (mainly because the radar scans at a mere 5-20hz, and the image refreshes at a much faster rate, leading to temporal desynchronization between the two), making temporal upsampling of the point cloud possible. The only requirement for this system is a camera with a higher frame rate than the lidar. The general flow of the algorithm is to first utilize the optical flow estimation from the available camera frames, and second, upgrade it to a 3D scene flow by optical expansion. Subsequently, a ground plane is fitted to the previous LIDAR point cloud. Finally, the estimated scene stream is applied to the previously measured object points to generate a new point cloud.

Note1: The highlight of the paper is that previous point cloud prediction methods require five or more previous frames to generate virtual measurements, while the article requires one frame.

Note2: The paper solves the problem of sampling frequency lower than the image by generating a virtual point cloud. The virtual point cloud generation is faster and more accurate, but the static targets are not considered for the design and the ground point modeling affects the accuracy, which can be considered for improvement.

Note3: The optical flow method is also one of the points that can be improved, because the optical flow method receives a lot of influence from light changes and object occlusion, and the robustness is not good, it can be considered to use the deep learning method and neural network for trajectory prediction instead of the optical flow method, but this will lead to an increase in the amount of computation, which is a problem that still needs to be considered.

Recent Gains from reading research papers[Sep 8th]

Fri, 08 Sep 2023 00:00:00 +0000

This paper presents a method for providing water hazard information to surface vehicles. The method does not require any constraints on camera motion or specific sensors and allows determining the 3D coordinates of underwater surface points in a least squares sense.Based on the theory of multi-view geometry and basic point cloud processing techniques, the method utilizes the principle of refraction to detect underwater hazards on the basis of matching points, estimating their surfaces and calculating the true depths of underwater shapes. The method helps to improve vehicle intelligence through water hazard depth estimation in both on-road and off-road situations.

Note1: The paper proposes a new method for estimating water depth with binocular camera. The binocular camera allows the problem to be simplified and the scene reconstructed while solving the problem of detecting hazards in a timely manner at a distance and estimating the underwater depth at a close distance. As the 3D environmental reconstruction of the scene continues, segmenting the water hazard area in just one image is sufficient to mark the hazardous area in 3D.

Note2: This paper applies the devised method to the autonomous driving of ground vehicles in off-road or road pothole environments, however, it is also possible to extend this method to other vehicles, intelligent systems, for example, it can be applied to UAVs in terms of terrain detection, for water depth mapping or for search and rescue missions in case of flooding disasters.

The advantage of the method proposed in the paper is that there is no restriction on the camera attitude during the detection process and no additional or specific sensors are required. The innovation lies in the fact that the paper proposes a calibration solution that defines the vehicle and camera poses in a universal coordinate system, which has never been done before.

Recent Gains from reading research papers[Sep 1st]

Fri, 01 Sep 2023 00:00:00 +0000

[1]Steininger D, Kriegler A, Pointner W, et al. Towards Scene Understanding for Autonomous Operations on Airport Aprons[C]//Proceedings of the Asian Conference on Computer Vision. 2022: 147-163.

This is a work published in CVPR, which focuses on the design of a dataset for the operation of an apron field for autopilot, specializing in static and dynamic objects commonly found in the apron area. And a method for image acquisition and annotation of object instances and environmental parameters is proposed that automatically extracts a representative set of samples from a large amount of image data while minimizing the manual work required for annotation. In addition, the authors produced several dataset variants on which baseline classification and detection experiments were performed, which were used to evaluate the overall performance and robustness of the resulting model to specific environmental conditions.

Note1:
The fact that the paper can be published in CVPR illustrates the scarcity and necessity of in-airport unmanned datasets. The paper is specifically based on acquiring image data by working with airports to install cameras on multiple safeguard vehicles, containing results from multiple seasons of operation on the ramp and in the logistics area. Variations in environmental conditions such as time of day, seasonal and atmospheric effects, lighting conditions, and camera-related degradation effects are also included. The labeling of the dataset focuses on multiple types of ramp vehicles and also includes other types of static and transient obstacles. From the paper, it can be said that the dataset has covered most of the scenarios within the apron, but the dataset does not seem to be publicly available and is not directly accessible.

Note2: Specific data collection methodology: recorded via Nextbase 612GW CarLog, using a resolution of 3860x2160 pixels, mounted on the inside of the windshield of two containerized delivery vehicles. One was modified to include a 90° field of view instead of using the 150° built-in lens. The majority of the data was captured in time-lapse mode at 5 fps to provide data variability that fully represents the environment. In addition, the recordings were supplemented with 30 fps sequences for future studies such as multi-target tracking.

Note3: The authors’ innovation was to give the dataset an additional set of defined parameters assigned to specify and categorize environmental factors during the recording time period, as shown in the figure below. This can be learned from in future production of the dataset.

Recent Gains from reading research papers[July 20th]

Thu, 20 Jul 2023 00:00:00 +0000

[1]Fang Z, López A M. Intention recognition of pedestrians and cyclists by 2d pose estimation[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(11): 4773-4783.

The paper explores the use of 2D attitude estimation from monocular images as the core information to recognize the intent of pedestrians and cyclists and to solve the pedestrian crossing/non-crossing classification (C/NC) task. For cyclists, the article assumes that they obey traffic rules and use arm signals to indicate future maneuvers. For pedestrians, it is assumed that the pedestrian’s walking pattern determines whether he/she has the intention to cross the street in the path of the self vehicle. The paper compares favorably to other papers of its kind, mainly complementing the detection by obtaining the effect of a noisy 2D skeleton and proposing the features that C/NC classifiers consider most relevant. It is also shown how the same approach can be used to recognize cyclist arm signals.

Note: For this type of task, the application can be extended to recognize the movements of people on the ramp by drawing on the article’s selection of the 9 most stable keypoints for humans, corresponding to the legs and shoulders, since the legs perform the start-stop motion of walking, and the keypoints of the shoulders and legs give the overall body orientation. Specifically, features were computed from the selected keypoints by first normalizing the coordinates of the keypoints using a factor h proportional to the height of the pedestrian, which was determined as the vertical distance from the top to the bottom keypoint. Then, different features are computed by considering the distances and relative angles between pairs of keypoints as well as the angles of the triangles induced by the triad of keypoints (conveying redundant information). Finally, a random forest classifier can be used to perform C/NC classification by directly providing the probability of a meaningful threshold treatment.

Recent Gains from reading research papers[July 13th]

Thu, 13 Jul 2023 00:00:00 +0000

[1]Donadio, F., Frejaville, J., Larnier, S., & Vetault, S. (2016, October). Human-robot collaboration to perform aircraft inspection in working environment. In Proceedings of 5th International conference on Machine Control and Guidance (MCG).

The paper describes two projects. One is intelligent video surveillance for monitoring ramp operations. The second is a collaborative mobile robot. For intelligent surveillance, the project uses data fusion methods to further process and categorize data on the basis of predefined object categories (people, vehicles, aircraft, etc.) and to associate these objects with functions in a three-dimensional form. In addition the paper describes the AIR-COBOT project for improving the efficiency of maintaining aircraft and the traceability of maintenance operations. The robot is capable of tracking, identifying and avoiding obstacles, and its main task is to analyze images of the aircraft after they have been acquired, to determine if the cabin doors are open or closed; if there are protections for certain equipment (static ports, probes); the state of the turbine fan blades; the state of the probes; or the wear and tear of the landing gear tires.

Note 1:

Flow chart

The paper’s motion understanding algorithm is worthwhile. It does this by prior modeling and then compares real-time data against it using a form of “knowledge base”. The basic idea is to first automatically report the ramp area information matching the relevant scene to the corresponding operator, then to fuse all the information related to the scene in the data fusion module, and finally to recognize the events occurring within the video stream information. The spatial and temporal characteristics of the detected moving objects are modeled by spatial and temporal relationships graded by different intensities. The system consists of three modules: trajectory velocity analysis module, trajectory clustering module and activity analysis module. The first module segments trajectories into segments of comparable velocity. The second module obtains the behavioral displacement patterns of the origin and destination of the moving objects observed in the scene by clustering the moving trajectories and discovering the topology of the scene.

Note 2:

The robot is navigated by first telling the robot the trajectory by the operator moving the robot in remote control mode or follow mode, and thereafter automatically following the trajectory, which relies heavily on GPS data to access the ramp. The robot is able to localize itself by laser data or image data from a camera, and for aircraft detection it mainly obtains a 3D point cloud by LiDAR, and matches between the aircraft model and the field point cloud to estimate the static attitude of the aircraft and the robot.

Recent Gains from reading research papers[July 6th]

Thu, 06 Jul 2023 00:00:00 +0000

[1]Xu J, Ding M, Zhang Z Z, et al. Vision-Based Automatic Collection of Nodes of In/Off Block and Docking/Undocking in Aircraft Turnaround[J]. Applied Sciences, 2023, 13(13): 7832.

The paper is based on a target detection algorithm to automatically recognize aircraft docking and push-off, and record the corresponding safeguard event node time. It consists of two modules, which are preprocessing module and key node collection module. The preprocessing module extracts spatio-temporal information from the airport field; the key node collection module designs the interaction of single target nodes and dual target nodes represented by docking passenger elevator trucks and evacuation of passenger elevator trucks, and designs the collection methods of two kinds of key nodes. The framework can replace the manual recording methods that are routinely used at present.

Note 1:

The dataset of this thesis is two neighboring aircraft positions with two cameras of different viewpoints. To solve the problem that airplanes from neighboring slots may be included in one surveillance image and cannot be correlated, the thesis sets an a priori rule. That is, by analyzing the dimensions of the aircraft at positions 734 and 939 in the image plane, it is found that the width or height of the BBOX of the aircraft at the current position is at least half of the width or height of the whole. Based on this a priori rule, the airplanes are associated by searching the bounding box of the airplane with the largest area that satisfies the above rule.

Note 2:

To determine the target motion stationary, the thesis uses the standard deviation of the coordinates of the center point of the bounding box in consecutive frames of a certain length from the IoU values of the previous and current bounding boxes to determine the motion state, and if the IOU is greater than a certain threshold, it is determined to be stationary.

Note 3:

The paper determines whether docking is complete by two conditions. One is that both the passenger elevator truck and the aircraft are at rest, and the second is that based on a specific angle of the in-flight camera, it was found that if 90% of the passenger elevator truck’s BBOX overlapped with the aircraft’s BBOX, the docking process was judged to be complete.

Some of the determination methods in this paper are simple, but cleverly utilize the a priori information of the position of the aircraft camera, which is worthwhile.

Recent Gains from reading research papers[June 30th]

Fri, 30 Jun 2023 00:00:00 +0000

[1]Zhang Y, Zhu L, Feng W, et al. Vil-100: A new dataset and a baseline model for video instance lane detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 15681-15690.

Lane detection in current research focuses on individual images (frames) and ignores the information available in continuous video. Instead, this thesis establishes a new video instance lane detection (VIL-100) dataset. In addition a new baseline model is proposed. Its main focus is to enhance the representation of the current frame by carefully aggregating local and global memory features from other frames.

Note 1:

Note 1: Extending deep learning-based lane detection from the image level to the video level can utilize temporal consistency to solve many intra-frame blurring, occlusion, and misrecognition problems. Similarly, this idea can be applied to intra-airport recognition. Specifically, the past frames of the original video can be formed into a local memory, the past frames of the mixing-sorting video can be used as a global memory, and the current video frame can be segmented and globally memorized as a query using features extracted from the video frames in the local memory. These multi-level memory features are aggregated by designing Local and Global Memory Aggregation (LGMA) modules, and then all CNN features are integrated together to produce video detection results.

Note 2:

Note 2: Existing memory networks utilize periodic sampling per N frames to include both near and far frames, but all sampled frames are ordered and extracted features may depend heavily on temporal information. In contrast, the LGMA module designed in this dissertation can utilize five frames from a shuffled video in global memory to eliminate temporal ordering and enhance the global semantic information used to detect lanes. Since the content differs for video frames, memory features from different frames will have different contributions to help the current video frame recognize the background object. Therefore, the attention mechanism can be utilized to learn the attention graph in order to automatically assign different weights to local memory features and global memory features.

Note 3:

A lane annotation method: a series of points are placed along the centerline of each lane in each frame and they are stored in a json format file. The points on each lane are stored in one group. Each set of points is then fitted to a curve by a third-order polynomial and expanded into a region of lanes with a certain width, thus enabling instance-level annotation.

Recent Gains from reading research papers[June 23th]

Fri, 23 Jun 2023 00:00:00 +0000

[1]Furletov Y, Willert V, Adamy J. Auditory scene understanding for autonomous driving[C]//2021 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2021: 697-702.

Experimental equipment

A necessary condition for autonomous driving is accurate and reliable perception of the vehicle’s surroundings. Current architectures rely on cameras and LIDAR to capture the visual environment and to locate and track other traffic participants. This paper, however, is inspired by the proposal of auditory perception. The article focuses on recognizing the siren sound of emergency vehicles and based on the recognition results in order to determine the correct driving strategy or plan a more rational path. The advantages of an auditory sensor are: 1 it increases the accuracy and richness of the environmental representation when other sensors are working; 2 if all other sensors are not working, it still maintains some perceptual ability to interact with the environment.

Note 1: Sound localization and classification includes information that is not available to other sensors in self-driving cars. Some extended ideas, such as sensing the sound of an engine running during ramp operations (loud enough and characteristic) can determine the type of aircraft or plan the appropriate route of operation based on whether the engine starts or not (considering the engine danger zone). It is also possible to sense sounds from tires to determine vehicle traction control, or speed information.

Note 2: Different sound processing algorithms are able to detect and classify different objects, extract auditory cues specific to particular types of objects, and localize each object through a combination of arrival time delays and amplitude based localization algorithms, mainly calculated by extracting the absolute loudness levels of microphones or the interaural time and level differences between several microphones, which can also be a complement to recognition detection.

Recent Gains from reading research papers[June 16th]

Fri, 16 Jun 2023 00:00:00 +0000

[1]Wang S, Che Y, Zhao H, et al. Accurate tracking, collision detection, and optimal scheduling of airport ground support equipment[J]. IEEE Internet of Things Journal, 2020, 8(1): 572-584.

The thesis aims to reduce the risk of accidents on the ramp, and its design of real-time, high-precision tracking algorithms for ground vehicles or equipment that can be informed of position and speed in time and report potential collisions between aircraft and ground targets. Specifically, the thesis develops a real-time, high-precision tracking device consisting of a real-time kinematic (RTK) unit and a heading unit for securing vehicles, and most importantly, the device can be used for baggage carts with multiple trailers and tracking accuracy to within centimeters, which is not studied in other papers.

Note 1: The main implementation of the tracking algorithm in this thesis is to install a tracking device to each section of the trailer and obtain the real-time position Pi of each trailer and its heading θi through the coordinate conversion algorithm. in addition, in order to predict potential collisions, the moving speed of the secured vehicle needs to be obtained. The approach used is the estimation of the collected position data. Since the time difference method of estimating the velocity from noisy position measurements (i.e., [x(n) - x(n-1)]/T)x(n) amplifies the noise, it uses a tracking algorithm, i.e., Markov linear tracking system, and a Kalman filter to obtain the accuracy-improved positions and estimated velocities of the trailers on the x and y axes, respectively.

Note 2: The collision detection of this system focuses on the corner points on the tractor or trailer, especially on the last trailer, such as the red star position in the above figure. However, the paper only acquires the position of the corner points by coordinate transformation, and does not realize the specific conflict collision detection, in which the research can still be continued.

Recent Gains from reading research papers[June 9th]

Fri, 09 Jun 2023 00:00:00 +0000

[1]Elrayes A, Ali M H, Zakaria A, et al. Smart airport foreign object debris detection rover using LiDAR technology[J]. Internet of Things, 2019, 5: 1-11.

Experimental scenes

This paper proposes and implements an intelligent robot solution for FOD detection at airports. The solution consists of installing millimeter wave sensors and LIDAR on the vehicle, and the vehicle moves along a designated track so as to achieve FOD ranging sensing, and when FOD is detected, a message with the FOD location is sent to the airport staff via Bluetooth or WIFI network. The system is capable of detecting FODs of various sizes at different distances from the vehicle. the vehicle is notable for its low cost and design that does not interfere with airside operations at the airport.

Note 1: The location of the FOD in the paper is identified using two coordinates: (i) the distance x0 between the vehicle and the FOD, calculated by the LiDAR sensor; and (ii) the relative position of the vehicle relative to the starting point y0 along the track, measured using an odometer. If debris is detected, the vehicle’s communication system will send a message to the operator that will contain information about the object’s location (x0, y0). The vehicle’s obstacle avoidance is fixed to the front and rear of the vehicle by two pairs of ultrasonic sensors. If an obstacle is detected, the vehicle will stop and send a message, and then start backing up. Because of its fixed track, it cannot implement autonomous obstacle avoidance operation, so the application scenario of this vehicle is limited and cannot be laid in large quantities.

Note 2: The vehicle cannot identify the type of FOD because it is not paired with a camera, so it can be considered to incorporate camera identification and combined with the study of more mature FOD detection algorithms, which may achieve better results. In addition the experiment of the article is divided into indoor laboratory and outdoor parking lot, suggesting that our subsequent experimental process can also be divided into two indoor and outdoor scenarios, which is more convincing.

Recent Gains from reading research papers[June 2nd]

Fri, 02 Jun 2023 00:00:00 +0000

[1]Lobo M J, Hurter C, Cousy M. A LIDAR interactive data visualization for ground aircraft detection at small airports[C]//SID 2019, 9th SESAR Innovation Days. 2019.

The paper implements the interaction and visualization of LiDAR detection data for small airports, which is used in the ENVISION project to enable the detection and localization of aircraft, vehicles and people. The implementation process is mainly done by coupling multiple sensors and processing modules, i.e. multiple cameras, a LiDAR and an ADS-B sensor, in order to calculate the target position. The data acquired in the sensors are provided to separate processing modules, which decode and process their data. The processing results are then integrated into a data fusion module, which tracks the target, calculates the target position and generates aircraft motion events. The LIDAR is placed on a high pole about 8 meters high, thus covering part of the ramp and the nearest taxiway.

Note 1: The paper differs in that in order to achieve fast point cloud segmentation, it detects the 2D images generated by the LiDAR instead of in the 3D point cloud and uses a plane fitting method to ignore the interference of the ground point cloud. The method to determine if they are the same target is as follows: for each laser point not on the ground, four neighboring laser points are considered: the two horizontal nearest points from the same laser scan and the vertical nearest point with the same azimuth. This constructs a graph with up to four neighboring points for each point, depending on the distance between the points. For two points to be considered adjacent, the distance between them must be below a distance threshold. For each two neighboring points in the graph a beta angle is to be calculated, which is the angle between the line segment passing through the LIDAR origin and the farthest of the two neighboring points. This angle reflects the depth difference between the two points, and if this angle is small, it indicates that these points may be at different depths and belong to different objects, i.e., the β threshold is determined to determine whether the two laser points belong to the same object.

Note 2: The 2D detection method provided in this paper cannot detect people or small vehicles because it filters out objects with the longest diagonal less than 1 meter from the bounding box. This is because it is found in its experiments to produce more false detections if small targets are included, which is an area that can be improved. In addition, the algorithm cannot perform target classification to obtain the class information of the target, and this aspect requires further fusion of the information obtained by the camera and fusion.

Recent Gains from reading research papers[May 26th]

Fri, 26 May 2023 00:00:00 +0000

[1] Brassel H, Zouhar A, Fricke H. 3D Modeling of the airport environment for fast and accurate LiDAR semantic segmentation of apron operations[C]//2020 AIAA/IEEE 39th Digital Avionics Systems Conference (DASC). IEEE, 2020: 1-10.

[2] Braßel H, Zouhar A, Fricke H. Adaptive point sampling for LiDAR-based detection and tracking of fast-moving vehicles using a virtual airport environment[J].

The above-mentioned papers propose a combination of LiDAR with target detection and tracking, and semantic segmentation algorithms, respectively, to achieve surveillance tasks on non-collaborative objects. The first paper focuses on fast and accurate LIDAR semantic segmentation in the apron. The main reasons for less research in this area are the dependence of model building on large-scale datasets and the tedious point-by-point annotation of 3D point clouds. Therefore, the paper builds simulated datasets and uses a virtual airport environment with integrated LiDAR sensor models to generate synthetic training data for the ramp. In addition the model identifies aircraft arrival/departure gates, high poles, airport buildings and ground planes. And the second one proposes a method for real-time detection and tracking of fast moving objects on the ramp in LIDAR scanning based on the study of semantic segmentation of airport scenes. The method integrates the newly generated point cloud into the Kalman filter by selecting feature points as a function of the velocity and distance of the moving target from the sensor.

Note 1: Combining the above two papers shows the importance of simulated datasets in the face of unavailability of field data. A major advantage of the simulated dataset is the ease of obtaining accurate data for each time point at each location, which facilitates the verification of the accuracy of the proposed model. In addition some validations related to ramp operations can be carried out in simulated scenarios, combined with digital twins.

Note 2: Simulation scenarios can be built with CAD and 3D models can be created based on publicly available ground views such as OpenStreetMap. However, freely available data sources usually provide only incomplete 3D information (e.g. height) and less detail. So building a simulation dataset also requires the collection of real data to ensure the match with reality. As in the above paper, detailed construction drawings provided directly by the airport and measurements from real LiDAR scans of the ramp were used to build the scenario.

Recent Gains from reading research papers[May 19th]

Fri, 19 May 2023 00:00:00 +0000

[1]Wang Zheng,Zhao Xiao,She Hongjie,Liu Honghai,Zhao Yanwei. AGV obstacle detection and obstacle avoidance based on binocular vision[J]. Computer Integrated Manufacturing Systems,2018,24(02):400-409.DOI:10.13196/j.cims.2018.02.012.

This paper proposes a binocular vision-based obstacle detection and obstacle avoidance method for AGVs. In the process of obstacle detection, an obstacle determination algorithm based on depth detection is used to determine the existence of obstacles, and then the frame difference method is used to obtain the orientation and speed information of static and dynamic obstacles. In addition, we also design obstacle avoidance strategies for different obstacles, and analyze the angle and distance deviation during the movement of AGV.

Note 1: The obstacle detection scheme of this paper, as well as its motion state detection, uses binocular vision to calculate the 3D coordinates of the image matching point in the left camera coordinate system, obtain its height distance and horizontal distance, and compare it with the set threshold to determine whether the obstacle exists, rather than using deep learning target detection method, the advantages and disadvantages of the two methods are unclear and need to be further compared.

Note 2: The obstacle avoidance strategy of this article is worthy of reference, and its different obstacle avoidance strategies are set according to its state after getting the obstacle motion characteristics. For dynamic obstacles, the self-driving safeguard vehicle should continue to wait in place until the obstacle leaves; for static obstacles, the self-driving safeguard vehicle should implement the obstacle avoidance strategy and then return to the normal path. The static obstacle avoidance strategy is as follows: first, according to the target detection algorithm to obtain the length of the obstacle located in the pixel coordinate system and the offset value compared to the boundary of the obstacle compared to the center of the vehicle, according to the camera calibration algorithm, to obtain its actual length L and offset p. According to the LIDAR to know the distance between the vehicle and the front obstacle a. According to the type of obstacles identified (two kinds of obstacles in the machine position: car, human ), the obstacle width W can be set as 5m and 0.5m respectively. secondly, if the obstacle is found to be to the right relative to the vehicle, the vehicle will avoid to the left, and vice versa.

Recent Gains from reading research papers[May 12th]

Fri, 12 May 2023 20:38:32 +0800

At present, there are three types of AGV navigation methods: magnetic navigation, visual navigation and laser navigation. Among them, magnetic navigation technology is mature and most used in the AGV industry, which is mainly applied to the magnetic line strip in the factory area, with high equipment erection cost and fixed operation line. Visual navigation usually adopts image processing technology as the core technology of AGV navigation, which has a high degree of autonomy, and it is more advantageous in terms of positioning accuracy and production cost. Therefore, this article designs a visual navigation system that can autonomously locate with high accuracy by using QR code, a quick response code, and an on-board camera to form the visual navigation component of AGVs.

Note 1: Extended to the apron unmanned citation scenario, consider whether the above approach can be borrowed. For the method of posting magnetic line strips in the field, which has the problem of fixed routes, and different aircraft docking locations, this method is obviously not suitable for apron operation. Considering the laser Slam navigation method, it can only be applied to indoor scenes and cannot be used to build maps for outdoor scenes. Therefore, the visual method is the most suitable.

Note 2: The most mainstream method of the visual method is to paste QR code on the ground, i.e., by pasting QR code on the apron surface, in order to achieve autonomous navigation of the secured vehicles. The QR code posted in the apron can store the coordinate information of each location in the apron. Each QR code label contains a QR code code information and coordinate point (x,y) location information. When the protection vehicle passes the QR code in the apron, the information of the point can be read quickly by the camera to locate and correct the path taken, so as to realize the requirement of automatic guidance control of the self-driving protection vehicle according to the predetermined path. The technical details of this method still need to be studied in depth.

Recent Gains from reading research papers[May 5th]

Fri, 05 May 2023 00:00:00 +0000

[1]Zeng Tuocheng,Wang Jiajun,Wang Xiaoling et al. Research on improved multi-objective multi-visual unloading recognition model for dam transporters under large scene video monitoring[J/OL]. Journal of Water Resources:1-12[2023-05-05]. https://doi.org/10.13243/j.cnki.slxb.20220910.

The paper features an application scenario for object detection under large range and large scale. It addresses the disadvantages of high equipment cost and low recognition accuracy of the current vehicles by using GNSS only for vehicle motion state analysis, and solves this problem through vision technology. It adopts bytetrack to realize the object detection and tracking of multiple transport vehicles in large-scene surveillance video, record their driving trajectory, and combine with HRNet, a key point detection network.

Note 1: Bytetrack is originally an algorithm used for pedestrian re-identification, and the authors migrate it to vehicles to achieve real-time tracking and recording of driving trajectories of multiple targets of transport vehicles in surveillance videos, for also providing more accurate driving speed, which is worth learning from.

Note 2: The most worthwhile point to learn in this paper is by combining bytetrack and HRNet, i.e., combining the results of driving trajectory and key point detection to determine the forward, stop and backward travel states of the transport vehicle, as follows:

Equation of state

Where 1 means backward, 0 means stop, -1 means forward. ai indicates the angle between the direction of travel of the ith transport vehicle and the direction of that transport vehicle from the key point at the front to the key point at the rear (when one of the key points at the front or rear is lacking, the transport vehicle is assumed as a rigid body and the center point of the tracking frame is used instead).

Note 3: For large-scene video surveillance 1080p image resolution and the calculation speed of multiple visual recognition algorithms, 960*544 image size is used as the network input and 3 scales of decoupling heads are set at the network output to output anchor frame information, which can guarantee the accuracy and improve the efficiency at the same time.

Deep Learning | Tianxiong Zhang

Recent Gains from reading research papers[May 3rd]

[1]Li M, Liu S, Zhou H. SGS-SLAM: Semantic Gaussian Splatting For Neural Dense SLAM[J]. arXiv preprint arXiv:2402.03246, 2024.

Recent Gains from reading research papers[Apr 26th]

[1]Yan C, Qu D, Wang D, et al. Gs-slam: Dense visual slam with 3d gaussian splatting[J]. arXiv preprint arXiv:2311.11700, 2023.

Recent Gains from reading research papers[Apr 19th]

[1] Yan G, Pi J, Guo J, et al. OASim: an Open and Adaptive Simulator based on Neural Rendering for Autonomous Driving[J]. arXiv preprint arXiv:2402.03830, 2024.

Recent Gains from reading research papers[Apr 12th]

[1] Zhu Z, Chen Y, Wu Z, et al. Latitude: Robotic global localization with truncated dynamic low-pass filter in city-scale nerf[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 8326-8332.

Recent Gains from reading research papers[Apr 5th]

[1] C. Jiang et al., “H 2 -Mapping: Real-Time Dense Mapping Using Hierarchical Hybrid Representation,” in IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6787-6794, Oct. 2023, doi: 10.1109/LRA.2023.3313051.

Recent Gains from reading research papers[Mar 29th]

[1] Charatan D, Li S, Tagliasacchi A, et al. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction[J]. arXiv preprint arXiv:2312.12337, 2023.

3D Reconstruction of Chengde Airport, China

Recent Gains from reading research papers[Mar 22th]

[1] Zhou X, Lin Z, Shan X, et al. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes[J]. arXiv preprint arXiv:2312.07920, 2023.

3D Reconstruction of TB-20 aircraft using 3D Gaussian Splitting and Unreal Engine 5

Recent Gains from reading research papers[Mar 15th]

[1] Yan Y, Lin H, Zhou C, et al. Street gaussians for modeling dynamic urban scenes[J]. arXiv preprint arXiv:2401.01339, 2024.

Recent Gains from reading research papers[Mar 8th]

[1] Chen Y, Gu C, Jiang J, et al. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering[J]. arXiv preprint arXiv:2311.18561, 2023.

Recent Gains from reading research papers[Mar 1st]

[1] Liu Y, Zhang K, Li Y, et al. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models[J]. arXiv preprint arXiv:2402.17177, 2024.

Recent Gains from reading research papers[Feb 23th]

[1] Tian R, Zhang Y, Feng Y, et al. Accurate and robust object SLAM with 3D quadric landmark reconstruction in outdoors[J]. IEEE Robotics and Automation Letters, 2021, 7(2): 1534-1541.

Recent Gains from reading research papers[Jan 26th]

[1] Nair G B, Daga S, Sajnani R, et al. Multi-object monocular SLAM for dynamic environments[C]//2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2020: 651-657.

Recent Gains from reading research papers[Jan 12th]

[1] Doherty K, Fourie D, Leonard J. Multimodal semantic slam with probabilistic data association[C]//2019 international conference on robotics and automation (ICRA). IEEE, 2019: 2419-2425.

Recent Gains from reading research papers[Jan 5th]

[1] Zhou J, Elksnis A, Fu Z, et al. MultiMap3D: A Multi-Level Semantic Perceptual Map Construction Based on SLAM and Point Cloud Detection[C]//2023 28th International Conference on Automation and Computing (ICAC). IEEE, 2023: 1-6.

Recent Gains from reading research papers[Dec 22th]

[1] Hosseinzadeh M, Li K, Latif Y, et al. Real-time monocular object-model aware sparse SLAM[C]//2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019: 7123-7129.

Recent Gains from reading research papers[Dec 14th]

[1] Moreno F M, Guindel C, Armingol J M, et al. Study of the Effect of Exploiting 3D Semantic Segmentation in LiDAR Odometry[J]. Applied Sciences, 2020, 10(16): 5657.

Recent Gains from reading research papers[Dec 7th]

[1] Rosinol A, Leonard J J, Carlone L. Nerf-slam: Real-time dense monocular slam with neural radiance fields[J]. arXiv preprint arXiv:2210.13641, 2022.

Recent Gains from reading research papers[Dec 1st]

[1] Li R, Li S J, Chen X, et al. TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation[J]. arXiv preprint arXiv:2309.07849, 2023.

Recent Gains from reading research papers[Nov 24th]

[1] Li P, Ding S, Chen X, et al. PowerBEV: A Powerful Yet Lightweight Framework for Instance Prediction in Bird’s-Eye View[J]. arXiv preprint arXiv:2306.10761, 2023.

Recent Gains from reading research papers[Nov 17th]

[1] Zhang H, Xie C, Toriya H, et al. Vehicle Localization in a Completed City-Scale 3D Scene Using Aerial Images and an On-Board Stereo Camera[J]. Remote Sensing, 2023, 15(15): 3871.

Recent Gains from reading research papers[Nov 10th]

[1] Yang S, Scherer S. CubeSLAM: Monocular 3D object detection and SLAM without prior models[J]. arXiv preprint arXiv:1806.00557, 2018.

Recent Gains from reading research papers[Nov 3rd]

[1] Hariya, Keigo, Hiroki Inoshita, Ryo Yanase, Keisuke Yoneda, and Naoki Suganuma. 2023. “ExistenceMap-PointPillars: A Multifusion Network for Robust 3D Object Detection with Object Existence Probability Map” Sensors 23, no. 20: 8367. https://doi.org/10.3390/s23208367

Recent Gains from reading research papers[Oct 27th]

[1]K. Yoneda, N. Ichihara, H. Kawanishi, T. Okuno, L. Cao and N. Suganuma, “Sun-Glare region recognition using Visual explanations for Traffic light detection,” 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 2021, pp. 1464-1469, doi: 10.1109/IV48863.2021.9575631.

Recent Gains from reading research papers[Oct 20th]

[1] 吴飞,金圣洁,林晓琛.基于Mask R-CNN和关键点提取的抓取位姿估计方法[J].合肥工业大学学报(自然科学版),2023,46(09):1178-1184.

Recent Gains from reading research papers[Oct 13th]

[1] Miyama M. Robust inference of multi-task convolutional neural network for advanced driving assistance by embedding coordinates[C]//Proceedings of the 8th World Congress on Electrical Engineering and Computer Systems and Science, EECSS. 2022: 105-1.

Recent Gains from reading research papers[Sep 29th]

[1]Rozsa Z, Sziranyi T. Object detection from a few LIDAR scanning planes[J]. IEEE Transactions on Intelligent Vehicles, 2019, 4(4): 548-560.

Recent Gains from reading research papers[Sep 22th]

[1]Duffhauss F, Baur S A. PillarFlowNet: A real-time deep multitask network for LiDAR-based 3D object detection and scene flow estimation[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 10734-10741.

Recent Gains from reading research papers[Sep 15th]

[1]Rozsa Z, Sziranyi T. Optical Flow and Expansion Based Deep Temporal Up-Sampling of LIDAR Point Clouds[J]. Remote Sensing, 2023, 15(10): 2487.

Recent Gains from reading research papers[Sep 8th]

[1] Rózsa Z, Golarits M, Szirányi T. Water Hazard Depth Estimation for Safe Navigation of Intelligent Vehicles[C]//VEHITS. 2021: 90-99.

Recent Gains from reading research papers[Sep 1st]

[1]Steininger D, Kriegler A, Pointner W, et al. Towards Scene Understanding for Autonomous Operations on Airport Aprons[C]//Proceedings of the Asian Conference on Computer Vision. 2022: 147-163.

Recent Gains from reading research papers[July 20th]

[1]Fang Z, López A M. Intention recognition of pedestrians and cyclists by 2d pose estimation[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(11): 4773-4783.

Recent Gains from reading research papers[July 13th]

[1]Donadio, F., Frejaville, J., Larnier, S., & Vetault, S. (2016, October). Human-robot collaboration to perform aircraft inspection in working environment. In Proceedings of 5th International conference on Machine Control and Guidance (MCG).

Recent Gains from reading research papers[July 6th]

[1]Xu J, Ding M, Zhang Z Z, et al. Vision-Based Automatic Collection of Nodes of In/Off Block and Docking/Undocking in Aircraft Turnaround[J]. Applied Sciences, 2023, 13(13): 7832.

Recent Gains from reading research papers[June 30th]

[1]Zhang Y, Zhu L, Feng W, et al. Vil-100: A new dataset and a baseline model for video instance lane detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 15681-15690.

Recent Gains from reading research papers[June 23th]

[1]Furletov Y, Willert V, Adamy J. Auditory scene understanding for autonomous driving[C]//2021 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2021: 697-702.

Recent Gains from reading research papers[June 16th]

[1]Wang S, Che Y, Zhao H, et al. Accurate tracking, collision detection, and optimal scheduling of airport ground support equipment[J]. IEEE Internet of Things Journal, 2020, 8(1): 572-584.

Recent Gains from reading research papers[June 9th]

[1]Elrayes A, Ali M H, Zakaria A, et al. Smart airport foreign object debris detection rover using LiDAR technology[J]. Internet of Things, 2019, 5: 1-11.

Recent Gains from reading research papers[June 2nd]

[1]Lobo M J, Hurter C, Cousy M. A LIDAR interactive data visualization for ground aircraft detection at small airports[C]//SID 2019, 9th SESAR Innovation Days. 2019.

Recent Gains from reading research papers[May 26th]