From 2D Boxes to 3D Trunks: Simplifying Point Cloud Labeling for Object Detection | by Carlos Argueta | May 2023

From 2D detections to 3D box annotations

Labeling point clouds for object detection is a daunting task. The data is massive and its 3D nature makes it more difficult for the annotator to find the objects of interest compared to the labeling images. In order to facilitate the point cloud labeling process, I explored and developed a simple algorithm that takes advantage of the maturity of image object detection to simplify the process of labeling 3D objects. The idea is that for a dataset consisting of image-point cloud pairs, like the one obtained with my robot, we can apply object detection to the images and then create 3D truncated cones from the 2D bounding boxes .

From 2D bounding boxes to 3D trunks

By taking a 2D bounding box with the intrinsic and extrinsic camera parameters, we can backproject the corners of the bounding box to get 3D points. These points define the vertices of the 3D truncated cone in real-world space. Essentially, each trunk can be conceptualized as a 3D field of view of a camera. The bounding box in the 2D image acts as the aperture of this camera, and the trunk encapsulates all the 3D points that the camera can potentially observe. In other words, the algorithm metaphorically places multiple virtual cameras in 3D space, each focusing on a specific object identified by the 2D object detector.

Let’s break it down:

2D object detection: I used the YOLO (You Only Look Once) model (V8), a real-time object detection system, to detect objects in the image. Detected objects are represented by 2D bounding boxes, which are then used to generate 3D truncated cones in the next step.

Object detection using YOLO on images

3D frost generation: A truncated cone is a portion of a solid located between one or two parallel planes intersecting it. Here we use 2D bounding boxes to generate 3D trunks in the point cloud. Each trunk corresponds to an object detected in the 2D image and contains the relevant 3D points of the point cloud.

The trunks, the cyans are the cars and the others are the people.

Point cloud segmentation: The algorithm then filters the points inside each trunk to create smaller pieces of the point cloud. Each piece contains the 3D points corresponding to an object detected in the 2D image.

A person in a gel

Viewing and saving: The final step of the algorithm is to visualize these small chunks and optionally save them for further processing or labeling.

Cars in a freeze

After saving the filtered point clouds, they can be uploaded to an annotation tool for easy labeling.

Labeling point clouds with CVAT

The algorithm involves several techniques that may require a basic understanding of image processing and 3D geometry. Let’s go through them:

  1. 2D to 3D rear projection: Rear projection is a technique for converting 2D image points into 3D points.
  2. 3D trunk generation: I used the corners of the 2D bounding box and the minimum and maximum depths to generate the eight vertices of the 3D trunk. Then I projected those vertices back onto 3D space using camera settings and rotation and translation between camera and world coordinates.
  3. Point cloud filtering: To segment the point cloud, we need to identify the points that are in each trunk. For this, I used the Delaunay triangulation of the vertices of the trunk, then I checked if each point of the point cloud is in the triangulation. The points that do are part of the truncated cone and are used to create the smaller point cloud pieces.

To access the full code and point cloud examples, check out my repository:

For a full demo, check out the video up top.






Leave a Reply

Your email address will not be published. Required fields are marked *