Visit by KTH Formula Student Driverless TeamApril 15, 2019
Last Mile Delivery RobotsMarch 16, 2020
Photogrammetry is the science of producing reliable measurements of surface geometry from photographs. This science is closely related to several important problems in computer vision such as:
- Estimating the relative positions of cameras, given the images they produce
- Finding regions across pictures that correspond to each-other
- Computing the most consistent geometry for the scene, given the pictures
- Estimating the scale of the scene
Photogrammetry is commonly used in the VFX and video games industries to import real-world objects into virtual environments with a high level of detail. It is used to replace the manual labour of re-creating what already exists and spending more time on designing. In the same way, photogrammetry may be a viable option for engineers to capture the geometry of an existing site with the use of a regular camera, instead of expensive 3D scanning equipment.
In many real-world cases, using an expensive scanning station is not even considered as an option. Our experience is that initial on-site surveys may often have been done with a simple measuring tape and a few notes scribbled down on paper. It takes a wealth of experience, qualified guesses and time to revert such input into a workable representation of the state of the world, against which a design can be made. Whether photogrammetry is indeed the best option will typically depend on the requirements, as well as how well each of the four steps outlined above can be solved for a particular combination of cameras, viewpoints, scene structure, and illumination conditions. In some cases the complexity of the measurement problem leaves a total-station scanning solution as the only viable choice. Other times the geometry is so well explained by a length, width, and height, that more sophisticated measurements just aren’t worth the effort.
At Univrses, we have leveraged components from our 3DAI™ Engine to deliver photogrammetry solutions in applications ranging from automated sizing of goods to automated inspection and quality assurance. As an example of what different trade-offs imply when performing photogrammetry, let’s take a more complex example than the box.
Suppose this is the geometry we wish to analyze – the engine of a good old Volvo S40 with some miles on it. Maybe for archival purposes in an all-electric future, maybe to figure out how many potates could be fit in there to bake during a road trip. This is just an example, let’s not get too carried away with the motivations behind it.
ONLINE PHOTOGRAMMETRY METHODS
For on-line scenarios in which one would like to have measurement results immediately, as images are being recorded, some form of SLAM system is likely to be required. SLAM or ”Simultaneous Localization And Mapping” is an active research topic in Robotics, Autonomous Driving and Computer Vision, and is a core component of Univrses 3DAI™ Engine.
SLAM is a technology that simplifies the problem by making it incremental in nature. Instead of looking at all images used as input, SLAM makes it possible, in a sense, to only analyze the last one and incrementally update the virtual 3D representation of the world based on the latest information.
The online systems typically have to make trade-offs to enable real-time execution. Compromises may include using lower resolution images, extracting a limited number of geometric points, performing fewer comparisons between the new and past data, and permanently baking past estimates into the model.
Notice that uniformly colored, smooth surfaces of the car’s engine have been lost. This occurs because pixels without texture are not unique and finding their exact correspondences in other images is hard to do reliably.
- Results are available immediately and makes feedback possible during data-collection
- Low computational cost
- Can be used on mobile devices
- Limited processing per frame – compromise between quality and run-time performance
Given additional time and computational resources, more complex algorithms can be used. In search for the best 3D surface and camera positions, one can consider all the pictures of the scene simultaneously, producing much more dense surface representations. The offline approach also enables higher-resolution images and more expensive filters to deal with noise. If there is sufficient structure in the scene, and there are enough images, the parameters describing the optics of the cameras used to collect the images can also be estimated, driven by data. Since images can be paired in any given order, the offline method also enables multiple users to collaboratively collect data or collect data in multiple sessions, without worrying that the underlying SLAM system will lose track of the ”current” position and orientation of the camera.
Now we’ve taken the time to interpolate between all available measurements, applied some filters to reduce the amount of noise and reasoned about what can be seen from where, we can now infer which surface is really behind which. This takes quite some time, and in spite of these efforts we notice, for example, that the air inlet manifold (the five tubes connected to a part with ”VOLVO” printed on it) is still somewhat flattened out and pitted. This is likely due to linear interpolation being used as a guess for what is in between the raw measurements. Although this assumption works quite well to fill in the flat panels and to draw sharp contours at discontinuities, the assumption agrees poorly with the rounded shape of the inlets.
One way of overcoming the limitations of incomplete data is to learn the most likely shape, given the data. While this topic is somewhat beyond the scope of this introduction, interested readers are welcome to read our published paper on solving this problem using deep learning.
At Univrses we have developed distributed 3D mapping systems that can split the workload between edge and cloud computing. In some applications it makes sense to leverage the best of both worlds when it comes to online and offline processing.
- Sacrifice run-time performance for maximum quality
- Off-the-shelf commercial and open-source solutions exist
- Long processing time
- Diminishing returns for additional computation
Photogrammetry pipelines using monocular cameras are mostly based on the ”structure from motion” principle, i.e. that objects at different distances will have different apparent motion when one’s viewpoint changes. This principle is more reliable than methods that rely on prior assumptions about the environment, such as the direction of incident light, or material properties. However reliable, the 3D models produced by monocular photogrammetry systems have no sense of absolute scale.
In order to make measurements in actual meters rather than in arbitrary units, the absolute scale needs to be determined somehow. Coping with the inherent scale ambiguity in photogrammetry is a challenge for which additional information has to be used. Some of this information is the experience we carry with us. After all, consider that even with one eye closed, humans know that an average adult is typically between 1.5 and 2m tall and we can understand an observed change in size as being due to distance and perspective. These general hunches are quite robust and tend to work well regardless of whether a person is sitting, standing, walking or lying down. We apparently have such mental models for most things in our natural environment. Mimicing this ability we can use general statistical models trained on deep learning, to give a corresponding ”hunch” for how far away a pixel is likely to be, given the patterns recognized around it. In more specific domains, such as an industrial plant, where the scale of things just doesn’t really make sense to the average person we may need additional hints.
We can use domain-specific knowledge such as, ”all pipes in this water treatment facility have a diameter of 80mm” , ”the camera is mounted at 1.5m height from the ground plane” and others to obtain a metric representation of the world. Lastly, an additional sensor, such as an accelerometer or wheel encoder is often a valuable source of scale information in real-world systems.
The need to infer scale from information present in the scene can be avoided by using sensors that directly output a depth for each pixel in the images they produce. These sensors rely on different principles, for example:
- STEREO: uses a pair of cameras that capture images simultaneously at known distance apart to triangulate identifiable features in the respective images and relate it to the known distance of separation between the cameras for metric scale.
- TIME OF FLIGHT: uses the known speed of light to determine distance by the time (or phase shift) of arrival of an emitted pulse of light as it returns to the sensor.
- STRUCTURED LIGHT: uses the known appearance of specific illumination patterns at different distances to infer at which distance the pattern is being seen.
While depth sensors are constantly improving, issues like low image resolution, large minimum range, low maximum range, interference with external light sources, noisy output, etc, may be limiting factors depending on the target application.
A depth sensor alone helps with part of the problem, but still has a limited field of view. The sensor is also not aware of its own position and orientation in space and does not automatically capture all its images into a consistent frame of reference. Two components of our 3DAI™ Engine deal specifically with pose estimation (3DAI™ Odometry) and surface reconstruction and meshing (3DAI™ Reconstruction).
We hope this introduction has sparked your imagination as to the breadth of different solutions that may be achieved using photogrammetry. While it is in general considered a science with an active research community driving development forward, in many aspects it is already considered a technology.
If you have a product or an industrial application where knowing the shape of the world around you is a means to reach your goals, get in touch!