Multi-sensor fusion is the method of combining sensor data obtained from multiple sources to estimate the environment. Its common applications are in automated manufacturing, automated navigation, target detection and tracking, environment perception, biometrics, etc. Out of these applications, object detection and tracking is very important in the field of robotics or computer vision and finds application in diverse areas such as video surveillance, person following, autonomous navigation etc. In the context of purely two-dimensional (2-D) camera based tracking, situations such as erratic motion of the object, scene changes, occlusions along with noise and illumination changes are an impediment to successful object tracking. Integration of information from range sensors with cameras helps alleviate some of the issues faced by 2-D tracking. This dissertation aims to explore novel methods to develop a sensor fusion framework to combine depth information from radars, infrared and Kinect sensors with an RGB camera to improve object detection and tracking accuracy.
In indoor robotics applications, the use of infrared sensors has mostly been limited to a proximity sensor to avoid obstacles. The first part of the dissertation focuses on extending the use of these low-cost, but extremely fast infrared sensors to accomplish tasks such as identifying the direction of motion of a person and fusing the sparse range data obtained from infrared sensors with a camera to develop a low-cost and efficient indoor tracking sensor system. A linear infrared array network has been used to classify the direction of motion of a human being. A histogram based iterative clustering algorithm segments data into clusters, from which extracted features are fed to a classification algorithm to classify the motion direction. To address the circumstances when a robot tracks an object that executes unpredictable behavior - making abrupt turns, stopping while moving in an irregular wavy track, such as when a personal robot assistant follows a shopper in a store or a tourist in a museum or a child playing around, the use of an adaptive motion model has been proposed to keep track of the object. Therefore, an array of infrared sensors can be advantageous over a depth camera, when discrete data is required at a fast processing rate.
Research regarding 3-D tracking has proliferated in the last decade with the advent of the low-cost Kinect sensors. Prior work on depth based tracking using Kinect sensors focuses mostly on depth based extraction of objects to aid in tracking. The next part of the dissertation focuses on object tracking in the x-z domain using a Kinect sensor, with an emphasis on occlusion handling. Particle filters, used for tracking, are propagated based on a motion model in the horizontal-depth framework. Observations are obtained by extracting objects using a suitable depth range. Particles, depicted by patches extracted in the x-z domain, are associated to these observations based on the closest match according to a likelihood model and then a majority voting is employed to select a final observation, based on which, particles are reweighted, and a final estimation is made. An occluder tracking system has been developed, which uses a part based association of the partially visible occluded objects to the whole object prior to its occlusion, thus helping to keep track of the object when it recovers from occlusion.
The latter part of the dissertation discusses a classical data association problem, where discrete range data from a depth sensor has to be associated to 2-D objects detected by a camera. A vision sensor helps to locate objects in a 2-D plane only but estimating the distance using a single vision sensor has limitations. A radar sensor returns the range of objects accurately; however, it does not indicate which range corresponds to which object. A sensor fusion approach for radar-vision integration has been proposed, which using a modified Hungarian algorithm with geometric constraints, associates data from a simulated radar to 2-D information from an image to establish the three-dimensional (3-D) position of vehicles around an ego vehicle in a highway. This information would help an autonomous vehicle to maneuver safely.