next up previous
Next: Summary and Conclusions Up: Experimental Evaluation Previous: Experiment 1.

Experiment 2.

This experiment evaluated the performance of the proposed voting approaches as well as the performance of individual cues with respect to three sensor-object configurations typically used in visual-servoing systems: i) static sensor/moving object (``stand-alone camera system''), ii) moving sensor/static object (``eye-in-hand camera'' servoing toward a static object), and iii) moving sensor/moving object (camera system on a mobile platform or eye-in-hand camera servoing toward a moving object). The results are presented as in the previous experiment, with respect to the accuracy and reliability. The two fusion approaches as well as the individual cues have been tested with respect to the ability to cope with occlusions of the target and to regain tracking after the target has left the field of view for a number of frames. The results are presented for correlation, color and image differences since the intensity variation cue can not be used alone for tracking.
Accuracy (Table [*]) - The best accuracy is achieved using the response fusion approach. Although the $ mse$ is similar for the action fusion approach in cases of static sensor/moving object and moving sensor/static object configurations, $ std$ is higher. The reason for this is, as in the previous experiment, the choice of the underlying voting space. The comparison of the performance of the fusion approaches and the performance of the individual cues shows the necessity for fusion. Image differences alone can not be used in cases of moving sensor/static object and moving sensor/moving object configurations since there is no ability to differ between the object and the background. During most of the sequences the target undergoes 3D motion which results with scale changes and rotations not modeled by SSD. It is obvious that these factors will affect this cue significantly resulting with a large error as demonstrated in the table. This problem may be solved by using a better model (see [9]). It can also be seen that the color cue performed best of the individual cues.

In the case of moving sensor/static object, after the tracking is initialized the color cue ``sticks'' to the object during the sequence and, since the background varies a little, the best accuracy is achieved compared to other configurations. During other two configurations the background will change containing also the color same as the target's. This distracts the color tracker resulting in increased error. The error is larger in the case of static sensor/moving object compared to moving sensor/moving object since in the test sequences the background included the target's color more often.

Table: Qualitative results for various sensor-object configurations (in pixels).
static sensor/ moving sensor/ moving sensor/
moving object static object moving object
mse std mse std mse std
RF 7 7 4 3 9 10
AF 7 9 4 10 13 25
Color 15 16 10 6 10 14
Diff 23 26 failed failed failed failed
SSD 25 27 12 13 17 21



Reliability (Table [*]) - Since the accuracy is obtained using the texture based weighting the reliability for the action and response fusion will be same as presented in Table [*] for this weighting technique. In Table [*], the obtained reliability results are ranked showing that color performs most reliably compared to other individual cues. In certain cases, especially when the influence of the background is not significant, this cue will perform satisfactorily. However, it will easily get distracted if the background takes a large portion of the window of attention and includes the target's color.

Image differencing will depend on the size of the moving target with respect to the size of the window of attention and variations in lighting. In structured environments, however, this cue may perform well and may be considered in cases of a single moving target where the size of the target is small compared to the size of the image (or window of attention).

Fig. [*] shows two example images and the tracking accuracy for the proposed fusion approaches and for each of the cues individually. The plots and the table show the deviation from the ground truth value (in pixels). A significant instability is demonstrated for image differencing indicating that this cue can not be used alone for tracking. It may also be seen that the color cue performed really well. This is not surprising since many of the face- or people-tracking systems rely strongly on this cue. Very little texture and significant changes in scale implies that correlation cue is very likely to fail. Similar results are shown in Fig. [*] for a case where the target is a package of raisins. During this sequence, a number of occlusion occurs (as demonstrated in the images), but the plots demonstrate a stable performance of the fusion approaches during the whole sequence. The color cues is, however, ``fooled'' by the box which is the same color as the target. The plots demonstrate how this cue fails around frame 300 and never regains tracking after that. These two examples clearly demonstrate that tracking by fusion is more superior than any of the individual cues.

Table: Success rate for individual cues and fusion approaches.
$ \sharp$ success $ \sharp$ failure %
RF Voting 27 3 90
AF Voting 22 8 73.3
Color 18 12 60
SSD 12 18 40
Diff. 7 23 23.3



next up previous
Next: Summary and Conclusions Up: Experimental Evaluation Previous: Experiment 1.
Danica Kragic 2002-12-06