Next: Experimental Evaluation
Up: Implementation
Previous: Response Fusion Approach
Action Fusion Approach
Here, the action space is defined by a direction and speed ,
see Fig. . Both the direction and the speed are
represented by histograms of discrete values where the direction is
represented by eight values, see Fig. :

(14) 
Speed is represented by 20 values with 0.5 pixel interval which
means that the maximum allowed displacement between successive frames
is 10 pixels (this is easily made adaptive based on the
estimated velocity). There are two reasons for choosing just eight
values for the direction: i) if the update rate is high or the
interframe motion is slow, this approach will still give a
reasonable accuracy and hence, a smooth performance, and ii) by
keeping the voting space rather small there is a higher chance that
the cues will vote for the same action. Accordingly, each cue will
vote for a desired direction and a desired speed. As presented in
Fig. a
neighborhood voting scheme is used to ensure that slight
differences between different cues do not result in an unstable
classification. (Eq. ) is modified so that:
and 
(15) 
In each frame, the following is estimated for each cue:
Color 
The response of the color cue is first estimated
according to (Eq. ) and followed by:

(16) 
where
represents the desired action and
is the predicted position of the tracked
region. Same approach is used to obtain
and
.
Correlation 
The minimum of the SSD surface is used as:

(17) 
Fusion:
After the desired action,
, for a cue is estimated, the
cue produces the votes as follows:
direction speed 
(18) 
where
is a
scalar function that maps the twodimensional direction vectors (see
(Eq. )) to onedimensional values representing the
bins of the direction histogram. Now, the estimated direction, ,
and the speed, , of a cue, , with a weight, , are used to
update the direction and speed of the histograms according to
Fig. and (Eq.). The new
measurement is then estimated by multiplying the actions from each
histogram which received the maximum number of votes according to
(Eq. ):

(19) 
where
. The update and prediction steps are then performed using
(Eq. ) and (Eq. ). The
reason for choosing this particular representation instead of simply
using a weighted sum of first moments of the responses of all cues is,
as it has been pointed out in [8], that arbitration
via vector addition can result in commands which are not satisfactory
to any of the contributing cues.
Next: Experimental Evaluation
Up: Implementation
Previous: Response Fusion Approach
Danica Kragic
20021206