next up previous
Next: Experimental Evaluation Up: Implementation Previous: Response Fusion Approach

Action Fusion Approach

Here, the action space is defined by a direction $ d$ and speed $ s$, see Fig. [*]. Both the direction and the speed are represented by histograms of discrete values where the direction is represented by eight values, see Fig. [*]:

...\right]\\  &\text{with L-left, R-right, D-down, U-up} \end{split}\end{equation*} (14)

Speed is represented by 20 values with 0.5 pixel interval which means that the maximum allowed displacement between successive frames is 10 pixels (this is easily made adaptive based on the estimated velocity). There are two reasons for choosing just eight values for the direction: i) if the update rate is high or the inter-frame motion is slow, this approach will still give a reasonable accuracy and hence, a smooth performance, and ii) by keeping the voting space rather small there is a higher chance that the cues will vote for the same action. Accordingly, each cue will vote for a desired direction and a desired speed. As presented in Fig. [*] a neighborhood voting scheme is used to ensure that slight differences between different cues do not result in an unstable classification. (Eq. [*]) is modified so that:

$\displaystyle \textbf{H} =\left[ \renewedcommand{arraycolsep}{1pt}\begin{smallmatrix}0 & 0 & 1 & 0\\  0 &0 & 0 & 1 \end{smallmatrix}\right]$   and$\displaystyle \hspace{1mm} \textbf{W}=\left[ \renewedcommand{arraycolsep}{2pt}\...
...a T &0 & \beta &0\\  0 & \alpha \Delta T& 0 & \beta \end{smallmatrix} \right]^T$ (15)

In each frame, the following is estimated for each cue:
Color - The response of the color cue is first estimated according to (Eq. [*]) and followed by:

\begin{displaymath}\begin{split}& \textbf{a}_{color}(k)= \frac{\sum_\textbf{x} O...
...m} \hat{\textbf{p}}_{k\vert k-1} + 0.5\textbf{x}_w] \end{split}\end{displaymath} (16)

where $ \textbf{a}_{color}(k)$ represents the desired action and $ \hat{\textbf{p}}_{k\vert k-1}$ is the predicted position of the tracked region. Same approach is used to obtain $ \textbf{a}_{motion}(k)$ and $ \textbf{a}_{var}(k)$.
Correlation - The minimum of the SSD surface is used as:

$\displaystyle \textbf{a}_{SSD}(k)= \operatornamewithlimits{arg min}_{\textbf{x}} (SSD(\textbf{x}, k)) - \hat{\textbf{p}}_{k\vert k-1}$ (17)

After the desired action, $ \textbf{a}_i(k)$, for a cue is estimated, the cue produces the votes as follows:

direction$\displaystyle \hspace{1mm} d_i={\mathcal{P}}(\operatorname{sgn} (\textbf{a}_i)),$   speed$\displaystyle \hspace{1mm} s_i=\vert\vert \textbf{a}_i \vert\vert$ (18)

where $ {\mathcal{P}}: \textbf{x} \rightarrow \{0,1, \dots, 7 \}$ is a scalar function that maps the two-dimensional direction vectors (see (Eq. [*])) to one-dimensional values representing the bins of the direction histogram. Now, the estimated direction, $ d_i$, and the speed, $ s_i$, of a cue, $ c_i$, with a weight, $ w_i$, are used to update the direction and speed of the histograms according to Fig. [*] and (Eq.[*]). The new measurement is then estimated by multiplying the actions from each histogram which received the maximum number of votes according to (Eq. [*]):

$\displaystyle \textbf{z}_k={\mathcal{S}}(\operatornamewithlimits{arg max}_d HD(d)) \operatornamewithlimits{arg max}_{s}HS(s)$ (19)

where $ \small {{\mathcal{S}}: x \rightarrow \renewedcommand{arraycolsep}{0.1pt}\{\lef...
... \right], \dots,
\left[ \begin{smallmatrix}-1 \\  1 \end{smallmatrix}\right]\}}$. The update and prediction steps are then performed using (Eq. [*]) and (Eq. [*]). The reason for choosing this particular representation instead of simply using a weighted sum of first moments of the responses of all cues is, as it has been pointed out in [8], that arbitration via vector addition can result in commands which are not satisfactory to any of the contributing cues.

next up previous
Next: Experimental Evaluation Up: Implementation Previous: Response Fusion Approach
Danica Kragic 2002-12-06