In terms of manipulation, it is usually required to accurately estimate the pose of the object to, for example, allow the alignment of the robot arm with the object or to generate a feasible grasp and grasp the object. Using prior knowledge about the object, a special representation can further increase the robustness of the tracking system. Along with commonly used CAD models (wire-frame models), view- and appearance-based representations may be employed .
A recent study of human visually guided grasps in situations similar to that typically used in visual servoing control,  has shown that the human visuo-motor system takes into account the three dimensional geometric features rather than the two dimensional projected image of the target objects to plan and control the required movements. These computations are more complex than those typically carried out in visual servoing systems and permit humans to operate in large range of environments.
After the object has been recognized and its position in the image is known, an appearance based method is employed to estimate its initial pose. The method we have implemented has been initially proposed in  where just three pose parameters have been estimated and used to move a robotic arm to a predefined pose with respect to the object. Compared to our approach, where the pose is expressed relative to the camera coordinate system, they express the pose relative to the current arm configuration, making the approach unsuitable for robots with different number of degrees of freedom.
Compared to the system proposed in , where the network has been entirely trained on simulated images, we use real images for training where no particular background was considered. As pointed out in , the illumination conditions (as well as the background) strongly affect the performance of their system and these can not be easily obtained with simulated images. In addition, the idea of projecting just the wire-frame model to obtain training images can not be employed in our case due to the objects' texture. The system proposed in  also employs a feature based approach where lines, corners and circles are used to provide the initial pose estimate. However, this initialization approach is not applicable in our case since, due to the geometry and textural properties, these features are not easy to find with high certainty.