The intensity of current research into the field of image recognition reflects the potential ramifications of computers being able to make sense of the visual world, either through neural networks or advances in database classification, or both. There is a magnitude of difference between an AI that can compare a real-world situation to the most prevalent features of a dataset query and one that can itself competently generate such data-sets based on effective learning algorithms – and then use that knowledge.

At the political high end, effective AI-based image recognition has huge significance in terms of security infrastructure, whilst the commercial applications, as currently being researched by Amazon, have significant economic consequences.

Scientific researchers for Facebook AI Research (FAIR) believe that the classic challenges of image classification, edge detection, object detection and semantic segmentation are so near to being solved that the field should turn its sights to the next major challenge: occlusion, or the fact that objects in a photo must often be ‘guessed’, either because they are cropped by the image frame, hidden by other elements, further away from ‘adjacent’ objects than may be immediately obvious or, in certain instances, logically indistinguishable from non-contiguous elements in the frame.

In Semantic Amodal Segmentation [PDF], FAIR researchers Yan Zhu, Yuandong Tian and Piotr Dollár – together with Rutgers University Department of Computer Science fellow Dimitris Mexatas – set small groups of human subjects to the task of ‘completing’ a vector outline for subjects in photographs which are not entirely visible.


In addition to distinguishing the occluded suggested outlines, the volunteers were also tasked with imposing a z-order on the classified objects, i.e. suggesting which are nearer to the camera.

amodal-segmentation-examples-2-stagIn the case of three huddled fox-cubs, this information is more or less intrinsic due to the fact that the cub with no occlusion (i.e. completely shown) is almost certain to be at the front of the group. In the case of a stag in front of stag-like branches, and regarding a perspective-shortening long lens, or of a musician holding an instrument (see image right), the distinction is far clearer to a human than an AI.

Clutter and clusters in image recognition

At the same time as this paper’s release another research group addresses [PDF] Amazon’s continuing efforts to get robots to accurately choose and pick items from shelves, noting the challenge of ‘clutter’, wherein the detection algorithms applied to the task can easily confuse other objects for their intended object. To this end the Amazon Picking Challenge provides extraordinary visual database resources, along with 3D CAD models that can help the algorithm to reproduce what it is seeing across a variety of potential matches and choose the match that scores highest for comparison.

Amazon is solving less abstract problems than FAIR, however. Though its work may develop principles and techniques that are more widely applicable, its task is primarily concerned with the recognition of ‘Amazon objects’ in an ‘Amazon environment’. The prospect of a thumb over a lens appearing to be a large pink balloon, or of a 750lb gorilla needing to be distinguished from a toy that represents a gorilla, are unlikely to occur and are therefore superfluous to the challenge’s scope.

From ‘A Dataset for Improved RGBD-based Object Detection and Pose Estimation for Warehouse Pick-and-Place’, - Amazon objects within an experimental database and their real-world correlations in situ.

From ‘A Dataset for Improved RGBD-based Object Detection and Pose Estimation for Warehouse Pick-and-Place’, – Amazon objects within an experimental database and their real-world correlations in situ.

Five bananas or a bunch of bananas?

Facebook’s researchers have wider concerns, almost touching the philosophical at certain points: is a group in itself an ‘object’? A bunch of bananas is a distinct entity in language, for instance, though composed of sub-objects. With more complex subjects such as humans – surely to the fore of Facebook’s scientific interest – the identification of the ‘human object’ leads to immense granularity: gender, age and individual body parts, to begin with, and that’s without addressing contextual challenges such as location, weather and other identifiable objects in the image

Both the database-driven and the neural-network approaches to image recognition have their limitations, the former of context and the latter of over-extended scope; Amazon seems likely to end up with a ‘baked’ system that works very well but will probably only be of developmental insight to industries that have similar or identical problems. At the same time wider research into object-recognition, particularly in the field of Advanced Driver Assistance Systems (ADAS) for self-driving cars, need to be able to take so many possible variables into account that manual annotation of an imageset-database seems the only realistic route at the present time; even if self-learning Neural Networks could be trusted to learn important information about what they are seeing through their IoT-sensors, adequate computing power for real-time responses in critical situations is not currently feasible.

Regarding the neural approach to image recognition, there is the additional possibility of developing rules which are usually correct but are so likely to fail in particular circumstances as to render them useless in important contexts. If an algorithm begins to understand that similar things can often be found together – such as kittens, bananas and people – it is likely to more successfully understand where there are multiple instances of an object in an image, but may begin to create non-existent ‘groups’ based on the general success of the principle.

Kanizsa-triangleIn their paper the Facebook researchers make note of the Kanizsa Triangle, one of many optical illusions likely to send current image recognition algorithms into a classic Star Trek-style ‘does not compute’ loop. Strictly speaking the image depicted contains six objects, but depicts anywhere between 4-6 objects, depending on your point of view – an interpretive conundrum which is often repeated across image-sets that are generated ad hoc rather than for the purposes of specific database experiments in controlled conditions.