Microsoft has released new research in an endeavour to develop automated captioning systems that contain value judgements, and can ‘tell a story’ from pictures in the traditional sense.

A selection of applied results can be seen at the company’s ‘Next’ blog, and the researchers behind it have also released the core database developed to support the work. The Microsoft Sequential Image Narrative Dataset (SIND) contains 81,743 photos arrayed into 20,211 groups – material from which algorithms can attempt to assign interpretive descriptions and even value judgements.

One of the auto-captioned sequences shows a sequence of photos from a baby’s first birthday, and includes captions such as ‘He was so excited to see the cake’, ‘The family had a great time’ and ‘They had a blast at the pool’.

sind-baby-narrative

The paper describing the work, entitled Visual Storytelling [PDF], explains how the SIND approach attempts to move beyond the merely descriptive AI-based captioning which so many projects are currently devoting themselves to:

‘There is a significant difference, yet unexplored, between remarking that a visual scene shows “sitting in a room” – typical of most image captioning work – and that the same visual scene shows “bonding”. The latter description is grounded in the visual signal, yet it brings to bear information about social relations and emotions that can be additionally inferred in context.’

The SIND photo dataset is derived from 10,117 CC-licensed Flickr albums, but only those that were taken in a continuous 48-hour period and which include 10-50 photos. Amazon Mechanical Turks work through the sequences in three stages, beginning with derived, ‘traditional’ descriptive captioning provided by crowdsourced workers and averaged out to a median agreed description, and finally transitioning to its own subjective interpretation of the image sequence:

microsoft-vision-to-caption-garage

Sets which are not ‘storyable’ in the Flickr groups, such as a set of coins, are ignored.

The algorithm demonstrated in this experiments has not been outfitted to automatically recognise objects and generate a subjective narrative, but rather is relying on further feedback from crowdsourced workers – who were asked ‘If these were my photos, I would like using a story like this to share my experience with my friends?’ – before deciding that a photo sequence is suitable fodder for a story.

Nonetheless Microsoft must feel rather tenderly about using crowd-sourced sentiment to inform an AI’s values system at the moment. It’s only recently that Microsoft researchers had to remove from the net a Twitterbot fashioned after a young girl, who learned from ‘the crowd’ and almost immediately became a misogynist racist.