Sometimes ‘sorry’ isn’t good enough – particularly when machine learning systems ‘try something new’ which they couldn’t have reasonably known would be disastrous. With this in mind, researchers from Google’s artificial intelligence unit have drawn up tentative guidelines for AI systems that address the possible areas of exploration which might put systems – and people – at risk.

In the paper Concrete Problems in AI Safety [PDF], Dario Amodei and Chris Olah from Google Brain, the company’s machine intelligence research department, join with researchers from Stanford and UC Berkeley to examine the areas in which self-learning systems might fall foul of either inadequate prior information, or of what’s euphemistically referred to as ‘common sense’.

The team uses the example of a hypothetical industrial cleaning robot which has adaptive capacity to calculate its way around obstacles in the pursuit of its goal:

‘Suppose a designer wants an RL agent (for example our cleaning robot) to achieve some goal, like moving a box from one side of a room to the other. Sometimes the most effective way to achieve the goal involves doing something unrelated and bad to the rest of the environment, like knocking over a vase of water that is in its path. If the agent is given reward only for moving the box, it will probably knock over the vase.’

The hypothetical bad robot, the researchers postulate, could do worse damage than household breakages, though. For instance, its routine could require it to bring cleaning liquids into highly sensitive technical environments – an undesirable happenstance even if it has been programmed not to use liquid in that particular environment.

Machine learning systems are designed to reward the agent for certain behaviours which the system will be inclined to ‘hack’. The paper posits that if the putative janitor-droid gets rewarded for not seeing any mess or disorder on its rounds, it may ‘earn reward points’ by simply turning off its visual monitoring system. The ‘informal intent’ of the systems designer, it would see, either has to take into account every possible method by which an AI might attempt to ‘cheat’, or else consider deeper-rooted foundations of logic – and even ethics – which are not only more comparable to Asimov’s famous three laws of robotics but would require additional contextual understanding far outside the scope of the AI’s objectives. A janitor-philosopher, bringing wider understanding to bear on very specific practical problems.

The intent of the paper is to develop frameworks through which artificial intelligence researchers can concretely address the unwanted possibilities of AI systems ‘branching’ into negative but innovative behaviour in support of their core objectives.

The paper is responding to a recent spate of popular articles speculating about the disastrous potential of AI to ‘adapt maliciously’ in critical systems such as infrastructure and healthcare, and observes that post-event fixes and chaotic parameter inventions may not be adequate to prevent future ‘logic glitches’ from having unintended and negative consequences:

‘The risk of larger accidents is more difficult to gauge, but we believe it is worthwhile and prudent to develop a principled and forward-looking approach to safety that continues to remain relevant as autonomous systems become more powerful. While many current-day safety problems can and have been handled with ad hoc fixes or case-by-case rules, we believe that the increasing trend towards end-to-end, fully autonomous systems points towards the need for a unified approach to prevent these systems from causing unintended harm.’