Video monitoring systems for traffic have been in use since the mid- 1960s, initially pioneered by Israel and the Netherlands in order to capture motorists violating traffic light regulations. But they would have been very differently-designed if computer evaluation, rather than human monitoring, had been anticipated – a problem which researchers from Carnegie Mellon are currently tackling.
The sheer scale and maturity of urban traffic-monitoring systems, many of which hail back to the 1980s, are a tantalising prospect for AI in terms of generating usable data about traffic flow, but are predicated on ‘common sense’ – currently an elusive goal for machine learning, even in terms of widely-agreed definition.
In the paper Understanding Traffic Density from Large-Scale Web Camera Data four researchers consider what Fully Convolutional Networks (FCNs) might be able to achieve in terms of leveraging all this extant, live data without the need to invest in expensive and experimental new monitoring systems.
City-wide traffic camera networks were originally developed as tools for municipal traffic authorities to make rough estimations of congestion, and to be apprised of serious blockages such as road accidents; any traffic flow data emerging from their use would be purely anecdotal – impossibly expensive or complicated to rationalise and analyse scientifically.
A machine system wanting to use such systems to gain a long-term understanding of traffic flow is faced with a number of issues: firstly, lenses deployed in them tend to be extremely wide-angle, in order to reduce the necessary camera-count for coverage. In terms of information that such images provide to machine systems, this means that the rear-most cars in a traffic jam may not occupy more than five pixels – an object-recognition challenge that seems science-fictional at the current state of the art.
The economies of scale in existence at the time these legacy networks were created are another challenging factor, since the low frame rate of video capture (sometimes as little as one frame every 2-3 seconds) that they are set at makes it difficult to continuously evaluate individual vehicles, and maintain some understanding of ongoing traffic flow.
Another ‘legacy’ economy poses an additional problem for image-recognition analysis – the frequently low video resolution of the feeds. A typical feed runs at a risible 352×240 pixels – mean, even by the standards of the early internet.
The issue of occlusion offers another challenge yet: as lanes recede into the distance, it can become practically impossible to distinguish individual vehicles from each other. Had these surveillance systems been designed with image recognition in mind, they would not have been placed at such an acute and distorting angle, but rather mounted as high as possible for vehicle individuation.
The researchers have adapted to these challenges by formulating individual approaches to each pixel sector of the legacy feeds, with highly creative approaches to perspective-based clustering, that promises to yield some consistency in traffic evaluation from systems which were not designed to provide this information.