Many small and medium businesses don’t have the luxury of big data. They need the tools to generate rational and informed insight from small data sets. Unless these small volumes can be exploited there is a risk that AI will fail to deliver on its promise.

The Stack recently spoke Dario Garcia-Gasulla, senior researcher of the High Performance Artificial Intelligence research group at Barcelona Supercomputing Center, to explore why the AI and Big Data partnership is just scratching the surface.

Why big data is just the tip of the iceberg

Thanks to exponential improvements in computational power, storage, and the host of sensing devices available on the cheap, we have never had so much data at our disposal. Whether signals from sensors in smartphones and industrial equipment, photos and videos snapped from our mobile cameras, or the data deluge from social media, data is big and it’s getting bigger.

IBM famously estimated that 90% of all the digital data in the world was created within the last two years. That prediction is itself five years old. To all those involved in the artificial intelligence and particularly machine learning – the subset of artificial intelligence that examines and compares large data sets to find common patterns – the big data age is a cause celebre. Data is used to test and refine algorithms, which generate yet more data to test and refine, and so on in a virtuous cycle.

If we consider companies that need to detect defects in a very specific manufacturing process, data here is typically scarce

Feed in a torrent of images of benign and dangerous skin legions, and with some testing, you can develop AI that is near-perfect at detecting skin cancer. So perfect that it outperforms dermatologists at staggering rates. By the time an application is achieving these results, it could have digested tens of millions of images.

Good things come in small packages

But, from start to finish this can mean a lengthy, difficult, and expensive process. For starters, data has to be sourced, scrubbed, and seasoned to make it palatable to machine learning algorithms. After rounds and rounds of testing, these expensive projects can often fail to deliver results. Many researchers simply lack the time and money to source data. It may seem hard to believe, but biologists working at the cellular level have to outline the borders and structure of cells by hand.

In reality, most real-life problems have limited to data exploit, explains Garcia-Gasulla. Biological and medical ML applications focus their crosshairs on highly specific issues where ‘big data’ simply isn’t available. The same applies to most business data science applications, where only 100 or so data points for each class exist.

“If we consider companies that need to detect defects in a very specific manufacturing process, data here is typically scarce. There is no data of this sort on the internet, and typically only a few pieces are produced daily,” he adds.

If this resource can be exploited a whole new range of beneficial applications will be brought within reach

Less is more

Taken together, this calls for a new form of data efficiency in ML that solves big problems from less data. Bradley Arsenault, CEO of Electric Brain wrote this year that ‘for every dataset with one billion entries, there are 1,000 datasets with one million entries, and 1,000,000 datasets with only one thousand entries’.

If this resource can be exploited a whole new range of beneficial applications will be brought within reach. For Garcia-Gasulla this the ‘process of democratisation of AI’ that will help lots of small and medium businesses. Big data is ‘just the tip of the iceberg’, he says.

Among the pioneers in this area is DARPA, which this year announced a new program called “learning with less labels”. DARPA’s aim is to research new learning algorithms that do not need a flood of information to train and update. It has articulated an ambitious objective of reducing the amount of data required to build a model by a million-fold.

To meet this challenge they are encouraging researchers to ‘create novel methods in the areas of meta-learning, transfer learning, active learning, k-shot learning, and supervised/unsupervised adaptation.’

Algorithms alone lack the causal inference and extrapolation capabilities to deliver informed and rational solutions from big data sets, although often throwing more data at a problem works in the end

Seize the data

While DARPA is calling for ‘novel methods’, Garcia-Gasulla is quick to underscore that AI itself is much broader than machine learning and was making an impact way before the data surge. It would be remiss to forget the useful contributions and tools available that can solve problems where data is scarce.

“Working with little data has been the reality of AI for over 40 years. There are lots of methods, like SVMs, decision trees, regressions, probabilistic models, and others, that work very well on “small” datasets,” he explains.

Big data is regularly criticised for its overriding faith in ‘correlation is king’. Algorithms alone lack the causal inference and extrapolation capabilities to deliver informed and rational solutions from big data sets, although often throwing more data at a problem works in the end. This poses a problem for big data’s pint-sized variants. Surely relying on small data will just attenuate meaningful insight?

If these somewhat reckless approaches are applied to all problems, we may end up being disappointed by AI in most cases

Garcia-Gasulla disagrees. He argues AI is capable of generating rational and informed insight from small data volumes if researchers bring outmoded data-analysis techniques back to the fore.

“The field has developed a wide variety of analysis for assessing data, which should not be disregarded. Correlation analysis, bias analysis, are rarely used these days, as people just expect to solve every problem by throwing more data at it. The first step on any data related problem should be properly understanding the data, its limitations and capabilities,” Garcia-Gasulla adds.

Every instance counts

Working with small datasets creates its own technical challenges. Garcia-Gasulla explains that researchers with enough data do not have to worry about things like redundancy or missing values. With small data, every instance counts, calling for more complex and patient strategies.

By reducing the variety and scale of data fed into an algorithm, there is a risk of accentuating problematic human biases that already plague machine learning applications.

“Every variable that adds noise to your model is a huge inconvenient. Models with small data must be aware of every aspect of the information being processed, and must find ways to exploit the advantages and ways to bypass the disadvantages. This usually means integrating different data mining and machine learning techniques.”

The looming problem of bias cannot go unexplored. By reducing the variety and scale of data fed into an algorithm, there is a risk of accentuating problematic human biases that already plague machine learning applications. AI may enhance our ability to detect visible cancerous skin lesions, but it also amplifies our hidden prejudices. We can all agree that whatever world artificial intelligence creates none of us what a band of bot-bigots inhabiting it.

Garcia-Gasulla says AI practitioners can help alleviate bias by enacting data pre-processing and post-processing on every machine learning process. However, practitioners are currently skimping on these precautions by (again) relying on large volumes to paper over the cracks.

“In some cases, the most extremely voluminous, this works out. However, if these somewhat reckless approaches are applied to all problems, we may end up being disappointed by AI in most cases,” he says.

“Although in reality that would be us disappointing AI and the many researchers that came before us.”

Scratching the surface

While big data is only going to get bigger, most real-life problems have don’t have the luxury of large data volumes. If small and medium businesses have the tools to exploit smaller datasets, their combined success may have a much larger impact than one big data problem solved. Big data is just the tip of the iceberg.


The Stack recently spoke to Bhushan Desam, global director of AI business at Lenovo DCG, and Dario Garcia-Gasulla, senior researcher of the High Performance Artificial Intelligence research group at Barcelona Supercomputing Center, to investigate AI’s potential impact in healthcare.