A new technique for generating coresets within larger data sets has been discovered, which can help operators manage and analyze big data more efficiently.

Shrinking an extremely large data set to a smaller subset that maintains necessary relationships is a known method of efficient analysis of large data sets. The newly discovered technique uses coresets as s a practical solution to reduce the dimensionality for extremely large-scale, sparse matrices while maintaining computational efficiency. Application of the coreset algorithm results in creating a smaller, but still accurate representative data set that preserves mathematical relationships between variables.

The coreset algorithm proposed by the research team represents an improvement over state of the art analysis techniques including singular-value decomposition, non-negative matrix factorization, and principal-component analysis.

Daniela Rus, MIT Professor and senior author of the paper, said, “These are all very general algorithms that are used in so many applications. They’re fundamental to so many problems. By figuring out the coreset for a huge matrix for one of these tools, you can enable computations that at the moment are simply not possible.”

This technique, discovered by researchers at MIT and the University of Haifa, uses a merge-and-reduce procedure that can be applied to an extremely large data set, with millions of variables, to create an accurate representative data set with only thousands of variables.

It starts by taking a sparse data set (where most entries on the matrix are 0) and analyzing a cluster of points, for example, 20 non-0 entries. It reduces the number of entries by selecting 10 as representative of the whole. It then takes a separate cluster of 20, selects 10 representative entries, and merges the two subset clusters of 10 representative entries to make a new representative cluster of 20, which is then reduced to a new representative subset of 10.

Applied to a large-scale data grouping, this technique can be used to reduce a sparse data set by a full order of magnitude. At that point, a more common analysis technique like principal-component analysis may be applied to further reduce the number of variables to hundreds, or even fewer.

While the proposed merge-and-reduce procedure for generating coresets examines every point in a large data set, it maintains computational efficiency because it deals with the data in small packets, and analyzes different data points concurrently on distributed machines in a cloud, network, or graphics processor.

In the paper, which contains an experiment applied to synthetic data, the researchers show that their reduction method provides a very good approximation of the full data set.

The merge-and-reduce algorithm can be tailored for applications in natural-language processing, computer vision, signal processing, recommendation systems, weather prediction, finance, and neuroscience, among many others.