The petabyte—a quantity of digital information 12 orders of magnitude greater than the lowly kilobyte—looms large as a future standard for data. To glean knowledge from this deluge of data, a team of researchers at the Data Science Research Center (DSRC) at Rensselaer is combining the reach of cloud computing with the precision of supercomputers in a new approach to Big Data analysis.

“Advances in technology for medical imaging devices, sensors, and in powerful scientific simulations are producing data that we must be able to access and mine,” said Bulent Yener, founding director of the DSRC, a professor of computer science within the School of Science, and a member of the research team. “The trend is heading toward petabyte data and we need to develop algorithms and methods that can help us understand the knowledge contained within it.”

The team, led by Petros Drineas, associate professor of computer science at Rensselaer, has been awarded a four-year, $1 million grant from the National Science Foundation Division of Information & Intelligent Systems to explore the new strategies for mining petabyte data. The project will enlist key faculty from across the Institute including Drineas and Yener; Christopher Carothers, director of the Rensselaer supercomputing center, the Computational Center for Nanotechnology Innovations (CCNI), and professor of computer science; Mohammed Zaki, professor of computer science; and Angel Garcia, head of the Department of Physics, Applied Physics, and Astronomy and senior chaired professor in the Biocomputation and Bioinformatics Constellation.

Drineas said the team proposes a novel two-stage approach to harnessing the petabyte.

“This is a new paradigm in dealing with massive amounts of data,” Drineas said. “In the first stage, we will use cloud computing—which is cheap and easily accessible—to create a sketch or a statistical summary of the data. In the second stage, we feed those sketches to a more precise—but also more expensive—computational system, like those in the Rensselaer supercomputing center, to mine the data for information.”

The problem, according to Yener, is that data on the petabyte scale is so large, scientists do not yet have a means to extract knowledge from the bounty.

“Scientifically, it is difficult to manage a petabyte of data,” said Yener. “It’s an enormous amount of data. If, for example, you wanted to transfer a petabyte of data from California to New York, you would need to hire an entire fleet of trucks to carry the disks. What we are trying to do is establish methods for mining and for extracting knowledge from this much data.”

Although petabyte data is still uncommon and not easily obtained (for this particular research project Angel Garcia will generate and provide a petabyte simulation of atomic-level movements), it is a visible frontier, and standard approaches to data analysis will be too costly, too time-consuming, and not sufficiently powerful to do the job given current computing power.

“Having a supercomputer process a petabyte of data is not a feasible model, but cloud computing cannot do the job alone either,” Yener said. “In this way, we do some pre-processing with the cloud, and then we do more precise computing with CCNI. So it is finding this balance between how much you are going to execute, and how accurately you can execute it.”

To read more, go to cs.rpi.edu/~yener/DSRC/about.html.