10/18/2016 Laura Schmitt
Written by Laura Schmitt
When CS @ ILLINOIS Professor Kevin Chang needed a way to manage the business data for Cazoodle, a startup company he founded for developing vertical search services, he found it impractical to use a database, so he improvised an invoice and payment solution by customizing a spreadsheet. In the classroom, Chang and his faculty colleague Aditya Parameswaran taught students enrolled in their database systems classes that databases are the 'right' choice for data management. However, they both managed the very same class grade book data in a spreadsheet rather than a database system.
Why the disconnect? Spreadsheets are easy to use and can interactively support many common data processing operations without requiring programming knowledge. On the other hand, databases can handle vast amounts of data, but it’s difficult to interactively use them to explore and manipulate data— users need to rely on specific applications created by programmers or they need to learn to use the structured query language (SQL). Consequently, it’s not surprising that spreadsheet software is immensely popular across the globe in a variety of sectors—science, finance, commerce, government, as well as use at home.In the era of today’s big data, though, spreadsheets are no longer able to satisfy user needs because they are beginning to reach their effective limits when dealing with large amounts of data. As an example, biologists from the NIH-funded Big Data 2 Knowledge (BD2K) center at Illinois, who used to be able to open and examine their data in spreadsheets, now are no longer able to do so. Instead, they ship the data to their bioinformatics colleagues just to check if the data has been correctly generated.
To address this issue, Chang, Parameswaran, and CS @ ILLINOIS faculty colleague Karrie Karahalios recently landed a prestigious NSF BIGDATA grant—$1.8 million over four years—to develop DataSpread, a system that holistically unifies spreadsheets and database systems.
“Although they both manage tabular data, database and spreadsheet are two very distinct software paradigms,” explained Chang, who serves as the grant’s principal investigator. “We’ll bring together the ease of use and interactivity of spreadsheets with the scalability, expressiveness, and collaboration capabilities of databases. While only programmers can directly use a database, people using spreadsheets are running into trouble because they can’t handle the entire dataset in the memory of their computer. By marrying the two paradigms into one, we hope to make it easy for everyone to deal with data of scale.” “What’s amazing is that the database systems community, despite being hugely successful commercially, has largely ignored the fact that the majority of the planet don’t use database systems for data management,” noted Parameswaran. “I call this underserved majority ‘the 99%’—those with access to big data but neither the tools nor the skills to analyze it. These are the people we’re targeting with DataSpread.”The Illinois team, along with their students, will develop new models, algorithms, and architectures that compactly represent spreadsheet data and computation, provide positionally aware indexing structures, and efficiently propagate updates to the user viewport. The project will also study the design of new interaction primitives to replace SQL, but can be effortlessly expressed within a spreadsheet interface.
Marrying interactive spreadsheets with scalable databases is not an easy task. According to Karahalios, the spreadsheet evolved from a series of opaque operations on a mainframe to a graphical table on a personal computer. The graphical nature of the spreadsheet allowed everyday users to see the layout of their data and directly manipulate their data. And as a result, the use of spreadsheets drastically increased.
“If, however, we suggest that users operate on millions or billions of cells on a spreadsheet, this process again becomes opaque due to the size of the data,” said Karahalios, a computer interface expert. “It is not clear that our current spreadsheet metaphor will work. How will people make sense of millions of billions of cells on a spreadsheet? How can they see an overview of their data? What operators will they use and why? These are just some of the usability questions that arise in the design of DataSpread.”Another challenge that will be addressed by the project is that of spreadsheet transaction management, which involves avoiding conflict when multiple users are accessing and perhaps updating data at the same time. “The underlying challenges are not trivial to address,” Parameswaran said. “Spreadsheets and database systems have such fundamentally different modes of operation that unifying spreadsheets and database systems is much like gluing an apple to a pancake.”
The Illinois team is collaborating with several partners on DataSpread’s development. For example, Yahoo software engineers in the University of Illinois Research Park are helping develop open-source software that will be available via the cloud. Interestingly, this Yahoo office manages much of the company’s advertisement sales data, which users access by checking it out and exporting it to a spreadsheet.
“Data companies like Yahoo are running into the very same issues,” said Chang. “There’s no way users can interact with data as it exists on the back end in a database. They’ve run into similar issues with data being too big for a spreadsheet.”
Another DataSpread partner is the NIH-funded Big Data 2 Knowledge (BD2K) center headed by CS Professor Saurabh Sinha, which Parameswaran is also a part of. Biologists, like many scientists in other domains, are comfortable with spreadsheets but not databases, and are struggling to import their data (e.g., millions of gene-gene or protein-protein interactions) into Excel, and are struggling to express simple operations like joins. The DataSpread project will tackle these challenges, to provide a database-powered spreadsheet for scientists to interactively explore and analyze data.
Learn more about the DataSpread project by visiting http://dataspread.github.io.