8/1/2016 Laura Schmitt & Colin Robertson, CS @ ILLINOIS
Written by Laura Schmitt & Colin Robertson, CS @ ILLINOIS
Note: this was the feature article in Click! Magazine, 2016, volume I. Later that year, CS @ ILLINOIS launched a new track in Data Science for our professional Master of Computer Science degreee program, in partnership with Coursera.
Every day, Twitter processes 500 million tweets in real time to determine trending topics, Match.com identifies thousands of potential relationships, PayPal’s anti-fraud measures sift through $300 million in payments, while medical researchers use genomic data to develop new cancer treatments. The common link? Data science.
Influential media outlets like Forbes, Harvard Business Review, and The New York Times have all touted data science—the process of extracting meaningful and actionable information from massive and varied sets of data (“big data”)—as the hot new discipline that is transforming business and society.
The excitement revolves around the promise of new statistical and computational tools capable of extracting knowledge from the Digital Revolution’s massive deluge of data, especially since much of the data generated today is unstructured, messy, and possibly untrustworthy.
“Lots of people in lots of areas knew how valuable data was, but they didn’t necessarily have the tools to do something with it,” said CS Professor David Forsyth, an expert in computer vision. “What is wonderful now is our ability to do things we couldn’t do.”
Unsurprisingly, CS @ ILLINOIS alumni have been helping drive the data science revolution. Advances have included Siebel Systems’ customer relations management software and C3 Energy’s smart energy platforms (Tom Siebel, BA History ’75, MBA ’83, MS CS ’85, Honorary ’06), YouTube’s enormous online video archive (Steve Chen, attended, and Jawed Karim, BS CS ’04), Yelp’s authentic user-generated local reviews (Russel Simmons, BS CS ’98), PayPal’s online payment system (Max Levchin BS ’97) and Informatica’s enterprise data integration and management applications (Sohaib Abbasi, BS CS ’78, MS’80).
It’s estimated that 80-90% of all the data that corporations and other entities deal with is unstructured, meaning that it comes in the form of text and images in email, blogs, event logs, product reviews, social media, news outlets, and dozens of other sources. Researchers at Illinois are developing the theories, algorithms, and tools to transform raw data into useful and understandable information. Here are just a few examples. Internationally known for his work in big structured data, data mining pioneer Jiawei Han is now developing novel techniques for mining information from unstructured data. His approach is to mine latent entity structures from massive unstructured and interconnected data.
His statistics-based algorithms automatically grab meaningful phrases from text and determine if they refer to a person, place, or thing in a scalable way. One of his phrase-mining algorithms, which works on multiple languages, was a grand prize winner of the 2015 Yelp Dataset Challenge. Three of Han’s text mining software packages are being used by the Army Research Lab.
“Essentially, it watches patterns and [word] combinations and learns from them so it can figure things out,” said Han, who demonstrated his system on a variety of domains including Yelp restaurant reviews and scientific research publications. Further, Han’s method can infer relationships between entities and its runtime is much faster than alternative methods used on large datasets.
In the medical arena, Han and CS Professor Saurabh Sinha are part of a $9.3 million NIH-funded collaboration with Mayo Clinic to create a revolutionary analytical tool—Knowledge Engine for Genomics (KnowEnG)—that will allow biomedical researchers to place a gene-based data set in the context of all previously published gene-related data. This broad context for individual data sets will offer new functional insights for the genes being studied.
CS Professor Dan Roth is helping ICG Solutions, a real-time analytics company, draw insights on the 2016 presidential debates from tens of thousands of Twitter messages. ICG is employing some of Roth’s natural language understanding tools in its LUX streaming analytics platform. These tools identify entities (names of people, organizations, location), sentiment analysis (people’s feelings about a candidate), and demographic parameters of those sending the Tweets.
Another aspect of Roth’s research investigates the trustworthiness of big data. “We not only want to know what people are saying about a topic, but can we believe it?” said Roth. “Algorithmically, you can determine if a source is trustworthy and if the claims are credible or not. We use this same technology for [debate] sentiment analysis.”
Roth co-founded NexLP to commercialize his text understanding and analytics tools. The Chicago-based company’s core technology is Story Engine, which automatically extracts and organizes facts from vast collections of documents like email messages and helps users understand key themes and connections within the data.
This story-telling aspect is a critical element to data science, said Illinois alumnus Aditya Singh (BS CompE ’01, MS EE ‘04), a partner with Foundation Capital. “The best data scientists in the world will be story tellers who use technology and domain expertise to provide a compelling so-what and who then can communicate that effectively across all the stakeholders,” Singh said.
Text isn’t the only form of unstructured data—images are too. Computer vision faculty David Forsyth and Derek Hoiem have developed a data-driven method to find the location of small but important things (aka “little landmarks”) in pictures. One example is a car door handle, which by itself is indistinct and hard for a computer to locate.
“You can’t find the handle directly because it doesn’t have any distinctive pattern, so you’ve got to find something else that tells you where it is,” explained Forsyth.
“We’ve created a system that learns the context automatically and learns a sequence of steps to find it,” said Hoiem, who anticipates their method could have applications in robotics, helping a robot find a door handle or turn on a light switch. “This method is much more accurate at finding these hard to locate parts than something that is looking for the parts directly.”
Hoiem is also developing a learning system that knows where to look in an image in order to answer a text-based question about the image. For example, his system can examine a photo of a stoplight and correctly answer a query about which color is lit up—a significant improvement over other models that also used the Microsoft Common Objects in Context (COCO) dataset.
According to Hoiem, his model takes a fairly simple approach, mapping natural language onto the image and scoring the various regions in the image for relevance in order to answer the question correctly. Ideally, he’d like to enhance the model so it can learn to perform specialized tasks like counting, reading, and recognizing activities in order to answer more complicated questions.
While advances in deep learning and vision are making it possible to automatically attach descriptions to images, many challenges still remain. CS Professor Julia Hockenmaier has built a probabilistic model that exploits certain knowledge from photo captions. According to Hockenmaier, captions contain a lot of common-sense knowledge about everyday events. For example, the computer can learn that if a person is holding a shovel, then he/she is probably digging a hole.
“This basic concept or world knowledge can be a bottleneck for the computer,” she said. “We’ve shown that this can be really useful for solving certain kinds of semantic tasks that require inference.”
Hockenmaier and fellow CS faculty member Svetlana Lazebnik have also created richer models of image captioning through their Flickr30K Entities project, which explicitly pairs the mention of objects in the caption to their corresponding image regions.
The process of creating and training new algorithms to make sense of unstructured data relies on humans to first sort, filter, label, or otherwise annotate the images, text, or video. Big companies like Facebook, Google, and Amazon hire tens of thousands of people each year through online crowdsourcing sites to complete these tasks. The companies then apply their machine-learning algorithms to the human-annotated data and generate machine-learning models that are applied to the rest of their datasets.
“These companies spend an inordinate amount of money on crowdsourcing,” said CS Assistant Professor Aditya Parameswaran, noting that humans tend to be costly, slow, and error prone compared to computers “It’s hard to figure out the best way to have humans help analyze unstructured data.”
Parameswaran has developed algorithms that rate workers based on their expertise, efficiency, and accuracy. “Our optimization algorithms could lead to significant reductions in cost, error, and latency,” he said.
Not only is big data unstructured, but it can potentially be inaccurate. There are millions of living species and each has its own set of genes that number in the thousands. CS and Bioengineering Professor Tandy Warnow encounters these big datasets in determining the evolution of species.
“The problem isn’t just the volume of data, but when you try to understand evolution by looking at modern day species and working backwards in time, each of these genes has its own story,” said Warnow. “This heterogeneity across the genome makes it very hard to figure out the species’ tree.”
A species tree describes how different species evolved from a common ancestor. However, the conventional methods of creating the tree generates some errors. “The problem is that on a given gene, this information is usually wrong because gene trees aren’t always accurately computed,” Warnow said. “Often the error is small, but there is some error.”
Over the last few years, Warnow developed a method to create more reliable gene trees. Known as statistical binning, her approach sorts all the genes into sets, which are combined to create supergene trees. These new trees, in turn, are combined to form a more accurate species tree.
Her statistical binning technique enables researchers to construct more accurate species trees detailing the lineage of genes and the relationships between species. In fact, Warnow’s technique helped an international team of researchers produce the most reliable evolutionary tree of 48 species of birds in 2014.
Another development that is key to data science’s emergence, according to CS Associate Professor Indranil Gupta, is the ability to seamlessly run very large datasets on multiple machines. CS @ ILLINOIS systems-area faculty are exploring ways to make industry-standard programming frameworks like Hadoop and Storm run faster and be more tolerant of server failures. They also are finding ways to increase the efficiency of servers’ run time as jobs are scaled up to take advantage of the processing power of more machines.
Hadoop and Storm, which are used to optimize data storage and workflow solutions, typically process both research/batch jobs and time-sensitive production jobs simultaneously. For example, a production job that counts the number of clicks per ad on a web site needs to run quickly and frequently. If the results are delayed for any reason, it could mean lost revenue. However, a research job that is trying to discover better ways to place ads might run on the same dataset, but it’s results aren’t nearly as time sensitive.
Some organizations run separate research and production clusters and restrict the jobs that can be run on the latter. However, this can actually lead to longer run times during periods when many jobs are running.
In collaboration with Yahoo, Gupta has created a system that addresses this inefficiency by enhancing the Hadoop stack to support production jobs that have priorities. “Our system allows the higher priority jobs to get the resources in the same class,” he said.
Illinois systems faculty are also addressing cloud computing research, in part, through the Air Force Research Lab-funded Assured Cloud Computing center, which develops technology for mission-critical cloud computing across secure military and insecure private networks. ACC is also ensuring the confidentiality and integrity of data and communications, job completion in the presence of cyber attacks and failures, and timely completion of jobs.
According to ACC Director and CS Professor Roy Campbell, Illinois researchers in the last four years have established that you can build mission-critical cloud computing elements, deliver real-time results to secure the cloud, and make the cloud reliable.
For example, Gupta has improved the functioning of NoSQL databases, which cloud systems frequently employ, and developed more advanced scheduling algorithms. His method efficiently makes configuration changes on the servers in the background while handling the reads, writes, and transactions in the foreground.
“The clients don’t know that the reconfiguration changes have happened,” said Gupta, noting that the current reconfiguration state of the art requires the database be shut down temporarily. “They just send the queries as normal.”
Gupta has implemented his method into two industry systems: MongoDB, a document storage database used by the New York Times, and Cassandra, an open-source system used by Facebook and Netflix. “These are very impactful systems, so being able to make changes in them that directly improve the experience of system administrators and developers is really good,” Gupta said.
While supercomputers are key to helping scientists solve complicated problems like predicting the weather, finding new oil reserves, and discovering new drugs, running an application on a petascale machine is expensive—more than $1,000 per hour. Ideally, that time would be used mainly for computation. In reality, a great deal of time is wasted while the machine reads, writes, and stores data. CS Professors Bill Gropp and Marianne Winslett want to help scientists get the most out of their supercomputer time. In a recent study, they analyzed the behavior of over a million jobs from four leading supercomputers, including the U of I’s Blue Waters machine, to look for ways to improve I/O performance. When they mined the performance data routinely collected on these supercomputers, they found common patterns of behavior that severely limit applications' I/O performance.
To address this problem, the researchers created an I/O analytics tool called Dashboard, which visualizes the high-level I/O behavior of an application across all of its runs. "Scientists usually run their applications hundreds to thousands of times, at many different scales, but I/O performance analysis tools weren't taking advantage of this," Winslett said.
With so many runs, scientific codes could essentially serve as their own benchmarks, the researchers realized. "With Dashboard, scientists and platform administrators get so excited when they literally see what's going on with I/O, across all the jobs of their application or platform," Winslett said. "Flagship applications used to get all the attention in parallel I/O research, but the Dashboard's data science techniques bring high-end performance analytics within the reach of all supercomputer users."