FlashRelate: Extracting Relational Data from Semi-Structured Spreadsheets Using Examples
With hundreds of millions of users, spreadsheets are one of the most important end-user applications. Spreadsheets are easy to use and allow users great flexibility in storing data. This flexibility comes at a price: users often treat spreadsheets as a poor man’s database, leading to creative solutions for storing high-dimensional data in a two dimensional grid. The trouble arises when users need to answer queries with their data. Data manipulation tools make strong assumptions about data layouts and cannot read these ad-hoc databases. Converting data into the appropriate layout requires programming skills or a major investment in manual reformatting. The effect is that vast amounts of real-world data is “locked-in” to a proliferation of one-off formats.
We introduce FlashRelate, a synthesis engine that lets ordinary users extract structured data from spreadsheets without programming. Instead, users drive the extraction process by specifying output examples, which FlashRelate uses to synthesize a program in Flare. Flare is a novel extraction language that extends regular expressions with a geometric constructs. We built an interactive user interface on top of FlashRelate that lets end-users generate Flare programs by point-and-click. We demonstrate that correct extraction programs can be synthesized in seconds from a small number of examples for 43 real-world scenarios. Finally, our case study shows that FlashRelate addresses the widespread problem of data trapped in corporate and government formats.
A video demonstration is available at: http://tinyurl.com/mh3bo3a