New system could help researchers preserve, share data

October 2, 2014  


WEST LAFAYETTE, Ind. - Researchers are developing a standardized system to solve a recurring problem: experimental data are repeatedly forgotten or even lost because researchers do not have a reliable and convenient way to preserve and share them.

"When a student graduates or a researcher moves after years of research and work in the laboratory, that person takes with him or her quite a bit of information. Digital records representing years of very expensive experiments are often lost," said Santiago Pujol, a professor of civil engineering at Purdue University.

He is leading a team working at Purdue in research funded by the National Science Foundation to develop a method "for the systematic collection, curation and preservation of engineering and science data."

The focus will initially be on civil engineering applications including bridges, highways, buildings, pipelines and power-distribution networks.

"The civil engineering flavor of this is only for a pilot study," he said. "It's a proof of concept. But we intend for this system to be useful to the broader engineering and scientific communities in the future."

The cost of not creating such a system is too high, Pujol said.

"People end up doing things over again," he said. "Say I have an idea. So I go to the lab and I run a dozen experiments. That takes me years, and a lot of money: tens or hundreds of thousands of dollars. Through my lens, I interpret the data I collected and test my idea. But that's just my idea. Someone else might have a better idea, and yet they don't have access to my data. Theses and papers contain only snapshots of what I did. The complete digital records, including video, photos, sensor records, etcetera, usually vanish on media that become obsolete in someone's office, just as floppy and zip disks did not long ago."

The work is funded with a three-year grant for $1.5 million.

One of the NSF's priority goals is to improve the nation's capacity in data science by investing in the development of infrastructure, building multi-institutional partnerships to increase the number of U.S. data scientists and augmenting the usefulness and ease of use of data.

As part of that effort, NSF this week announced $31 million in new funding to support 17 projects under the Data Infrastructure Building Blocks (DIBBS) program.

"Collecting metadata during the archiving process is a major challenge for science and engineering communities," said Amy Walton, program director for the recent DIBBs solicitation at NSF. "This project would provide a platform for data sharing and archiving, and incorporate the automatic collection of metadata. The work builds upon existing infrastructure at Purdue University, where several testbeds are already in operation, and could have a significant impact on a broad community."

Pujol is working with Ann Christine Catlin, a senior research scientist in Purdue's Rosen Center for Advanced Computing; Michael McLennan, a senior research scientist and director of the HUBzero Platform for Scientific Collaboration; Ayhan Irfanoglu, an associate professor of civil engineering; Lisa Zilinski, an assistant professor of library science; and Chungwook Sim, a postdoctoral research associate in the Lyles School of Civil Engineering.

The concept grew out of work for NSF's George E. Brown Jr. Network for Earthquake Engineering Simulation (NEES), based at Purdue. NEES includes 14 laboratories for earthquake engineering and tsunami research, tied together with cyberinfrastructure to provide information technology for the network.

"We were supposed to act as curators of all the data produced by these 14 labs, and this became quite a challenge," Pujol said.

As is the case with research data in general, not all the data files were well organized, sometimes lacking measurement units and column headers explaining what the recorded data were. Data recorded from sensors in experiments often do not specify exactly where the sensors are located in test sites or specimens. This is problematic because it is critical to know where each sensor is placed on the site or specimen to properly interpret the measurements.

The new research will add to two efforts related to NEES: One is a "project warehouse" that provides a place for researchers to upload project data, video, audio, documents, papers, dissertations and other files. The other effort is a system that makes it easier for people conducting similar experiments to share and explore data. Called DataStore, the system automatically turns spreadsheets into searchable databases that compile data from and for researchers with common interests around the world.

The new work takes the previous research a step further by standardizing the technology and eventually expanding it to other areas of science and engineering beyond earthquake engineering.

"The idea is to create a flexible system that people in other disciplines can use," Pujol said. "The ultimate goal is to have a 'cloud' repository that is easy to use, that is intuitive and takes you through the right steps. The system has to be based on standards, but it's got to be simple and intuitive. Otherwise, no one is going to use it."  

Writer: Emil Venere, 765-494-4709, venere@purdue.edu

Source: Santiago Pujol, 765-496-8368, spujol@purdue.edu

Note to Journalists: Information about the NSF DIBBS program is available from Aaron Dubrow, adubrow@nsf.gov, 703-292-4489, in the NSF Office of Legislative and Public Affairs.

Purdue University, 610 Purdue Mall, West Lafayette, IN 47907, (765) 494-4600

© 2014-18 Purdue University | An equal access/equal opportunity university | Integrity Statement | Copyright Complaints | Brand Toolkit | Maintained by Marketing and Media

Trouble with this page? Disability-related accessibility issue? Please contact us at online@purdue.edu so we can help.