Purdue expands open-source research in the cloud with Amazon Web Services

Purdue is pleased to announce that D2S has been accepted into the Amazon Web Services (AWS) Open Data Sponsorship Program, which encourages development of communities that benefit from access to shared datasets.
Researchers sit around a conference room table with their laptops while two professors stand in front of a TV monitor show a data visualization.
From left to right: Benjamin Hancock, Minyoung Jung, Zhen Yu Qian, Jinha Jung, Sungchan Oh, Benjamin Goller and Ziqian Gong. (Purdue University photo)

Scientists and engineers thrive on data — tons of it, along with customizable ways to package and examine it. That’s what Purdue University’s Data to Science initiative (D2S) provides — an open-source platform where a community of multidisciplinary researchers can access, and contribute to, a centralized repository of globally sourced geospatial datasets in a single, widely shared location.

Purdue is pleased to announce that D2S has been accepted into the Amazon Web Services (AWS) Open Data Sponsorship Program, which encourages development of communities that benefit from access to shared datasets.

The AWS Open Data program covers the cost of storage for publicly available, high-value cloud-optimized datasets, and the data transfer costs for end users accessing the data. It works with collaborators who seek to democratize access to data for analysis on the cloud, and who develop new cloud-native techniques, formats, and tools to lower the cost of working with that data.

“AWS is the industry leader when it comes to public cloud services, providing the most comprehensive and reliable cloud services worldwide,” said Jinha Jung, associate professor in the Lyles School of Civil and Construction Engineering.

“If we want to develop an ecosystem that is scalable and accessible worldwide, AWS will be the best choice in my opinion. We are deeply grateful to be accepted into the AWS Open Data Sponsorship Program, the result of a competitive selection process whose notable awardees include the EPA, NASA, NOAA, USGS, and the Indiana Geographic Information Office.”

Currently, all data on the D2S platform is hosted on servers managed by Purdue, which has sufficient computational resources to provide D2S services at the moment. As the D2S user community grows, Jung envisions that the on-premises services may reach a tipping point where more scalable computing infrastructure is required.

“This is where AWS Open Data program will be very valuable,” he said. “AWS works with data providers to democratize access to data by making it available for analysis on AWS; develop new cloud-native techniques, formats, and tools that lower the cost of working with data; and encourage the development of communities that benefit from access to shared datasets. Through the program, AWS has democratized access to petabytes of data, including satellite imagery, climate and weather data, genomic data, and data used for natural language processing. The full list of publicly available datasets is available on the Registry of Open Data on AWS.”

Researchers stand shoulder to shoulder while looking a laptop computer.
From left to right: Minyoung Jung, Jinha Jung and Benjamin Hancock. (Purdue University photo)

Focus on UAV Datasets

D2S is the brainchild of Jung, whose expertise is in geomatics, working with spatial (geographic) data. His research group, the Geospatial Data Science Lab, aims to innovate, enrich and synthesize geospatial data science for solving challenging problems by leveraging its knowhow in remote sensing, geographic information systems, unmanned aerial systems and high-performance computing.

“My group collaborates with research scientists with diverse backgrounds, including, but not limited to, agriculture, forestry and transportation engineering,” he said. “The D2S platform is an effort to build an open-source ecosystem where these research scientists can collaborate openly and sustainably, centered around big geospatial data.”

Initial funding for D2S was provided by Purdue's Plant Sciences 2.0 Initiative and the Institute for Digital Forestry. The Institute for Digital Forestry focuses on developing digital platforms and strategies that will revolutionize forestry. "The goal was to facilitate data sharing among researchers and provide easy access to information that could help managers, policymakers and others make data-driven decisions," said Songlin Fei, director of the Institute and Dean's Chair of Remote Sensing in the College of Agriculture’s Department of Forestry and Natural Resources. "D2S is an important resource in our work, and it’s gratifying to see it continue to grow, making more data available to many more people," Fei added.

The D2S platform initially focused on drone and field-sampled data from unmanned aerial vehicles (UAVs) for crops and forestry. It stands out from other data-sharing offerings in a number of ways. Unlike general platforms, it is specifically designed to manage and share data from UAVs. It is open source, providing free access. And it absorbs user input, so its tools and features meet user needs, as well as offer the collaborators training and support.

D2S is designed for ease of self-deployment. “It can be deployed in any environment that supports the open-source platform Docker, providing researchers with the flexibility to integrate the platform into their existing infrastructure,” Jung said. “This ensures that the platform can be customized and scaled according to specific research requirements.”

D2S currently hosts the USDA WheatCAP project — UAV data from 41 wheat breeding programs across 22 institutions in 20 states. The Tippecanoe County Sheriff’s Office uses it to process, visualize and analyze UAV 3D mapping data from crash scenes; Purdue’s Agriculture and Natural Resources Extension similarly uses it to process, visualize, share and analyze their UAV data.

D2S is also a central geospatial data repository for the Institute for Digital Forestry, which is developing various applications on the ecosystem, including urban 3D mapping for Cook County, New York City, Denver and others. And it’s compatible with Breedbase, breeding management and analysis software used by many breeding programs worldwide to extend innovation opportunities to agricultural research communities.

Community-Driven Approach

D2S aligns closely with Purdue Computes, an initiative that encompasses Purdue’s research and programs in physical AI, computing, semiconductors, and quantum science. Purdue Computes is focused at the intersection of the virtual and physical — between the bits and bytes of AI and the atoms of growing, making, and moving things in the real world.

“To unlock the true potential of computing and AI, it is crucial to move beyond individual efforts to a collective, community-driven approach to data infrastructure,” Jung said. “This is in keeping with the Purdue Computes initiative — empowering researchers across different disciplines through a centralized repository of valuable datasets from projects worldwide.”

D2S not only fits Purdue strategy like a glove — it serves national policy, especially in an era when unmanned autonomous vehicles are increasingly assuming center stage.

“D2S aligns with the White House Office of Technology and Policy mandates on openness in scientific enterprise, ensuring that federally funded research and supporting data are disclosed to the public at no cost,” Jung said. “While existing open science data repositories can serve as alternatives for open UAV data, they are often not well-equipped to handle the complex, voluminous and spatiotemporally rich nature of UAV data. As far as I know, D2S is the only open-source platform specialized for big geospatial data.”

As users proliferate and the datasets grow, Jung likens the effect to that which large language models (LLMs) have on AI.

“Advances in artificial intelligence are based on the enormous amounts of training data,” he said. “Many of the foundational techniques that underpin LLMs are applicable to geospatial artificial intelligence. These large-scale, high-quality D2S datasets hold enormous potential to help unlock new AI-powered frontiers in many disciplines.”