Although data-driven analysis has been heralded as a new paradigm in fundamental material science such as X-ray Diffraction (XRD) analysis, high-value material datasets are often not made public and are underutilized. This project designs and develops CRUX, a crowdsourced data infrastructure and services to curate, discover, share, and recommend unpublished XRD data and analytical results. CRUX promotes underutilized high-quality material science data by allowing the sharing and exploration of unpublished data with state-of-the-art crowdsourcing, knowledge harvesting, and machine learning techniques. CRUX provides a crowdsourced knowledge base to allow scientists and the general public to share and access unpublished data resources. It also provides (a) a novel search engine that supports simple keyword search, can provide relevant data resources when the exact keyword matching does not exist, and self-evolves to improve the search quality, and (b) a "data feed" service to allow users to easily receive and track updates of specific data resources of interest. The developed infrastructure and tools enable an open, collaborative, and sustainable platform that can facilitate exchanging of unpublished XRD data and discoveries, unlock new research problems (e.g., predictive analysis of materials compositions with multi-phase data), and inspire the novel design of machine learning pipelines (e.g., deep neural networks) for data-driven materials science. CRUX will make materials data resources available and shareable for a broad community including materials scientists, data analysts, software developers, and the general public, and thus promote long-term collaborative research, software development, and education.

The developed CRUX system enables (1) coherent representation of materials data, metadata, and knowledge in terms of a three-tier knowledge graph model; (2) scalable XRD metadata curation and information extraction techniques to promote high-value unpublished XRD data sources for data-driven materials research; (3) adaptive, self-improving search and recommendation techniques to recommend relevant datasets upon user requests and feedback, with sustainability beyond the time of the project; and (4) interactive and exploratory search techniques to explain and recommend the relevant datasets beyond the scope of initial queries. CRUX will be evaluated with established human-in-the-loop knowledge bases and active machine learning algorithms by cornerstone materials research such as the discovery of new high-temperature ferroelectrics. The research community will be able to share XRD data resources (analytical results, machine learning models, processing data) via "one-click" upload, search for high-quality data resources, and (re)discover new resources for machine learning pipelines. CRUX enables several components to advance data-driven materials research, including a materials knowledge graph model, automatic data integration, and exploratory query engine that support "Why" and "What-if" analysis for XRD analysis. Developed solutions will benefit data-driven material science in general. For example, researchers can make use of unpublished two-phase data to predict new materials compositions, identify solubility limits through parameterization by machine learning tools, and refine machine learning models with more sophisticated techniques such as deep neural networks.


This research is funded by NSF award CSSI OAC-2104007.