Research data and software publications have become a regular output of scientific work. Yet unlike more traditional text publications, widely established processes to assess and evaluate their quality are still missing. This fact prevents researches from getting the proper credit they deserve as common performance indicators often just omit this part of scientific contributions.
As part of the Helmholtz Association, the Task Group Helmholtz Quality Indicators for Data and Software Publications has been set up to develop a quality indicator to be used within the Association. The goal is to define a set of quality dimensions and attributes suitable for all branches represented in Helmholtz and raise the awareness and appreciation of research data and software publications as equally important scientific outputs. We base our work on already well-established frameworks like the FAIR principles and the COBIT Maturity Model and aim to define a graded model accounting for multifaceted nature of contemporary research.
In our talk, we will present the vision of the Task Group as well as the current state of discussions. As the definition these criteria is a continuous and dynamic process, we welcome feedback by the audience want to encourage a further dialogue within the community.
There is a gap between current responsible research and innovation (RRI) as well as open sciences (OS) practices and assessment practices. While
research practices and their ways of publication and dissemination have diversified, assessment practices have remained narrow – focusing on criteria of publication quantity and reputation. In my talk, I will discuss two projects. The first project is the
MERIT portal – an application and assessment software for appointments of professors. The
MERIT portal introduces structured CVs including RRI as well as OS criteria and strategies to reduce the risk of bias during assessments. The focus is to strengthen quality and content-oriented assessments with the support of science-based metrics. The
second project is the Open Data LOM project. In 2019, the Charité introduced an open sciences indicator in the institutional performance-oriented funding system.
The use of commonly agreed terminologies is an elementary component of database systems. They have an impact on data consistency, querying and retrieval or interoperability. Creating, searching for and agreeing on a terminology to be used is a non-trivial problem, as it requires specialised knowledge and coordination processes. This presentation introduces the terminology service that deals with some of these issues.
Within the MeDaX project we study bioMedical Data eXploration using graph technologies. We design and implement efficient concepts and tools for integration, enrichment, scoring, retrieval, and analysis of biomedical data. Interested in data similarity and quality measures, we initiated an international community project for biomedical provenance standardisation and cooperate within the Medical Informatics Initiative (MII) to FAIRify the MII core data set. Those and other projects build the basis for development of a pipeline for knowledge graph (KG) creation from diverse data sources, for automated semantic enrichment, and for data scoring and analysis. For the MeDaX-KG prototype, we build on existing tools such as CyFHIR (generic conversion of FHIR to Neo4j) and BioCypher (harmonising framework for KG creation) and optimise graph complexity and structure by our own methods and code.
Changes occur frequently, especially in data-driven long-term studies. Changing databases lead to the accumulation of many schemes and instances over time. However, any scientific application must be able to reconstruct the historical data to ensure the reproducibility or at least the explainability of the research results. A method is needed that allows each database version to be easily reconstructed at both the schema and data level, and data to be migrated between the different versions. Storing all versions over time is not a feasible solution, as it is often too expensive and storage-consuming. In contrast, a method that allows backward processing to earlier versions of the database guarantees the recoverability of the stored information without keeping different versions. This is the subject of our current research, where we use evolution with provenance and additional information to facilitate the reproducibility of scientific results over long periods of time. In this way, information loss can be avoided or at least reduced.
Galaxy is an open-source platform that allows researchers to analyze and share scientific data using interoperable APIs and various user-friendly web-based interfaces. The Galaxy project was launched in 2005 and has since become a powerful tool for researchers across a wide range of research fields, including *omics, biodiversity, machine learning, cheminformatics, NLP, material science, climate research.
One of the key features of the Galaxy platform is its emphasis on transparency, reproducibility, and reusability. Galaxy is a multi-user environment which facilitates sharing of e.g. tools, workflows, notebooks, visualizations, and data with others. This makes it particularly easy to reproduce results in order to verify their correctness and enable other researchers to build upon them in future studies. All provenance information of a dataset, including version of used tools, parameters, execution environment are captured and can be reused or exported using standards like BCO or RO-Crate to public archives.
Does research data management as we know it in the context
of database research or data science need platforms like Hugging Face?
Or are platforms and services such as Kaggle or GESIS sufficient? In
this talk, after giving a brief overview of the core features of
Hugging Face, we claim that the data research community would benefit
a lot from a platform similar to Hugging Face, in particular when
considering the support of the FAIR principles. We will also stress
that proper infrastructures for research data management should go
beyond just managing datasets and making them accessible to the
research community. In particular, in view of large-scale data
management, processing and analysis, it would be extremely helpful to
provide researchers a platform that offers various tools and AIPs to
easily interact with and explore diverse forms of data.
Im Vortrag wird Snowflake kurz vorgestellt und Herausforderungen im Bereich Datenbanken aufgezeigt, an denen wir derzeit arbeiten. Auch kurz das Snowflake Academia Programm wird vorgestellt.
The current biodiversity crisis has triggered an extreme need for a better understanding of the network of life on Earth. Efficient data management is crucial in biodiversity and is the backbone for a digital twin of past, present, and future life. The Research Data Commons (RDC) is the central cloud-based information system architecture of NFDI4Biodiversity, the consortia of the NFDI (Nationale Forschungsdateninfrastruktur) offering reliable biodiversity data and services for improving the conservation of global biodiversity.
This talk introduces the essential components of the RDC and provides an overview of research problems and issues we faced during its first development phase. As biodiversity is a data-intensive discipline with many heterogeneous small and large data sources following various metadata formats and collected from different research communities, the RDC faces massive data integration problems. Moreover, the derived data products also must obey specific criteria like the FAIR data principles. In summary, we see plenty of opportunities for the database community to address challenging research questions in an area highly relevant to society.
1 Minute Teasers presenting the posters
The problem of generating synthetic data is almost as old as modern research itself. However, with the advent of generative AI, new possibilities for synthesizing tabular data have emerged that go far beyond the capabilities of traditional statistical or rule-based approaches. Most of this new research comes from the ML community, where ML models need to be fed with useful training data. Since many data management use cases also require synthetic data, it makes sense to adapt these research results. Nevertheless, those use cases, such as query optimization, have different requirements than ML use cases. Requirements that are currently not met by such modern synthesizers. In this talk, we will give an overview of the current state of the art in the field of tabular data synthesis and discuss open challenges in the context of generating synthetic tabular data for data management.
Reproducible research emphasizes the importance of documenting and publishing scientific results in a manner that enables others to verify and extend them. In this talk, we explore computational reproducibility within the context of Jupyter notebooks, presenting insights and challenges from our study. We will present the key steps of the pipeline we used for assessing the reproducibility of Jupyter Notebooks. In our study, we analyzed the notebooks extracted from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. Our process involved identifying the notebooks by mining the full text of publications, locating them on GitHub, and attempting to rerun them in an environment closely resembling the original. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications, including results related to programming languages, notebook structure, naming conventions, modules, dependencies, etc. Furthermore, we will discuss the common issues and practices, identify emerging trends, and explore potential enhancements to Jupyter-centric workflows. Through this comprehensive examination, we aim to provide actionable insights and practical strategies for researchers striving to enhance the reproducibility of their work within the Jupyter notebook ecosystem and contribute to the ongoing dialogue surrounding reproducibility and computational methodologies in scientific research.
TIRA is a platform to organize shared tasks with software
submissions, mostly in information retrieval and natural language
processing. Due to the software submissions, TIRA allows blinded
experimentation on (confidential) datasets to which participants have no
access. After a shared task, the artifacts of the shared tasks, i.e.,
research data in the form of submitted software, inputs, and outputs to
systems, or ground-truth labels, can be made publicly accessible if
desired. Archiving of software and data artifacts in TIRA aims to
improve experimental results' reproducibility and simplify comparisons
against strong baselines in future research.