Frühjahrstreffen der FG Datenbanken

Europe/Berlin
HS 024 (Universitätshauptgebäude)

HS 024

Universitätshauptgebäude

Fürstengraben 1 07743 Jena
Birgitta König-Ries (Heinz Nixdorf Chair for Distributed Information Systems)
Registration
Registration for the 2024 Spring Symposium of the GI Fachgruppe Database Systems
    • Talks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
      • 1
        Welcome
        Speaker: Birgitta König-Ries (Heinz Nixdorf Chair for Distributed Information Systems)
      • 2
        On the Path to a Quality Indicator for Software and Data Publications for the Helmholtz

        Research data and software publications have become a regular output of scientific work. Yet unlike more traditional text publications, widely established processes to assess and evaluate their quality are still missing. This fact prevents researches from getting the proper credit they deserve as common performance indicators often just omit this part of scientific contributions.
        As part of the Helmholtz Association, the Task Group Helmholtz Quality Indicators for Data and Software Publications has been set up to develop a quality indicator to be used within the Association. The goal is to define a set of quality dimensions and attributes suitable for all branches represented in Helmholtz and raise the awareness and appreciation of research data and software publications as equally important scientific outputs. We base our work on already well-established frameworks like the FAIR principles and the COBIT Maturity Model and aim to define a graded model accounting for multifaceted nature of contemporary research.
        In our talk, we will present the vision of the Task Group as well as the current state of discussions. As the definition these criteria is a continuous and dynamic process, we welcome feedback by the audience want to encourage a further dialogue within the community.

        Speaker: Marcel Meistring (Helmholtz Open Science Office)
      • 3
        From theory to practice - Advancing Research Assessment for Incentives at Charité and BIH through infrastructure

        There is a gap between current responsible research and innovation (RRI) as well as open sciences (OS) practices and assessment practices. While
        research practices and their ways of publication and dissemination have diversified, assessment practices have remained narrow – focusing on criteria of publication quantity and reputation. In my talk, I will discuss two projects. The first project is the
        MERIT portal – an application and assessment software for appointments of professors. The
        MERIT portal introduces structured CVs including RRI as well as OS criteria and strategies to reduce the risk of bias during assessments. The focus is to strengthen quality and content-oriented assessments with the support of science-based metrics. The
        second project is the Open Data LOM project. In 2019, the Charité introduced an open sciences indicator in the institutional performance-oriented funding system.

        Speaker: Miriam Kip (BIH Charité)
    • Breaks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
    • Talks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
      • 4
        Terminologies in database systems

        The use of commonly agreed terminologies is an elementary component of database systems. They have an impact on data consistency, querying and retrieval or interoperability. Creating, searching for and agreeing on a terminology to be used is a non-trivial problem, as it requires specialised knowledge and coordination processes. This presentation introduces the terminology service that deals with some of these issues.

        Speaker: Felix Engel (TIB)
      • 5
        Medax - a knowledge graph for biomedicine

        Within the MeDaX project we study bioMedical Data eXploration using graph technologies. We design and implement efficient concepts and tools for integration, enrichment, scoring, retrieval, and analysis of biomedical data. Interested in data similarity and quality measures, we initiated an international community project for biomedical provenance standardisation and cooperate within the Medical Informatics Initiative (MII) to FAIRify the MII core data set. Those and other projects build the basis for development of a pipeline for knowledge graph (KG) creation from diverse data sources, for automated semantic enrichment, and for data scoring and analysis. For the MeDaX-KG prototype, we build on existing tools such as CyFHIR (generic conversion of FHIR to Neo4j) and BioCypher (harmonising framework for KG creation) and optimise graph complexity and structure by our own methods and code.

        Speaker: Judith Wodke (U Greifswald)
      • 6
        Schema Evolution in Research Data

        Changes occur frequently, especially in data-driven long-term studies. Changing databases lead to the accumulation of many schemes and instances over time. However, any scientific application must be able to reconstruct the historical data to ensure the reproducibility or at least the explainability of the research results. A method is needed that allows each database version to be easily reconstructed at both the schema and data level, and data to be migrated between the different versions. Storing all versions over time is not a feasible solution, as it is often too expensive and storage-consuming. In contrast, a method that allows backward processing to earlier versions of the database guarantees the recoverability of the stored information without keeping different versions. This is the subject of our current research, where we use evolution with provenance and additional information to facilitate the reproducibility of scientific results over long periods of time. In this way, information loss can be avoided or at least reduced.

        Speaker: Tanja Auge (U Regensburg)
    • Breaks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
    • Talks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
      • 7
        Democratising data analysis with Galaxy

        Galaxy is an open-source platform that allows researchers to analyze and share scientific data using interoperable APIs and various user-friendly web-based interfaces. The Galaxy project was launched in 2005 and has since become a powerful tool for researchers across a wide range of research fields, including *omics, biodiversity, machine learning, cheminformatics, NLP, material science, climate research.

        One of the key features of the Galaxy platform is its emphasis on transparency, reproducibility, and reusability. Galaxy is a multi-user environment which facilitates sharing of e.g. tools, workflows, notebooks, visualizations, and data with others. This makes it particularly easy to reproduce results in order to verify their correctness and enable other researchers to build upon them in future studies. All provenance information of a dataset, including version of used tools, parameters, execution environment are captured and can be reused or exported using standards like BCO or RO-Crate to public archives.

        Speaker: Björn Grüning
      • 8
        From Research Data Management to Data Platforms: A Hugging Face Approach

        Does research data management as we know it in the context
        of database research or data science need platforms like Hugging Face?
        Or are platforms and services such as Kaggle or GESIS sufficient? In
        this talk, after giving a brief overview of the core features of
        Hugging Face, we claim that the data research community would benefit
        a lot from a platform similar to Hugging Face, in particular when
        considering the support of the FAIR principles. We will also stress
        that proper infrastructures for research data management should go
        beyond just managing datasets and making them accessible to the
        research community. In particular, in view of large-scale data
        management, processing and analysis, it would be extremely helpful to
        provide researchers a platform that offers various tools and AIPs to
        easily interact with and explore diverse forms of data.

        Speaker: Michael Gertz (U Heidelberg)
      • 9
        Snowflake Berlin

        Im Vortrag wird Snowflake kurz vorgestellt und Herausforderungen im Bereich Datenbanken aufgezeigt, an denen wir derzeit arbeiten. Auch kurz das Snowflake Academia Programm wird vorgestellt.

        Speaker: Dirk Junghanns (Snowflake)
    • Breaks: Symposium Dinner Fritz Mitte

      Fritz Mitte

      Schlossgasse 20
    • Talks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
      • 10
        Problems and Issues in Biodiversity Data Infrastructures

        The current biodiversity crisis has triggered an extreme need for a better understanding of the network of life on Earth. Efficient data management is crucial in biodiversity and is the backbone for a digital twin of past, present, and future life. The Research Data Commons (RDC) is the central cloud-based information system architecture of NFDI4Biodiversity, the consortia of the NFDI (Nationale Forschungsdateninfrastruktur) offering reliable biodiversity data and services for improving the conservation of global biodiversity.

        This talk introduces the essential components of the RDC and provides an overview of research problems and issues we faced during its first development phase. As biodiversity is a data-intensive discipline with many heterogeneous small and large data sources following various metadata formats and collected from different research communities, the RDC faces massive data integration problems. Moreover, the derived data products also must obey specific criteria like the FAIR data principles. In summary, we see plenty of opportunities for the database community to address challenging research questions in an area highly relevant to society.

        Speaker: Bernhard Seeger (U Marburg)
      • 11
        Flashtalks

        1 Minute Teasers presenting the posters

    • Fachgruppentreffen: Sitzung der Fachgruppe HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
    • Poster: Poster Session with Coffee Foyer

      Foyer

      Universitätshauptgebäude

    • Talks HS 024

      HS 024

      Universitätshauptgebäude

      Fürstengraben 1 07743 Jena
      • 12
        Tabular Data Synthesis for Data Management

        The problem of generating synthetic data is almost as old as modern research itself. However, with the advent of generative AI, new possibilities for synthesizing tabular data have emerged that go far beyond the capabilities of traditional statistical or rule-based approaches. Most of this new research comes from the ML community, where ML models need to be fed with useful training data. Since many data management use cases also require synthetic data, it makes sense to adapt these research results. Nevertheless, those use cases, such as query optimization, have different requirements than ML use cases. Requirements that are currently not met by such modern synthesizers. In this talk, we will give an overview of the current state of the art in the field of tabular data synthesis and discuss open challenges in the context of generating synthetic tabular data for data management.

        Speaker: Fabian Panse (HPI)
      • 13
        Exploring Computational Reproducibility in Jupyter Notebooks: Insights and Challenges

        Reproducible research emphasizes the importance of documenting and publishing scientific results in a manner that enables others to verify and extend them. In this talk, we explore computational reproducibility within the context of Jupyter notebooks, presenting insights and challenges from our study. We will present the key steps of the pipeline we used for assessing the reproducibility of Jupyter Notebooks. In our study, we analyzed the notebooks extracted from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. Our process involved identifying the notebooks by mining the full text of publications, locating them on GitHub, and attempting to rerun them in an environment closely resembling the original. We documented reproduction success and exceptions and explored relationships between notebook reproducibility and variables related to the notebooks or publications, including results related to programming languages, notebook structure, naming conventions, modules, dependencies, etc. Furthermore, we will discuss the common issues and practices, identify emerging trends, and explore potential enhancements to Jupyter-centric workflows. Through this comprehensive examination, we aim to provide actionable insights and practical strategies for researchers striving to enhance the reproducibility of their work within the Jupyter notebook ecosystem and contribute to the ongoing dialogue surrounding reproducibility and computational methodologies in scientific research.

        Speaker: Sheeba Samuel (Friedrich Schiller University)
      • 14
        Research Data Management in TIRA for Reproducible Shared Tasks

        TIRA is a platform to organize shared tasks with software
        submissions, mostly in information retrieval and natural language
        processing. Due to the software submissions, TIRA allows blinded
        experimentation on (confidential) datasets to which participants have no
        access. After a shared task, the artifacts of the shared tasks, i.e.,
        research data in the form of submitted software, inputs, and outputs to
        systems, or ground-truth labels, can be made publicly accessible if
        desired. Archiving of software and data artifacts in TIRA aims to
        improve experimental results' reproducibility and simplify comparisons
        against strong baselines in future research.

        Speaker: Maik Fröbe (U Jena)