blankenburg

Keynotes

We are very pleased to announce the GvDB 2017 keynote speakers



Big Data Management and Apache Flink: Key Challenges and (Some) Solutions (by Volker Markl)

Abstract: The shortage of qualified data scientists is effectively limiting Big Data from fully realizing its potential to deliver insight and provide value for scientists, business analysts, and society as a whole. In order to remedy this situation, we believe that novel technologies that draw on the concepts of declarative languages, query optimization, automatic parallelization and hardware adaptation are necessary. In this talk, we will discuss several aspects of our research in this area, including results in how to optimize iterative data flow programs, optimistic fault-tolerance, and steps toward a deep language embedding of advanced data analysis programs. We will also discuss how our research activities have led to Apache Flink, an open-source big data analytics system, which by now has become a major data processing engine in the Apache Big Data Stack, used in a variety of applications by academia and industry. 

volker_markl Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) group at the Technische Universitat Berlin (TU Berlin), director of the research group “Intelligent Analysis of Massive Data” at the German Research Center for Artificial Intelligence (DFKI), and speaker of the Berlin Big Data Center (BBDC). Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 19 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been speaker and principal investigator of the Stratosphere research project that resulted in the "Apache Flink" big data analytics system. Dr. Markl currently serves as the secretary of the VLDB Endowment and was elected as one of Germany's leading "digital minds" (Digitale Köpfe) by the German Informatics Society (GI). Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.

Website http://www.dima.tu-berlin.de

Query Processing on modern CPUs (by Johann-Christoph Freytag)

Abstract:This talk consists of two parts both relating to the challenge how to better take advantage of fine-grain parallelism that comes with today’s modern CPUs. After a general introduction about the changes in hardware over the last two decades, we show in the first part of the talk how to accelerate the processing of tree-based index structures by using SIMD instructions. We adapt the B+-Tree and prefix B-Tree (trie) by changing the search algorithm on inner nodes from binary search to k-ary search. We develop adaptations of tree structures that satisfy the specific constraints of SIMD instructions. We present algorithms for transforming the original tree layout into a SIMD-friendly layout. Our adapted B+-Tree speeds up search processes by a factor of up to eight for small data types compared to the original B+-Tree using binary search. Furthermore, our adapted prefix B-Tree enables a high search performance even for larger data types. The second part of this talk focuses on the problem of how to find the best partitioning of a given query execution plans and data into tasks to perform an optimal (query) execution on multiple CPU cores. Therefore, we first present a classification of existing approaches in various DBMSs as a basis to develop a generic Query Task Model (QTM) for query execution. This model opens up a design space for scheduling parallel task execution thus making different existing approaches comparable. Based on QTM we show that existing execution schedules do not guarantee the fastest execution on multiple cores – at the same time it allows us to characterize those best (fastest) execution schedules based on (the ratio of) data locality and instruction locality.
This work was done together with Steffen Zeuch and Frank Huber.

johann_christoph_freytag Bio: Johann-Christoph Freytag is currently full professor for Databases and Information Systems (DBIS) at the Computer Science Department of the Humboldt-Universität zu Berlin, Germany. Before joining the department in 1994, he was a research staff member at the IBM Almaden Research Center (1985-1987), a researcher at the European Computer-Industry-Research Centre (ECRC, in Munich, Germany, 1987-1989), and the head of Digital's Database Technology Center (also in Munich, 1990-1993). He holds a Ph.D. in Applied Mathematics/Computer Science from Harvard University, MA. Freytag's research interests include all aspects of query processing and query optimization in object-relational database systems, new developments in the database area (such as semi-structured data, data quality, databases and security), privacy in database systems, and data quality as well as applying database technology to applications such as GIS, genomics, and bioinformatics/life science. In the last years he received the IBM Faculty Award four times for collaborative work in the areas of databases, middleware, and bioinformatics/life science, as well as the HP Innovation Award of excellent cooperation in the area of databases and workflow systems. He organized the VLDB conference in Berlin in 2003 and was a member of the VLDB Endowment (2001-2007). From 2009 to 2015 Freytag headed the German database interest group of the (Fachbereich DBIS, Gesellschaft für Informatik). Since 2015 he is a member of the Extended Executive Board of the GI.

Application of database technology to manage, preserve and analyse plant genomics and phenomics data (by Matthias Lange)

Abstract: The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) is committed to the conservation and valorization of plant genetic resources. Its research agenda comprises upstream and downstream analyses in the fields of genetics, physiology and cell biology aiming at a broad understanding of plants at molecular, cellular and organismic levels. The “big data” challenge has long been reached the IPK and the life science community in general. It forces the need for capable infrastructures to document, share, publish, integrate and explore research data. In this talk, we give an overview to IPK’s research projects in the fields of Lab Data Management, Data Citation and Information Retrieval.
Data documentation (LIMS) Handling research data is an important task in IPK’s research strategy and particularly a central component along the value-added chain towards scientific publications, patents and biotechnological innovations. In consequence, there exists the conviction that it is essential to implement an intuitive and seamless data storage and documentation infrastructure, which can be easily embedded into existing workflows and will be highly accepted by scientists. In practice it is very challenging to meet this aim. Bioinformaticians implemented individual, isolated and heterogeneous systems to manage experimental data, laboratory workflows and biological samples. Those considerations where the driving force for establishing a central Laboratory Information Management System (LIMS). In the talk, we will present experiences from last 6 year LIMS data management, pitfalls and successfully implemented lab processes.
Data Publication (e!DAL) Besides publication of scientific findings, it is important to keep the data investment and ensure its future processing. This implies a guarantee for a long-term preservation and preventing of data loss. Condensed and enriched with metadata, primary data would be a more valuable resource than the “re-extraction” from articles. In this context it becomes essential, to change the handling and the acceptance of primary data within the scientific community. Data and publications should be honored with a high attention and reputation for data publishers. Here, we present e!DAL [2] (http://edal.ipk-gatersleben.de) as a lightweight software framework for the publication and sharing of research data. Its main features are version tracking, management of metadata, information retrieval, registration of persistent identifier (DOI), embedded HTTP(S) server for public data access, access as network file system, and a scalable storage backend. e!DAL is available as API for a local non-shared storage and remote API to feature distributed applications. IPK is an approved data center in the international DataCite consortium and apply e!DAL as data submission and registration system. e!DAL is the software infrastructure for the Plant Genomics and Phenomics Research Data Repository (PGP) [3], a repository to comprehensively publish plant research data: This covers in particular cross-domain datasets that are not being published in central repositories because of its volume or unsupported data scope, like image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents. PGP is registered as research data repository at BioSharing.org, re3data.org and OpenAIRE as valid EU Horizon 2020 open data archive. Above features, the programmatic interface and the support of standard metadata formats, enable PGP to fulfil the FAIR data principles—findable, accessible, interoperable, reusable.
Information retrieval (LAILAPS) Due to advances in high-throughput technologies, the amount of data available over life science web resources is growing rapidly. It is becoming an increasingly difficult and time consuming task for scientists to derive information from those resources and to keep up-to-date even within their own field of research. For example, correct identification of causative genes for an important agronomic trait can be very valuable for effective marker assisted breeding. However, even well-defined QTL often span genomic regions that can contain hundreds of positional candidate genes. Evaluation of potential functional candidates from such long lists is often time-consuming and requires the integration of information from many different sources. In this context, information retrieval (IR) is evolving to a key technology. Its increasing popularity for data exploration is because there is no need for a user to have knowledge about complex query languages, underlying data structures or data formats. Here we will present how to use the LAILAPS [3] integrative search engine for plant genomics data (http://lailaps.ipk-gatersleben.de), which is developed in the frame of EU transPLANT consortium. LAILAPS supports the integrative search over the distributed genome annotation (traits, gene functions, agronomic factors). For this, 50 million records of most popular used genome annotation repositories, like UniProt, BioModels, OBO ontologies and PDB, are indexed. Moreover, 80 million gene annotations of plant genomic resources are linked by reverse identifier mapping. In order to select most relevant candidate genes for queried traits, LAILAPS use context based relevance ranking. The order of the search hits is computed by an artificial intelligence driven relevance ranking, which has been trained by domain experts and evaluated for QTL candidate gene prediction.
Literature
[1] Daniel Arend, Christian Colmsee, Helmut Knüpffer, Markus Oppermann, Uwe Scholz, Danuta Schüler, Stephan Weise, Matthias Lange. Data management experiences and best practices from the perspective of a plant research institute. In: Galhardas H, Rahm E (Eds.): Data integration in the life sciences: 10th international conference, DILS 2014
[2] Daniel Arend, Matthias Lange, Jinbo Chen, Christian Colmsee, Steffen Flemming, Denny Hecht, Uwe Scholz. e!DAL - a framework to store, share and publish research data. BMC Bioinformatics. 2014 Jun 24;15(1):214
[3] Daniel Arend, Astrid Junker, Uwe Scholz, Danuta Schüler, Juliane Wylie, Matthias Lange. PGP repository: a plant phenomics and genomics data publication infrastructure. Database. 2016
[4] Maria Esch, Jinbo Chen, Christian Colmsee, Matthias Klapperstück, Eva Grafahrend-Belau, Uwe Scholz, Matthias Lange. LAILAPS - The Plant Science Search Engine. Plant and Cell Physiology. 2014 Dec 24;55(1):pcu185


Bio: Dr. Lange works for more than ten years in the field of bioinformatics and data management in life sciences. His primary research topics are dedicated to information retrieval, search engine technology, and research data management of different data domains, e.g. sequence, marker, metabolic and especially phenotypic information. His special focus is on standards and infrastructures for data sharing and publication. Here, he contributed to the MIAPPE and ISATAB standard for plant phenotyping data. Furthermore, Dr. Lange coordinates the central lab information systems in the IPK and is responsible for IPK datacenter activities in the frame of the DataCite consortium. As core service activity he is deputy administrator of IPK's ORACLE enterprise datamanagement backend. Furthermore Dr. Lange supervises work packages in German Plant Phenotyping Network (DPPN), the German Network for Bioinformatics Infrastructure (de.NBI) and contributes in EU transPlant research project to build up a trans-national data infrastructure for plant genomics data.
(ORCID http://orcid.org/0000-0002-4316-078X)

logos

logos