Deep Learning

All-atom Diffusion Transformers for Unified 3D Molecular and Material Generation

Background The All-atom Diffusion Transformer (ADiT) represents a significant advancement in the generative modeling of 3D atomic systems. Unlike traditional models that are tailored specifically for either molecules or materials, ADiT introduces a unified latent diffusion framework capable of jointly generating both periodic materials and non-periodic molecular systems using a single model. This is achieved through a two-stage process: An autoencoder maps unified, all-atom representations of molecules and materials to a shared latent embedding space. ...

Supervised Principal Component Analysis using Neural Autoencoders

Motivation and Background Principal Component Analysis (PCA) is a widely-used method for dimensionality reduction, capturing maximum variance in data through orthogonal linear transformations. However, standard PCA ignores label information, potentially overlooking directions critical for predictive tasks. By incorporating label information into PCA, supervised PCA can extract dimensions directly related to the target variable, enhancing predictive performance and interpretability. Objectives The primary objective of this project is to develop a supervised neural network autoencoder (NN-AE) that integrates label information into PCA by learning orthogonal basis functions informed by supervised targets. This methodology aims to enhance interpretability and predictive accuracy relative to traditional PCA. ...

Geometric Deep Generative Models for Scientific Data

Add link to the project on the DTU website: Background At the Machine Learning in Life Science (MLSS) Research Center, we have been developing methodologies that we aim to apply and test on real-world data. In short, our work focuses on latent representations, typically obtained from a Variational Autoencoder (VAE), with the goal of extracting meaningful and reliable knowledge from them. An example of this research direction is this article, which explores representations of protein sequences. They found that the geodesic distances seem to recover evolutionary development protein sequences. ...

Enhancing Relative Representations using Custom Weighted Mahalanobis Distance

Background Relative representations are a powerful tool in machine learning and data analysis, where data points are represented based on their distances or similarities to a set of reference points called anchors. Traditional methods often rely on similarity measures, such as cosine similarity, which are invariant under rotation and scaling but may not satisfy the properties of a metric space, particularly the triangle inequality. This limitation can hinder the effectiveness of certain algorithms that require a proper distance metric. ...

Diffusion models for generating molecules in 3-d

Background Diffusion models are rapidly advancing the field of 3D molecular generation, offering new tools for applications in drug discovery and materials science. These models generate realistic molecular structures by iteratively refining noisy inputs, capturing the intricate spatial relationships crucial to molecular properties. The aim of this project is to explore Equivariant Neural Diffusion (END), an innovative 3D molecular generation model that preserves equivariance to Euclidean transformations. END stands out for its learnable forward process, parameterized in a time- and data-dependent manner, ensuring robust equivariance to rigid transformations. The project will involve extending the capabilities of END, benchmarking its performance on standard molecular generation datasets, and refining its generative accuracy and scalability to enhance its utility in molecular modeling applications. ...

Equivariant graph neural networks for molecular modeling.

Background Graph Neural Networks (GNNs) have become powerful tools for modeling molecular systems, with applications in drug discovery and materials science. Equivariant GNNs, which preserve symmetries like rotations and translations, are especially well-suited for molecular modeling as they ensure that the model’s output changes consistently with the molecular structure’s orientation. This capability enhances accuracy and generalization, making them valuable for tasks such as drug discovery and material design. However, challenges remain around their robustness, generalization, and uncertainty quantification (UQ) capabilities. Reliable UQ is crucial for scientific applications, where predictions need to be interpretable and uncertainties well-calibrated. ...

Graph neural networks based on geometric algebra

Background Geometric Algebra (GA) provides a unified mathematical framework for representing and manipulating geometric entities and transformations in arbitrary dimensions. Its ability to elegantly encode rotations, reflections, and other symmetries makes it an ideal tool for advancing geometric deep learning. While current Graph Neural Networks (GNNs) have made significant strides in processing molecular data, they often rely on specialized techniques to handle equivariance and fail to fully leverage the expressive power of GA. ...

Variational inference with the spacings estimator

Background Variational inference (VI) is a key framework in Bayesian deep learning, enabling scalable approximations of complex posterior distributions. Accurate entropy estimation is critical in VI but remains challenging, particularly for high-dimensional or multimodal distributions. Traditional methods, such as closed-form approximations or Monte Carlo sampling, can be computationally intensive or inaccurate. The spacings estimator, a non-parametric technique leveraging the ordering and spacing of samples, offers a promising alternative for efficient and robust entropy estimation. ...

Adaptive, generalized and personalized preference models for speech enhancement

Background Speech enhancement, the process of improving the quality speech signals, can not only improve quality of experience for listeners and the quality of communication, it can also aid the performance of machine-and deep-learning models in downstream tasks. However, the challenge of the trade-off between noise removal and artifact incorporation is ongoing [1]. The project aims to investigate the factors influencing noise reduction preferences and develop a technical framework around it. Low data-resources will be an important consideration in this project. ...

Low-resource speech technology for healthcare

Background We are seeking students interested advancing speech technology in low-resource environments. The project is sufficiently open-ended and will be focused on developing machine learning models and algorithms tailored to address the unique challenges posed by limited data and computational resources in speech processing, also in high-stakes applications like healthcare and education. Objective(s) Potential directions are: Research and develop novel machine learning techniques optimized for low-resource speech technology applications. Design and implement efficient algorithms for speech recognition, synthesis, and understanding in resource-constrained settings. Conduct experiments, analyze results, and iterate on models to continuously improve performance and robustness. Contribute to the development of tools and frameworks to streamline the deployment and evaluation of low-resource speech models. Requirements Need to have: ...

Geometric Analysis of Deep Representations

Background Modern deep neural networks, especially those in the overparameterized regime with a very high number of parameters, perform impressively well. Traditional learning theories contradict these empirical results and fail to explain this phenomenon, leading to new approaches that aim to understand why deep learning generalizes. A common belief is that flat minima [1] in the parameter space lead to models with good generalization characteristics. For instance, these models may learn to extract high-quality features from the data, known as representations. However, it has also been shown that models with equivalent performance can exist at sharp minima [2, 3]. In this project, we will study from a geometric perspective the learned representations of deep learning models and their relationship to the sharpness of the loss landscape. We will also consider in the same framework additional aspects of training that enhance generalization. ...

Prediction of Drug Induced Gene Expression Perturbations through Drug Target and Protein-Protein Interaction Information

Background Transcriptomics provide insights into gene expression and with it the ability to analyze one of the fundamental processes of life - the translation from gene to protein. Single Cell RNA sequencing (scRNAseq) is a technology that measures transcriptomics on the single cell level. However, biological data is highly complex, variability and noisy, making it challenging to analyze and work with. The goal of the project is to evaluate if deep learning can infer gene expression profiles of specific conditions (exposures) by only receiving prior information about an exposure, such as a drug’s known gene targets as well as a general protein-protein interaction network. The aim is to evaluate the model based on its zero-shot performance (e.g. unseen drugs). ...

Transfer learning & Training of (Explainable) Deep Learning Model for Single Cell Transcriptomics

Background Transcriptomics provide insights into gene expression and with it the ability to analyse one of the fundamental processes of life - the translation from gene to protein. single Cell RNA sequencing (scRNAseq) is a technology that measures transcriptomics on the single cell level. However, biological data is highly complex, variability and noisy, making it challenging to analyse and wo. By building on pre-trained general scRNAseq deep learning model we want to fine-tune and train the model task specific. Examples of existing models are Geneformer https://www.nature.com/articles/s41586-023-06139-9 or scGPT https://www.nature.com/articles/s41592-024-02201-0. If the student decides that existing models are not suitable there is also the option to build/train from scratch. ...