**Modeling data with energy-based generative models: applications to genomics/proteomics**

by Prof. Beatriz Seoane,

LISN - Laboratoire Interdisciplinaire des Sciences du Numérique, Universidad Paris-Saclay.

Campus Universitaire bâtiment 507, Rue du Belvédère, 91405 Orsay - FRANCE

**Summary:**

Energy-based generative models have proven to be a powerful tool to capture the complexity of different types of data, including biological sequences. When properly trained, we can use these models to create new, synthetic data that not only closely resembles the original dataset, but can also generate synthetic protein sequences that prove to be perfectly functional in vitro experiments. Beyond data generation, these models also provide us with an effective interaction model that, if not too complicated, can be analyzed using standard statistical physics' tools. In other words, energy-based models can be used to systematically extract interpretable information from data. The crucial question is where and how to find this information in the neural network.

This methodology is linked to a well-established concept in statistical physics known as the "inverse Ising problem". Traditionally, statistical physics has focused on deriving statistical properties of a model based on its predefined parameters, such as temperature or interaction strength. In contrast, the inverse approach shifts the goal to inferring these parameters based on the observation of statistically independent equilibrium samples of that model. In the field of computational biology, the Direct Coupling Analysis technique, based on the principles of inverse Potts models, has proven to be particularly successful. This technique has played an important role in identifying direct pairwise interactions from datasets of evolutionarily

related sequences. Such interactions are crucial for predicting physically contacting residues in proteins, unveiling interaction regions, elucidating sequence motifs essential for function, and revealing fitness landscapes. These landscapes in turn serve as reliable models for new mutations, as shown in experiments on deep mutational scanning and predicting protein evolution during pandemics such as COVID-19 among other applications.

However, the simplicity and direct interpretability of pairwise models, including Inverse Potts models, comes at the expense of a lack of expressiveness. Due to their design, these models are limited to capturing at most the pairwise correlations present in the data set, i.e. the co-evolution of sites in the sequence. The integration of higher order interactions into generalized multi-body Ising/Potts models is associated with prohibitive costs in terms of parameter complexity, which makes such extensions impractical. An alternative way to circumvent these limitations is to use bipartite models and latent variables, such as the Restricted Boltzmann Machine. These neural networks, which are remarkably similar to the Ising/Potts models in terms of the energy function, are instead universal approximators and achieve great generative performance while keeping the number of trained parameters in bounds.

In the upcoming course, I will address the utility of these models to model biological data, highlighting their ability to infer complex interaction networks and analyze the model's free energy landscape for applications such as clustering. Remarkably, these models retain their interpretability and effectiveness even when data availability is limited. Although they are a general, powerful and interpretable tool for modeling data, these types of models are notoriously difficult to train. In my lectures, I will give an overview of the training procedures, illustrate what we understand about the learning mechanism and discuss the role of Monte Carlo simulations and how to overcome the training difficulties to safely apply these methods to general problems.

**Preliminary program:**

- Introduction: Sequence data and multiple sequence alignment, principal component analysis, inference of Boltzmann distributions from sequence data, application examples
- From pairwise models to Restricted Boltzmann Machines: Inference of multi-body interaction networks.
- On the correct training of Restricted Boltzmann Machines: learning dynamics, out-of-equilibrium effects
- Mean-field analysis of trained models and applications for hierarchical clustering or motif identification
- Synthetic generation of sequences.