# scVAR **scVAR** is a computational framework for extracting and integrating genetic variants from single-cell RNA sequencing (scRNA-seq) data. It uses a coupled variational autoencoder (VAE) to merge transcriptomic and variant-derived information into a unified latent representation, enabling deeper characterization of cellular heterogeneity in diseases such as leukemia. ## ๐Ÿ” Motivation AML and B-ALL display extensive genetic and transcriptional heterogeneity, making clonal identification difficult when relying solely on gene expression. While scRNA-seq is routinely used to quantify transcriptional states, it also contains information on expressed genetic variants. **scVAR** is designed to simultaneously analyze both sources of information from a *single* scRNA-seq assay, without requiring matched DNA sequencing. ## ๐Ÿ‘‰ Key Features - Extracts expressed genetic variants from scRNA-seq BAMs - Produces a variant-by-cell matrix using VarTrix - Processes transcriptomic data using Scanpy-compatible workflows - Integrates both modalities through a dual-encoder VAE - Fuses RNA and variant embeddings via cross-attention - Generates a unified latent space for clustering and visualization - Robust under sparse and noisy 3โ€ฒ scRNA-seq coverage - Scales up to datasets with tens of thousands of cells ## ๐Ÿ› ๏ธ Installation To install **scVAR**, create a new environment using `mamba` and install the package from source: ``` mamba create -n scvar_env python=3.10 mamba activate scvar_env git clone http://www.bioinfotiget.it/gitlab/custom/scvar.git cd scvar pip install . ``` **Note:** scVAR requires **Python == 3.10**. ## ๐Ÿ“ Data Availability All datasets used in the scVAR manuscript including AML, B-ALL, and synthetic benchmarking datasets are publicly available at: **https://www.dropbox.com/scl/fo/kc49b6y47hjf2zdle1zz2/AA-UA7lKpLpdHOTldAhasds?rlkey=4dkx4t5yxc407twomwqjte65p&dl=0** The repository contains: - 10x matrices - VarTrix genotype matrices - metadata - synthetic datasets - files required to reproduce manuscript figures ## ๐Ÿš€ Getting Started Two Jupyter notebooks are included in the `notebooks` directory: - **Leukemia notebook:** full application of scVAR to a public AML dataset - **Synthetic notebook:** benchmarking scVAR using the in silico datasets ## ๐Ÿงช Synthetic Dataset Generator scVAR provides an in silico simulator designed to generate paired single-cell datasets containing both transcriptomic and variant-derived information for each simulated cell. These synthetic datasets were used to benchmark the integration performance of scVAR under controlled noise, sparsity, and coverage conditions. The simulator produces: - gene expression matrices - variant-by-cell genotype matrices - configurable cell types and genotypes - realistic dropout, sparsity, and allelic imbalance - optional cross-modal label mismatches - datasets ranging from 5,000 to 50,000+ cells ## ๐Ÿ“œ License Distributed under the MIT License. See the `LICENSE` file for more information.