# scVAR

**scVAR** is a computational framework for extracting and integrating genetic variants from single-cell RNA sequencing (scRNA-seq) data. It uses a coupled variational autoencoder (VAE) to merge transcriptomic and variant-derived information into a unified latent representation, enabling deeper characterization of cellular heterogeneity in diseases such as leukemia.


## 🔍 Motivation

AML and B-ALL display extensive genetic and transcriptional heterogeneity, making clonal identification difficult when relying solely on gene expression.  
While scRNA-seq is routinely used to quantify transcriptional states, it also contains information on expressed genetic variants.  
**scVAR** is designed to simultaneously analyze both sources of information from a *single* scRNA-seq assay, without requiring matched DNA sequencing.


## 👉 Key Features

- Extracts expressed genetic variants from scRNA-seq BAMs  
- Produces a variant-by-cell matrix using VarTrix  
- Processes transcriptomic data using Scanpy-compatible workflows  
- Integrates both modalities through a dual-encoder VAE  
- Fuses RNA and variant embeddings via cross-attention  
- Generates a unified latent space for clustering and visualization  
- Robust under sparse and noisy 3′ scRNA-seq coverage  
- Scales up to datasets with tens of thousands of cells


## 🛠️ Installation

To install **scVAR**, create a new environment using `mamba` and install the package from source:

```
mamba create -n scvar_env python=3.10  
mamba activate scvar_env
git clone http://www.bioinfotiget.it/gitlab/custom/scvar.git
cd scvar  
pip install .
```

**Note:** scVAR requires **Python == 3.10**.


## 📁 Data Availability

All datasets used in the scVAR manuscript including AML, B-ALL, and synthetic benchmarking datasets are publicly available at:

**https://www.dropbox.com/scl/fo/kc49b6y47hjf2zdle1zz2/AA-UA7lKpLpdHOTldAhasds?rlkey=4dkx4t5yxc407twomwqjte65p&dl=0**

The repository contains:

- 10x matrices
- VarTrix genotype matrices
- metadata
- synthetic datasets
- files required to reproduce manuscript figures


## 🚀 Getting Started
 
Two Jupyter notebooks are included in the `notebooks` directory:

- **Leukemia notebook:** full application of scVAR to a public AML dataset
- **Synthetic notebook:** benchmarking scVAR using the in silico datasets


## 🧪 Synthetic Dataset Generator

scVAR provides an in silico simulator designed to generate paired single-cell datasets containing both transcriptomic and variant-derived information for each simulated cell.  
These synthetic datasets were used to benchmark the integration performance of scVAR under controlled noise, sparsity, and coverage conditions.

The simulator produces:

- gene expression matrices
- variant-by-cell genotype matrices
- configurable cell types and genotypes
- realistic dropout, sparsity, and allelic imbalance
- optional cross-modal label mismatches
- datasets ranging from 5,000 to 50,000+ cells


## 📜 License

Distributed under the MIT License. See the `LICENSE` file for more information.
