Commit 2c405146 authored by Ivan Merelli's avatar Ivan Merelli
Browse files

Initial commit

parents
MIT License
Copyright (c) 2022 scVAR
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# scVAR
**scVAR** is a computational framework for extracting and integrating genetic variants from single-cell RNA sequencing (scRNA-seq) data. It uses a coupled variational autoencoder (VAE) to merge transcriptomic and variant-derived information into a unified latent representation, enabling deeper characterization of cellular heterogeneity in diseases such as leukemia.
## 🔍 Motivation
AML and B-ALL display extensive genetic and transcriptional heterogeneity, making clonal identification difficult when relying solely on gene expression.
While scRNA-seq is routinely used to quantify transcriptional states, it also contains information on expressed genetic variants.
**scVAR** is designed to simultaneously analyze both sources of information from a *single* scRNA-seq assay, without requiring matched DNA sequencing.
## 👉 Key Features
- Extracts expressed genetic variants from scRNA-seq BAMs
- Produces a variant-by-cell matrix using VarTrix
- Processes transcriptomic data using Scanpy-compatible workflows
- Integrates both modalities through a dual-encoder VAE
- Fuses RNA and variant embeddings via cross-attention
- Generates a unified latent space for clustering and visualization
- Robust under sparse and noisy 3′ scRNA-seq coverage
- Scales up to datasets with tens of thousands of cells
## 🛠️ Installation
To install **scVAR**, create a new environment using `mamba` and install the package from source:
```
mamba create -n scvar_env python=3.10
mamba activate scvar_env
git clone http://www.bioinfotiget.it/gitlab/custom/scvar.git
cd scvar
pip install .
```
**Note:** scVAR requires **Python == 3.10**.
## 📁 Data Availability
All datasets used in the scVAR manuscript including AML, B-ALL, and synthetic benchmarking datasets are publicly available at:
**https://www.dropbox.com/scl/fo/kc49b6y47hjf2zdle1zz2/AA-UA7lKpLpdHOTldAhasds?rlkey=4dkx4t5yxc407twomwqjte65p&dl=0**
The repository contains:
- 10x matrices
- VarTrix genotype matrices
- metadata
- synthetic datasets
- files required to reproduce manuscript figures
## 🚀 Getting Started
Two Jupyter notebooks are included in the `notebooks` directory:
- **Leukemia notebook:** full application of scVAR to a public AML dataset
- **Synthetic notebook:** benchmarking scVAR using the in silico datasets
## 🧪 Synthetic Dataset Generator
scVAR provides an in silico simulator designed to generate paired single-cell datasets containing both transcriptomic and variant-derived information for each simulated cell.
These synthetic datasets were used to benchmark the integration performance of scVAR under controlled noise, sparsity, and coverage conditions.
The simulator produces:
- gene expression matrices
- variant-by-cell genotype matrices
- configurable cell types and genotypes
- realistic dropout, sparsity, and allelic imbalance
- optional cross-modal label mismatches
- datasets ranging from 5,000 to 50,000+ cells
## 📜 License
Distributed under the MIT License. See the `LICENSE` file for more information.
../aml
\ No newline at end of file
../bll
\ No newline at end of file
"""
scVAR package initialization
============================
Expose main analysis functions for transcriptomic and variant integration.
"""
from .scVAR import (
transcriptomicAnalysis,
variantAnalysis,
calcOmicsClusters,
weightsInit,
save_all_umaps,
omicsIntegration,
pairedIntegrationTrainer,
distributionClusters,
)
__all__ = [
"transcriptomicAnalysis",
"variantAnalysis",
"calcOmicsClusters",
"weightsInit",
"omicsIntegration",
"save_all_umaps",
"pairedIntegrationTrainer",
"distributionClusters",
]
# import packs
import numpy as np
import pandas as pd
import scanpy as sc
import gc
import torch
import torch.nn.functional as F
import umap
import anndata
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
from sklearn.feature_extraction.text import TfidfTransformer
from scipy import io
from scipy import sparse
import matplotlib.pyplot as plt
### FUNCTION ###
################
def transcriptomicAnalysis(
path_10x,
bcode_variants,
high_var_genes_min_mean=0.015,
high_var_genes_max_mean=3,
high_var_genes_min_disp=0.75,
min_genes=200,
min_cells=3,
max_pct_mt=100,
n_neighbors=20,
n_pcs=30,
transcript_key='trans',
manual_matrix_reading=False,
force_float64=True,
):
"""
Preprocess transcriptomic (RNA) data from 10x Genomics format.
Ora la normalizzazione e lo scaling replicano esattamente la pipeline Muon:
normalize_total → log1p → scale(max_value=10)
"""
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy import io
from sklearn.preprocessing import StandardScaler
np.set_printoptions(precision=6, suppress=False, linewidth=140)
print("[INFO] === TRANSCRIPTOMIC ANALYSIS START ===")
# === Lettura dati RNA ===
if manual_matrix_reading:
rna = path_10x + '/matrix.mtx.gz'
barcode = path_10x + '/barcodes.tsv.gz'
feat = path_10x + '/features.tsv.gz'
M = io.mmread(rna).astype(np.float64 if force_float64 else np.float32)
B = M.T # (cells x genes)
barcode = pd.read_csv(barcode, sep='\t', header=None, names=['bcode'])
barcode.index = barcode['bcode']
feat = pd.read_csv(feat, sep='\t', header=None,
names=['gene_ids', 'gene_symbol', 'feature_types'])
feat.index = feat['gene_symbol']
adata = anndata.AnnData(X=B, obs=barcode, var=feat)
adata.X = sp.csr_matrix(adata.X)
else:
adata = sc.read_10x_mtx(path_10x, var_names='gene_symbols', cache=True)
adata.X = adata.X.astype(np.float64 if force_float64 else np.float32)
# === Salva copia grezza ===
if not sp.issparse(adata.X):
adata.X = sp.csr_matrix(adata.X)
adata.uns[f"{transcript_key}_raw"] = adata.X.copy()
adata.uns[f"{transcript_key}_raw_obs_names"] = np.array(adata.obs_names)
adata.uns[f"{transcript_key}_raw_var_names"] = np.array(adata.var_names)
print(f"[DEBUG] Raw RNA matrix shape: {adata.X.shape}")
raw_block = adata.X[:5, :5].toarray().astype(float)
print(f"[DEBUG] Prime 5×5 celle RAW (counts):\n{np.array2string(raw_block, precision=5, floatmode='maxprec_equal')}")
# === Filtra barcode presenti anche nel file varianti ===
bcode_var = pd.read_csv(bcode_variants, sep='\t', header=None)[0]
adata = adata[adata.obs.index.isin(bcode_var)].copy()
# === Filtraggio e QC ===
sc.pp.filter_cells(adata, min_genes=min_genes)
sc.pp.filter_genes(adata, min_cells=min_cells)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.pct_counts_mt < max_pct_mt, :].copy()
# === Normalizzazione Muon-style ===
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)
# === Rimozione eventuali NaN/inf ===
if sp.issparse(adata.X):
adata.X.data[np.isnan(adata.X.data)] = 0
adata.X.data[np.isinf(adata.X.data)] = 0
else:
adata.X = np.nan_to_num(adata.X, nan=0.0, posinf=0.0, neginf=0.0)
print(f"[DEBUG] RNA normalizzata e scalata (Muon-style). Shape: {adata.X.shape}")
norm_block = adata.X[:5, :5].toarray().astype(float) if sp.issparse(adata.X) else adata.X[:5, :5].astype(float)
print(f"[DEBUG] Prime 5×5 celle RNA (normalizzate):\n{np.array2string(norm_block, precision=4, floatmode='maxprec_equal')}")
# === PCA opzionale ===
if n_pcs is not None:
print(f"[INFO] PCA con {n_pcs} componenti...")
sc.tl.pca(adata, n_comps=n_pcs, random_state=42)
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs, random_state=42)
sc.tl.umap(adata, random_state=42)
# Salva embedding PCA e UMAP
adata.obsm[f"{transcript_key}_pca"] = adata.obsm["X_pca"].copy()
adata.obsm[f"{transcript_key}_umap"] = adata.obsm["X_umap"].copy()
# Salva anche in uns (per integrazione successiva)
adata.uns[f"{transcript_key}_X"] = adata.obsm["X_pca"].copy()
# Debug numerico
print(f"[INFO] Salvato adata.uns['{transcript_key}_X'] con shape {adata.uns[f'{transcript_key}_X'].shape}")
print(f"[DEBUG] Prime 5×5 celle PCA:\n{np.round(adata.uns[f'{transcript_key}_X'][:5, :5], 4)}")
print(f"[DEBUG] PCA variance ratio (prime 5): {np.round(adata.uns['pca']['variance_ratio'][:5], 4)}")
else:
print("[INFO] PCA disabilitata → uso matrice normalizzata intera.")
adata.uns[f"{transcript_key}_X"] = adata.X.toarray() if sp.issparse(adata.X) else adata.X
print(f"[INFO] === TRANSCRIPTOMIC ANALYSIS DONE ({adata.n_obs} cells, {adata.n_vars} genes) ===")
return adata
def _prep_embedding_for_training(X):
"""
Z-score per feature, poi L2 per cella. Restituisce float32 numpy.
"""
import numpy as np
from sklearn.preprocessing import StandardScaler, normalize
X = np.asarray(X, dtype=np.float32)
X = StandardScaler(with_mean=True, with_std=True).fit_transform(X)
X = normalize(X, norm="l2", axis=1)
return X
class PairedAE(torch.nn.Module):
"""
Due encoder (RNA/VAR) e due decoder con bottleneck condiviso (latent_dim).
Loss: ricostruzione + allineamento (cosine) tra z_rna e z_var.
"""
def __init__(self, dim_a, dim_b, latent_dim=32, hidden=256):
super().__init__()
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(dim_a, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(dim_b, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.decoder_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_a),
)
self.decoder_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_b),
)
def forward(self, xa, xb):
za = self.encoder_a(xa)
zb = self.encoder_b(xb)
ra = self.decoder_a(za)
rb = self.decoder_b(zb)
return za, zb, ra, rb
def train_paired_ae(
X_a,
X_b,
latent_dim=32,
num_epoch=300,
lr=1e-3,
lam_align=0.5,
lam_recon_a=1.0,
lam_recon_b=1.0,
lam_cross=2.0,
use_self_attention=True,
verbose=True,
):
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# === Encoder/Decoder ===
def make_encoder(in_dim, latent_dim):
return nn.Sequential(
nn.Linear(in_dim, 256),
nn.BatchNorm1d(256),
nn.LeakyReLU(0.1),
nn.Dropout(0.2),
nn.Linear(256, latent_dim),
nn.BatchNorm1d(latent_dim),
)
def make_decoder(latent_dim, out_dim):
return nn.Sequential(
nn.Linear(latent_dim, 256),
nn.BatchNorm1d(256),
nn.LeakyReLU(0.1),
nn.Linear(256, out_dim),
)
in_dim_a, in_dim_b = X_a.shape[1], X_b.shape[1]
enc_a, dec_a = make_encoder(in_dim_a, latent_dim).to(device), make_decoder(latent_dim, in_dim_a).to(device)
enc_b, dec_b = make_encoder(in_dim_b, latent_dim).to(device), make_decoder(latent_dim, in_dim_b).to(device)
# === Fusion & optional self-attention ===
fusion = nn.Sequential(
nn.Linear(latent_dim, latent_dim),
nn.LayerNorm(latent_dim),
nn.LeakyReLU(0.1),
).to(device)
if use_self_attention:
self_attn = nn.MultiheadAttention(embed_dim=latent_dim, num_heads=4, dropout=0.1, batch_first=True).to(device)
else:
self_attn = None
opt = optim.Adam(
list(enc_a.parameters()) + list(dec_a.parameters()) +
list(enc_b.parameters()) + list(dec_b.parameters()) +
list(fusion.parameters()) +
([] if self_attn is None else list(self_attn.parameters())),
lr=lr,
)
loss_fn = nn.MSELoss()
Xa, Xb = torch.tensor(X_a, dtype=torch.float32).to(device), torch.tensor(X_b, dtype=torch.float32).to(device)
loader = DataLoader(TensorDataset(Xa, Xb), batch_size=128, shuffle=True)
best_loss, patience, patience_counter = float("inf"), 15, 0
for epoch in range(num_epoch):
enc_a.train(), dec_a.train(), enc_b.train(), dec_b.train(), fusion.train()
if self_attn is not None:
self_attn.train()
total_loss = 0.0
for batch_a, batch_b in loader:
za, zb = enc_a(batch_a), enc_b(batch_b)
# --- Top-k soft-attention + gating (cross-domain fusion) ---
za_n = torch.nn.functional.normalize(za, dim=1)
zb_n = torch.nn.functional.normalize(zb, dim=1)
sim = torch.matmul(za_n, zb_n.T) # (B,B) similarity RNA–VAR
k = min(10, sim.size(1)) # top-k local neighborhood
vals, idx = torch.topk(sim, k=k, dim=1)
mask = torch.full_like(sim, float('-inf'))
mask.scatter_(1, idx, vals)
attn = torch.softmax(mask / 0.8, dim=1) # temperatura più morbida
zb_agg = torch.matmul(attn, zb) # weighted local fusion
# gating adattivo tra RNA-latent e VAR-aggregato
gate = torch.sigmoid(((za * zb_agg).sum(dim=1, keepdim=True)) / (latent_dim ** 0.5))
zf = fusion(gate * za + (1.0 - gate) * zb_agg) # fusione bilanciata
# --- Self-attention residua opzionale ---
if self_attn is not None:
zf_sa, _ = self_attn(zf.unsqueeze(0), zf.unsqueeze(0), zf.unsqueeze(0))
zf = zf + 0.3 * zf_sa.squeeze(0) # residuo pesato
# --- Reconstruction losses ---
ra, rb = dec_a(zf), dec_b(zf)
loss_recon_a, loss_recon_b = loss_fn(ra, batch_a), loss_fn(rb, batch_b)
rcross_a, rcross_b = dec_a(zb), dec_b(za)
loss_cross = 0.5 * (loss_fn(rcross_a, batch_a) + loss_fn(rcross_b, batch_b))
# --- Alignment (MSE + contrastive soft) ---
loss_align_mse = loss_fn(za, zb) + 0.5 * (loss_fn(za, zf) + loss_fn(zb, zf))
za_sim = torch.nn.functional.normalize(za, dim=1)
zb_sim = torch.nn.functional.normalize(zb, dim=1)
sim_contrast = torch.mm(za_sim, zb_sim.T) / 0.5 # temperatura meno rigida
labels = torch.arange(sim_contrast.size(0)).to(device)
loss_align_contrast = nn.CrossEntropyLoss()(sim_contrast, labels)
loss_align = loss_align_mse + 0.4 * loss_align_contrast # contrastive rinforzato
# --- Varianza e allineamento direzionale ---
var_loss = torch.var(zf, dim=0).mean()
cos_sim = torch.nn.functional.cosine_similarity(za, zb, dim=1).mean()
# --- Total loss ---
loss = (
lam_recon_a * loss_recon_a
+ lam_recon_b * loss_recon_b
+ lam_cross * loss_cross
+ lam_align * loss_align
+ 0.001 * var_loss
- 0.05 * cos_sim
)
opt.zero_grad()
loss.backward()
opt.step()
total_loss += loss.item()
total_loss /= len(loader)
if verbose and epoch % 10 == 0:
print(f"[EPOCH {epoch:03d}] loss={total_loss:.5f}")
# --- Early stopping ---
if total_loss + 1e-6 < best_loss:
best_loss, patience_counter = total_loss, 0
else:
patience_counter += 1
if patience_counter > patience:
if verbose:
print(f"[EARLY STOP] epoch={epoch}, loss={total_loss:.5f}")
break
# === Inference ===
enc_a.eval(), enc_b.eval(), fusion.eval()
if self_attn is not None:
self_attn.eval()
with torch.no_grad():
zA = fusion(enc_a(Xa))
zB = fusion(enc_b(Xb))
if self_attn is not None:
zA = zA + 0.3 * self_attn(zA.unsqueeze(0), zA.unsqueeze(0), zA.unsqueeze(0))[0].squeeze(0)
zB = zB + 0.3 * self_attn(zB.unsqueeze(0), zB.unsqueeze(0), zB.unsqueeze(0))[0].squeeze(0)
zA, zB = zA.cpu().numpy(), zB.cpu().numpy()
return None, zA, zB
def variantAnalysis(
adata,
matrix_path=None,
bcode_path=None,
variants_path=None,
min_cells=5,
max_cell_fraction=0.95,
variant_filter_level="norm",
n_pcs=50,
variant_rep="muon", # "muon", "tfidf", "lognorm"
variant_key="variant",
):
"""
Preprocess variant (DNA) data — simmetrico con transcriptomicsAnalysis.
Salva:
- adata.uns['variant_raw'] matrice grezza
- adata.uns['variant_X'] embedding numerico (PCA/scaled)
- adata.obsm['variant_pca'] PCA (identico a uns['variant_X'])
- adata.obsm['variant_umap'] embedding 2D per QC
"""
import scanpy as sc
import numpy as np
import pandas as pd
from scipy import sparse, io
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfTransformer
import umap
print("[INFO] === VARIANT ANALYSIS START ===")
print(f"[INFO] Lettura file: {matrix_path}")
# --- Lettura matrice ---
var_mtx = io.mmread(matrix_path)
X = sparse.csr_matrix(var_mtx)
barcodes = pd.read_csv(bcode_path, sep="\t", header=None)[0].astype(str).values
variants = pd.read_csv(variants_path, sep="\t", header=None)[0].astype(str).values
# --- Fix orientamento ---
if X.shape[0] == len(variants) and X.shape[1] == len(barcodes):
print(f"[WARN] Transposing variant matrix {X.shape} → expected (cells × variants)")
X = X.T
elif X.shape[1] != len(variants):
print(f"[WARN] Dimensioni inconsuete {X.shape}, provo a trasporre per sicurezza")
X = X.T
print(f"[INFO] Matrice varianti caricata → {X.shape[0]} celle × {X.shape[1]} varianti")
# === Debug blocco RAW ===
raw_block = X[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle RAW (counts):\n{np.round(raw_block, 2)}")
# --- Allineamento con RNA ---
rna_barcodes = np.array([b.split('-')[0] for b in adata.obs_names])
var_barcodes = np.array([b.split('-')[0] for b in barcodes])
common = np.intersect1d(var_barcodes, rna_barcodes)
if len(common) == 0:
raise ValueError("[ERROR] Nessun barcode comune tra RNA e VAR.")
order_idx = np.array([np.where(var_barcodes == b)[0][0] for b in rna_barcodes if b in var_barcodes])
X = X[order_idx, :]
barcodes = barcodes[order_idx]
print(f"[INFO] Celle comuni con RNA: {len(barcodes)}")
# --- Salva GREZZI ---
adata.uns[f"{variant_key}_raw"] = X.copy()
adata.uns[f"{variant_key}_raw_obs_names"] = barcodes
adata.uns[f"{variant_key}_raw_var_names"] = variants
# === Selezione modalità di rappresentazione ===
rep = variant_rep.lower().strip()
print(f"[INFO] Rappresentazione scelta: {rep.upper()}")
# ===============================
# MUON / BINARY (default)
# ===============================
if rep in ["muon", "binary"]:
print("[INFO] Uso pipeline Muon-style (binarizzazione + scaling + PCA + UMAP)")
X_bin = (X > 0).astype(float)
dna = sc.AnnData(X_bin)
dna.obs_names = barcodes
dna.var_names = variants
# Filtra varianti troppo rare o troppo comuni
sc.pp.filter_genes(dna, min_cells=min_cells)
freq = np.array(dna.X.sum(axis=0)).flatten() / dna.n_obs
keep = freq < max_cell_fraction
dna = dna[:, keep].copy()
print(f"[INFO] Varianti mantenute: {dna.n_vars}/{len(variants)}")
# Scala
sc.pp.scale(dna, max_value=10)
norm_block = dna.X[:5, :5].toarray() if sparse.issparse(dna.X) else dna.X[:5, :5]
print(f"[DEBUG] Prime 5×5 celle DNA (scalate):\n{np.round(norm_block, 4)}")
# PCA e UMAP
sc.tl.pca(dna, n_comps=n_pcs, random_state=42)
sc.pp.neighbors(dna, n_pcs=min(30, n_pcs), random_state=42)
sc.tl.umap(dna, random_state=42)
adata.obsm[f"{variant_key}_pca"] = dna.obsm["X_pca"].copy()
adata.uns[f"{variant_key}_X"] = dna.obsm["X_pca"].copy()
adata.obsm[f"{variant_key}_umap"] = dna.obsm["X_umap"].copy()
print(f"[DEBUG] PCA variance ratio (prime 5): {np.round(dna.uns['pca']['variance_ratio'][:5], 4)}")
print(f"[DEBUG] Prime 5×5 celle PCA:\n{np.round(adata.uns[f'{variant_key}_X'][:5, :5], 4)}")
# ===============================
# TF-IDF
# ===============================
elif rep == "tfidf":
print("[INFO] Uso rappresentazione TF-IDF + L2 norm")
tfidf = TfidfTransformer(norm=None, use_idf=True, smooth_idf=True)
X_rep = tfidf.fit_transform(X)
X_rep = normalize(X_rep, norm="l2", axis=1)
rep_block = X_rep[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle TFIDF:\n{np.round(rep_block, 4)}")
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_rep)
adata.uns[f"{variant_key}_X"] = embedding.astype(np.float32)
adata.obsm[f"{variant_key}_pca"] = embedding.astype(np.float32)
reducer = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42)
adata.obsm[f"{variant_key}_umap"] = reducer.fit_transform(embedding)
# ===============================
# LOGNORM
# ===============================
elif rep == "lognorm":
print("[INFO] Uso rappresentazione log1p-normalized")
lib = np.asarray(X.sum(axis=1)).ravel()
lib[lib == 0] = 1.0
X_norm = X.multiply(1.0 / lib[:, None])
X_norm.data = np.log1p(X_norm.data)
rep_block = X_norm[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle LOGNORM:\n{np.round(rep_block, 4)}")
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_norm)
adata.uns[f"{variant_key}_X"] = embedding.astype(np.float32)
adata.obsm[f"{variant_key}_pca"] = embedding.astype(np.float32)
reducer = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42)
adata.obsm[f"{variant_key}_umap"] = reducer.fit_transform(embedding)
# ===============================
# ERRORE
# ===============================
else:
raise ValueError(f"[ERROR] variant_rep '{variant_rep}' non riconosciuto. Usa 'muon', 'tfidf' o 'lognorm'.")
print(f"[INFO] DNA PCA shape: {adata.uns[f'{variant_key}_X'].shape}")
print(f"[INFO] === VARIANT ANALYSIS DONE ({X.shape[0]} cells, {X.shape[1]} variants) ===")
return adata
def calcOmicsClusters(adata, omic_key="int", res=0.5, n_neighbors=15, n_pcs=None):
"""
Calcola il clustering Leiden sull'embedding integrato (es. 'int').
Logica:
- se n_cells < 100 → usa PCA concatenata salvata in adata.uns['int_X'] (Muon-style)
- se n_cells >= 100 → usa embedding AE salvato in adata.uns['int_X']
Nessun uso di UMAP. Clustering deterministico (igraph).
"""
import scanpy as sc
import numpy as np
n_cells = adata.n_obs
print(f"\n[INFO] === CALC OMiCS CLUSTERS ({omic_key}) ===")
print(f"[INFO] Numero celle: {n_cells}")
# ==========================================================
# 1️⃣ Recupera embedding integrato (PCA concatenata o AE)
# ==========================================================
if f"{omic_key}_X" not in adata.uns:
raise ValueError(f"[ERROR] uns['{omic_key}_X'] mancante — devi prima eseguire omicsIntegration().")
X = np.asarray(adata.uns[f"{omic_key}_X"], dtype=np.float32)
adata.obsm[f"{omic_key}_X"] = X.copy()
if n_cells < 100:
print("[INFO] Dataset piccolo → uso PCA concatenata (Muon-style)")
else:
print("[INFO] Dataset grande → uso embedding autoencoder (AE-style)")
print(f"[INFO] Embedding caricato da adata.uns['{omic_key}_X'] ({X.shape[1]} dimensioni)")
# 🔍 Debug numerico
debug_block = np.round(X[:5, :5], 4)
print(f"[DEBUG] Prime 5×5 celle embedding:\n{debug_block}")
print(f"[DEBUG] Somma prime 5 righe: {[round(X[i,:].sum(), 4) for i in range(min(5, X.shape[0]))]}")
print(f"[DEBUG] Varianza media embedding: {X.var():.4f}")
# (Opzionale) verifica coerenza con obsm["X_concat_pca"]
if "X_concat_pca" in adata.obsm:
same = np.allclose(X[:5, :5], adata.obsm["X_concat_pca"][:5, :5])
print(f"[DEBUG] Coerenza con obsm['X_concat_pca']: {same}")
# ==========================================================
# 2️⃣ Costruisci neighbors
# ==========================================================
dims = X.shape[1]
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
n_pcs=min(n_pcs or dims, dims),
use_rep=f"{omic_key}_X",
key_added=f"{omic_key}_neighbors"
)
conn = adata.obsp[f"{omic_key}_neighbors_connectivities"]
dist = adata.obsp[f"{omic_key}_neighbors_distances"]
print(f"[DEBUG] Neighbors graph: {conn.shape}, mean_conn={conn.mean():.6f}, mean_dist={dist.mean():.6f}")
# ==========================================================
# 3️⃣ Leiden deterministico (igraph)
# ==========================================================
key = f"{omic_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
flavor="igraph",
n_iterations=2,
directed=False,
random_state=0, # deterministico
neighbors_key=f"{omic_key}_neighbors",
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Leiden completato su {omic_key} (res={res}) → {n_clusters} cluster")
print("[INFO] === CLUSTERING DONE ===")
return adata
def weightsInit(m):
if isinstance(m, torch.nn.Linear):
torch.nn.init.xavier_uniform(m.weight.data)
m.bias.data.zero_()
def omicsIntegration(
adata,
transcript_key="trans",
variant_key="variant",
integration_key="int",
latent_dim=32,
num_epoch=300,
lam_align=2.0,
lam_recon_a=1.0,
lam_recon_b=1.0,
lam_cross=5.0,
seed=42,
res=None,
balance_var=False,
):
"""
Integrazione RNA+VAR:
- Se n_cells < 100 → PCA concatenata (Muon-style, senza riscalatura)
- Se n_cells >= 100 → Autoencoder asimmetrico (RNA=teacher, VAR=student)
Se res è specificato, calcola Leiden integrato a quella risoluzione.
"""
import numpy as np
import scanpy as sc
print("[INFO] === OMICS INTEGRATION START ===")
# --- verifica input ---
assert transcript_key + "_X" in adata.uns, f"Missing adata.uns['{transcript_key}_X']"
assert variant_key + "_X" in adata.uns, f"Missing adata.uns['{variant_key}_X']"
X_a = np.asarray(adata.uns[transcript_key + "_X"], dtype=np.float32)
X_b = np.asarray(adata.uns[variant_key + "_X"], dtype=np.float32)
assert X_a.shape[0] == X_b.shape[0], "RNA/VAR rows (cells) must match"
n_cells = X_a.shape[0]
# --- bilancia varianze (solo se richiesto) ---
if balance_var:
print("[INFO] Bilanciamento varianze attivo (X /= std).")
X_a /= np.std(X_a)
X_b /= np.std(X_b)
else:
print("[INFO] Nessuna riscalatura: uso PCA pura (Muon-style).")
# ==========================================================
# === BLOCCO 1: PCA CONCATENATA (dataset piccolo < 100) ===
# ==========================================================
if n_cells < 100:
print(f"[INFO] n_cells={n_cells} < 100 → uso PCA concatenata Muon-style")
from sklearn.decomposition import PCA
# --- Usa PCA già salvate in adata.uns se disponibili ---
if f"{transcript_key}_X" in adata.uns and f"{variant_key}_X" in adata.uns:
print(f"[INFO] Uso PCA pre-esistenti da adata.uns['{transcript_key}_X'] e adata.uns['{variant_key}_X']")
pca_rna = np.asarray(adata.uns[f"{transcript_key}_X"])
pca_var = np.asarray(adata.uns[f"{variant_key}_X"])
# === Debug approfondito ===
print(f"[DEBUG] PCA RNA shape: {pca_rna.shape} | PCA VAR shape: {pca_var.shape}")
print(f"[DEBUG] Prime 5×5 PCA RNA:\n{np.round(pca_rna[:5, :5], 4)}")
print(f"[DEBUG] Prime 5×5 PCA VAR:\n{np.round(pca_var[:5, :5], 4)}")
print(f"[DEBUG] Media/varianza PCA RNA: {np.mean(pca_rna):.4f} / {np.var(pca_rna):.4f}")
print(f"[DEBUG] Media/varianza PCA VAR: {np.mean(pca_var):.4f} / {np.var(pca_var):.4f}")
else:
print("[WARN] PCA non trovate in adata.uns — le ricomputo localmente da X grezze.")
from sklearn.decomposition import PCA
X_a = np.asarray(adata.uns[f"{transcript_key}_raw"].todense() if hasattr(adata.uns[f"{transcript_key}_raw"], "todense") else adata.uns[f"{transcript_key}_raw"])
X_b = np.asarray(adata.uns[f"{variant_key}_raw"].todense() if hasattr(adata.uns[f"{variant_key}_raw"], "todense") else adata.uns[f"{variant_key}_raw"])
n_comp = min(50, X_a.shape[1], X_b.shape[1])
pca_rna = PCA(n_components=n_comp, random_state=seed).fit_transform(X_a)
pca_var = PCA(n_components=n_comp, random_state=seed).fit_transform(X_b)
adata.uns[f"{transcript_key}_X"] = pca_rna
adata.uns[f"{variant_key}_X"] = pca_var
print(f"[INFO] Create nuove PCA locali: RNA={pca_rna.shape}, VAR={pca_var.shape}")
print(f"[DEBUG] Prime 5×5 PCA RNA (nuova):\n{np.round(pca_rna[:5, :5], 4)}")
print(f"[DEBUG] Prime 5×5 PCA VAR (nuova):\n{np.round(pca_var[:5, :5], 4)}")
# --- concatenazione PCA pura ---
X_concat = np.concatenate([pca_rna, pca_var], axis=1)
adata.uns[integration_key + "_X"] = X_concat.copy()
adata.obsm["X_concat_pca"] = X_concat.copy()
print(f"[INFO] Concatenazione PCA completata: {X_concat.shape}")
concat_block = X_concat[:5, :5]
print(f"[DEBUG] Prime 5×5 celle CONCATENATE:\n{np.round(concat_block, 4)}")
print(f"[DEBUG] Somma prime 5 righe concatenata: {[round(X_concat[i,:].sum(), 4) for i in range(5)]}")
print(f"[DEBUG] Varianza PCA RNA / VAR: {pca_rna.var():.3f} / {pca_var.var():.3f}")
# --- Neighbors + UMAP integrato ---
sc.pp.neighbors(adata, use_rep="X_concat_pca", key_added=f"{integration_key}_neighbors")
sc.tl.umap(adata)
adata.obsm[f"{integration_key}_umap"] = adata.obsm["X_umap"].copy()
# --- Leiden opzionale ---
if res is not None:
key = f"{integration_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
print(f"[INFO] Calcolo Leiden integrato per res={res}")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
neighbors_key=f"{integration_key}_neighbors",
flavor="igraph",
n_iterations=2,
directed=False,
random_state=seed,
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Creato {key} ({n_clusters} cluster)")
print("[INFO] === MUON-STYLE INTEGRATION COMPLETATA ===")
return adata
# ========================================================
# === BLOCCO 2: AUTOENCODER ASIMMETRICO (dataset grande) ===
# ========================================================
else:
print(f"[INFO] n_cells={n_cells} ≥ 2000 → uso Autoencoder asimmetrico (RNA=teacher, VAR=student)")
from sklearn.preprocessing import StandardScaler
import umap
# --- Normalizzazione separata per modalità ---
scaler_a = StandardScaler(with_mean=True, with_std=True)
scaler_b = StandardScaler(with_mean=True, with_std=True)
#X_a_scaled = scaler_a.fit_transform(X_a)
#X_b_scaled = scaler_b.fit_transform(X_b)
from sklearn.preprocessing import StandardScaler
# Normalizzazione combinata (RNA+VAR insieme)
scaler = StandardScaler(with_mean=True, with_std=True)
X_concat = np.concatenate([X_a, X_b], axis=1)
X_concat = scaler.fit_transform(X_concat)
# separa di nuovo RNA e VAR normalizzati sulla stessa scala
X_a_scaled = X_concat[:, :X_a.shape[1]]
X_b_scaled = X_concat[:, X_a.shape[1]:]
# --- Addestramento AE migliorato ---
_, zA, zB = train_paired_ae(
X_a_scaled,
X_b_scaled,
latent_dim=latent_dim,
num_epoch=num_epoch,
lam_align=lam_align,
lam_recon_a=lam_recon_a,
lam_recon_b=lam_recon_b,
lam_cross=lam_cross,
verbose=True,
)
# --- Diagnostica dettagliata dell'embedding ---
import numpy as np
print("\n[DEBUG] === AE DIAGNOSTIC ===")
print(f"Shape zA: {zA.shape}, zB: {zB.shape}")
print(f"Varianza zA: {np.mean(np.var(zA, axis=0)):.6f}")
print(f"Varianza zB: {np.mean(np.var(zB, axis=0)):.6f}")
print(f"Covarianza media zA/zB: {np.mean(np.cov(np.concatenate([zA, zB], axis=1).T)):.6f}")
print(f"SimAE (dot): {float(np.mean(np.sum(zA * zB, axis=1))):.6f}")
# Verifica correlazioni tra feature latenti
corr = np.corrcoef(zA.T)
print(f"Media correlazione assoluta tra feature zA: {np.mean(np.abs(corr[np.triu_indices_from(corr,1)])):.4f}")
# Range valori
print(f"Range valori zA: {zA.min():.4f}{zA.max():.4f}")
print(f"Range valori zB: {zB.min():.4f}{zB.max():.4f}")
print("[DEBUG] Prime 5x5 celle zA:\n", np.round(zA[:5, :5], 4))
print("[DEBUG] Prime 5x5 celle zB:\n", np.round(zB[:5, :5], 4))
print("================================\n")
simAE = float(np.mean(np.sum(zA * zB, axis=1)))
zAE = 0.5 * (zA + zB)
print(f"[DEBUG] Varianza media zA: {zA.var():.4f}, zB: {zB.var():.4f}, zAE: {zAE.var():.4f}")
print(f"[DEBUG] Prime 5×5 celle embedding AE:\n{np.round(zAE[:5, :5], 4)}")
if zAE.var() < 1e-3:
print("[WARN] Varianza AE molto bassa — embedding probabilmente collassato")
# --- Salva risultato integrato ---
adata.uns[integration_key + "_X"] = zAE.astype(np.float32)
adata.uns[integration_key + "_metrics"] = {
"simAE": simAE,
"latent_dim": int(latent_dim),
"num_epoch": int(num_epoch),
"lam_align": float(lam_align),
"lam_recon_a": float(lam_recon_a),
"lam_recon_b": float(lam_recon_b),
"lam_cross": float(lam_cross),
}
# --- UMAP e clustering opzionale ---
um = umap.UMAP(n_neighbors=15, min_dist=0.05, random_state=seed).fit_transform(zAE)
adata.obsm[f"{integration_key}_umap"] = um
if res is not None:
key = f"{integration_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
print(f"[INFO] Calcolo Leiden integrato per res={res}")
sc.pp.neighbors(adata, use_rep=f"{integration_key}_X", key_added=f"{integration_key}_neighbors")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
neighbors_key=f"{integration_key}_neighbors",
flavor="igraph",
n_iterations=2,
directed=False,
random_state=seed,
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Creato {key} ({n_clusters} cluster)")
print(f"[INFO] AE similarity={simAE:.3f}")
print("[INFO] === AUTOENCODER INTEGRATION COMPLETATA ===")
return adata
class pairedIntegration(torch.nn.Module):
def __init__(self,input_dim_a=100,input_dim_b=100,clf_out=10):
super(pairedIntegration, self).__init__()
self.input_dim_a = input_dim_a
self.input_dim_b = input_dim_b
self.clf_out = clf_out
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_a, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_b, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.clf = torch.nn.Sequential(
torch.nn.Linear(128, self.clf_out),
torch.nn.Softmax(dim=1))
self.feature = torch.nn.Sequential(
torch.nn.Linear(128, 32))
def forward(self, x_a,x_b):
out_a = self.encoder_a(x_a)
f_a = self.feature(out_a)
y_a = self.clf(out_a)
out_b = self.encoder_b(x_b)
f_b = self.feature(out_b)
y_b = self.clf(out_b)
return f_a,y_a,f_b,y_b
def pairedIntegrationTrainer(X_a, X_b, model, batch_size = 512, num_epoch=5,
f_temp = 0.1, p_temp = 1.0):
device = torch.device("cpu")
f_con = contrastiveLoss(batch_size = batch_size,temperature = f_temp)
p_con = contrastiveLoss(batch_size = model.clf_out,temperature = p_temp)
opt = torch.optim.SGD(model.parameters(),lr=0.01, momentum=0.9,weight_decay=5e-4)
for k in range(num_epoch):
model.to(device)
n = X_a.shape[0]
r = np.random.permutation(n)
X_train_a = X_a[r,:]
X_tensor_A=torch.tensor(X_train_a).float()
X_train_b = X_b[r,:]
X_tensor_B=torch.tensor(X_train_b).float()
losses = 0
for j in range(n//batch_size):
inputs_a = X_tensor_A[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_a2 = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_a = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_b = X_tensor_B[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_b = inputs_b + torch.normal(0,1,inputs_b.shape).to(device)
feas,o,nfeas,no = model(inputs_a,inputs_b)
feas2,o2,_,_ = model(inputs_a2,inputs_b)
fea_mi = f_con(feas,nfeas)+f_con(feas,feas2)
p_mi = p_con(o.T,no.T)+p_con(o.T,o2.T)
loss = fea_mi + p_mi
opt.zero_grad()
loss.backward()
opt.step()
losses += loss.data.tolist()
print("Total loss: "+str(round(losses,4)))
gc.collect()
class contrastiveLoss(torch.nn.Module):
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.register_buffer("temperature", torch.tensor(temperature))
self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2,
dtype=bool)).float())
def forward(self, emb_i, emb_j):
# """
# emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
# z_i, z_j as per SimCLR paper
# """
z_i = F.normalize(emb_i, dim=1,p=2)
z_j = F.normalize(emb_j, dim=1,p=2)
representations = torch.cat([z_i, z_j], dim=0)
similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
sim_ij = torch.diag(similarity_matrix, self.batch_size)
sim_ji = torch.diag(similarity_matrix, -self.batch_size)
positives = torch.cat([sim_ij, sim_ji], dim=0)
nominator = torch.exp(positives / self.temperature)
denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))
loss = torch.sum(loss_partial) / (2 * self.batch_size)
return loss
def distributionClusters(adata, control_cl, group_cl, perc_cell_to_show=0.1, figsize = (12,8), dpi=100, save_path=None):
df = adata.obs.groupby([group_cl, control_cl]).size().unstack()
df = df.loc[df.sum(axis=1)/df.values.sum()>=perc_cell_to_show,:] # remove row if group_cluster represents less cells in perc than perc_cell_to_show.
df_rel = df
df = df.div(df.sum(axis=1), axis=0)
df[group_cl] = df.index
plt.rcParams["figure.figsize"] = figsize
plt.rcParams['figure.dpi'] = dpi
plt.rcParams['figure.facecolor'] = '#FFFFFF'
df.plot(
x = group_cl,
kind = 'barh',
stacked = True,
mark_right = True,
)
leg = plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
leg.get_frame().set_edgecolor('black')
plt.xlabel('perc_'+control_cl, loc='center')
for n in df_rel:
for i, (pos_x, ab, value) in enumerate(zip(df.iloc[:, :-1].cumsum(1)[n], df[n], df_rel[n])):
if (value == 0) | (ab <=0.05):
value = ''
plt.text(pos_x-ab/2, i, str(value), va='center', ha='center')
plt.grid(False)
if save_path is not None:
plt.savefig(save_path, bbox_inches='tight')
return(plt.show())
# ================================================================
# 🔧 OUTPUT HANDLER: salva UMAP e AnnData in /output/
# ================================================================
import os
import matplotlib.pyplot as plt
import scanpy as sc
def save_all_umaps(adata, prefix="output", color_by=None, dpi=300):
"""
Salva tutte le UMAP presenti in adata.obsm come immagini PNG.
Non salva l'AnnData, nessun h5ad viene scritto.
"""
import os
import scanpy as sc
import matplotlib.pyplot as plt
os.makedirs(prefix, exist_ok=True)
print(f"[INFO] Cartella di output: {prefix}")
# === Trova tutte le UMAP disponibili ===
umap_keys = [k for k in adata.obsm.keys() if k.endswith("_umap") or k == "X_umap"]
if not umap_keys:
print("[WARN] Nessuna UMAP trovata in adata.obsm")
return
print(f"[INFO] UMAP trovate: {umap_keys}")
# === Determina cosa usare come colore ===
if color_by is None:
cluster_cols = [c for c in adata.obs.columns if "clust" in c.lower()]
color_by = cluster_cols if cluster_cols else ["n_genes"]
elif isinstance(color_by, str):
color_by = [color_by]
print(f"[INFO] Colorazioni da usare: {color_by}")
# === Salva ogni combinazione UMAP × colore ===
for key in umap_keys:
for color in color_by:
fig_path = os.path.join(prefix, f"{key}_{color}.png")
try:
sc.pl.embedding(
adata,
basis=key,
color=color,
frameon=False,
show=False,
)
plt.savefig(fig_path, dpi=dpi, bbox_inches="tight")
plt.close()
print(f"[OK] Salvata {fig_path}")
except Exception as e:
print(f"[WARN] Errore nel salvare {fig_path}: {e}")
print("[✅] Tutte le UMAP salvate con successo.")
# import packs
import numpy as np
import pandas as pd
import scanpy as sc
import gc
import torch
import torch.nn.functional as F
import umap
import anndata
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
from sklearn.feature_extraction.text import TfidfTransformer
from scipy import io
from scipy import sparse
import matplotlib.pyplot as plt
### FUNCTION ###
################
def transcriptomicAnalysis(
path_10x,
bcode_variants,
high_var_genes_min_mean=0.015,
high_var_genes_max_mean=3,
high_var_genes_min_disp=0.75,
min_genes=200,
min_cells=3,
max_pct_mt=100,
n_neighbors=20,
n_pcs=30, # se None → niente PCA
transcript_key='trans',
manual_matrix_reading=False
):
"""
Preprocess transcriptomic (RNA) data from 10x Genomics format.
If n_pcs=None, skip PCA and use full (normalized) expression matrix as embedding.
"""
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy import io, sparse
from sklearn.preprocessing import StandardScaler
print("[INFO] === TRANSCRIPTOMIC ANALYSIS START ===")
# === Lettura dati RNA ===
if manual_matrix_reading:
rna = path_10x + '/matrix.mtx.gz'
barcode = path_10x + '/barcodes.tsv.gz'
feat = path_10x + '/features.tsv.gz'
rna = io.mmread(rna)
B = rna.todense().T
barcode = pd.read_csv(barcode, sep='\t', header=None, names=['bcode'])
barcode.index = barcode.iloc[:, 0]
barcode = barcode.drop('bcode', axis=1)
feat = pd.read_csv(feat, sep='\t', header=None, names=['gene_ids', 'gene_symbol', 'feature_types'])
feat.index = feat['gene_symbol']
feat = feat.drop('gene_symbol', axis=1)
adata = anndata.AnnData(X=B, obs=barcode, var=feat)
adata.X = sparse.csr_matrix(adata.X)
else:
adata = sc.read_10x_mtx(path_10x, var_names='gene_symbols', cache=True)
# === Filtra barcode presenti anche nel file varianti ===
bcode_var = pd.read_csv(bcode_variants, sep='\t', header=None)[0]
adata = adata[adata.obs.index.isin(bcode_var)]
# === Filtraggio e QC ===
sc.pp.filter_cells(adata, min_genes=min_genes)
sc.pp.filter_genes(adata, min_cells=min_cells)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.pct_counts_mt < max_pct_mt, :]
# === Normalizzazione e log ===
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# ~_~T FIX: rimuove eventuali NaN generati da log1p
if sp.issparse(adata.X):
adata.X.data[np.isnan(adata.X.data)] = 0
else:
adata.X = np.nan_to_num(adata.X, nan=0.0, posinf=0.0, neginf=0.0)
# === Selezione geni altamente variabili ===
sc.pp.highly_variable_genes(
adata,
min_mean=high_var_genes_min_mean,
max_mean=high_var_genes_max_mean,
min_disp=high_var_genes_min_disp
)
# === Salva matrice originale ===
adata.layers["trans_raw"] = adata.X.copy()
# === PCA opzionale ===
if n_pcs is not None:
print(f"[INFO] PCA con {n_pcs} componenti...")
sc.pp.pca(adata, n_comps=n_pcs, use_highly_variable=True)
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs, random_state=42)
sc.tl.umap(adata, random_state=42)
adata.obsm[transcript_key + '_umap'] = adata.obsm['X_umap']
embedding = adata.obsm['X_pca']
else:
print("[INFO] PCA disabilitata → uso matrice intera (tutte le feature normalizzate).")
embedding = adata.X.toarray() if sp.issparse(adata.X) else adata.X
# === Scaling per uniformare range numerico ===
scaler = StandardScaler(with_mean=False) # evita densificazione su sparse
adata.uns[transcript_key + '_X'] = scaler.fit_transform(embedding)
print(f"[INFO] Salvato embedding → adata.uns['{transcript_key}_X'] shape={adata.uns[transcript_key + '_X'].shape}")
print("[INFO] === TRANSCRIPTOMIC ANALYSIS DONE ===")
return adata
def _prep_embedding_for_training(X):
"""
Z-score per feature, poi L2 per cella. Restituisce float32 numpy.
"""
import numpy as np
from sklearn.preprocessing import StandardScaler, normalize
X = np.asarray(X, dtype=np.float32)
X = StandardScaler(with_mean=True, with_std=True).fit_transform(X)
X = normalize(X, norm="l2", axis=1)
return X
import torch
import torch.nn as nn
import torch.nn.functional as F
class PairedAE(nn.Module):
"""
Autoencoder doppio (RNA↔VAR) con encoder e decoder separati.
Supporta forward(xa, xb) -> (za, zb, ra, rb)
e decodifica indipendente: decode_a(zb), decode_b(za).
"""
def __init__(self, input_dim_a, input_dim_b, latent_dim=32, hidden=256):
super(PairedAE, self).__init__()
# Encoder RNA
self.enc_a = nn.Sequential(
nn.Linear(input_dim_a, hidden),
nn.ReLU(),
nn.Linear(hidden, latent_dim)
)
# Encoder VAR
self.enc_b = nn.Sequential(
nn.Linear(input_dim_b, hidden),
nn.ReLU(),
nn.Linear(hidden, latent_dim)
)
# Decoder RNA
self.dec_a = nn.Sequential(
nn.Linear(latent_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, input_dim_a)
)
# Decoder VAR
self.dec_b = nn.Sequential(
nn.Linear(latent_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, input_dim_b)
)
def forward(self, xa, xb):
za = self.enc_a(xa)
zb = self.enc_b(xb)
ra = self.dec_a(za)
rb = self.dec_b(zb)
return za, zb, ra, rb
# 🔹 Queste due servono per la cross-reconstruction
def decode_a(self, z):
"""Decodifica un vettore latente nel dominio RNA"""
return self.dec_a(z)
def decode_b(self, z):
"""Decodifica un vettore latente nel dominio VAR"""
return self.dec_b(z)
def train_paired_ae(
Xa, Xb,
latent_dim=32,
hidden=256,
batch_size=256,
num_epoch=300,
lr=1e-3,
lam_align=0.5, # forza dell’allineamento coseno
lam_recon_a=1.0, # peso ricostruzione RNA
lam_recon_b=1.0, # peso ricostruzione VAR
lam_cross=2.0, # peso cross-reconstruction
lam_corr=1.0, # 🔥 nuovo: forza correlazione latente
use_bernoulli_b=False,
verbose=True,
):
"""
Paired AutoEncoder con:
- ricostruzione RNA/VAR
- allineamento coseno
- cross-reconstruction (A→B, B→A)
- correlazione latente (tipo DeepCCA)
"""
import torch
import torch.nn.functional as F
import numpy as np
device = torch.device("cpu")
Xa = np.asarray(Xa, dtype=np.float32)
Xb = np.asarray(Xb, dtype=np.float32)
n, da = Xa.shape
_, db = Xb.shape
# === Define model ===
class PairedAE(torch.nn.Module):
def __init__(self, da, db, hidden=256, latent_dim=32):
super().__init__()
self.enc_a = torch.nn.Sequential(
torch.nn.Linear(da, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.enc_b = torch.nn.Sequential(
torch.nn.Linear(db, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.dec_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, da),
)
self.dec_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, db),
)
def forward(self, xa, xb):
za = self.enc_a(xa)
zb = self.enc_b(xb)
ra = self.dec_a(za)
rb = self.dec_b(zb)
return za, zb, ra, rb
model = PairedAE(da, db, hidden, latent_dim).to(device)
opt = torch.optim.Adam(model.parameters(), lr=lr)
Xa_t = torch.tensor(Xa, dtype=torch.float32).to(device)
Xb_t = torch.tensor(Xb, dtype=torch.float32).to(device)
cos = torch.nn.CosineEmbeddingLoss(margin=0.0, reduction="mean")
def corr_loss(z1, z2):
"""correlation-based alignment (DeepCCA-style)"""
z1 = z1 - z1.mean(0)
z2 = z2 - z2.mean(0)
cov = (z1 * z2).mean(0)
std1 = z1.std(0) + 1e-8
std2 = z2.std(0) + 1e-8
corr = cov / (std1 * std2)
return -corr.mean() # want to maximize corr
for epoch in range(1, num_epoch + 1):
perm = torch.randperm(n)
total_loss = 0.0
for i in range(0, n, batch_size):
idx = perm[i:i + batch_size]
xa = Xa_t[idx]
xb = Xb_t[idx]
za, zb, ra, rb = model(xa, xb)
# reconstruction losses
recon_a = torch.mean((ra - xa) ** 2)
if use_bernoulli_b:
recon_b = F.binary_cross_entropy(torch.sigmoid(rb), xb.clamp(0, 1))
else:
recon_b = torch.mean((rb - xb) ** 2)
# cross reconstruction
rb_from_a = model.dec_b(za)
ra_from_b = model.dec_a(zb)
cross_ab = torch.mean((rb_from_a - xb) ** 2)
cross_ba = torch.mean((ra_from_b - xa) ** 2)
# cosine alignment
y = torch.ones(za.size(0), device=device)
align = cos(za, zb, y)
# correlation loss
corr = corr_loss(za, zb)
# total loss
loss = (
lam_recon_a * recon_a +
lam_recon_b * recon_b +
lam_cross * (cross_ab + cross_ba) +
lam_align * align +
lam_corr * corr
)
opt.zero_grad()
loss.backward()
opt.step()
total_loss += loss.item()
if verbose and (epoch % max(1, num_epoch // 10) == 0 or epoch == 1):
print(f"[AE] Epoch {epoch:04d}/{num_epoch} "
f"| loss={total_loss:.4f} "
f"| rec_a={recon_a.item():.4f} rec_b={recon_b.item():.4f} "
f"| align={align.item():.4f} corr={-corr.item():.4f}")
# final embeddings
with torch.no_grad():
zA, zB, _, _ = model(Xa_t, Xb_t)
zA = F.normalize(zA, dim=1).cpu().numpy()
zB = F.normalize(zB, dim=1).cpu().numpy()
return model, zA, zB
def variantAnalysis(
adata,
matrix_path=None,
bcode_path=None,
variants_path=None,
min_cells=10,
min_counts=20,
variant_filter_level="Norm",
n_pcs=50, # PCA facoltativa per ridurre rumore
variant_rep="tfidf", # 'tfidf' | 'binary' | 'lognorm'
):
"""
Prepara la vista VAR a partire da adata.X (counts/binari).
Applica filtro opzionale, rappresentazione scelta, scaling e PCA (facoltativa).
Salva l'embedding in adata.uns['variant_X'] e genera variant_umap per QC.
"""
import scanpy as sc
import numpy as np
from scipy import sparse
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfTransformer
print("[INFO] === VARIANT ANALYSIS START ===")
print(f"[INFO] Filtraggio: {variant_filter_level} | Rappresentazione: {variant_rep} | PCA: {n_pcs}")
X = adata.X.copy()
n_total = X.shape[1]
# --- soglie filtro ---
if variant_filter_level.lower() == "none":
min_cells_eff, min_counts_eff = 0, 0
elif variant_filter_level.lower() == "low":
min_cells_eff = max(1, int(min_cells * 0.3))
min_counts_eff = max(1, int(min_counts * 0.3))
else: # "Norm"
min_cells_eff, min_counts_eff = min_cells, min_counts
# --- filtro per varianti rare/deboli ---
if sparse.issparse(X):
counts_per_var = np.asarray(X.sum(axis=0)).ravel()
cells_per_var = np.asarray((X > 0).sum(axis=0)).ravel()
else:
counts_per_var = X.sum(axis=0)
cells_per_var = (X > 0).sum(axis=0)
mask = (cells_per_var >= min_cells_eff) & (counts_per_var >= min_counts_eff)
removed = int((~mask).sum())
if removed > 0:
adata = adata[:, mask].copy()
X = adata.X
print(f"[INFO] Varianti filtrate: {removed}/{n_total} → kept {adata.n_vars}")
# --- rappresentazione varianti ---
if variant_rep == "binary":
if sparse.issparse(X):
X_rep = (X > 0).astype("float32")
else:
X_rep = (X > 0).astype("float32")
elif variant_rep == "lognorm":
# counts → log1p(normalized)
if sparse.issparse(X):
lib = np.asarray(X.sum(axis=1)).ravel()
lib[lib == 0] = 1.0
X_norm = X.multiply(1.0 / lib[:, None])
X_rep = X_norm
else:
lib = X.sum(axis=1, keepdims=True)
lib[lib == 0] = 1.0
X_rep = X / lib
# log1p
if sparse.issparse(X_rep):
X_rep = X_rep.tocsr(copy=True)
X_rep.data = np.log1p(X_rep.data)
else:
X_rep = np.log1p(X_rep)
else: # "tfidf" (default, consigliato per binario/sparse)
tfidf = TfidfTransformer(norm=None, use_idf=True, smooth_idf=True, sublinear_tf=False)
if sparse.issparse(X):
X_rep = tfidf.fit_transform(X)
else:
X_rep = tfidf.fit_transform(sparse.csr_matrix(X))
# normalizza per cella (L2) per comparabilità
from sklearn.preprocessing import normalize
X_rep = normalize(X_rep, norm="l2", axis=1)
# --- scaling & PCA (facoltativa) ---
if n_pcs is not None:
sc.pp.scale(adata, zero_center=False) # solo per settare .var e compatibilità
# PCA con arpack non accetta direttamente sparse: convertiamo
from sklearn.decomposition import PCA, TruncatedSVD
if sparse.issparse(X_rep):
# SVD tronca più stabile sullo sparse
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_rep)
else:
pca = PCA(n_components=n_pcs, random_state=42)
embedding = pca.fit_transform(X_rep)
else:
embedding = X_rep.toarray() if sparse.issparse(X_rep) else X_rep
# --- salva embedding per integrazione + QC UMAP ---
from sklearn.preprocessing import StandardScaler
adata.uns["variant_X"] = StandardScaler(with_mean=True, with_std=True).fit_transform(
np.asarray(embedding, dtype=np.float32)
)
# QC UMAP (2D) solo per ispezione
sc.pp.neighbors(adata, use_rep=None, n_neighbors=15, n_pcs=None) # non usato per variant_X
import umap
emb2d = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42).fit_transform(adata.uns["variant_X"])
adata.obsm["variant_umap"] = emb2d
print(f"[INFO] Salvato adata.uns['variant_X'] shape={adata.uns['variant_X'].shape}")
print("[INFO] === VARIANT ANALYSIS DONE ===")
return adata
def calcOmicsClusters(adata, omic_key, res=0.5, n_neighbors=None, n_pcs=None):
"""
Compute clustering (Leiden) on a given omic embedding (trans, variant, int).
Automatically adapts to 2D UMAPs or high-dimensional embeddings.
"""
import scanpy as sc
import numpy as np
# Default n_neighbors
if n_neighbors is None:
n_neighbors = 15
# --- Detect representation type ---
use_rep = omic_key + "_umap"
if use_rep in adata.obsm:
# UMAP is 2D -> skip n_pcs
print(f"[INFO] Using 2D UMAP embedding: {use_rep}")
sc.pp.neighbors(adata, n_neighbors=n_neighbors, use_rep=use_rep, key_added=f"{omic_key}_neighbors")
elif omic_key + "_X" in adata.uns:
# High-dimensional representation saved in uns
X = adata.uns[omic_key + "_X"]
adata.obsm[omic_key + "_X"] = X
dims = X.shape[1]
print(f"[INFO] Using high-dimensional embedding: {omic_key}_X ({dims} dims)")
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
n_pcs=min(n_pcs or dims, dims),
use_rep=omic_key + "_X",
key_added=f"{omic_key}_neighbors"
)
else:
raise ValueError(f"No valid representation found for omic_key='{omic_key}'")
# --- Leiden clustering ---
sc.tl.leiden(
adata,
resolution=res,
key_added=f"{omic_key}_clust_{res}",
neighbors_key=f"{omic_key}_neighbors"
)
return adata
def weightsInit(m):
if isinstance(m, torch.nn.Linear):
torch.nn.init.xavier_uniform(m.weight.data)
m.bias.data.zero_()
def omicsIntegration(
adata,
transcript_key='trans',
variant_key='variant',
integration_key='int',
latent_dim=32,
num_epoch=300,
lam_align=1.0,
lam_recon_a=1.0,
lam_recon_b=1.0,
lam_cross=2.0,
seed=42,
):
"""
Adaptive omics integration with automatic regularization:
- Small datasets → weighted PCA fusion (ConcatPCA-like) + AE refinement
- Large datasets → full asymmetric AE integration
- No CCA, stable scaling between modalities
"""
import numpy as np
import umap, scanpy as sc
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
print("[INFO] === OMICS INTEGRATION (Regularization mode) START ===")
Xa = np.asarray(adata.uns[transcript_key + "_X"], dtype=np.float32)
Xb = np.asarray(adata.uns[variant_key + "_X"], dtype=np.float32)
n_cells, n_featA = Xa.shape
_, n_featB = Xb.shape
assert Xa.shape[0] == Xb.shape[0], "RNA/VAR rows (cells) must match"
# --- Global normalization ---
Xa = StandardScaler(with_mean=True, with_std=True).fit_transform(Xa)
Xb = StandardScaler(with_mean=True, with_std=True).fit_transform(Xb)
# --- Adaptive selection ---
if n_cells < 5000 or n_featB < 2000:
print("[INFO] Regularization mode → adaptive integration")
# Weighted PCA fusion (ConcatPCA-style)
scale_b = np.std(Xa) / (np.std(Xb) + 1e-9)
Xb *= scale_b * 1.2
X_concat = np.concatenate([Xa, Xb], axis=1)
pca = PCA(n_components=min(latent_dim * 2, X_concat.shape[1]), random_state=seed)
z_init = pca.fit_transform(X_concat)
z_init = StandardScaler().fit_transform(z_init)
z_init = normalize(z_init, norm="l2", axis=1)
# Light AE refinement
try:
_, zA, zB = train_paired_ae(
Xa, Xb,
latent_dim=latent_dim,
num_epoch=max(30, num_epoch // 3),
lam_align=lam_align * 0.5,
lam_recon_a=lam_recon_a,
lam_recon_b=lam_recon_b,
lam_cross=lam_cross * 0.5,
verbose=False,
)
z = 0.5 * (z_init + 0.5 * (zA + zB))
except Exception as e:
print(f"[WARN] AE refinement skipped ({e})")
z = z_init
else:
print("[INFO] Full AE mode → asymmetric autoencoder integration")
_, zA, zB = train_paired_ae(
Xa, Xb,
latent_dim=latent_dim,
num_epoch=num_epoch,
lam_align=lam_align,
lam_recon_a=lam_recon_a,
lam_recon_b=lam_recon_b,
lam_cross=lam_cross,
verbose=True,
)
z = 0.5 * (zA + zB)
# --- Latent smoothing and normalization ---
z = z + 0.05 * np.random.randn(*z.shape).astype(np.float32)
z = StandardScaler().fit_transform(z)
z = normalize(z, norm="l2", axis=1)
# --- UMAP computation ---
um = umap.UMAP(n_neighbors=15, min_dist=0.05, random_state=seed).fit_transform(z)
adata.uns[integration_key + "_X"] = z.astype(np.float32)
adata.obsm[integration_key + "_umap"] = um
adata.uns[integration_key + "_metrics"] = {
"latent_dim": int(latent_dim),
"num_epoch": int(num_epoch),
"lam_align": float(lam_align),
"lam_cross": float(lam_cross),
"mode": "Regularization" if n_cells < 5000 else "AE",
}
print("[INFO] === OMICS INTEGRATION DONE ===")
return adata
class pairedIntegration(torch.nn.Module):
def __init__(self,input_dim_a=2000,input_dim_b=2000,clf_out=10):
super(pairedIntegration, self).__init__()
self.input_dim_a = input_dim_a
self.input_dim_b = input_dim_b
self.clf_out = clf_out
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_a, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_b, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.clf = torch.nn.Sequential(
torch.nn.Linear(128, self.clf_out),
torch.nn.Softmax(dim=1))
self.feature = torch.nn.Sequential(
torch.nn.Linear(128, 32))
def forward(self, x_a,x_b):
out_a = self.encoder_a(x_a)
f_a = self.feature(out_a)
y_a = self.clf(out_a)
out_b = self.encoder_b(x_b)
f_b = self.feature(out_b)
y_b = self.clf(out_b)
return f_a,y_a,f_b,y_b
def pairedIntegrationTrainer(X_a, X_b, model, batch_size = 512, num_epoch=5,
f_temp = 0.1, p_temp = 1.0):
device = torch.device("cpu")
f_con = contrastiveLoss(batch_size = batch_size,temperature = f_temp)
p_con = contrastiveLoss(batch_size = model.clf_out,temperature = p_temp)
opt = torch.optim.SGD(model.parameters(),lr=0.01, momentum=0.9,weight_decay=5e-4)
for k in range(num_epoch):
model.to(device)
n = X_a.shape[0]
r = np.random.permutation(n)
X_train_a = X_a[r,:]
X_tensor_A=torch.tensor(X_train_a).float()
X_train_b = X_b[r,:]
X_tensor_B=torch.tensor(X_train_b).float()
losses = 0
for j in range(n//batch_size):
inputs_a = X_tensor_A[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_a2 = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_a = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_b = X_tensor_B[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_b = inputs_b + torch.normal(0,1,inputs_b.shape).to(device)
feas,o,nfeas,no = model(inputs_a,inputs_b)
feas2,o2,_,_ = model(inputs_a2,inputs_b)
fea_mi = f_con(feas,nfeas)+f_con(feas,feas2)
p_mi = p_con(o.T,no.T)+p_con(o.T,o2.T)
loss = fea_mi + p_mi
opt.zero_grad()
loss.backward()
opt.step()
losses += loss.data.tolist()
print("Total loss: "+str(round(losses,4)))
gc.collect()
class contrastiveLoss(torch.nn.Module):
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.register_buffer("temperature", torch.tensor(temperature))
self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2,
dtype=bool)).float())
def forward(self, emb_i, emb_j):
# """
# emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
# z_i, z_j as per SimCLR paper
# """
z_i = F.normalize(emb_i, dim=1,p=2)
z_j = F.normalize(emb_j, dim=1,p=2)
representations = torch.cat([z_i, z_j], dim=0)
similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
sim_ij = torch.diag(similarity_matrix, self.batch_size)
sim_ji = torch.diag(similarity_matrix, -self.batch_size)
positives = torch.cat([sim_ij, sim_ji], dim=0)
nominator = torch.exp(positives / self.temperature)
denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))
loss = torch.sum(loss_partial) / (2 * self.batch_size)
return loss
def distributionClusters(adata, control_cl, group_cl, perc_cell_to_show=0.1, figsize = (12,8), dpi=100, save_path=None):
df = adata.obs.groupby([group_cl, control_cl]).size().unstack()
df = df.loc[df.sum(axis=1)/df.values.sum()>=perc_cell_to_show,:] # remove row if group_cluster represents less cells in perc than perc_cell_to_show.
df_rel = df
df = df.div(df.sum(axis=1), axis=0)
df[group_cl] = df.index
plt.rcParams["figure.figsize"] = figsize
plt.rcParams['figure.dpi'] = dpi
plt.rcParams['figure.facecolor'] = '#FFFFFF'
df.plot(
x = group_cl,
kind = 'barh',
stacked = True,
mark_right = True,
)
leg = plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
leg.get_frame().set_edgecolor('black')
plt.xlabel('perc_'+control_cl, loc='center')
for n in df_rel:
for i, (pos_x, ab, value) in enumerate(zip(df.iloc[:, :-1].cumsum(1)[n], df[n], df_rel[n])):
if (value == 0) | (ab <=0.05):
value = ''
plt.text(pos_x-ab/2, i, str(value), va='center', ha='center')
plt.grid(False)
if save_path is not None:
plt.savefig(save_path, bbox_inches='tight')
return(plt.show())
# ================================================================
# 🔧 OUTPUT HANDLER: salva UMAP e AnnData in /output/
# ================================================================
import os
import matplotlib.pyplot as plt
import scanpy as sc
def save_all_umaps(adata, prefix="output", color_by=None, dpi=300):
"""
Salva tutte le UMAP presenti in adata.obsm come immagini PNG.
Non salva l'AnnData, nessun h5ad viene scritto.
"""
import os
import scanpy as sc
import matplotlib.pyplot as plt
os.makedirs(prefix, exist_ok=True)
print(f"[INFO] Cartella di output: {prefix}")
# === Trova tutte le UMAP disponibili ===
umap_keys = [k for k in adata.obsm.keys() if k.endswith("_umap") or k == "X_umap"]
if not umap_keys:
print("[WARN] Nessuna UMAP trovata in adata.obsm")
return
print(f"[INFO] UMAP trovate: {umap_keys}")
# === Determina cosa usare come colore ===
if color_by is None:
cluster_cols = [c for c in adata.obs.columns if "clust" in c.lower()]
color_by = cluster_cols if cluster_cols else ["n_genes"]
elif isinstance(color_by, str):
color_by = [color_by]
print(f"[INFO] Colorazioni da usare: {color_by}")
# === Salva ogni combinazione UMAP × colore ===
for key in umap_keys:
for color in color_by:
fig_path = os.path.join(prefix, f"{key}_{color}.png")
try:
sc.pl.embedding(
adata,
basis=key,
color=color,
frameon=False,
show=False,
)
plt.savefig(fig_path, dpi=dpi, bbox_inches="tight")
plt.close()
print(f"[OK] Salvata {fig_path}")
except Exception as e:
print(f"[WARN] Errore nel salvare {fig_path}: {e}")
print("[✅] Tutte le UMAP salvate con successo.")
# import packs
import numpy as np
import pandas as pd
import scanpy as sc
import gc
import torch
import torch.nn.functional as F
import umap
import anndata
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
from sklearn.feature_extraction.text import TfidfTransformer
from scipy import io
from scipy import sparse
import matplotlib.pyplot as plt
### FUNCTION ###
################
def transcriptomicAnalysis(
path_10x,
bcode_variants,
high_var_genes_min_mean=0.015,
high_var_genes_max_mean=3,
high_var_genes_min_disp=0.75,
min_genes=200,
min_cells=3,
max_pct_mt=100,
n_neighbors=20,
n_pcs=30,
transcript_key='trans',
manual_matrix_reading=False,
force_float64=True,
):
"""
Preprocess transcriptomic (RNA) data from 10x Genomics format.
Ora la normalizzazione e lo scaling replicano esattamente la pipeline Muon:
normalize_total → log1p → scale(max_value=10)
"""
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy import io
from sklearn.preprocessing import StandardScaler
np.set_printoptions(precision=6, suppress=False, linewidth=140)
print("[INFO] === TRANSCRIPTOMIC ANALYSIS START ===")
# === Lettura dati RNA ===
if manual_matrix_reading:
rna = path_10x + '/matrix.mtx.gz'
barcode = path_10x + '/barcodes.tsv.gz'
feat = path_10x + '/features.tsv.gz'
M = io.mmread(rna).astype(np.float64 if force_float64 else np.float32)
B = M.T # (cells x genes)
barcode = pd.read_csv(barcode, sep='\t', header=None, names=['bcode'])
barcode.index = barcode['bcode']
feat = pd.read_csv(feat, sep='\t', header=None,
names=['gene_ids', 'gene_symbol', 'feature_types'])
feat.index = feat['gene_symbol']
adata = anndata.AnnData(X=B, obs=barcode, var=feat)
adata.X = sp.csr_matrix(adata.X)
else:
adata = sc.read_10x_mtx(path_10x, var_names='gene_symbols', cache=True)
adata.X = adata.X.astype(np.float64 if force_float64 else np.float32)
# === Salva copia grezza ===
if not sp.issparse(adata.X):
adata.X = sp.csr_matrix(adata.X)
adata.uns[f"{transcript_key}_raw"] = adata.X.copy()
adata.uns[f"{transcript_key}_raw_obs_names"] = np.array(adata.obs_names)
adata.uns[f"{transcript_key}_raw_var_names"] = np.array(adata.var_names)
print(f"[DEBUG] Raw RNA matrix shape: {adata.X.shape}")
raw_block = adata.X[:5, :5].toarray().astype(float)
print(f"[DEBUG] Prime 5×5 celle RAW (counts):\n{np.array2string(raw_block, precision=5, floatmode='maxprec_equal')}")
# === Filtra barcode presenti anche nel file varianti ===
bcode_var = pd.read_csv(bcode_variants, sep='\t', header=None)[0]
adata = adata[adata.obs.index.isin(bcode_var)].copy()
# === Filtraggio e QC ===
sc.pp.filter_cells(adata, min_genes=min_genes)
sc.pp.filter_genes(adata, min_cells=min_cells)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.pct_counts_mt < max_pct_mt, :].copy()
# === Normalizzazione Muon-style ===
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)
# === Rimozione eventuali NaN/inf ===
if sp.issparse(adata.X):
adata.X.data[np.isnan(adata.X.data)] = 0
adata.X.data[np.isinf(adata.X.data)] = 0
else:
adata.X = np.nan_to_num(adata.X, nan=0.0, posinf=0.0, neginf=0.0)
print(f"[DEBUG] RNA normalizzata e scalata (Muon-style). Shape: {adata.X.shape}")
norm_block = adata.X[:5, :5].toarray().astype(float) if sp.issparse(adata.X) else adata.X[:5, :5].astype(float)
print(f"[DEBUG] Prime 5×5 celle RNA (normalizzate):\n{np.array2string(norm_block, precision=4, floatmode='maxprec_equal')}")
# === PCA opzionale ===
if n_pcs is not None:
print(f"[INFO] PCA con {n_pcs} componenti...")
sc.tl.pca(adata, n_comps=n_pcs, random_state=42)
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs, random_state=42)
sc.tl.umap(adata, random_state=42)
# Salva embedding PCA e UMAP
adata.obsm[f"{transcript_key}_pca"] = adata.obsm["X_pca"].copy()
adata.obsm[f"{transcript_key}_umap"] = adata.obsm["X_umap"].copy()
# Salva anche in uns (per integrazione successiva)
adata.uns[f"{transcript_key}_X"] = adata.obsm["X_pca"].copy()
# Debug numerico
print(f"[INFO] Salvato adata.uns['{transcript_key}_X'] con shape {adata.uns[f'{transcript_key}_X'].shape}")
print(f"[DEBUG] Prime 5×5 celle PCA:\n{np.round(adata.uns[f'{transcript_key}_X'][:5, :5], 4)}")
print(f"[DEBUG] PCA variance ratio (prime 5): {np.round(adata.uns['pca']['variance_ratio'][:5], 4)}")
else:
print("[INFO] PCA disabilitata → uso matrice normalizzata intera.")
adata.uns[f"{transcript_key}_X"] = adata.X.toarray() if sp.issparse(adata.X) else adata.X
print(f"[INFO] === TRANSCRIPTOMIC ANALYSIS DONE ({adata.n_obs} cells, {adata.n_vars} genes) ===")
return adata
def _prep_embedding_for_training(X):
"""
Z-score per feature, poi L2 per cella. Restituisce float32 numpy.
"""
import numpy as np
from sklearn.preprocessing import StandardScaler, normalize
X = np.asarray(X, dtype=np.float32)
X = StandardScaler(with_mean=True, with_std=True).fit_transform(X)
X = normalize(X, norm="l2", axis=1)
return X
class PairedAE(torch.nn.Module):
"""
Due encoder (RNA/VAR) e due decoder con bottleneck condiviso (latent_dim).
Loss: ricostruzione + allineamento (cosine) tra z_rna e z_var.
"""
def __init__(self, dim_a, dim_b, latent_dim=32, hidden=256):
super().__init__()
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(dim_a, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(dim_b, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.decoder_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_a),
)
self.decoder_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_b),
)
def forward(self, xa, xb):
za = self.encoder_a(xa)
zb = self.encoder_b(xb)
ra = self.decoder_a(za)
rb = self.decoder_b(zb)
return za, zb, ra, rb
def train_paired_ae(
Xa, Xb,
latent_dim=32,
hidden=256,
batch_size=256,
num_epoch=300,
lr=1e-3,
lam_align=0.5, # forza allineamento
lam_recon_a=1.0, # ricostruzione RNA
lam_recon_b=3.0, # ricostruzione VAR
lam_cross=2.0, # nuova: RNA→VAR consistency
verbose=True,
):
"""
Paired Autoencoder asimmetrico:
RNA = teacher, VAR = student. Loss = recA + recB + cross(A→B) + cosine alignment.
"""
import torch, torch.nn.functional as F, numpy as np
Xa = _prep_embedding_for_training(Xa)
Xb = _prep_embedding_for_training(Xb)
Xa, Xb = np.asarray(Xa, np.float32), np.asarray(Xb, np.float32)
n, da = Xa.shape
_, db = Xb.shape
class AsymAE(torch.nn.Module):
def __init__(self, da, db, latent_dim, hidden):
super().__init__()
self.enc_a = torch.nn.Sequential(
torch.nn.Linear(da, hidden), torch.nn.ReLU(),
torch.nn.Dropout(0.1),
torch.nn.Linear(hidden, latent_dim)
)
self.enc_b = torch.nn.Sequential(
torch.nn.Linear(db, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim)
)
self.dec_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, da)
)
self.dec_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, db)
)
def forward(self, xa, xb):
za = self.enc_a(xa)
zb = self.enc_b(xb)
ra = self.dec_a(za)
rb = self.dec_b(zb)
return za, zb, ra, rb
device = torch.device("cpu")
model = AsymAE(da, db, latent_dim, hidden).to(device)
opt = torch.optim.Adam(model.parameters(), lr=lr)
cos = torch.nn.CosineEmbeddingLoss(margin=0.0)
Xa_t, Xb_t = torch.tensor(Xa), torch.tensor(Xb)
for ep in range(1, num_epoch + 1):
perm = torch.randperm(n)
total = 0
for i in range(0, n, batch_size):
idx = perm[i:i+batch_size]
xa, xb = Xa_t[idx].to(device), Xb_t[idx].to(device)
za, zb, ra, rb = model(xa, xb)
rec_a = F.mse_loss(ra, xa)
rec_b = F.mse_loss(rb, xb)
cross = F.mse_loss(zb, za.detach()) # VAR deve seguire RNA
y = torch.ones(za.shape[0], device=device)
align = cos(za, zb, y)
loss = lam_recon_a*rec_a + lam_recon_b*rec_b + lam_cross*cross + lam_align*align
opt.zero_grad()
loss.backward()
opt.step()
total += loss.item()
if verbose and (ep % max(1, num_epoch//10) == 0 or ep==1):
print(f"[AE] Epoch {ep:04d}/{num_epoch} | loss={total:.3f} | recA={rec_a:.3f} recB={rec_b:.3f} cross={cross:.3f}")
with torch.no_grad():
zA, zB, _, _ = model(Xa_t, Xb_t)
zA = F.normalize(zA, dim=1).cpu().numpy()
zB = F.normalize(zB, dim=1).cpu().numpy()
return model, zA, zB
def variantAnalysis(
adata,
matrix_path=None,
bcode_path=None,
variants_path=None,
min_cells=5,
max_cell_fraction=0.95,
variant_filter_level="norm",
n_pcs=50,
variant_rep="muon", # "muon", "tfidf", "lognorm"
variant_key="variant",
):
"""
Preprocess variant (DNA) data — simmetrico con transcriptomicsAnalysis.
Salva:
- adata.uns['variant_raw'] matrice grezza
- adata.uns['variant_X'] embedding numerico (PCA/scaled)
- adata.obsm['variant_pca'] PCA (identico a uns['variant_X'])
- adata.obsm['variant_umap'] embedding 2D per QC
"""
import scanpy as sc
import numpy as np
import pandas as pd
from scipy import sparse, io
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.feature_extraction.text import TfidfTransformer
import umap
print("[INFO] === VARIANT ANALYSIS START ===")
print(f"[INFO] Lettura file: {matrix_path}")
# --- Lettura matrice ---
var_mtx = io.mmread(matrix_path)
X = sparse.csr_matrix(var_mtx)
barcodes = pd.read_csv(bcode_path, sep="\t", header=None)[0].astype(str).values
variants = pd.read_csv(variants_path, sep="\t", header=None)[0].astype(str).values
# --- Fix orientamento ---
if X.shape[0] == len(variants) and X.shape[1] == len(barcodes):
print(f"[WARN] Transposing variant matrix {X.shape} → expected (cells × variants)")
X = X.T
elif X.shape[1] != len(variants):
print(f"[WARN] Dimensioni inconsuete {X.shape}, provo a trasporre per sicurezza")
X = X.T
print(f"[INFO] Matrice varianti caricata → {X.shape[0]} celle × {X.shape[1]} varianti")
# === Debug blocco RAW ===
raw_block = X[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle RAW (counts):\n{np.round(raw_block, 2)}")
# --- Allineamento con RNA ---
rna_barcodes = np.array([b.split('-')[0] for b in adata.obs_names])
var_barcodes = np.array([b.split('-')[0] for b in barcodes])
common = np.intersect1d(var_barcodes, rna_barcodes)
if len(common) == 0:
raise ValueError("[ERROR] Nessun barcode comune tra RNA e VAR.")
order_idx = np.array([np.where(var_barcodes == b)[0][0] for b in rna_barcodes if b in var_barcodes])
X = X[order_idx, :]
barcodes = barcodes[order_idx]
print(f"[INFO] Celle comuni con RNA: {len(barcodes)}")
# --- Salva GREZZI ---
adata.uns[f"{variant_key}_raw"] = X.copy()
adata.uns[f"{variant_key}_raw_obs_names"] = barcodes
adata.uns[f"{variant_key}_raw_var_names"] = variants
# === Selezione modalità di rappresentazione ===
rep = variant_rep.lower().strip()
print(f"[INFO] Rappresentazione scelta: {rep.upper()}")
# ===============================
# MUON / BINARY (default)
# ===============================
if rep in ["muon", "binary"]:
print("[INFO] Uso pipeline Muon-style (binarizzazione + scaling + PCA + UMAP)")
X_bin = (X > 0).astype(float)
dna = sc.AnnData(X_bin)
dna.obs_names = barcodes
dna.var_names = variants
# Filtra varianti troppo rare o troppo comuni
sc.pp.filter_genes(dna, min_cells=min_cells)
freq = np.array(dna.X.sum(axis=0)).flatten() / dna.n_obs
keep = freq < max_cell_fraction
dna = dna[:, keep].copy()
print(f"[INFO] Varianti mantenute: {dna.n_vars}/{len(variants)}")
# Scala
sc.pp.scale(dna, max_value=10)
norm_block = dna.X[:5, :5].toarray() if sparse.issparse(dna.X) else dna.X[:5, :5]
print(f"[DEBUG] Prime 5×5 celle DNA (scalate):\n{np.round(norm_block, 4)}")
# PCA e UMAP
sc.tl.pca(dna, n_comps=n_pcs, random_state=42)
sc.pp.neighbors(dna, n_pcs=min(30, n_pcs), random_state=42)
sc.tl.umap(dna, random_state=42)
adata.obsm[f"{variant_key}_pca"] = dna.obsm["X_pca"].copy()
adata.uns[f"{variant_key}_X"] = dna.obsm["X_pca"].copy()
adata.obsm[f"{variant_key}_umap"] = dna.obsm["X_umap"].copy()
print(f"[DEBUG] PCA variance ratio (prime 5): {np.round(dna.uns['pca']['variance_ratio'][:5], 4)}")
print(f"[DEBUG] Prime 5×5 celle PCA:\n{np.round(adata.uns[f'{variant_key}_X'][:5, :5], 4)}")
# ===============================
# TF-IDF
# ===============================
elif rep == "tfidf":
print("[INFO] Uso rappresentazione TF-IDF + L2 norm")
tfidf = TfidfTransformer(norm=None, use_idf=True, smooth_idf=True)
X_rep = tfidf.fit_transform(X)
X_rep = normalize(X_rep, norm="l2", axis=1)
rep_block = X_rep[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle TFIDF:\n{np.round(rep_block, 4)}")
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_rep)
adata.uns[f"{variant_key}_X"] = embedding.astype(np.float32)
adata.obsm[f"{variant_key}_pca"] = embedding.astype(np.float32)
reducer = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42)
adata.obsm[f"{variant_key}_umap"] = reducer.fit_transform(embedding)
# ===============================
# LOGNORM
# ===============================
elif rep == "lognorm":
print("[INFO] Uso rappresentazione log1p-normalized")
lib = np.asarray(X.sum(axis=1)).ravel()
lib[lib == 0] = 1.0
X_norm = X.multiply(1.0 / lib[:, None])
X_norm.data = np.log1p(X_norm.data)
rep_block = X_norm[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle LOGNORM:\n{np.round(rep_block, 4)}")
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_norm)
adata.uns[f"{variant_key}_X"] = embedding.astype(np.float32)
adata.obsm[f"{variant_key}_pca"] = embedding.astype(np.float32)
reducer = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42)
adata.obsm[f"{variant_key}_umap"] = reducer.fit_transform(embedding)
# ===============================
# ERRORE
# ===============================
else:
raise ValueError(f"[ERROR] variant_rep '{variant_rep}' non riconosciuto. Usa 'muon', 'tfidf' o 'lognorm'.")
print(f"[INFO] DNA PCA shape: {adata.uns[f'{variant_key}_X'].shape}")
print(f"[INFO] === VARIANT ANALYSIS DONE ({X.shape[0]} cells, {X.shape[1]} variants) ===")
return adata
def calcOmicsClusters(adata, omic_key="int", res=0.5, n_neighbors=15, n_pcs=None):
"""
Calcola il clustering Leiden sull'embedding integrato (es. 'int').
Logica:
- se n_cells < 2000 → usa PCA concatenata salvata in adata.uns['int_X'] (Muon-style)
- se n_cells >= 2000 → usa embedding AE salvato in adata.uns['int_X']
Nessun uso di UMAP. Clustering deterministico (igraph).
"""
import scanpy as sc
import numpy as np
n_cells = adata.n_obs
print(f"\n[INFO] === CALC OMiCS CLUSTERS ({omic_key}) ===")
print(f"[INFO] Numero celle: {n_cells}")
# ==========================================================
# 1️⃣ Recupera embedding integrato (PCA concatenata o AE)
# ==========================================================
if f"{omic_key}_X" not in adata.uns:
raise ValueError(f"[ERROR] uns['{omic_key}_X'] mancante — devi prima eseguire omicsIntegration().")
X = np.asarray(adata.uns[f"{omic_key}_X"], dtype=np.float32)
adata.obsm[f"{omic_key}_X"] = X.copy()
if n_cells < 2000:
print("[INFO] Dataset piccolo → uso PCA concatenata (Muon-style)")
else:
print("[INFO] Dataset grande → uso embedding autoencoder (AE-style)")
print(f"[INFO] Embedding caricato da adata.uns['{omic_key}_X'] ({X.shape[1]} dimensioni)")
# 🔍 Debug numerico
debug_block = np.round(X[:5, :5], 4)
print(f"[DEBUG] Prime 5×5 celle embedding:\n{debug_block}")
print(f"[DEBUG] Somma prime 5 righe: {[round(X[i,:].sum(), 4) for i in range(min(5, X.shape[0]))]}")
print(f"[DEBUG] Varianza media embedding: {X.var():.4f}")
# (Opzionale) verifica coerenza con obsm["X_concat_pca"]
if "X_concat_pca" in adata.obsm:
same = np.allclose(X[:5, :5], adata.obsm["X_concat_pca"][:5, :5])
print(f"[DEBUG] Coerenza con obsm['X_concat_pca']: {same}")
# ==========================================================
# 2️⃣ Costruisci neighbors
# ==========================================================
dims = X.shape[1]
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
n_pcs=min(n_pcs or dims, dims),
use_rep=f"{omic_key}_X",
key_added=f"{omic_key}_neighbors"
)
conn = adata.obsp[f"{omic_key}_neighbors_connectivities"]
dist = adata.obsp[f"{omic_key}_neighbors_distances"]
print(f"[DEBUG] Neighbors graph: {conn.shape}, mean_conn={conn.mean():.6f}, mean_dist={dist.mean():.6f}")
# ==========================================================
# 3️⃣ Leiden deterministico (igraph)
# ==========================================================
key = f"{omic_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
flavor="igraph",
n_iterations=2,
directed=False,
random_state=0, # deterministico
neighbors_key=f"{omic_key}_neighbors",
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Leiden completato su {omic_key} (res={res}) → {n_clusters} cluster")
print("[INFO] === CLUSTERING DONE ===")
return adata
def weightsInit(m):
if isinstance(m, torch.nn.Linear):
torch.nn.init.xavier_uniform(m.weight.data)
m.bias.data.zero_()
def omicsIntegration(
adata,
transcript_key="trans",
variant_key="variant",
integration_key="int",
latent_dim=32,
num_epoch=300,
lam_align=0.5,
lam_recon_a=1.0,
lam_recon_b=1.0,
lam_cross=2.0,
seed=42,
res=None,
balance_var=False, # 👈 nuovo argomento: default = False (come Muon)
):
"""
Integrazione RNA+VAR:
- Se n_cells < 2000 → PCA concatenata (Muon-style, senza riscalatura)
- Se n_cells >= 2000 → Autoencoder asimmetrico (RNA=teacher, VAR=student)
Se res è specificato, calcola Leiden integrato a quella risoluzione.
"""
import numpy as np
import scanpy as sc
print("[INFO] === OMICS INTEGRATION START ===")
# --- verifica input ---
assert transcript_key + "_X" in adata.uns, f"Missing adata.uns['{transcript_key}_X']"
assert variant_key + "_X" in adata.uns, f"Missing adata.uns['{variant_key}_X']"
X_a = np.asarray(adata.uns[transcript_key + "_X"], dtype=np.float32)
X_b = np.asarray(adata.uns[variant_key + "_X"], dtype=np.float32)
assert X_a.shape[0] == X_b.shape[0], "RNA/VAR rows (cells) must match"
n_cells = X_a.shape[0]
# --- bilancia varianze (solo se richiesto) ---
if balance_var:
print("[INFO] Bilanciamento varianze attivo (X /= std).")
X_a /= np.std(X_a)
X_b /= np.std(X_b)
else:
print("[INFO] Nessuna riscalatura: uso PCA pura (Muon-style).")
# ==========================================================
# === BLOCCO 1: PCA CONCATENATA (dataset piccolo < 2000) ===
# ==========================================================
if n_cells < 2000:
print(f"[INFO] n_cells={n_cells} < 2000 → uso PCA concatenata Muon-style")
from sklearn.decomposition import PCA
# --- Usa PCA già salvate in adata.uns se disponibili ---
if f"{transcript_key}_X" in adata.uns and f"{variant_key}_X" in adata.uns:
print(f"[INFO] Uso PCA pre-esistenti da adata.uns['{transcript_key}_X'] e adata.uns['{variant_key}_X']")
pca_rna = np.asarray(adata.uns[f"{transcript_key}_X"])
pca_var = np.asarray(adata.uns[f"{variant_key}_X"])
# === Debug approfondito ===
print(f"[DEBUG] PCA RNA shape: {pca_rna.shape} | PCA VAR shape: {pca_var.shape}")
print(f"[DEBUG] Prime 5×5 PCA RNA:\n{np.round(pca_rna[:5, :5], 4)}")
print(f"[DEBUG] Prime 5×5 PCA VAR:\n{np.round(pca_var[:5, :5], 4)}")
print(f"[DEBUG] Media/varianza PCA RNA: {np.mean(pca_rna):.4f} / {np.var(pca_rna):.4f}")
print(f"[DEBUG] Media/varianza PCA VAR: {np.mean(pca_var):.4f} / {np.var(pca_var):.4f}")
else:
print("[WARN] PCA non trovate in adata.uns — le ricomputo localmente da X grezze.")
from sklearn.decomposition import PCA
X_a = np.asarray(adata.uns[f"{transcript_key}_raw"].todense() if hasattr(adata.uns[f"{transcript_key}_raw"], "todense") else adata.uns[f"{transcript_key}_raw"])
X_b = np.asarray(adata.uns[f"{variant_key}_raw"].todense() if hasattr(adata.uns[f"{variant_key}_raw"], "todense") else adata.uns[f"{variant_key}_raw"])
n_comp = min(50, X_a.shape[1], X_b.shape[1])
pca_rna = PCA(n_components=n_comp, random_state=seed).fit_transform(X_a)
pca_var = PCA(n_components=n_comp, random_state=seed).fit_transform(X_b)
adata.uns[f"{transcript_key}_X"] = pca_rna
adata.uns[f"{variant_key}_X"] = pca_var
print(f"[INFO] Create nuove PCA locali: RNA={pca_rna.shape}, VAR={pca_var.shape}")
print(f"[DEBUG] Prime 5×5 PCA RNA (nuova):\n{np.round(pca_rna[:5, :5], 4)}")
print(f"[DEBUG] Prime 5×5 PCA VAR (nuova):\n{np.round(pca_var[:5, :5], 4)}")
# --- concatenazione PCA pura ---
X_concat = np.concatenate([pca_rna, pca_var], axis=1)
adata.uns[integration_key + "_X"] = X_concat.copy()
adata.obsm["X_concat_pca"] = X_concat.copy()
print(f"[INFO] Concatenazione PCA completata: {X_concat.shape}")
concat_block = X_concat[:5, :5]
print(f"[DEBUG] Prime 5×5 celle CONCATENATE:\n{np.round(concat_block, 4)}")
print(f"[DEBUG] Somma prime 5 righe concatenata: {[round(X_concat[i,:].sum(), 4) for i in range(5)]}")
print(f"[DEBUG] Varianza PCA RNA / VAR: {pca_rna.var():.3f} / {pca_var.var():.3f}")
# --- Neighbors + UMAP integrato ---
sc.pp.neighbors(adata, use_rep="X_concat_pca", key_added=f"{integration_key}_neighbors")
sc.tl.umap(adata)
adata.obsm[f"{integration_key}_umap"] = adata.obsm["X_umap"].copy()
# --- Leiden opzionale ---
if res is not None:
key = f"{integration_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
print(f"[INFO] Calcolo Leiden integrato per res={res}")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
neighbors_key=f"{integration_key}_neighbors",
flavor="igraph",
n_iterations=2,
directed=False,
random_state=seed,
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Creato {key} ({n_clusters} cluster)")
print("[INFO] === MUON-STYLE INTEGRATION COMPLETATA ===")
return adata
# ========================================================
# === BLOCCO 2: AUTOENCODER ASIMMETRICO (dataset grande) ===
# ========================================================
else:
print(f"[INFO] n_cells={n_cells} ≥ 2000 → uso autoencoder asimmetrico (RNA→VAR)")
from sklearn.preprocessing import StandardScaler
import umap
_, zA, zB = train_paired_ae(
X_a,
X_b,
latent_dim=latent_dim,
num_epoch=num_epoch,
lam_align=lam_align,
lam_recon_a=lam_recon_a,
lam_recon_b=lam_recon_b,
lam_cross=lam_cross,
verbose=True,
)
simAE = float(np.mean(np.sum(zA * zB, axis=1)))
zAE = 0.5 * (zA + zB)
adata.uns[integration_key + "_X"] = zAE.astype(np.float32)
adata.uns[integration_key + "_metrics"] = {
"simAE": simAE,
"latent_dim": int(latent_dim),
"num_epoch": int(num_epoch),
"lam_align": float(lam_align),
"lam_recon_b": float(lam_recon_b),
"lam_cross": float(lam_cross),
}
um = umap.UMAP(n_neighbors=15, min_dist=0.05, random_state=seed).fit_transform(zAE)
adata.obsm[f"{integration_key}_umap"] = um
# --- Leiden opzionale ---
if res is not None:
key = f"{integration_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
print(f"[INFO] Calcolo Leiden integrato per res={res}")
sc.pp.neighbors(adata, use_rep=f"{integration_key}_X", key_added=f"{integration_key}_neighbors")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
neighbors_key=f"{integration_key}_neighbors",
flavor="igraph",
n_iterations=2,
directed=False,
random_state=seed,
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Creato {key} ({n_clusters} cluster)")
print(f"[INT] AE similarity={simAE:.3f}")
print("[INFO] === AUTOENCODER INTEGRATION COMPLETATA ===")
return adata
class pairedIntegration(torch.nn.Module):
def __init__(self,input_dim_a=2000,input_dim_b=2000,clf_out=10):
super(pairedIntegration, self).__init__()
self.input_dim_a = input_dim_a
self.input_dim_b = input_dim_b
self.clf_out = clf_out
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_a, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_b, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.clf = torch.nn.Sequential(
torch.nn.Linear(128, self.clf_out),
torch.nn.Softmax(dim=1))
self.feature = torch.nn.Sequential(
torch.nn.Linear(128, 32))
def forward(self, x_a,x_b):
out_a = self.encoder_a(x_a)
f_a = self.feature(out_a)
y_a = self.clf(out_a)
out_b = self.encoder_b(x_b)
f_b = self.feature(out_b)
y_b = self.clf(out_b)
return f_a,y_a,f_b,y_b
def pairedIntegrationTrainer(X_a, X_b, model, batch_size = 512, num_epoch=5,
f_temp = 0.1, p_temp = 1.0):
device = torch.device("cpu")
f_con = contrastiveLoss(batch_size = batch_size,temperature = f_temp)
p_con = contrastiveLoss(batch_size = model.clf_out,temperature = p_temp)
opt = torch.optim.SGD(model.parameters(),lr=0.01, momentum=0.9,weight_decay=5e-4)
for k in range(num_epoch):
model.to(device)
n = X_a.shape[0]
r = np.random.permutation(n)
X_train_a = X_a[r,:]
X_tensor_A=torch.tensor(X_train_a).float()
X_train_b = X_b[r,:]
X_tensor_B=torch.tensor(X_train_b).float()
losses = 0
for j in range(n//batch_size):
inputs_a = X_tensor_A[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_a2 = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_a = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_b = X_tensor_B[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_b = inputs_b + torch.normal(0,1,inputs_b.shape).to(device)
feas,o,nfeas,no = model(inputs_a,inputs_b)
feas2,o2,_,_ = model(inputs_a2,inputs_b)
fea_mi = f_con(feas,nfeas)+f_con(feas,feas2)
p_mi = p_con(o.T,no.T)+p_con(o.T,o2.T)
loss = fea_mi + p_mi
opt.zero_grad()
loss.backward()
opt.step()
losses += loss.data.tolist()
print("Total loss: "+str(round(losses,4)))
gc.collect()
class contrastiveLoss(torch.nn.Module):
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.register_buffer("temperature", torch.tensor(temperature))
self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2,
dtype=bool)).float())
def forward(self, emb_i, emb_j):
# """
# emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
# z_i, z_j as per SimCLR paper
# """
z_i = F.normalize(emb_i, dim=1,p=2)
z_j = F.normalize(emb_j, dim=1,p=2)
representations = torch.cat([z_i, z_j], dim=0)
similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
sim_ij = torch.diag(similarity_matrix, self.batch_size)
sim_ji = torch.diag(similarity_matrix, -self.batch_size)
positives = torch.cat([sim_ij, sim_ji], dim=0)
nominator = torch.exp(positives / self.temperature)
denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))
loss = torch.sum(loss_partial) / (2 * self.batch_size)
return loss
def distributionClusters(adata, control_cl, group_cl, perc_cell_to_show=0.1, figsize = (12,8), dpi=100, save_path=None):
df = adata.obs.groupby([group_cl, control_cl]).size().unstack()
df = df.loc[df.sum(axis=1)/df.values.sum()>=perc_cell_to_show,:] # remove row if group_cluster represents less cells in perc than perc_cell_to_show.
df_rel = df
df = df.div(df.sum(axis=1), axis=0)
df[group_cl] = df.index
plt.rcParams["figure.figsize"] = figsize
plt.rcParams['figure.dpi'] = dpi
plt.rcParams['figure.facecolor'] = '#FFFFFF'
df.plot(
x = group_cl,
kind = 'barh',
stacked = True,
mark_right = True,
)
leg = plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
leg.get_frame().set_edgecolor('black')
plt.xlabel('perc_'+control_cl, loc='center')
for n in df_rel:
for i, (pos_x, ab, value) in enumerate(zip(df.iloc[:, :-1].cumsum(1)[n], df[n], df_rel[n])):
if (value == 0) | (ab <=0.05):
value = ''
plt.text(pos_x-ab/2, i, str(value), va='center', ha='center')
plt.grid(False)
if save_path is not None:
plt.savefig(save_path, bbox_inches='tight')
return(plt.show())
# ================================================================
# 🔧 OUTPUT HANDLER: salva UMAP e AnnData in /output/
# ================================================================
import os
import matplotlib.pyplot as plt
import scanpy as sc
def save_all_umaps(adata, prefix="output", color_by=None, dpi=300):
"""
Salva tutte le UMAP presenti in adata.obsm come immagini PNG.
Non salva l'AnnData, nessun h5ad viene scritto.
"""
import os
import scanpy as sc
import matplotlib.pyplot as plt
os.makedirs(prefix, exist_ok=True)
print(f"[INFO] Cartella di output: {prefix}")
# === Trova tutte le UMAP disponibili ===
umap_keys = [k for k in adata.obsm.keys() if k.endswith("_umap") or k == "X_umap"]
if not umap_keys:
print("[WARN] Nessuna UMAP trovata in adata.obsm")
return
print(f"[INFO] UMAP trovate: {umap_keys}")
# === Determina cosa usare come colore ===
if color_by is None:
cluster_cols = [c for c in adata.obs.columns if "clust" in c.lower()]
color_by = cluster_cols if cluster_cols else ["n_genes"]
elif isinstance(color_by, str):
color_by = [color_by]
print(f"[INFO] Colorazioni da usare: {color_by}")
# === Salva ogni combinazione UMAP × colore ===
for key in umap_keys:
for color in color_by:
fig_path = os.path.join(prefix, f"{key}_{color}.png")
try:
sc.pl.embedding(
adata,
basis=key,
color=color,
frameon=False,
show=False,
)
plt.savefig(fig_path, dpi=dpi, bbox_inches="tight")
plt.close()
print(f"[OK] Salvata {fig_path}")
except Exception as e:
print(f"[WARN] Errore nel salvare {fig_path}: {e}")
print("[✅] Tutte le UMAP salvate con successo.")
# import packs
import numpy as np
import pandas as pd
import scanpy as sc
import gc
import torch
import torch.nn.functional as F
import umap
import anndata
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
from sklearn.feature_extraction.text import TfidfTransformer
from scipy import io
from scipy import sparse
import matplotlib.pyplot as plt
### FUNCTION ###
################
def transcriptomicAnalysis(
path_10x,
bcode_variants,
high_var_genes_min_mean=0.015,
high_var_genes_max_mean=3,
high_var_genes_min_disp=0.75,
min_genes=200,
min_cells=3,
max_pct_mt=100,
n_neighbors=20,
n_pcs=30, # se None → niente PCA
transcript_key='trans',
manual_matrix_reading=False
):
"""
Preprocess transcriptomic (RNA) data from 10x Genomics format.
If n_pcs=None, skip PCA and use full (normalized) expression matrix as embedding.
"""
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy import io, sparse
from sklearn.preprocessing import StandardScaler
print("[INFO] === TRANSCRIPTOMIC ANALYSIS START ===")
# === Lettura dati RNA ===
if manual_matrix_reading:
rna = path_10x + '/matrix.mtx.gz'
barcode = path_10x + '/barcodes.tsv.gz'
feat = path_10x + '/features.tsv.gz'
rna = io.mmread(rna)
B = rna.todense().T
barcode = pd.read_csv(barcode, sep='\t', header=None, names=['bcode'])
barcode.index = barcode.iloc[:, 0]
barcode = barcode.drop('bcode', axis=1)
feat = pd.read_csv(feat, sep='\t', header=None, names=['gene_ids', 'gene_symbol', 'feature_types'])
feat.index = feat['gene_symbol']
feat = feat.drop('gene_symbol', axis=1)
adata = anndata.AnnData(X=B, obs=barcode, var=feat)
adata.X = sparse.csr_matrix(adata.X)
else:
adata = sc.read_10x_mtx(path_10x, var_names='gene_symbols', cache=True)
# === Filtra barcode presenti anche nel file varianti ===
bcode_var = pd.read_csv(bcode_variants, sep='\t', header=None)[0]
adata = adata[adata.obs.index.isin(bcode_var)]
# === Filtraggio e QC ===
sc.pp.filter_cells(adata, min_genes=min_genes)
sc.pp.filter_genes(adata, min_cells=min_cells)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.pct_counts_mt < max_pct_mt, :]
# === Normalizzazione e log ===
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# ~_~T FIX: rimuove eventuali NaN generati da log1p
if sp.issparse(adata.X):
adata.X.data[np.isnan(adata.X.data)] = 0
else:
adata.X = np.nan_to_num(adata.X, nan=0.0, posinf=0.0, neginf=0.0)
# === Selezione geni altamente variabili ===
sc.pp.highly_variable_genes(
adata,
min_mean=high_var_genes_min_mean,
max_mean=high_var_genes_max_mean,
min_disp=high_var_genes_min_disp
)
# === Salva matrice originale ===
adata.layers["trans_raw"] = adata.X.copy()
# === PCA opzionale ===
if n_pcs is not None:
print(f"[INFO] PCA con {n_pcs} componenti...")
sc.pp.pca(adata, n_comps=n_pcs, use_highly_variable=True)
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs, random_state=42)
sc.tl.umap(adata, random_state=42)
adata.obsm[transcript_key + '_umap'] = adata.obsm['X_umap']
embedding = adata.obsm['X_pca']
else:
print("[INFO] PCA disabilitata → uso matrice intera (tutte le feature normalizzate).")
embedding = adata.X.toarray() if sp.issparse(adata.X) else adata.X
# === Scaling per uniformare range numerico ===
scaler = StandardScaler(with_mean=False) # evita densificazione su sparse
adata.uns[transcript_key + '_X'] = scaler.fit_transform(embedding)
print(f"[INFO] Salvato embedding → adata.uns['{transcript_key}_X'] shape={adata.uns[transcript_key + '_X'].shape}")
print("[INFO] === TRANSCRIPTOMIC ANALYSIS DONE ===")
return adata
def _prep_embedding_for_training(X):
"""
Z-score per feature, poi L2 per cella. Restituisce float32 numpy.
"""
import numpy as np
from sklearn.preprocessing import StandardScaler, normalize
X = np.asarray(X, dtype=np.float32)
X = StandardScaler(with_mean=True, with_std=True).fit_transform(X)
X = normalize(X, norm="l2", axis=1)
return X
class PairedAE(torch.nn.Module):
"""
Due encoder (RNA/VAR) e due decoder con bottleneck condiviso (latent_dim).
Loss: ricostruzione + allineamento (cosine) tra z_rna e z_var.
"""
def __init__(self, dim_a, dim_b, latent_dim=32, hidden=256):
super().__init__()
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(dim_a, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(dim_b, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.decoder_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_a),
)
self.decoder_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_b),
)
def forward(self, xa, xb):
za = self.encoder_a(xa)
zb = self.encoder_b(xb)
ra = self.decoder_a(za)
rb = self.decoder_b(zb)
return za, zb, ra, rb
def train_paired_ae(
Xa, Xb,
latent_dim=32,
hidden=256,
batch_size=256,
num_epoch=300,
lr=1e-3,
lam_align=0.5, # forza allineamento
lam_recon_a=1.0, # ricostruzione RNA
lam_recon_b=3.0, # ricostruzione VAR
lam_cross=2.0, # nuova: RNA→VAR consistency
verbose=True,
):
"""
Paired Autoencoder asimmetrico:
RNA = teacher, VAR = student. Loss = recA + recB + cross(A→B) + cosine alignment.
"""
import torch, torch.nn.functional as F, numpy as np
Xa = _prep_embedding_for_training(Xa)
Xb = _prep_embedding_for_training(Xb)
Xa, Xb = np.asarray(Xa, np.float32), np.asarray(Xb, np.float32)
n, da = Xa.shape
_, db = Xb.shape
class AsymAE(torch.nn.Module):
def __init__(self, da, db, latent_dim, hidden):
super().__init__()
self.enc_a = torch.nn.Sequential(
torch.nn.Linear(da, hidden), torch.nn.ReLU(),
torch.nn.Dropout(0.1),
torch.nn.Linear(hidden, latent_dim)
)
self.enc_b = torch.nn.Sequential(
torch.nn.Linear(db, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim)
)
self.dec_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, da)
)
self.dec_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, db)
)
def forward(self, xa, xb):
za = self.enc_a(xa)
zb = self.enc_b(xb)
ra = self.dec_a(za)
rb = self.dec_b(zb)
return za, zb, ra, rb
device = torch.device("cpu")
model = AsymAE(da, db, latent_dim, hidden).to(device)
opt = torch.optim.Adam(model.parameters(), lr=lr)
cos = torch.nn.CosineEmbeddingLoss(margin=0.0)
Xa_t, Xb_t = torch.tensor(Xa), torch.tensor(Xb)
for ep in range(1, num_epoch + 1):
perm = torch.randperm(n)
total = 0
for i in range(0, n, batch_size):
idx = perm[i:i+batch_size]
xa, xb = Xa_t[idx].to(device), Xb_t[idx].to(device)
za, zb, ra, rb = model(xa, xb)
rec_a = F.mse_loss(ra, xa)
rec_b = F.mse_loss(rb, xb)
cross = F.mse_loss(zb, za.detach()) # VAR deve seguire RNA
y = torch.ones(za.shape[0], device=device)
align = cos(za, zb, y)
loss = lam_recon_a*rec_a + lam_recon_b*rec_b + lam_cross*cross + lam_align*align
opt.zero_grad()
loss.backward()
opt.step()
total += loss.item()
if verbose and (ep % max(1, num_epoch//10) == 0 or ep==1):
print(f"[AE] Epoch {ep:04d}/{num_epoch} | loss={total:.3f} | recA={rec_a:.3f} recB={rec_b:.3f} cross={cross:.3f}")
with torch.no_grad():
zA, zB, _, _ = model(Xa_t, Xb_t)
zA = F.normalize(zA, dim=1).cpu().numpy()
zB = F.normalize(zB, dim=1).cpu().numpy()
return model, zA, zB
def variantAnalysis(
adata,
matrix_path=None,
bcode_path=None,
variants_path=None,
min_cells=10,
min_counts=20,
variant_filter_level="Norm",
n_pcs=50, # PCA facoltativa per ridurre rumore
variant_rep="tfidf", # 'tfidf' | 'binary' | 'lognorm'
):
"""
Prepara la vista VAR a partire da adata.X (counts/binari).
Applica filtro opzionale, rappresentazione scelta, scaling e PCA (facoltativa).
Salva l'embedding in adata.uns['variant_X'] e genera variant_umap per QC.
"""
import scanpy as sc
import numpy as np
from scipy import sparse
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfTransformer
print("[INFO] === VARIANT ANALYSIS START ===")
print(f"[INFO] Filtraggio: {variant_filter_level} | Rappresentazione: {variant_rep} | PCA: {n_pcs}")
X = adata.X.copy()
n_total = X.shape[1]
# --- soglie filtro ---
if variant_filter_level.lower() == "none":
min_cells_eff, min_counts_eff = 0, 0
elif variant_filter_level.lower() == "low":
min_cells_eff = max(1, int(min_cells * 0.3))
min_counts_eff = max(1, int(min_counts * 0.3))
else: # "Norm"
min_cells_eff, min_counts_eff = min_cells, min_counts
# --- filtro per varianti rare/deboli ---
if sparse.issparse(X):
counts_per_var = np.asarray(X.sum(axis=0)).ravel()
cells_per_var = np.asarray((X > 0).sum(axis=0)).ravel()
else:
counts_per_var = X.sum(axis=0)
cells_per_var = (X > 0).sum(axis=0)
mask = (cells_per_var >= min_cells_eff) & (counts_per_var >= min_counts_eff)
removed = int((~mask).sum())
if removed > 0:
adata = adata[:, mask].copy()
X = adata.X
print(f"[INFO] Varianti filtrate: {removed}/{n_total} → kept {adata.n_vars}")
# --- rappresentazione varianti ---
if variant_rep == "binary":
if sparse.issparse(X):
X_rep = (X > 0).astype("float32")
else:
X_rep = (X > 0).astype("float32")
elif variant_rep == "lognorm":
# counts → log1p(normalized)
if sparse.issparse(X):
lib = np.asarray(X.sum(axis=1)).ravel()
lib[lib == 0] = 1.0
X_norm = X.multiply(1.0 / lib[:, None])
X_rep = X_norm
else:
lib = X.sum(axis=1, keepdims=True)
lib[lib == 0] = 1.0
X_rep = X / lib
# log1p
if sparse.issparse(X_rep):
X_rep = X_rep.tocsr(copy=True)
X_rep.data = np.log1p(X_rep.data)
else:
X_rep = np.log1p(X_rep)
else: # "tfidf" (default, consigliato per binario/sparse)
tfidf = TfidfTransformer(norm=None, use_idf=True, smooth_idf=True, sublinear_tf=False)
if sparse.issparse(X):
X_rep = tfidf.fit_transform(X)
else:
X_rep = tfidf.fit_transform(sparse.csr_matrix(X))
# normalizza per cella (L2) per comparabilità
from sklearn.preprocessing import normalize
X_rep = normalize(X_rep, norm="l2", axis=1)
# --- scaling & PCA (facoltativa) ---
if n_pcs is not None:
sc.pp.scale(adata, zero_center=False) # solo per settare .var e compatibilità
# PCA con arpack non accetta direttamente sparse: convertiamo
from sklearn.decomposition import PCA, TruncatedSVD
if sparse.issparse(X_rep):
# SVD tronca più stabile sullo sparse
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_rep)
else:
pca = PCA(n_components=n_pcs, random_state=42)
embedding = pca.fit_transform(X_rep)
else:
embedding = X_rep.toarray() if sparse.issparse(X_rep) else X_rep
# --- salva embedding per integrazione + QC UMAP ---
from sklearn.preprocessing import StandardScaler
adata.uns["variant_X"] = StandardScaler(with_mean=True, with_std=True).fit_transform(
np.asarray(embedding, dtype=np.float32)
)
# QC UMAP (2D) solo per ispezione
sc.pp.neighbors(adata, use_rep=None, n_neighbors=15, n_pcs=None) # non usato per variant_X
import umap
emb2d = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42).fit_transform(adata.uns["variant_X"])
adata.obsm["variant_umap"] = emb2d
print(f"[INFO] Salvato adata.uns['variant_X'] shape={adata.uns['variant_X'].shape}")
print("[INFO] === VARIANT ANALYSIS DONE ===")
return adata
def calcOmicsClusters(adata, omic_key, res=0.5, n_neighbors=None, n_pcs=None):
"""
Compute clustering (Leiden) on a given omic embedding (trans, variant, int).
Automatically adapts to 2D UMAPs or high-dimensional embeddings.
"""
import scanpy as sc
import numpy as np
# Default n_neighbors
if n_neighbors is None:
n_neighbors = 15
# --- Detect representation type ---
use_rep = omic_key + "_umap"
if use_rep in adata.obsm:
# UMAP is 2D -> skip n_pcs
print(f"[INFO] Using 2D UMAP embedding: {use_rep}")
sc.pp.neighbors(adata, n_neighbors=n_neighbors, use_rep=use_rep, key_added=f"{omic_key}_neighbors")
elif omic_key + "_X" in adata.uns:
# High-dimensional representation saved in uns
X = adata.uns[omic_key + "_X"]
adata.obsm[omic_key + "_X"] = X
dims = X.shape[1]
print(f"[INFO] Using high-dimensional embedding: {omic_key}_X ({dims} dims)")
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
n_pcs=min(n_pcs or dims, dims),
use_rep=omic_key + "_X",
key_added=f"{omic_key}_neighbors"
)
else:
raise ValueError(f"No valid representation found for omic_key='{omic_key}'")
# --- Leiden clustering ---
sc.tl.leiden(
adata,
resolution=res,
key_added=f"{omic_key}_clust_{res}",
neighbors_key=f"{omic_key}_neighbors"
)
return adata
def weightsInit(m):
if isinstance(m, torch.nn.Linear):
torch.nn.init.xavier_uniform(m.weight.data)
m.bias.data.zero_()
def omicsIntegration(
adata,
transcript_key='trans',
variant_key='variant',
integration_key='int',
latent_dim=32,
num_epoch=300,
lam_align=0.5,
lam_recon_a=1.0,
lam_recon_b=1.0,
lam_cross=2.0,
seed=42,
):
"""
Integrazione RNA+VAR con autoencoder asimmetrico (RNA=teacher, VAR=student)
+ confronto CCA opzionale, sceglie automaticamente embedding migliore.
"""
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cross_decomposition import CCA
from sklearn.preprocessing import StandardScaler
import umap, scanpy as sc
print("[INFO] === OMICS INTEGRATION (Asymmetric AE + CCA) START ===")
assert transcript_key + "_X" in adata.uns, f"Missing adata.uns['{transcript_key}_X']"
assert variant_key + "_X" in adata.uns, f"Missing adata.uns['{variant_key}_X']"
X_a = np.asarray(adata.uns[transcript_key + "_X"], dtype=np.float32)
X_b = np.asarray(adata.uns[variant_key + "_X"], dtype=np.float32)
assert X_a.shape[0] == X_b.shape[0], "RNA/VAR rows (cells) must match"
# --- Bilancia varianze globali ---
X_a /= np.std(X_a)
X_b /= np.std(X_b)
# --- 1️⃣ Autoencoder asimmetrico ---
_, zA, zB = train_paired_ae(
X_a, X_b,
latent_dim=latent_dim,
num_epoch=num_epoch,
lam_align=lam_align,
lam_recon_a=lam_recon_a,
lam_recon_b=lam_recon_b,
lam_cross=lam_cross,
verbose=True,
)
simAE = float(np.mean(np.sum(zA * zB, axis=1)))
zAE = 0.5 * (zA + zB)
# --- 2️⃣ CCA LINEARE ---
Xa_s = StandardScaler(with_mean=True, with_std=True).fit_transform(X_a)
Xb_s = StandardScaler(with_mean=True, with_std=True).fit_transform(X_b)
cca = CCA(n_components=min(latent_dim, Xa_s.shape[1], Xb_s.shape[1]))
Za, Zb = cca.fit_transform(Xa_s, Xb_s)
Za = (Za / np.linalg.norm(Za, axis=1, keepdims=True).clip(1e-9)).astype(np.float32)
Zb = (Zb / np.linalg.norm(Zb, axis=1, keepdims=True).clip(1e-9)).astype(np.float32)
simCCA = float(np.mean(np.sum(Za * Zb, axis=1)))
zCCA = 0.5 * (Za + Zb)
# --- 3️⃣ Scelta embedding migliore ---
if simCCA >= simAE:
chosen, z, sim = "CCA", zCCA, simCCA
else:
chosen, z, sim = "AE", zAE, simAE
# --- 4️⃣ Salva embedding e metriche ---
adata.uns[integration_key + "_X"] = z.astype(np.float32)
adata.uns[integration_key + "_metrics"] = {
"simAE": simAE,
"simCCA": simCCA,
"chosen": chosen,
"latent_dim": int(latent_dim),
"num_epoch": int(num_epoch),
"lam_align": float(lam_align),
"lam_recon_b": float(lam_recon_b),
"lam_cross": float(lam_cross),
}
# --- 5️⃣ Calcola UMAP finale ---
um = umap.UMAP(n_neighbors=15, min_dist=0.05, random_state=seed).fit_transform(z)
adata.obsm[integration_key + "_umap"] = um
print(f"[INT] AE={simAE:.3f} | CCA={simCCA:.3f} → chosen: {chosen}")
print("[INFO] === OMICS INTEGRATION DONE ===")
return adata
class pairedIntegration(torch.nn.Module):
def __init__(self,input_dim_a=2000,input_dim_b=2000,clf_out=10):
super(pairedIntegration, self).__init__()
self.input_dim_a = input_dim_a
self.input_dim_b = input_dim_b
self.clf_out = clf_out
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_a, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_b, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.clf = torch.nn.Sequential(
torch.nn.Linear(128, self.clf_out),
torch.nn.Softmax(dim=1))
self.feature = torch.nn.Sequential(
torch.nn.Linear(128, 32))
def forward(self, x_a,x_b):
out_a = self.encoder_a(x_a)
f_a = self.feature(out_a)
y_a = self.clf(out_a)
out_b = self.encoder_b(x_b)
f_b = self.feature(out_b)
y_b = self.clf(out_b)
return f_a,y_a,f_b,y_b
def pairedIntegrationTrainer(X_a, X_b, model, batch_size = 512, num_epoch=5,
f_temp = 0.1, p_temp = 1.0):
device = torch.device("cpu")
f_con = contrastiveLoss(batch_size = batch_size,temperature = f_temp)
p_con = contrastiveLoss(batch_size = model.clf_out,temperature = p_temp)
opt = torch.optim.SGD(model.parameters(),lr=0.01, momentum=0.9,weight_decay=5e-4)
for k in range(num_epoch):
model.to(device)
n = X_a.shape[0]
r = np.random.permutation(n)
X_train_a = X_a[r,:]
X_tensor_A=torch.tensor(X_train_a).float()
X_train_b = X_b[r,:]
X_tensor_B=torch.tensor(X_train_b).float()
losses = 0
for j in range(n//batch_size):
inputs_a = X_tensor_A[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_a2 = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_a = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_b = X_tensor_B[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_b = inputs_b + torch.normal(0,1,inputs_b.shape).to(device)
feas,o,nfeas,no = model(inputs_a,inputs_b)
feas2,o2,_,_ = model(inputs_a2,inputs_b)
fea_mi = f_con(feas,nfeas)+f_con(feas,feas2)
p_mi = p_con(o.T,no.T)+p_con(o.T,o2.T)
loss = fea_mi + p_mi
opt.zero_grad()
loss.backward()
opt.step()
losses += loss.data.tolist()
print("Total loss: "+str(round(losses,4)))
gc.collect()
class contrastiveLoss(torch.nn.Module):
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.register_buffer("temperature", torch.tensor(temperature))
self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2,
dtype=bool)).float())
def forward(self, emb_i, emb_j):
# """
# emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
# z_i, z_j as per SimCLR paper
# """
z_i = F.normalize(emb_i, dim=1,p=2)
z_j = F.normalize(emb_j, dim=1,p=2)
representations = torch.cat([z_i, z_j], dim=0)
similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
sim_ij = torch.diag(similarity_matrix, self.batch_size)
sim_ji = torch.diag(similarity_matrix, -self.batch_size)
positives = torch.cat([sim_ij, sim_ji], dim=0)
nominator = torch.exp(positives / self.temperature)
denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))
loss = torch.sum(loss_partial) / (2 * self.batch_size)
return loss
def distributionClusters(adata, control_cl, group_cl, perc_cell_to_show=0.1, figsize = (12,8), dpi=100, save_path=None):
df = adata.obs.groupby([group_cl, control_cl]).size().unstack()
df = df.loc[df.sum(axis=1)/df.values.sum()>=perc_cell_to_show,:] # remove row if group_cluster represents less cells in perc than perc_cell_to_show.
df_rel = df
df = df.div(df.sum(axis=1), axis=0)
df[group_cl] = df.index
plt.rcParams["figure.figsize"] = figsize
plt.rcParams['figure.dpi'] = dpi
plt.rcParams['figure.facecolor'] = '#FFFFFF'
df.plot(
x = group_cl,
kind = 'barh',
stacked = True,
mark_right = True,
)
leg = plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
leg.get_frame().set_edgecolor('black')
plt.xlabel('perc_'+control_cl, loc='center')
for n in df_rel:
for i, (pos_x, ab, value) in enumerate(zip(df.iloc[:, :-1].cumsum(1)[n], df[n], df_rel[n])):
if (value == 0) | (ab <=0.05):
value = ''
plt.text(pos_x-ab/2, i, str(value), va='center', ha='center')
plt.grid(False)
if save_path is not None:
plt.savefig(save_path, bbox_inches='tight')
return(plt.show())
# ================================================================
# 🔧 OUTPUT HANDLER: salva UMAP e AnnData in /output/
# ================================================================
import os
import matplotlib.pyplot as plt
import scanpy as sc
def save_all_umaps(adata, prefix="output", color_by=None, dpi=300):
"""
Salva tutte le UMAP presenti in adata.obsm come immagini PNG.
Non salva l'AnnData, nessun h5ad viene scritto.
"""
import os
import scanpy as sc
import matplotlib.pyplot as plt
os.makedirs(prefix, exist_ok=True)
print(f"[INFO] Cartella di output: {prefix}")
# === Trova tutte le UMAP disponibili ===
umap_keys = [k for k in adata.obsm.keys() if k.endswith("_umap") or k == "X_umap"]
if not umap_keys:
print("[WARN] Nessuna UMAP trovata in adata.obsm")
return
print(f"[INFO] UMAP trovate: {umap_keys}")
# === Determina cosa usare come colore ===
if color_by is None:
cluster_cols = [c for c in adata.obs.columns if "clust" in c.lower()]
color_by = cluster_cols if cluster_cols else ["n_genes"]
elif isinstance(color_by, str):
color_by = [color_by]
print(f"[INFO] Colorazioni da usare: {color_by}")
# === Salva ogni combinazione UMAP × colore ===
for key in umap_keys:
for color in color_by:
fig_path = os.path.join(prefix, f"{key}_{color}.png")
try:
sc.pl.embedding(
adata,
basis=key,
color=color,
frameon=False,
show=False,
)
plt.savefig(fig_path, dpi=dpi, bbox_inches="tight")
plt.close()
print(f"[OK] Salvata {fig_path}")
except Exception as e:
print(f"[WARN] Errore nel salvare {fig_path}: {e}")
print("[✅] Tutte le UMAP salvate con successo.")
# import packs
import numpy as np
import pandas as pd
import scanpy as sc
import gc
import torch
import torch.nn.functional as F
import umap
import anndata
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics import normalized_mutual_info_score
from sklearn.feature_extraction.text import TfidfTransformer
from scipy import io
from scipy import sparse
import matplotlib.pyplot as plt
### FUNCTION ###
################
def transcriptomicAnalysis(
path_10x,
bcode_variants,
high_var_genes_min_mean=0.015,
high_var_genes_max_mean=3,
high_var_genes_min_disp=0.75,
min_genes=200,
min_cells=3,
max_pct_mt=100,
n_neighbors=20,
n_pcs=30,
transcript_key='trans',
manual_matrix_reading=False,
force_float64=True,
):
"""
Preprocess transcriptomic (RNA) data from 10x Genomics format.
Ora la normalizzazione e lo scaling replicano esattamente la pipeline Muon:
normalize_total → log1p → scale(max_value=10)
"""
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy import io
from sklearn.preprocessing import StandardScaler
np.set_printoptions(precision=6, suppress=False, linewidth=140)
print("[INFO] === TRANSCRIPTOMIC ANALYSIS START ===")
# === Lettura dati RNA ===
if manual_matrix_reading:
rna = path_10x + '/matrix.mtx.gz'
barcode = path_10x + '/barcodes.tsv.gz'
feat = path_10x + '/features.tsv.gz'
M = io.mmread(rna).astype(np.float64 if force_float64 else np.float32)
B = M.T # (cells x genes)
barcode = pd.read_csv(barcode, sep='\t', header=None, names=['bcode'])
barcode.index = barcode['bcode']
feat = pd.read_csv(feat, sep='\t', header=None,
names=['gene_ids', 'gene_symbol', 'feature_types'])
feat.index = feat['gene_symbol']
adata = anndata.AnnData(X=B, obs=barcode, var=feat)
adata.X = sp.csr_matrix(adata.X)
else:
adata = sc.read_10x_mtx(path_10x, var_names='gene_symbols', cache=True)
adata.X = adata.X.astype(np.float64 if force_float64 else np.float32)
# === Salva copia grezza ===
if not sp.issparse(adata.X):
adata.X = sp.csr_matrix(adata.X)
adata.uns[f"{transcript_key}_raw"] = adata.X.copy()
adata.uns[f"{transcript_key}_raw_obs_names"] = np.array(adata.obs_names)
adata.uns[f"{transcript_key}_raw_var_names"] = np.array(adata.var_names)
print(f"[DEBUG] Raw RNA matrix shape: {adata.X.shape}")
raw_block = adata.X[:5, :5].toarray().astype(float)
print(f"[DEBUG] Prime 5×5 celle RAW (counts):\n{np.array2string(raw_block, precision=5, floatmode='maxprec_equal')}")
# === Filtra barcode presenti anche nel file varianti ===
bcode_var = pd.read_csv(bcode_variants, sep='\t', header=None)[0]
adata = adata[adata.obs.index.isin(bcode_var)].copy()
# === Filtraggio e QC ===
sc.pp.filter_cells(adata, min_genes=min_genes)
sc.pp.filter_genes(adata, min_cells=min_cells)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
adata = adata[adata.obs.pct_counts_mt < max_pct_mt, :].copy()
# === Normalizzazione Muon-style ===
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)
# === Rimozione eventuali NaN/inf ===
if sp.issparse(adata.X):
adata.X.data[np.isnan(adata.X.data)] = 0
adata.X.data[np.isinf(adata.X.data)] = 0
else:
adata.X = np.nan_to_num(adata.X, nan=0.0, posinf=0.0, neginf=0.0)
print(f"[DEBUG] RNA normalizzata e scalata (Muon-style). Shape: {adata.X.shape}")
norm_block = adata.X[:5, :5].toarray().astype(float) if sp.issparse(adata.X) else adata.X[:5, :5].astype(float)
print(f"[DEBUG] Prime 5×5 celle RNA (normalizzate):\n{np.array2string(norm_block, precision=4, floatmode='maxprec_equal')}")
# === PCA opzionale ===
if n_pcs is not None:
print(f"[INFO] PCA con {n_pcs} componenti...")
sc.pp.pca(adata, n_comps=n_pcs)
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs, random_state=42)
sc.tl.umap(adata, random_state=42)
adata.obsm[f"{transcript_key}_umap"] = adata.obsm["X_umap"]
embedding = adata.obsm["X_pca"]
else:
print("[INFO] PCA disabilitata → uso matrice normalizzata intera.")
embedding = adata.X.toarray() if sp.issparse(adata.X) else adata.X
# === Scaling per range uniforme (senza centrare) ===
scaler = StandardScaler(with_mean=False)
adata.uns[f"{transcript_key}_X"] = scaler.fit_transform(embedding)
print(f"[INFO] Salvato adata.uns['{transcript_key}_X'] shape={adata.uns[f'{transcript_key}_X'].shape}")
print(f"[INFO] === TRANSCRIPTOMIC ANALYSIS DONE ({adata.n_obs} cells, {adata.n_vars} genes) ===")
return adata
def _prep_embedding_for_training(X):
"""
Z-score per feature, poi L2 per cella. Restituisce float32 numpy.
"""
import numpy as np
from sklearn.preprocessing import StandardScaler, normalize
X = np.asarray(X, dtype=np.float32)
X = StandardScaler(with_mean=True, with_std=True).fit_transform(X)
X = normalize(X, norm="l2", axis=1)
return X
class PairedAE(torch.nn.Module):
"""
Due encoder (RNA/VAR) e due decoder con bottleneck condiviso (latent_dim).
Loss: ricostruzione + allineamento (cosine) tra z_rna e z_var.
"""
def __init__(self, dim_a, dim_b, latent_dim=32, hidden=256):
super().__init__()
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(dim_a, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(dim_b, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim),
)
self.decoder_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_a),
)
self.decoder_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, dim_b),
)
def forward(self, xa, xb):
za = self.encoder_a(xa)
zb = self.encoder_b(xb)
ra = self.decoder_a(za)
rb = self.decoder_b(zb)
return za, zb, ra, rb
def train_paired_ae(
Xa, Xb,
latent_dim=32,
hidden=256,
batch_size=256,
num_epoch=300,
lr=1e-3,
lam_align=0.5, # forza allineamento
lam_recon_a=1.0, # ricostruzione RNA
lam_recon_b=3.0, # ricostruzione VAR
lam_cross=2.0, # nuova: RNA→VAR consistency
verbose=True,
):
"""
Paired Autoencoder asimmetrico:
RNA = teacher, VAR = student. Loss = recA + recB + cross(A→B) + cosine alignment.
"""
import torch, torch.nn.functional as F, numpy as np
Xa = _prep_embedding_for_training(Xa)
Xb = _prep_embedding_for_training(Xb)
Xa, Xb = np.asarray(Xa, np.float32), np.asarray(Xb, np.float32)
n, da = Xa.shape
_, db = Xb.shape
class AsymAE(torch.nn.Module):
def __init__(self, da, db, latent_dim, hidden):
super().__init__()
self.enc_a = torch.nn.Sequential(
torch.nn.Linear(da, hidden), torch.nn.ReLU(),
torch.nn.Dropout(0.1),
torch.nn.Linear(hidden, latent_dim)
)
self.enc_b = torch.nn.Sequential(
torch.nn.Linear(db, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, latent_dim)
)
self.dec_a = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, da)
)
self.dec_b = torch.nn.Sequential(
torch.nn.Linear(latent_dim, hidden), torch.nn.ReLU(),
torch.nn.Linear(hidden, db)
)
def forward(self, xa, xb):
za = self.enc_a(xa)
zb = self.enc_b(xb)
ra = self.dec_a(za)
rb = self.dec_b(zb)
return za, zb, ra, rb
device = torch.device("cpu")
model = AsymAE(da, db, latent_dim, hidden).to(device)
opt = torch.optim.Adam(model.parameters(), lr=lr)
cos = torch.nn.CosineEmbeddingLoss(margin=0.0)
Xa_t, Xb_t = torch.tensor(Xa), torch.tensor(Xb)
for ep in range(1, num_epoch + 1):
perm = torch.randperm(n)
total = 0
for i in range(0, n, batch_size):
idx = perm[i:i+batch_size]
xa, xb = Xa_t[idx].to(device), Xb_t[idx].to(device)
za, zb, ra, rb = model(xa, xb)
rec_a = F.mse_loss(ra, xa)
rec_b = F.mse_loss(rb, xb)
cross = F.mse_loss(zb, za.detach()) # VAR deve seguire RNA
y = torch.ones(za.shape[0], device=device)
align = cos(za, zb, y)
loss = lam_recon_a*rec_a + lam_recon_b*rec_b + lam_cross*cross + lam_align*align
opt.zero_grad()
loss.backward()
opt.step()
total += loss.item()
if verbose and (ep % max(1, num_epoch//10) == 0 or ep==1):
print(f"[AE] Epoch {ep:04d}/{num_epoch} | loss={total:.3f} | recA={rec_a:.3f} recB={rec_b:.3f} cross={cross:.3f}")
with torch.no_grad():
zA, zB, _, _ = model(Xa_t, Xb_t)
zA = F.normalize(zA, dim=1).cpu().numpy()
zB = F.normalize(zB, dim=1).cpu().numpy()
return model, zA, zB
def variantAnalysis(
adata,
matrix_path=None,
bcode_path=None,
variants_path=None,
min_cells=5,
max_cell_fraction=0.95,
variant_filter_level="norm",
n_pcs=50,
variant_rep="muon", # default comportamentale come Muon
variant_key="variant",
):
"""
Preprocess variant (DNA) data.
Supporta più rappresentazioni:
- muon/binary: binaria + scalata (default)
- tfidf: TF-IDF + L2 norm
- lognorm: log1p normalizzato
In tutti i casi salva:
- adata.uns['variant_raw'] matrice grezza
- adata.uns['variant_X'] embedding numerico (PCA/scaled)
- adata.obsm['variant_umap'] embedding 2D per QC
"""
import scanpy as sc
import numpy as np
import pandas as pd
from scipy import sparse, io
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfTransformer
import umap
np.set_printoptions(precision=6, suppress=False, linewidth=140)
print("[INFO] === VARIANT ANALYSIS START ===")
print(f"[INFO] Lettura file: {matrix_path}")
# --- Lettura matrice ---
var_mtx = io.mmread(matrix_path)
X = sparse.csr_matrix(var_mtx)
barcodes = pd.read_csv(bcode_path, sep="\t", header=None)[0].astype(str).values
variants = pd.read_csv(variants_path, sep="\t", header=None)[0].astype(str).values
# --- Fix orientamento ---
if X.shape[0] == len(variants) and X.shape[1] == len(barcodes):
print(f"[WARN] Transposing variant matrix {X.shape} → expected (cells × variants)")
X = X.T
elif X.shape[1] != len(variants):
print(f"[WARN] Dimensioni inconsuete {X.shape}, provo a trasporre per sicurezza")
X = X.T
print(f"[INFO] Matrice varianti caricata → {X.shape[0]} celle × {X.shape[1]} varianti")
# === Debug blocco RAW ===
raw_block = X[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle RAW (counts):\n{np.round(raw_block, 2)}")
# --- Allineamento con RNA ---
rna_barcodes = np.array([b.split('-')[0] for b in adata.obs_names])
var_barcodes = np.array([b.split('-')[0] for b in barcodes])
common = np.intersect1d(var_barcodes, rna_barcodes)
if len(common) == 0:
raise ValueError("[ERROR] Nessun barcode comune tra RNA e VAR.")
order_idx = np.array([np.where(var_barcodes == b)[0][0] for b in rna_barcodes if b in var_barcodes])
X = X[order_idx, :]
barcodes = barcodes[order_idx]
print(f"[INFO] Celle comuni con RNA: {len(barcodes)}")
# --- Salva GREZZI ---
adata.uns[f"{variant_key}_raw"] = X.copy()
adata.uns[f"{variant_key}_raw_obs_names"] = barcodes
adata.uns[f"{variant_key}_raw_var_names"] = variants
# === Selezione modalità di rappresentazione ===
rep = variant_rep.lower().strip()
print(f"[INFO] Rappresentazione scelta: {rep.upper()}")
# ===============================
# MUON / BINARY (default)
# ===============================
if rep in ["muon", "binary"]:
print("[INFO] Uso pipeline Muon-style (binary + scale)")
import scanpy as sc
X_bin = (X > 0).astype(float)
dna = sc.AnnData(X_bin)
dna.obs_names = barcodes
dna.var_names = variants
# Filtro varianti rare e troppo comuni
sc.pp.filter_genes(dna, min_cells=min_cells)
freq = np.array(dna.X.sum(axis=0)).flatten() / dna.n_obs
keep = freq < max_cell_fraction
dna = dna[:, keep].copy()
print(f"[INFO] Varianti mantenute: {dna.n_vars}/{len(variants)}")
# Scala
sc.pp.scale(dna, max_value=10)
norm_block = dna.X[:5, :5].toarray() if sparse.issparse(dna.X) else dna.X[:5, :5]
print(f"[DEBUG] Prime 5×5 celle DNA (normalizzate):\n{np.array2string(norm_block, precision=4, floatmode='maxprec_equal')}")
sc.tl.pca(dna, n_comps=n_pcs)
sc.pp.neighbors(dna, n_pcs=min(30, n_pcs))
sc.tl.umap(dna, random_state=42)
adata.uns[f"{variant_key}_X"] = dna.obsm["X_pca"]
adata.obsm[f"{variant_key}_umap"] = dna.obsm["X_umap"]
# ===============================
# TF-IDF
# ===============================
elif rep == "tfidf":
print("[INFO] Uso rappresentazione TF-IDF + L2 norm")
tfidf = TfidfTransformer(norm=None, use_idf=True, smooth_idf=True)
X_rep = tfidf.fit_transform(X)
X_rep = normalize(X_rep, norm="l2", axis=1)
rep_block = X_rep[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle TFIDF:\n{np.round(rep_block, 4)}")
# PCA o SVD
if sparse.issparse(X_rep):
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_rep)
else:
pca = PCA(n_components=n_pcs, random_state=42)
embedding = pca.fit_transform(X_rep)
scaler = StandardScaler(with_mean=True, with_std=True)
adata.uns[f"{variant_key}_X"] = scaler.fit_transform(np.asarray(embedding, dtype=np.float32))
# ===============================
# LOGNORM
# ===============================
elif rep == "lognorm":
print("[INFO] Uso rappresentazione log1p-normalized")
lib = np.asarray(X.sum(axis=1)).ravel()
lib[lib == 0] = 1.0
X_norm = X.multiply(1.0 / lib[:, None])
X_norm.data = np.log1p(X_norm.data)
rep_block = X_norm[:5, :5].toarray()
print(f"[DEBUG] Prime 5×5 celle LOGNORM:\n{np.round(rep_block, 4)}")
if sparse.issparse(X_norm):
svd = TruncatedSVD(n_components=n_pcs, random_state=42)
embedding = svd.fit_transform(X_norm)
else:
pca = PCA(n_components=n_pcs, random_state=42)
embedding = pca.fit_transform(X_norm)
scaler = StandardScaler(with_mean=True, with_std=True)
adata.uns[f"{variant_key}_X"] = scaler.fit_transform(np.asarray(embedding, dtype=np.float32))
# ===============================
# ERRORE SE MODALITÀ NON VALIDA
# ===============================
else:
raise ValueError(f"[ERROR] variant_rep '{variant_rep}' non riconosciuto. Usa 'muon', 'binary', 'tfidf' o 'lognorm'.")
# --- UMAP QC (solo se non già fatto) ---
if f"{variant_key}_umap" not in adata.obsm:
reducer = umap.UMAP(n_neighbors=10, min_dist=0.05, random_state=42)
emb2d = reducer.fit_transform(adata.uns[f"{variant_key}_X"])
adata.obsm[f"{variant_key}_umap"] = emb2d
print(f"[INFO] DNA PCA shape: {adata.uns[f'{variant_key}_X'].shape}")
print(f"[INFO] === VARIANT ANALYSIS DONE ({X.shape[0]} cells, {X.shape[1]} variants) ===")
return adata
def calcOmicsClusters(adata, omic_key, res=0.5, n_neighbors=None, n_pcs=None):
"""
Compute clustering (Leiden) on a given omic embedding (trans, variant, int).
Automatically adapts to 2D UMAPs or high-dimensional embeddings.
"""
import scanpy as sc
import numpy as np
# Default n_neighbors
if n_neighbors is None:
n_neighbors = 15
# --- Detect representation type ---
use_rep = omic_key + "_umap"
if use_rep in adata.obsm:
# Caso 2D UMAP
print(f"[INFO] Using 2D UMAP embedding: {use_rep}")
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
use_rep=use_rep,
key_added=f"{omic_key}_neighbors"
)
elif omic_key + "_X" in adata.uns:
# Caso embedding ad alta dimensionalità salvato in uns
X = adata.uns.get(omic_key + "_X")
if X is None or not isinstance(X, (np.ndarray, list)):
raise ValueError(f"adata.uns['{omic_key}_X'] non valido o vuoto.")
adata.obsm[omic_key + "_X"] = np.asarray(X, dtype=np.float32)
dims = adata.obsm[omic_key + "_X"].shape[1]
print(f"[INFO] Using high-dimensional embedding: {omic_key}_X ({dims} dims)")
sc.pp.neighbors(
adata,
n_neighbors=n_neighbors,
n_pcs=min(n_pcs or dims, dims),
use_rep=omic_key + "_X",
key_added=f"{omic_key}_neighbors"
)
else:
raise ValueError(f"No valid representation found for omic_key='{omic_key}'")
# --- Leiden clustering ---
sc.tl.leiden(
adata,
resolution=res,
key_added=f"{omic_key}_clust_{res}",
neighbors_key=f"{omic_key}_neighbors"
)
print(f"[INFO] Leiden clustering completato su {omic_key} con res={res}")
return adata
def weightsInit(m):
if isinstance(m, torch.nn.Linear):
torch.nn.init.xavier_uniform(m.weight.data)
m.bias.data.zero_()
def omicsIntegration(
adata,
transcript_key="trans",
variant_key="variant",
integration_key="int",
latent_dim=32,
num_epoch=300,
lam_align=0.5,
lam_recon_a=1.0,
lam_recon_b=1.0,
lam_cross=2.0,
seed=42,
res=None, # 👈 opzionale, se specificato calcola Leiden
):
"""
Integrazione RNA+VAR:
- Se n_cells < 2000 → PCA concatenata (Muon-style)
- Se n_cells >= 2000 → Autoencoder asimmetrico (RNA=teacher, VAR=student)
Se res è specificato, calcola il Leiden integrato a quella risoluzione.
"""
import numpy as np
import scanpy as sc
print("[INFO] === OMICS INTEGRATION START ===")
# --- verifica input ---
assert transcript_key + "_X" in adata.uns, f"Missing adata.uns['{transcript_key}_X']"
assert variant_key + "_X" in adata.uns, f"Missing adata.uns['{variant_key}_X']"
X_a = np.asarray(adata.uns[transcript_key + "_X"], dtype=np.float32)
X_b = np.asarray(adata.uns[variant_key + "_X"], dtype=np.float32)
assert X_a.shape[0] == X_b.shape[0], "RNA/VAR rows (cells) must match"
n_cells = X_a.shape[0]
# --- bilancia varianze ---
X_a /= np.std(X_a)
X_b /= np.std(X_b)
# ==========================================================
# === BLOCCO 1: PCA CONCATENATA (dataset piccolo < 2000) ===
# ==========================================================
if n_cells < 2000:
print(f"[INFO] n_cells={n_cells} < 2000 → uso PCA concatenata Muon-style")
from sklearn.decomposition import PCA
# --- usa PCA già esistenti se presenti ---
if f"{transcript_key}_pca" in adata.obsm and f"{variant_key}_pca" in adata.obsm:
print(f"[INFO] Uso PCA pre-esistenti da adata.obsm['{transcript_key}_pca'] e adata.obsm['{variant_key}_pca']")
pca_rna = adata.obsm[f"{transcript_key}_pca"]
pca_var = adata.obsm[f"{variant_key}_pca"]
else:
print(f"[WARN] PCA non trovate in adata.obsm, le ricomputo localmente.")
n_comp = min(50, X_a.shape[1], X_b.shape[1])
pca_rna = PCA(n_components=n_comp, random_state=seed).fit_transform(X_a)
pca_var = PCA(n_components=n_comp, random_state=seed).fit_transform(X_b)
adata.obsm[f"{transcript_key}_pca"] = pca_rna
adata.obsm[f"{variant_key}_pca"] = pca_var
# --- controlla coerenza ---
if pca_rna.shape[0] != pca_var.shape[0]:
raise ValueError(f"PCA RNA e VAR hanno dimensioni diverse: {pca_rna.shape[0]} vs {pca_var.shape[0]}")
# --- concatenazione PCA ---
X_concat = np.concatenate([pca_rna, pca_var], axis=1)
adata.uns[integration_key + "_X"] = X_concat.copy()
adata.obsm["X_concat_pca"] = X_concat.copy()
print(f"[INFO] Concatenazione PCA completata: {X_concat.shape}")
concat_block = X_concat[:5, :5]
print(f"[DEBUG] Prime 5×5 celle CONCATENATE:\n{np.round(concat_block, 4)}")
print(f"[DEBUG] Somma prime 5 righe concatenata: {[round(X_concat[i,:].sum(), 4) for i in range(5)]}")
print(f"[DEBUG] Media varianza PCA RNA vs VAR: {pca_rna.var():.3f} / {pca_var.var():.3f}")
# --- Neighbors + UMAP integrato ---
sc.pp.neighbors(adata, use_rep="X_concat_pca", key_added=f"{integration_key}_neighbors")
sc.tl.umap(adata)
adata.obsm[f"{integration_key}_umap"] = adata.obsm["X_umap"].copy()
# --- Leiden opzionale ---
if res is not None:
key = f"{integration_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
print(f"[INFO] Calcolo Leiden integrato per res={res}")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
neighbors_key=f"{integration_key}_neighbors",
flavor="igraph",
n_iterations=2,
directed=False,
random_state=seed,
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Creato {key} ({n_clusters} cluster)")
print("[INFO] === MUON-STYLE INTEGRATION COMPLETATA ===")
return adata
# ========================================================
# === BLOCCO 2: AUTOENCODER ASIMMETRICO (dataset grande) ===
# ========================================================
else:
print(f"[INFO] n_cells={n_cells} ≥ 2000 → uso autoencoder asimmetrico (RNA→VAR)")
from sklearn.preprocessing import StandardScaler
import umap
_, zA, zB = train_paired_ae(
X_a,
X_b,
latent_dim=latent_dim,
num_epoch=num_epoch,
lam_align=lam_align,
lam_recon_a=lam_recon_a,
lam_recon_b=lam_recon_b,
lam_cross=lam_cross,
verbose=True,
)
simAE = float(np.mean(np.sum(zA * zB, axis=1)))
zAE = 0.5 * (zA + zB)
adata.uns[integration_key + "_X"] = zAE.astype(np.float32)
adata.uns[integration_key + "_metrics"] = {
"simAE": simAE,
"latent_dim": int(latent_dim),
"num_epoch": int(num_epoch),
"lam_align": float(lam_align),
"lam_recon_b": float(lam_recon_b),
"lam_cross": float(lam_cross),
}
um = umap.UMAP(n_neighbors=15, min_dist=0.05, random_state=seed).fit_transform(zAE)
adata.obsm[f"{integration_key}_umap"] = um
# --- Leiden opzionale ---
if res is not None:
key = f"{integration_key}_clust_{res:.2f}".rstrip("0").rstrip(".")
print(f"[INFO] Calcolo Leiden integrato per res={res}")
sc.pp.neighbors(adata, use_rep=f"{integration_key}_X", key_added=f"{integration_key}_neighbors")
sc.tl.leiden(
adata,
resolution=res,
key_added=key,
neighbors_key=f"{integration_key}_neighbors",
flavor="igraph",
n_iterations=2,
directed=False,
random_state=seed,
)
n_clusters = adata.obs[key].nunique()
adata.obs[key] = adata.obs[key].astype("category")
adata.uns[key] = {"resolution": res, "n_clusters": n_clusters}
print(f"[INFO] → Creato {key} ({n_clusters} cluster)")
print(f"[INT] AE similarity={simAE:.3f}")
print("[INFO] === AUTOENCODER INTEGRATION COMPLETATA ===")
return adata
class pairedIntegration(torch.nn.Module):
def __init__(self,input_dim_a=2000,input_dim_b=2000,clf_out=10):
super(pairedIntegration, self).__init__()
self.input_dim_a = input_dim_a
self.input_dim_b = input_dim_b
self.clf_out = clf_out
self.encoder_a = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_a, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.encoder_b = torch.nn.Sequential(
torch.nn.Linear(self.input_dim_b, 1000),
torch.nn.BatchNorm1d(1000),
torch.nn.ReLU(),
torch.nn.Linear(1000, 512),
torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
torch.nn.Linear(512, 128),
torch.nn.BatchNorm1d(128),
torch.nn.ReLU())
self.clf = torch.nn.Sequential(
torch.nn.Linear(128, self.clf_out),
torch.nn.Softmax(dim=1))
self.feature = torch.nn.Sequential(
torch.nn.Linear(128, 32))
def forward(self, x_a,x_b):
out_a = self.encoder_a(x_a)
f_a = self.feature(out_a)
y_a = self.clf(out_a)
out_b = self.encoder_b(x_b)
f_b = self.feature(out_b)
y_b = self.clf(out_b)
return f_a,y_a,f_b,y_b
def pairedIntegrationTrainer(X_a, X_b, model, batch_size = 512, num_epoch=5,
f_temp = 0.1, p_temp = 1.0):
device = torch.device("cpu")
f_con = contrastiveLoss(batch_size = batch_size,temperature = f_temp)
p_con = contrastiveLoss(batch_size = model.clf_out,temperature = p_temp)
opt = torch.optim.SGD(model.parameters(),lr=0.01, momentum=0.9,weight_decay=5e-4)
for k in range(num_epoch):
model.to(device)
n = X_a.shape[0]
r = np.random.permutation(n)
X_train_a = X_a[r,:]
X_tensor_A=torch.tensor(X_train_a).float()
X_train_b = X_b[r,:]
X_tensor_B=torch.tensor(X_train_b).float()
losses = 0
for j in range(n//batch_size):
inputs_a = X_tensor_A[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_a2 = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_a = inputs_a + torch.normal(0,1,inputs_a.shape).to(device)
inputs_b = X_tensor_B[j*batch_size:(j+1)*batch_size,:].to(device)
inputs_b = inputs_b + torch.normal(0,1,inputs_b.shape).to(device)
feas,o,nfeas,no = model(inputs_a,inputs_b)
feas2,o2,_,_ = model(inputs_a2,inputs_b)
fea_mi = f_con(feas,nfeas)+f_con(feas,feas2)
p_mi = p_con(o.T,no.T)+p_con(o.T,o2.T)
loss = fea_mi + p_mi
opt.zero_grad()
loss.backward()
opt.step()
losses += loss.data.tolist()
print("Total loss: "+str(round(losses,4)))
gc.collect()
class contrastiveLoss(torch.nn.Module):
def __init__(self, batch_size, temperature=0.5):
super().__init__()
self.batch_size = batch_size
self.register_buffer("temperature", torch.tensor(temperature))
self.register_buffer("negatives_mask", (~torch.eye(batch_size * 2, batch_size * 2,
dtype=bool)).float())
def forward(self, emb_i, emb_j):
# """
# emb_i and emb_j are batches of embeddings, where corresponding indices are pairs
# z_i, z_j as per SimCLR paper
# """
z_i = F.normalize(emb_i, dim=1,p=2)
z_j = F.normalize(emb_j, dim=1,p=2)
representations = torch.cat([z_i, z_j], dim=0)
similarity_matrix = F.cosine_similarity(representations.unsqueeze(1), representations.unsqueeze(0), dim=2)
sim_ij = torch.diag(similarity_matrix, self.batch_size)
sim_ji = torch.diag(similarity_matrix, -self.batch_size)
positives = torch.cat([sim_ij, sim_ji], dim=0)
nominator = torch.exp(positives / self.temperature)
denominator = self.negatives_mask * torch.exp(similarity_matrix / self.temperature)
loss_partial = -torch.log(nominator / torch.sum(denominator, dim=1))
loss = torch.sum(loss_partial) / (2 * self.batch_size)
return loss
def distributionClusters(adata, control_cl, group_cl, perc_cell_to_show=0.1, figsize = (12,8), dpi=100, save_path=None):
df = adata.obs.groupby([group_cl, control_cl]).size().unstack()
df = df.loc[df.sum(axis=1)/df.values.sum()>=perc_cell_to_show,:] # remove row if group_cluster represents less cells in perc than perc_cell_to_show.
df_rel = df
df = df.div(df.sum(axis=1), axis=0)
df[group_cl] = df.index
plt.rcParams["figure.figsize"] = figsize
plt.rcParams['figure.dpi'] = dpi
plt.rcParams['figure.facecolor'] = '#FFFFFF'
df.plot(
x = group_cl,
kind = 'barh',
stacked = True,
mark_right = True,
)
leg = plt.legend(bbox_to_anchor=(1, 1), loc="upper left")
leg.get_frame().set_edgecolor('black')
plt.xlabel('perc_'+control_cl, loc='center')
for n in df_rel:
for i, (pos_x, ab, value) in enumerate(zip(df.iloc[:, :-1].cumsum(1)[n], df[n], df_rel[n])):
if (value == 0) | (ab <=0.05):
value = ''
plt.text(pos_x-ab/2, i, str(value), va='center', ha='center')
plt.grid(False)
if save_path is not None:
plt.savefig(save_path, bbox_inches='tight')
return(plt.show())
# ================================================================
# 🔧 OUTPUT HANDLER: salva UMAP e AnnData in /output/
# ================================================================
import os
import matplotlib.pyplot as plt
import scanpy as sc
def save_all_umaps(adata, prefix="output", color_by=None, dpi=300):
"""
Salva tutte le UMAP presenti in adata.obsm come immagini PNG.
Non salva l'AnnData, nessun h5ad viene scritto.
"""
import os
import scanpy as sc
import matplotlib.pyplot as plt
os.makedirs(prefix, exist_ok=True)
print(f"[INFO] Cartella di output: {prefix}")
# === Trova tutte le UMAP disponibili ===
umap_keys = [k for k in adata.obsm.keys() if k.endswith("_umap") or k == "X_umap"]
if not umap_keys:
print("[WARN] Nessuna UMAP trovata in adata.obsm")
return
print(f"[INFO] UMAP trovate: {umap_keys}")
# === Determina cosa usare come colore ===
if color_by is None:
cluster_cols = [c for c in adata.obs.columns if "clust" in c.lower()]
color_by = cluster_cols if cluster_cols else ["n_genes"]
elif isinstance(color_by, str):
color_by = [color_by]
print(f"[INFO] Colorazioni da usare: {color_by}")
# === Salva ogni combinazione UMAP × colore ===
for key in umap_keys:
for color in color_by:
fig_path = os.path.join(prefix, f"{key}_{color}.png")
try:
sc.pl.embedding(
adata,
basis=key,
color=color,
frameon=False,
show=False,
)
plt.savefig(fig_path, dpi=dpi, bbox_inches="tight")
plt.close()
print(f"[OK] Salvata {fig_path}")
except Exception as e:
print(f"[WARN] Errore nel salvare {fig_path}: {e}")
print("[✅] Tutte le UMAP salvate con successo.")
This source diff could not be displayed because it is too large. You can view the blob instead.
{
"cells": [
{
"cell_type": "code",
"execution_count": 11,
"id": "2fe6811b-dc90-4869-880c-3686f5da116f",
"metadata": {},
"outputs": [],
"source": [
"# !pip install muon scanpy mofapy2 # se servisse\n",
"import os\n",
"out_dir = \"../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/\"\n",
"os.makedirs(out_dir, exist_ok=True)\n",
"\n",
"# Config Scanpy\n",
"import scanpy as sc\n",
"sc.settings.figdir = os.path.join(out_dir, \"figures\")\n",
"os.makedirs(sc.settings.figdir, exist_ok=True)\n",
"sc.settings.autoshow = False"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "09df59a3-8faf-48f7-bbc5-102b750134ab",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import scanpy as sc\n",
"import muon as mu\n",
"from anndata import AnnData\n",
"from sklearn.metrics import adjusted_rand_score\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "8fa241bd-d809-4c1b-a18d-90462ec6e255",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RNA shape (genes × cells): (10000, 5000)\nDNA shape (variants × cells): (20000, 5000)",
"\n[DEBUG] Prime 5×5 celle RNA (righe=celle, colonne=geni):\n[[1.5445 0. 3.065 1.8446 2.3458]\n [0. 1.4837 0. 0. 1.6266]\n [1.3126 2.223 4.4167 4.5424 0. ]\n [5.8761 5.6104 4.0811 0. 4.1624]\n [2.2925 1.86 0. 0. 2.099 ]]\n\n[DEBUG] Prime 5×5 celle DNA (righe=celle, colonne=varianti):\n[[0 0 1 0 0]\n [0 0 0 0 0]\n [1 0 1 0 0]\n [0 1 0 0 0]\n [0 0 1 0 0]]",
"\n[INFO] RNA — valore medio: 2.1562, std: 1.9827",
"[INFO] DNA — valore medio: 0.1973, std: 0.3979"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from anndata import AnnData\n",
"\n",
"rna_csv = \"../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/gene_expression.csv\"\n",
"dna_csv = \"../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/variant_matrix.csv\"\n",
"\n",
"rna_df = pd.read_csv(rna_csv, index_col=0)\n",
"dna_df = pd.read_csv(dna_csv, index_col=0)\n",
"\n",
"print(f\"RNA shape (genes × cells): {rna_df.shape}\")\n",
"print(f\"DNA shape (variants × cells): {dna_df.shape}\")\n",
"\n",
"# Crea AnnData\n",
"rna_adata = AnnData(rna_df.T.copy())\n",
"dna_adata = AnnData(dna_df.T.copy())\n",
"\n",
"rna_adata.var_names.name = \"gene\"\n",
"dna_adata.var_names.name = \"variant\"\n",
"\n",
"# Verifica coerenza dei barcode\n",
"assert (rna_adata.obs_names == dna_adata.obs_names).all(), \"Cell names don't match!\"\n",
"\n",
"# === Debug: prime 5×5 celle per ciascuna omica ===\n",
"print(\"\\n[DEBUG] Prime 5×5 celle RNA (righe=celle, colonne=geni):\")\n",
"print(np.round(rna_adata.X[:5, :5], 4))\n",
"\n",
"print(\"\\n[DEBUG] Prime 5×5 celle DNA (righe=celle, colonne=varianti):\")\n",
"print(np.round(dna_adata.X[:5, :5], 4))\n",
"\n",
"# === Piccolo riassunto dei valori ===\n",
"print(f\"\\n[INFO] RNA — valore medio: {rna_adata.X.mean():.4f}, std: {rna_adata.X.std():.4f}\")\n",
"print(f\"[INFO] DNA — valore medio: {dna_adata.X.mean():.4f}, std: {dna_adata.X.std():.4f}\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b7824e6e-e6c4-4281-a4da-5a1b827fa000",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MuData object with n_obs × n_vars = 5000 × 30000\n 2 modalities\n rna:\t5000 x 10000\n dna:\t5000 x 20000"
]
}
],
"source": [
"mdata = mu.MuData({\"rna\": rna_adata, \"dna\": dna_adata})\n",
"mdata.write(os.path.join(out_dir, \"mdata_raw.h5mu\"))\n",
"print(mdata)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "90ce219f-7613-4e6d-8675-d81b1b6cc6a7",
"metadata": {},
"outputs": [],
"source": [
"#sc.pp.normalize_total(mdata[\"rna\"])\n",
"#sc.pp.log1p(mdata[\"rna\"])\n",
"#sc.pp.highly_variable_genes(mdata[\"rna\"], flavor=\"cell_ranger\", n_top_genes=2000)\n",
"#mdata.mod[\"rna\"] = mdata[\"rna\"][:, mdata[\"rna\"].var.highly_variable].copy()\n",
"sc.pp.scale(mdata[\"rna\"], max_value=10)\n",
"sc.tl.pca(mdata[\"rna\"], n_comps=50)\n",
"sc.pp.neighbors(mdata[\"rna\"], n_pcs=30)\n",
"sc.tl.umap(mdata[\"rna\"])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "c44f00fe-fb8e-4fb3-8a85-0766d262c03e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Preprocessing DNA binario completato."
]
}
],
"source": [
"import numpy as np\n",
"import scanpy as sc\n",
"\n",
"# Assicurati che i dati siano binari float\n",
"mdata[\"dna\"].X = (mdata[\"dna\"].X > 0).astype(float)\n",
"\n",
"# Filtra varianti troppo rare\n",
"sc.pp.filter_genes(mdata[\"dna\"], min_cells=5)\n",
"\n",
"# Filtra varianti troppo frequenti (>95% delle cellule)\n",
"to_keep = np.array((mdata[\"dna\"].X.sum(axis=0) < 0.95 * mdata[\"dna\"].n_obs)).flatten()\n",
"mdata.mod[\"dna\"] = mdata[\"dna\"][:, to_keep].copy()\n",
"\n",
"# Scala per PCA\n",
"sc.pp.scale(mdata[\"dna\"], max_value=10)\n",
"\n",
"# PCA + neighbors + UMAP\n",
"sc.tl.pca(mdata[\"dna\"], n_comps=50)\n",
"sc.pp.neighbors(mdata[\"dna\"], n_pcs=30)\n",
"sc.tl.umap(mdata[\"dna\"])\n",
"\n",
"print(\"[INFO] Preprocessing DNA binario completato.\")"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9a590752-5c73-4bdb-b55d-fb8932b7face",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Carico metadati da: ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/cell_metadata.csv\n[INFO] Trovate colonne cell_type (TrueCellType) e genotype (TrueGenotype)\n→ Colonna etichetta vera utilizzata: true_label"
]
}
],
"source": [
"import pandas as pd\n",
"import os\n",
"\n",
"# Percorso opzionale del file di metadati (modificalo se serve)\n",
"meta_csv = \"../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/cell_metadata.csv\"\n",
"\n",
"true_label_col = None # nome della colonna con etichetta vera\n",
"\n",
"if os.path.exists(meta_csv):\n",
" print(f\"[INFO] Carico metadati da: {meta_csv}\")\n",
" metadata = pd.read_csv(meta_csv)\n",
"\n",
" # Tenta di riconoscere la colonna delle cellule\n",
" if \"Cell\" in metadata.columns:\n",
" metadata = metadata.set_index(\"Cell\")\n",
" elif \"cell_id\" in metadata.columns:\n",
" metadata = metadata.set_index(\"cell_id\")\n",
" else:\n",
" print(\"[WARN] Nessuna colonna 'Cell' trovata nei metadati — controlla il file.\")\n",
" print(\"Colonne disponibili:\", metadata.columns.tolist())\n",
"\n",
" # Aggiungi i metadati alle modalità RNA, DNA e alla vista combinata\n",
" for mod in [\"rna\", \"dna\"]:\n",
" mdata.mod[mod].obs = mdata.mod[mod].obs.join(metadata, how=\"left\")\n",
" mdata.obs = mdata.obs.join(metadata, how=\"left\")\n",
"\n",
" # Cerca possibili colonne che rappresentano la “verità” (label ground truth)\n",
" ct_cols = [c for c in [\"TrueCellType\", \"CellType\", \"cell_type\"] if c in mdata.obs.columns]\n",
" gt_cols = [c for c in [\"TrueGenotype\", \"Genotype\", \"genotype\"] if c in mdata.obs.columns]\n",
"\n",
" # Combina in una singola colonna \"true_label\"\n",
" if ct_cols and gt_cols:\n",
" mdata.obs[\"true_label\"] = (\n",
" mdata.obs[ct_cols[0]].astype(str) + \"_\" + mdata.obs[gt_cols[0]].astype(str)\n",
" )\n",
" true_label_col = \"true_label\"\n",
" print(f\"[INFO] Trovate colonne cell_type ({ct_cols[0]}) e genotype ({gt_cols[0]})\")\n",
" elif ct_cols:\n",
" mdata.obs[\"true_label\"] = mdata.obs[ct_cols[0]].astype(str)\n",
" true_label_col = \"true_label\"\n",
" print(f\"[INFO] Trovata colonna cell_type ({ct_cols[0]})\")\n",
" elif gt_cols:\n",
" mdata.obs[\"true_label\"] = mdata.obs[gt_cols[0]].astype(str)\n",
" true_label_col = \"true_label\"\n",
" print(f\"[INFO] Trovata colonna genotype ({gt_cols[0]})\")\n",
" else:\n",
" print(\"[WARN] Nessuna colonna adatta trovata per 'true_label'.\")\n",
"\n",
"else:\n",
" print(f\"[WARN] Nessun file di metadati trovato in {meta_csv}\")\n",
"\n",
"print(\"→ Colonna etichetta vera utilizzata:\", true_label_col)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "6ae58343-bc13-4cbc-b618-20ad608eff0d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Costruisco grafo multimodale WNN...",
"[INFO] Calcolo UMAP su WNN...",
"[INFO] Leiden WNN alle risoluzioni: [0.5, 1.0, 1.5, 2.0]",
"[INFO] Salvato plot → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/figures_wnn/wnn_umap_leiden_res0.5.pdf",
"[INFO] Salvato plot → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/figures_wnn/wnn_umap_leiden_res1.0.pdf",
"[INFO] Salvato plot → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/figures_wnn/wnn_umap_leiden_res1.5.pdf",
"[INFO] Salvato plot → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/figures_wnn/wnn_umap_leiden_res2.0.pdf"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>method</th>\n",
" <th>resolution</th>\n",
" <th>ARI</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>WNN</td>\n",
" <td>0.5</td>\n",
" <td>0.264106</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>WNN</td>\n",
" <td>1.0</td>\n",
" <td>0.263178</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>WNN</td>\n",
" <td>1.5</td>\n",
" <td>0.271365</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>WNN</td>\n",
" <td>2.0</td>\n",
" <td>0.311895</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" method resolution ARI\n",
"0 WNN 0.5 0.264106\n",
"1 WNN 1.0 0.263178\n",
"2 WNN 1.5 0.271365\n",
"3 WNN 2.0 0.311895"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] ARI salvati → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/ARI_WNN_summary.csv",
"[INFO] Salvato MuData → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mdata_after_wnn.h5mu"
]
}
],
"source": [
"# --- BLOCCO 8: Integrazione WNN (RNA+DNA) + UMAP + Leiden + ARI ---\n",
"\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scanpy as sc\n",
"import muon as mu\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Cartelle output (riusa out_dir se già definita, altrimenti usa una di default)\n",
"if \"out_dir\" not in globals():\n",
" out_dir = \"./wnn_out/\"\n",
"os.makedirs(out_dir, exist_ok=True)\n",
"fig_dir = os.path.join(out_dir, \"figures_wnn\")\n",
"os.makedirs(fig_dir, exist_ok=True)\n",
"\n",
"# 0) Controlli di base sulle modalità\n",
"assert \"rna\" in mdata.mod and \"dna\" in mdata.mod, \"Mi servono le modalità 'rna' e 'dna' dentro mdata.mod\"\n",
"\n",
"print(\"[INFO] Costruisco grafo multimodale WNN...\")\n",
"mu.pp.neighbors(mdata, key_added=\"wnn\") # crea: mdata.uns['wnn'], mdata.obsp['wnn_connectivities'], mdata.obsp['wnn_distances']\n",
"\n",
"# 1) UMAP sul grafo WNN\n",
"print(\"[INFO] Calcolo UMAP su WNN...\")\n",
"mu.tl.umap(mdata, neighbors_key=\"wnn\")\n",
"\n",
"# Salva anche con un nome esplicito\n",
"mdata.obsm[\"X_wnn_umap\"] = mdata.obsm[\"X_umap\"].copy()\n",
"\n",
"# 2) Clustering Leiden su WNN a più risoluzioni\n",
"wnn_res_list = [0.5, 1.0, 1.5, 2.0]\n",
"print(f\"[INFO] Leiden WNN alle risoluzioni: {wnn_res_list}\")\n",
"for res in wnn_res_list:\n",
" key = f\"leiden_wnn_{res}\"\n",
" sc.tl.leiden(\n",
" mdata,\n",
" neighbors_key=\"wnn\",\n",
" resolution=res,\n",
" key_added=key,\n",
" flavor=\"igraph\",\n",
" n_iterations=2,\n",
" directed=False,\n",
" )\n",
"\n",
" # Plot UMAP colorata per questo clustering\n",
" fig, ax = plt.subplots(figsize=(6, 5))\n",
" sc.pl.embedding(\n",
" mdata,\n",
" basis=\"X_wnn_umap\",\n",
" color=key,\n",
" title=f\"WNN UMAP — Leiden res={res}\",\n",
" frameon=False,\n",
" ax=ax,\n",
" show=False,\n",
" )\n",
" out_path = os.path.join(fig_dir, f\"wnn_umap_leiden_res{res}.pdf\")\n",
" plt.savefig(out_path, dpi=300, bbox_inches=\"tight\")\n",
" plt.close(fig)\n",
" print(f\"[INFO] Salvato plot → {out_path}\")\n",
"\n",
"# (Opzionale) Plot per “modality” se esiste in mdata.obs\n",
"if \"modality\" in mdata.obs.columns:\n",
" fig, ax = plt.subplots(figsize=(6, 5))\n",
" sc.pl.embedding(\n",
" mdata,\n",
" basis=\"X_wnn_umap\",\n",
" color=\"modality\",\n",
" title=\"WNN UMAP — modality\",\n",
" frameon=False,\n",
" ax=ax,\n",
" show=False,\n",
" )\n",
" out_path = os.path.join(fig_dir, \"wnn_umap_modality.pdf\")\n",
" plt.savefig(out_path, dpi=300, bbox_inches=\"tight\")\n",
" plt.close(fig)\n",
" print(f\"[INFO] Salvato plot → {out_path}\")\n",
"\n",
"# 3) Calcolo ARI (se è stata preparata la colonna true_label nel Blocco 7)\n",
"def _maybe_ari(obs, pred_col, label_col=None):\n",
" if label_col is None or label_col not in obs.columns:\n",
" return np.nan\n",
" valid = obs[[label_col, pred_col]].dropna()\n",
" if valid.empty:\n",
" return np.nan\n",
" return adjusted_rand_score(valid[label_col].astype(str), valid[pred_col].astype(str))\n",
"\n",
"from sklearn.metrics import adjusted_rand_score\n",
"\n",
"results = []\n",
"label_col = None\n",
"if \"true_label_col\" in globals() and true_label_col is not None:\n",
" label_col = true_label_col\n",
"elif \"true_label\" in mdata.obs.columns:\n",
" label_col = \"true_label\"\n",
"\n",
"if label_col is not None:\n",
" for res in wnn_res_list:\n",
" col = f\"leiden_wnn_{res}\"\n",
" ari = _maybe_ari(mdata.obs, col, label_col)\n",
" results.append({\"method\": \"WNN\", \"resolution\": res, \"ARI\": float(ari)})\n",
" res_df = pd.DataFrame(results).sort_values([\"resolution\"])\n",
" display(res_df)\n",
" csv_path = os.path.join(out_dir, \"ARI_WNN_summary.csv\")\n",
" res_df.to_csv(csv_path, index=False)\n",
" print(f\"[INFO] ARI salvati → {csv_path}\")\n",
"else:\n",
" print(\"[WARN] Nessuna colonna di etichette vere trovata (true_label). Salto il calcolo ARI.\")\n",
"\n",
"# 4) Salva il MuData aggiornato\n",
"mdata.write(os.path.join(out_dir, \"mdata_after_wnn.h5mu\"))\n",
"print(f\"[INFO] Salvato MuData → {os.path.join(out_dir, 'mdata_after_wnn.h5mu')}\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "3d0f7df4-f85a-473b-a844-d9b65cbeac2e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Inizio integrazione per concatenazione PCA (DEBUG 5×5 MODE)...\n\n[DEBUG] --- RNA ---\nShape (cells × features): (5000, 10000)\nPrime 5×5 celle rna (normalizzate e scalate):\n[[-4.5930e-01 -1.0819e+00 4.8890e-01 -3.9620e-01 1.5960e-01]\n [-1.0020e+00 -2.8540e-01 -1.4451e+00 -1.3384e+00 -3.0990e-01]\n [-5.4080e-01 1.1150e-01 1.3419e+00 9.8180e-01 -1.3718e+00]\n [ 1.0625e+00 1.9301e+00 1.1301e+00 -1.3384e+00 1.3454e+00]\n [-1.9660e-01 -8.3400e-02 -1.4451e+00 -1.3384e+00 -1.5000e-03]]\nPCA shape: (5000, 50), varianza media: 51.6740\nPrime 5x5 celle della PCA:\n[[-36.2213 28.7004 16.1661 -0.816 -3.7326]\n [-36.9042 26.9355 16.6614 -1.1801 3.8297]\n [ 1.8618 -3.6369 -18.9392 45.1308 1.4271]\n [ 6.9455 -32.0362 35.4417 -1.3034 -1.8827]\n [-37.3183 28.8054 16.4231 -1.7821 5.1176]]\nPCA variance ratio (prime 5): [0.0619 0.0602 0.0595 0.0569 0.0004]\n\n[DEBUG] --- DNA ---\nShape (cells × features): (5000, 20000)\nPrime 5×5 celle dna (normalizzate e scalate):\n[[-0.4074 -0.3807 1.7872 -0.3896 -0.395 ]\n [-0.4074 -0.3807 -0.5594 -0.3896 -0.395 ]\n [ 2.4538 -0.3807 1.7872 -0.3896 -0.395 ]\n [-0.4074 2.6263 -0.5594 -0.3896 -0.395 ]\n [-0.4074 -0.3807 1.7872 -0.3896 -0.395 ]]\nPCA shape: (5000, 50), varianza media: 54.3697\nPrime 5x5 celle della PCA:\n[[-23.7148 2.554 54.0891 -0.8168 0.8134]\n [-32.1915 -40.0711 -16.1456 2.0053 2.1533]\n [-21.7155 2.5589 52.835 3.8058 0.8215]\n [-22.7593 2.0298 53.5315 -0.1129 -0.8372]\n [-31.3252 -38.9129 -18.8992 -0.4091 -0.2577]]\nPCA variance ratio (prime 5): [0.0513 0.0347 0.0315 0.0004 0.0004]\n\n[INFO] Celle allineate: 5000\n[INFO] Concatenazione PCA completata: (5000, 100)\n[DEBUG] Prime 5×5 celle CONCATENATE:\n[[-36.2213 28.7004 16.1661 -0.816 -3.7326]\n [-36.9042 26.9355 16.6614 -1.1801 3.8297]\n [ 1.8618 -3.6369 -18.9392 45.1308 1.4271]\n [ 6.9455 -32.0362 35.4417 -1.3034 -1.8827]\n [-37.3183 28.8054 16.4231 -1.7821 5.1176]]\n[DEBUG] Neighbors graph: shape=(5000, 5000), mean_conn=0.003063, nnz=111560\n[DEBUG] Neighbors graph: shape=(5000, 5000), mean_conn=0.081779, nnz=70000",
"[INFO] Calcolo Leiden per risoluzioni: [0.5, 1.0, 1.5, 2.0]\n[DEBUG] → leiden_concat_0.5: 20 clusters | mean=250.0 | min=40 | max=746",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/concat_pca/figures_concat/concat_umap_leiden_res0.5.pdf\n[DEBUG] → leiden_concat_1: 20 clusters | mean=250.0 | min=40 | max=746",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/concat_pca/figures_concat/concat_umap_leiden_res1.0.pdf\n[DEBUG] → leiden_concat_1.5: 20 clusters | mean=250.0 | min=40 | max=746",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/concat_pca/figures_concat/concat_umap_leiden_res1.5.pdf\n[DEBUG] → leiden_concat_2: 20 clusters | mean=250.0 | min=40 | max=746",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/concat_pca/figures_concat/concat_umap_leiden_res2.0.pdf\n\n=== RISULTATI ARI CONCAT (DEBUG 5×5) ===\n method resolution ARI\nConcatPCA 0.5 0.569959\nConcatPCA 1.0 0.569959\nConcatPCA 1.5 0.569959\nConcatPCA 2.0 0.569959\n[INFO] ARI salvati → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/concat_pca/ARI_ConcatPCA_summary.csv",
"[INFO] Salvato MuData → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/concat_pca/mdata_after_concat.h5mu"
]
}
],
"source": [
"# --- BLOCCO 9: Integrazione concatenazione PCA (RNA+DNA) + Leiden + ARI (DEBUG 5×5 MODE) ---\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scanpy as sc\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics import adjusted_rand_score\n",
"import os\n",
"\n",
"concat_dir = os.path.join(out_dir, \"concat_pca\")\n",
"os.makedirs(concat_dir, exist_ok=True)\n",
"fig_dir = os.path.join(concat_dir, \"figures_concat\")\n",
"os.makedirs(fig_dir, exist_ok=True)\n",
"\n",
"print(\"[INFO] Inizio integrazione per concatenazione PCA (DEBUG 5×5 MODE)...\")\n",
"\n",
"# === 1️⃣ RNA e DNA PCA check ===\n",
"for mod in [\"rna\", \"dna\"]:\n",
" if \"X_pca\" not in mdata[mod].obsm:\n",
" raise ValueError(f\"[ERROR] Manca PCA per {mod}! Riesegui sc.tl.pca.\")\n",
" \n",
" print(f\"\\n[DEBUG] --- {mod.upper()} ---\")\n",
" print(f\"Shape (cells × features): {mdata[mod].shape}\")\n",
"\n",
" # Prime 5×5 celle della matrice normalizzata\n",
" X = mdata[mod].X\n",
" block_raw = X[:5, :5].A if hasattr(X, \"A\") else np.asarray(X[:5, :5])\n",
" print(f\"Prime 5×5 celle {mod} (normalizzate e scalate):\\n{np.round(block_raw, 4)}\")\n",
"\n",
" # PCA info# PCA info\n",
" X_pca = mdata[mod].obsm[\"X_pca\"]\n",
" print(f\"PCA shape: {X_pca.shape}, varianza media: {np.var(X_pca):.4f}\")\n",
"\n",
" # Mostra le prime 5 righe e 5 colonne della PCA\n",
" print(\"Prime 5x5 celle della PCA:\")\n",
" print(np.round(X_pca[:5, :5], 4))\n",
"\n",
" # Varianza spiegata (se disponibile)\n",
" if \"pca\" in mdata[mod].uns and \"variance_ratio\" in mdata[mod].uns[\"pca\"]:\n",
" vr = mdata[mod].uns[\"pca\"][\"variance_ratio\"]\n",
" print(f\"PCA variance ratio (prime 5): {np.round(vr[:5], 4)}\")\n",
" \n",
"# === 2️⃣ Estrai PCA e concatena ===\n",
"pca_rna = mdata[\"rna\"].obsm[\"X_pca\"]\n",
"pca_dna = mdata[\"dna\"].obsm[\"X_pca\"]\n",
"\n",
"if pca_rna.shape[0] != pca_dna.shape[0]:\n",
" raise ValueError(f\"[ERROR] RNA e DNA hanno celle diverse: RNA={pca_rna.shape[0]} DNA={pca_dna.shape[0]}\")\n",
"\n",
"print(f\"\\n[INFO] Celle allineate: {pca_rna.shape[0]}\")\n",
"X_concat = np.concatenate([pca_rna, pca_dna], axis=1)\n",
"mdata.obsm[\"X_concat_pca\"] = X_concat\n",
"\n",
"print(f\"[INFO] Concatenazione PCA completata: {X_concat.shape}\")\n",
"print(f\"[DEBUG] Prime 5×5 celle CONCATENATE:\\n{np.round(X_concat[:5, :5], 4)}\")\n",
"\n",
"# === 3️⃣ Neighbors, UMAP, Leiden ===\n",
"sc.pp.neighbors(mdata, use_rep=\"X_concat_pca\")\n",
"conn = mdata.obsp[\"connectivities\"]\n",
"dist = mdata.obsp[\"distances\"]\n",
"print(f\"[DEBUG] Neighbors graph: shape={conn.shape}, mean_conn={conn.mean():.6f}, nnz={conn.nnz}\")\n",
"print(f\"[DEBUG] Neighbors graph: shape={dist.shape}, mean_conn={dist.mean():.6f}, nnz={dist.nnz}\")\n",
"\n",
"\n",
"sc.tl.umap(mdata)\n",
"mdata.obsm[\"X_concat_umap\"] = mdata.obsm[\"X_umap\"].copy()\n",
"\n",
"concat_res_list = [0.5, 1.0, 1.5, 2.0]\n",
"print(f\"[INFO] Calcolo Leiden per risoluzioni: {concat_res_list}\")\n",
"\n",
"for res in concat_res_list:\n",
" key = f\"leiden_concat_{res:.2f}\".rstrip(\"0\").rstrip(\".\")\n",
" sc.tl.leiden(\n",
" mdata,\n",
" resolution=res,\n",
" key_added=key,\n",
" flavor=\"igraph\",\n",
" n_iterations=2,\n",
" directed=False,\n",
" random_state=0, # deterministic\n",
" )\n",
"\n",
" # Debug cluster info\n",
" n_clusters = mdata.obs[key].nunique()\n",
" sizes = mdata.obs[key].value_counts().sort_index().values\n",
" print(f\"[DEBUG] → {key}: {n_clusters} clusters | mean={np.mean(sizes):.1f} | min={np.min(sizes)} | max={np.max(sizes)}\")\n",
"\n",
" # UMAP plot\n",
" fig, ax = plt.subplots(figsize=(6, 5))\n",
" sc.pl.embedding(\n",
" mdata,\n",
" basis=\"X_concat_umap\",\n",
" color=key,\n",
" title=f\"Concat PCA UMAP — Leiden res={res}\",\n",
" frameon=False,\n",
" ax=ax,\n",
" show=False,\n",
" )\n",
" out_path = os.path.join(fig_dir, f\"concat_umap_leiden_res{res}.pdf\")\n",
" plt.savefig(out_path, dpi=300, bbox_inches=\"tight\")\n",
" plt.close(fig)\n",
" print(f\"[INFO] Salvato → {out_path}\")\n",
"\n",
"# === 4️⃣ ARI (se presente true_label) ===\n",
"label_col = None\n",
"if \"true_label_col\" in globals() and true_label_col is not None:\n",
" label_col = true_label_col\n",
"elif \"true_label\" in mdata.obs.columns:\n",
" label_col = \"true_label\"\n",
"\n",
"def _maybe_ari(obs, pred_col, label_col=None):\n",
" if label_col is None or label_col not in obs.columns:\n",
" return np.nan\n",
" valid = obs[[label_col, pred_col]].dropna()\n",
" if valid.empty:\n",
" return np.nan\n",
" return adjusted_rand_score(valid[label_col].astype(str), valid[pred_col].astype(str))\n",
"\n",
"results_concat = []\n",
"if label_col is not None:\n",
" for res in concat_res_list:\n",
" col = f\"leiden_concat_{res:.2f}\".rstrip(\"0\").rstrip(\".\")\n",
" ari = _maybe_ari(mdata.obs, col, label_col)\n",
" results_concat.append({\"method\": \"ConcatPCA\", \"resolution\": res, \"ARI\": float(ari)})\n",
" df_concat = pd.DataFrame(results_concat).sort_values(\"resolution\")\n",
" print(\"\\n=== RISULTATI ARI CONCAT (DEBUG 5×5) ===\")\n",
" print(df_concat.to_string(index=False))\n",
" csv_path = os.path.join(concat_dir, \"ARI_ConcatPCA_summary.csv\")\n",
" df_concat.to_csv(csv_path, index=False)\n",
" print(f\"[INFO] ARI salvati → {csv_path}\")\n",
"else:\n",
" print(\"[WARN] Nessuna colonna true_label trovata, salto ARI per Concat PCA.\")\n",
"\n",
"# === 5️⃣ Salvataggio MuData ===\n",
"out_path = os.path.join(concat_dir, \"mdata_after_concat.h5mu\")\n",
"mdata.write(out_path)\n",
"print(f\"[INFO] Salvato MuData → {out_path}\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "05047a95-cc55-4cbb-abad-a336c2074c6e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Likelihood per vista: {'rna': 'gaussian', 'dna': 'gaussian'}\n[INFO] Likelihoods usate in ordine: ['gaussian', 'gaussian']\n[INFO] Avvio MOFA con n_factors=10...\n\n #########################################################\n ### __ __ ____ ______ ### \n ### | \\/ |/ __ \\| ____/\\ _ ### \n ### | \\ / | | | | |__ / \\ _| |_ ### \n ### | |\\/| | | | | __/ /\\ \\_ _| ###\n ### | | | | |__| | | / ____ \\|_| ###\n ### |_| |_|\\____/|_|/_/ \\_\\ ###\n ### ### \n ######################################################### \n \n \n ",
"Loaded view='rna' group='group1' with N=5000 samples and D=10000 features...\nLoaded view='dna' group='group1' with N=5000 samples and D=20000 features...\n\n",
"Model options:\n- Automatic Relevance Determination prior on the factors: True\n- Automatic Relevance Determination prior on the weights: True\n- Spike-and-slab prior on the factors: False\n- Spike-and-slab prior on the weights: True\nLikelihoods:\n- View 0 (rna): gaussian\n- View 1 (dna): gaussian\n\n",
"\n\n######################################\n## Training the model with seed 0 ##\n######################################\n\n",
"\nConverged!\n\n\n\n#######################\n## Training finished ##\n#######################\n\n\nSaving model in /tmp/mofa_20251125-223303.hdf5...",
"Saved MOFA embeddings in .obsm['X_mofa'] slot and their loadings in .varm['LFs'].\n[INFO] Calcolo neighbors e UMAP su fattori MOFA...",
"[INFO] Calcolo Leiden per risoluzioni: [0.5, 1.0, 1.5, 2.0]",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mofa/figures_mofa/mofa_umap_leiden_res0.5.pdf",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mofa/figures_mofa/mofa_umap_leiden_res1.0.pdf",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mofa/figures_mofa/mofa_umap_leiden_res1.5.pdf",
"[INFO] Salvato → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mofa/figures_mofa/mofa_umap_leiden_res2.0.pdf"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>method</th>\n",
" <th>resolution</th>\n",
" <th>ARI</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MOFA</td>\n",
" <td>0.5</td>\n",
" <td>0.569959</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>MOFA</td>\n",
" <td>1.0</td>\n",
" <td>0.569959</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>MOFA</td>\n",
" <td>1.5</td>\n",
" <td>0.569959</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>MOFA</td>\n",
" <td>2.0</td>\n",
" <td>0.420227</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" method resolution ARI\n",
"0 MOFA 0.5 0.569959\n",
"1 MOFA 1.0 0.569959\n",
"2 MOFA 1.5 0.569959\n",
"3 MOFA 2.0 0.420227"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] ARI salvati → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mofa/ARI_MOFA_summary.csv",
"[INFO] Salvato MuData finale → ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/mofa/mdata_after_mofa.h5mu"
]
}
],
"source": [
"# --- BLOCCO 10: Integrazione MOFA(+) + UMAP + Leiden + ARI ---\n",
"\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scanpy as sc\n",
"import muon as mu\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.metrics import adjusted_rand_score\n",
"\n",
"mofa_dir = os.path.join(out_dir, \"mofa\")\n",
"os.makedirs(mofa_dir, exist_ok=True)\n",
"fig_dir = os.path.join(mofa_dir, \"figures_mofa\")\n",
"os.makedirs(fig_dir, exist_ok=True)\n",
"\n",
"# 0️⃣ Verifica che mofapy2 sia installato\n",
"try:\n",
" import mofapy2 # noqa: F401\n",
"except ImportError:\n",
" raise ImportError(\"Devi installare mofapy2 prima di usare MOFA → pip install mofapy2\")\n",
"\n",
"# 1️⃣ Imposta likelihood per ciascuna vista\n",
"likelihoods = {}\n",
"for view in mdata.mod.keys():\n",
" X = mdata.mod[view].X\n",
" try:\n",
" vals = np.unique(X.data if hasattr(X, \"data\") else np.asarray(X).ravel())\n",
" is_binary = np.all(np.isin(vals, [0, 1]))\n",
" except Exception:\n",
" is_binary = False\n",
"\n",
" if view.lower() in (\"rna\", \"gex\", \"gene\", \"transcriptome\"):\n",
" likelihoods[view] = \"gaussian\"\n",
" elif is_binary:\n",
" likelihoods[view] = \"bernoulli\"\n",
" else:\n",
" likelihoods[view] = \"gaussian\"\n",
"\n",
"print(\"[INFO] Likelihood per vista:\", likelihoods)\n",
"\n",
"# ➕ Converti in LISTA nell'ordine delle viste in mdata.mod\n",
"likelihoods_list = [likelihoods[view] for view in mdata.mod.keys()]\n",
"print(\"[INFO] Likelihoods usate in ordine:\", likelihoods_list)\n",
"\n",
"# 2️⃣ Esegui MOFA\n",
"n_factors = 10\n",
"print(f\"[INFO] Avvio MOFA con n_factors={n_factors}...\")\n",
"mu.tl.mofa(\n",
" mdata,\n",
" n_factors=n_factors,\n",
" likelihoods=likelihoods_list, # ✅ usa lista, non dict\n",
" convergence_mode=\"medium\",\n",
" seed=0,\n",
")\n",
"\n",
"# I fattori vengono salvati in mdata.obsm[\"X_mofa\"]\n",
"assert \"X_mofa\" in mdata.obsm, \"MOFA non ha prodotto X_mofa; controlla i log di mofapy2\"\n",
"\n",
"# 3️⃣ Costruisci grafo e UMAP nello spazio MOFA\n",
"print(\"[INFO] Calcolo neighbors e UMAP su fattori MOFA...\")\n",
"sc.pp.neighbors(mdata, use_rep=\"X_mofa\")\n",
"sc.tl.umap(mdata)\n",
"mdata.obsm[\"X_mofa_umap\"] = mdata.obsm[\"X_umap\"].copy()\n",
"\n",
"# 4️⃣ Leiden multi-risoluzione\n",
"mofa_res_list = [0.5, 1.0, 1.5, 2.0]\n",
"print(f\"[INFO] Calcolo Leiden per risoluzioni: {mofa_res_list}\")\n",
"for res in mofa_res_list:\n",
" key = f\"leiden_mofa_{res}\"\n",
" sc.tl.leiden(\n",
" mdata,\n",
" resolution=res,\n",
" key_added=key,\n",
" flavor=\"igraph\",\n",
" n_iterations=2,\n",
" directed=False,\n",
" )\n",
"\n",
" # Plot\n",
" fig, ax = plt.subplots(figsize=(6, 5))\n",
" sc.pl.embedding(\n",
" mdata,\n",
" basis=\"X_mofa_umap\",\n",
" color=key,\n",
" title=f\"MOFA UMAP — Leiden res={res}\",\n",
" frameon=False,\n",
" ax=ax,\n",
" show=False,\n",
" )\n",
" out_path = os.path.join(fig_dir, f\"mofa_umap_leiden_res{res}.pdf\")\n",
" plt.savefig(out_path, dpi=300, bbox_inches=\"tight\")\n",
" plt.close(fig)\n",
" print(f\"[INFO] Salvato → {out_path}\")\n",
"\n",
"# 5️⃣ Calcolo ARI (se esiste una colonna di verità)\n",
"def _maybe_ari(obs, pred_col, label_col=None):\n",
" if label_col is None or label_col not in obs.columns:\n",
" return np.nan\n",
" valid = obs[[label_col, pred_col]].dropna()\n",
" if valid.empty:\n",
" return np.nan\n",
" return adjusted_rand_score(valid[label_col].astype(str), valid[pred_col].astype(str))\n",
"\n",
"label_col = None\n",
"if \"true_label_col\" in globals() and true_label_col is not None:\n",
" label_col = true_label_col\n",
"elif \"true_label\" in mdata.obs.columns:\n",
" label_col = \"true_label\"\n",
"\n",
"results_mofa = []\n",
"if label_col is not None:\n",
" for res in mofa_res_list:\n",
" col = f\"leiden_mofa_{res}\"\n",
" ari = _maybe_ari(mdata.obs, col, label_col)\n",
" results_mofa.append({\"method\": \"MOFA\", \"resolution\": res, \"ARI\": float(ari)})\n",
" df_mofa = pd.DataFrame(results_mofa).sort_values(\"resolution\")\n",
" display(df_mofa)\n",
" csv_path = os.path.join(mofa_dir, \"ARI_MOFA_summary.csv\")\n",
" df_mofa.to_csv(csv_path, index=False)\n",
" print(f\"[INFO] ARI salvati → {csv_path}\")\n",
"else:\n",
" print(\"[WARN] Nessuna colonna true_label trovata: salto ARI per MOFA.\")\n",
"\n",
"# 6️⃣ Salvataggio finale del MuData aggiornato\n",
"out_path = os.path.join(mofa_dir, \"mdata_after_mofa.h5mu\")\n",
"mdata.write(out_path)\n",
"print(f\"[INFO] Salvato MuData finale → {out_path}\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "40b9c845-6a75-4167-8d0d-21a5b5b99d04",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 20,
"id": "a77cd0f2-4b08-44e2-9422-6fb664613677",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5710a4fe-6d4d-4894-a5c0-c620160836ff",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "127da0bc-d1c6-4c5a-9e0d-c601b1b9b9be",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "f605a6d7-c6e2-48e4-8510-f75266de9660",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c31cbe4-e48a-4e23-abfc-6a4cff80842b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:scvar_env]",
"language": "python",
"name": "conda-env-scvar_env-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
{
"cells": [
{
"cell_type": "code",
"execution_count": 14,
"id": "04c96386-1480-4ab3-a7f3-2aed794832e3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'calcOmicsClusters', 'distributionClusters', 'omicsIntegration', 'pairedIntegrationTrainer', 'save_all_umaps', 'scVAR', 'transcriptomicAnalysis', 'variantAnalysis', 'weightsInit']"
]
}
],
"source": [
"#Import all dependencies from the scVAR enviroment (see installation instructions)\n",
"import importlib\n",
"import scVAR\n",
"importlib.reload(scVAR)\n",
"print(dir(scVAR))\n",
"import sys\n",
"import pickle\n",
"import os\n",
"import scanpy as sc\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "be60befd-a7b5-4e3a-84c5-a80aa06ccc4b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Start Analysis sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3"
]
}
],
"source": [
"sample = \"sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3\"\n",
"out_path = '../tests/' + sample\n",
"in_path = '../tests/' + sample\n",
"\n",
"# Crea\n",
"if not os.path.exists(out_path):\n",
" os.makedirs(out_path, exist_ok=True)\n",
"\n",
"print('Start Analysis', sample)\n",
"\n",
"# Specify genomic file path\n",
"var_mat = in_path + '/' +'consensus_filtered_markdup.mtx'\n",
"barcode_var = in_path + '/' + 'barcodes_var.tsv'\n",
"snv = in_path + '/' +'variants_filtered_markdup.txt'"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "93db82d6-fd94-4494-86a1-844b90d05287",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] === TRANSCRIPTOMIC ANALYSIS START ===[DEBUG] Raw RNA matrix shape: (5000, 10000)\n",
"[DEBUG] Prime 5×5 celle RAW (counts):\n",
"[[1.54455 0.00000 3.06500 1.84460 2.34581]\n",
" [0.00000 1.48369 0.00000 0.00000 1.62663]\n",
" [1.31263 2.22300 4.41671 4.54242 0.00000]\n",
" [5.87613 5.61039 4.08110 0.00000 4.16244]\n",
" [2.29248 1.85998 0.00000 0.00000 2.09903]]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/tools/deg/miniforge3/envs/scvar_env/lib/python3.10/site-packages/scanpy/preprocessing/_simple.py:392: RuntimeWarning: invalid value encountered in log1p\n",
" np.log1p(X, out=X)\n",
"/opt/tools/deg/miniforge3/envs/scvar_env/lib/python3.10/functools.py:889: UserWarning: zero-centering a sparse array/matrix densifies it.\n",
" return dispatch(args[0].__class__)(*args, **kw)"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[DEBUG] RNA normalizzata e scalata (Muon-style). Shape: (5000, 10000)\n",
"[DEBUG] Prime 5×5 celle RNA (normalizzate):\n",
"[[-0.1368 -1.3475 0.5900 -0.0227 0.3918]\n",
" [-1.2973 0.0203 -1.5114 -1.4452 0.0200]\n",
" [-0.2579 0.4090 1.0154 0.8799 -1.4613]\n",
" [ 1.1003 1.4959 0.9280 -1.4452 1.0607]\n",
" [ 0.1765 0.2256 -1.5114 -1.4452 0.2665]]\n",
"[INFO] PCA con 50 componenti..."
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/tools/deg/miniforge3/envs/scvar_env/lib/python3.10/site-packages/sklearn/manifold/_spectral_embedding.py:328: UserWarning: Graph is not fully connected, spectral embedding may not work as expected.\n",
" warnings.warn("
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Salvato adata.uns['trans_X'] con shape (5000, 50)\n",
"[DEBUG] Prime 5×5 celle PCA:\n",
"[[ 27.5802 12.5513 -13.206 -0.4545 2.8701]\n",
" [ 28.8187 10.3438 -13.1964 -1.2082 -4.3537]\n",
" [ -1.777 2.2861 12.3413 30.4629 -2.2863]\n",
" [ -6.7155 -26.169 -18.2502 -0.6037 0.9084]\n",
" [ 28.8888 12.6758 -13.6591 -1.4595 -4.2337]]\n",
"[DEBUG] PCA variance ratio (prime 5): [0.0285 0.0276 0.0272 0.0257 0.0005]\n",
"[INFO] === TRANSCRIPTOMIC ANALYSIS DONE (5000 cells, 10000 genes) ===\n",
"[INFO] === VARIANT ANALYSIS START ===\n",
"[INFO] Lettura file: ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/consensus_filtered_markdup.mtx\n",
"[WARN] Transposing variant matrix (20000, 5000) → expected (cells × variants)\n",
"[INFO] Matrice varianti caricata → 5000 celle × 20000 varianti\n",
"[DEBUG] Prime 5×5 celle RAW (counts):\n",
"[[0 0 1 0 0]\n",
" [0 0 0 0 0]\n",
" [1 0 1 0 0]\n",
" [0 1 0 0 0]\n",
" [0 0 1 0 0]][INFO] Celle comuni con RNA: 5000\n",
"[INFO] Rappresentazione scelta: MUON\n",
"[INFO] Uso pipeline Muon-style (binarizzazione + scaling + PCA + UMAP)[INFO] Varianti mantenute: 20000/20000"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/tools/deg/miniforge3/envs/scvar_env/lib/python3.10/functools.py:889: UserWarning: zero-centering a sparse array/matrix densifies it.\n",
" return dispatch(args[0].__class__)(*args, **kw)"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[DEBUG] Prime 5×5 celle DNA (scalate):\n",
"[[-0.4074 -0.3807 1.7872 -0.3896 -0.395 ]\n",
" [-0.4074 -0.3807 -0.5594 -0.3896 -0.395 ]\n",
" [ 2.4538 -0.3807 1.7872 -0.3896 -0.395 ]\n",
" [-0.4074 2.6263 -0.5594 -0.3896 -0.395 ]\n",
" [-0.4074 -0.3807 1.7872 -0.3896 -0.395 ]][DEBUG] PCA variance ratio (prime 5): [0.0513 0.0347 0.0315 0.0004 0.0004]\n",
"[DEBUG] Prime 5×5 celle PCA:\n",
"[[-23.7148 2.554 54.0891 -0.8168 0.8134]\n",
" [-32.1915 -40.0711 -16.1456 2.0053 2.1533]\n",
" [-21.7155 2.5589 52.835 3.8058 0.8215]\n",
" [-22.7593 2.0298 53.5315 -0.1129 -0.8372]\n",
" [-31.3252 -38.9129 -18.8992 -0.4091 -0.2577]]\n",
"[INFO] DNA PCA shape: (5000, 30)\n",
"[INFO] === VARIANT ANALYSIS DONE (5000 cells, 20000 variants) ===\n",
"[INFO] === OMICS INTEGRATION START ===\n",
"[INFO] Nessuna riscalatura: uso PCA pura (Muon-style).\n",
"[INFO] n_cells=5000 ≥ 2000 → uso Autoencoder asimmetrico (RNA=teacher, VAR=student)[EPOCH 000] loss=12.20041[EPOCH 010] loss=9.84364[EPOCH 020] loss=9.29281[EPOCH 030] loss=8.89979[EPOCH 040] loss=8.66576[EPOCH 050] loss=8.47935[EPOCH 060] loss=8.37573[EPOCH 070] loss=8.21203[EPOCH 080] loss=8.10949[EPOCH 090] loss=8.02636[EPOCH 100] loss=7.95381[EPOCH 110] loss=7.89721[EPOCH 120] loss=7.86011[EPOCH 130] loss=7.75689[EPOCH 140] loss=7.68363[EPOCH 150] loss=7.67342[EPOCH 160] loss=7.59935[EPOCH 170] loss=7.59624[EPOCH 180] loss=7.59375[EPOCH 190] loss=7.53767[EPOCH 200] loss=7.50628[EPOCH 210] loss=7.48779[EPOCH 220] loss=7.46081[EPOCH 230] loss=7.45913[EPOCH 240] loss=7.40112[EPOCH 250] loss=7.40735[EPOCH 260] loss=7.36739[EPOCH 270] loss=7.37908[EPOCH 280] loss=7.33565[EPOCH 290] loss=7.33232[EPOCH 300] loss=7.29641[EPOCH 310] loss=7.29751[EPOCH 320] loss=7.27481[EPOCH 330] loss=7.31685[EPOCH 340] loss=7.28791[EPOCH 350] loss=7.25487[EARLY STOP] epoch=353, loss=7.25588\n",
"\n",
"[DEBUG] === AE DIAGNOSTIC ===\n",
"Shape zA: (5000, 400), zB: (5000, 400)\n",
"Varianza zA: 0.003693\n",
"Varianza zB: 0.000615\n",
"Covarianza media zA/zB: 0.000003\n",
"SimAE (dot): 35.875084\n",
"Media correlazione assoluta tra feature zA: 0.0882\n",
"Range valori zA: -1.6930 → 2.9349\n",
"Range valori zB: -1.6362 → 2.8093\n",
"[DEBUG] Prime 5x5 celle zA:\n",
" [[ 0.0551 -0.0612 -0.0695 -0.0238 -0.0324]\n",
" [-0.0339 -0.0796 -0.0597 0.0498 0.039 ]\n",
" [-0.0385 0.0456 0.0597 0.0698 0.0962]\n",
" [-0.0265 0.0119 0.0295 0.0565 0.1366]\n",
" [ 0.0226 0.0446 0.0062 0.0481 -0.0308]]\n",
"[DEBUG] Prime 5x5 celle zB:\n",
" [[-0.0421 -0.0021 0.0417 0.0299 0.0093]\n",
" [-0.0069 -0.0332 0.0173 0.0387 0.0122]\n",
" [-0.0257 0.0129 0.0199 -0.0048 0.0483]\n",
" [-0.0359 -0.0138 0.0808 0.0518 -0.0027]\n",
" [-0.0778 -0.0142 0.058 0.0141 -0.0141]]\n",
"================================\n",
"\n",
"[DEBUG] Varianza media zA: 0.0921, zB: 0.0869, zAE: 0.0884\n",
"[DEBUG] Prime 5×5 celle embedding AE:\n",
"[[ 0.0065 -0.0316 -0.0139 0.0031 -0.0115]\n",
" [-0.0204 -0.0564 -0.0212 0.0442 0.0256]\n",
" [-0.0321 0.0293 0.0398 0.0325 0.0722]\n",
" [-0.0312 -0.001 0.0551 0.0541 0.067 ]\n",
" [-0.0276 0.0152 0.0321 0.0311 -0.0224]]"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/tools/deg/miniforge3/envs/scvar_env/lib/python3.10/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.\n",
" warn("
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] AE similarity=35.875\n",
"[INFO] === AUTOENCODER INTEGRATION COMPLETATA ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (variant) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['variant_X'] (30 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[-23.7148 2.554 54.0891 -0.8168 0.8134]\n",
" [-32.1915 -40.0711 -16.1456 2.0053 2.1533]\n",
" [-21.7155 2.5589 52.835 3.8058 0.8215]\n",
" [-22.7593 2.0298 53.5315 -0.1129 -0.8372]\n",
" [-31.3252 -38.9129 -18.8992 -0.4091 -0.2577]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(41.214), np.float32(-75.0844), np.float32(36.6839), np.float32(14.1632), np.float32(-99.744)]\n",
"[DEBUG] Varianza media embedding: 85.4580\n",
"[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002743, mean_dist=0.040279\n",
"[INFO] → Leiden completato su variant (res=0.5) → 4 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (trans) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['trans_X'] (50 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 27.5802 12.5513 -13.206 -0.4545 2.8701]\n",
" [ 28.8187 10.3438 -13.1964 -1.2082 -4.3537]\n",
" [ -1.777 2.2861 12.3413 30.4629 -2.2863]\n",
" [ -6.7155 -26.169 -18.2502 -0.6037 0.9084]\n",
" [ 28.8888 12.6758 -13.6591 -1.4595 -4.2337]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(24.3581), np.float32(26.9298), np.float32(35.7606), np.float32(-48.3718), np.float32(31.0977)]\n",
"[DEBUG] Varianza media embedding: 26.3937[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002376, mean_dist=0.046996\n",
"[INFO] → Leiden completato su trans (res=0.5) → 5 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (int) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['int_X'] (400 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 0.0065 -0.0316 -0.0139 0.0031 -0.0115]\n",
" [-0.0204 -0.0564 -0.0212 0.0442 0.0256]\n",
" [-0.0321 0.0293 0.0398 0.0325 0.0722]\n",
" [-0.0312 -0.001 0.0551 0.0541 0.067 ]\n",
" [-0.0276 0.0152 0.0321 0.0311 -0.0224]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(19.3927), np.float32(18.6286), np.float32(19.5757), np.float32(18.5747), np.float32(19.2562)]\n",
"[DEBUG] Varianza media embedding: 0.0884[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.001461, mean_dist=0.001989\n",
"[INFO] → Leiden completato su int (res=0.5) → 4 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (variant) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['variant_X'] (30 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[-23.7148 2.554 54.0891 -0.8168 0.8134]\n",
" [-32.1915 -40.0711 -16.1456 2.0053 2.1533]\n",
" [-21.7155 2.5589 52.835 3.8058 0.8215]\n",
" [-22.7593 2.0298 53.5315 -0.1129 -0.8372]\n",
" [-31.3252 -38.9129 -18.8992 -0.4091 -0.2577]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(41.214), np.float32(-75.0844), np.float32(36.6839), np.float32(14.1632), np.float32(-99.744)]\n",
"[DEBUG] Varianza media embedding: 85.4580[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002743, mean_dist=0.040279\n",
"[INFO] → Leiden completato su variant (res=1.0) → 4 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (trans) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['trans_X'] (50 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 27.5802 12.5513 -13.206 -0.4545 2.8701]\n",
" [ 28.8187 10.3438 -13.1964 -1.2082 -4.3537]\n",
" [ -1.777 2.2861 12.3413 30.4629 -2.2863]\n",
" [ -6.7155 -26.169 -18.2502 -0.6037 0.9084]\n",
" [ 28.8888 12.6758 -13.6591 -1.4595 -4.2337]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(24.3581), np.float32(26.9298), np.float32(35.7606), np.float32(-48.3718), np.float32(31.0977)]\n",
"[DEBUG] Varianza media embedding: 26.3937[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002376, mean_dist=0.046996\n",
"[INFO] → Leiden completato su trans (res=1.0) → 5 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (int) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['int_X'] (400 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 0.0065 -0.0316 -0.0139 0.0031 -0.0115]\n",
" [-0.0204 -0.0564 -0.0212 0.0442 0.0256]\n",
" [-0.0321 0.0293 0.0398 0.0325 0.0722]\n",
" [-0.0312 -0.001 0.0551 0.0541 0.067 ]\n",
" [-0.0276 0.0152 0.0321 0.0311 -0.0224]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(19.3927), np.float32(18.6286), np.float32(19.5757), np.float32(18.5747), np.float32(19.2562)]\n",
"[DEBUG] Varianza media embedding: 0.0884[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.001461, mean_dist=0.001989\n",
"[INFO] → Leiden completato su int (res=1.0) → 5 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (variant) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['variant_X'] (30 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[-23.7148 2.554 54.0891 -0.8168 0.8134]\n",
" [-32.1915 -40.0711 -16.1456 2.0053 2.1533]\n",
" [-21.7155 2.5589 52.835 3.8058 0.8215]\n",
" [-22.7593 2.0298 53.5315 -0.1129 -0.8372]\n",
" [-31.3252 -38.9129 -18.8992 -0.4091 -0.2577]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(41.214), np.float32(-75.0844), np.float32(36.6839), np.float32(14.1632), np.float32(-99.744)]\n",
"[DEBUG] Varianza media embedding: 85.4580[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002743, mean_dist=0.040279\n",
"[INFO] → Leiden completato su variant (res=1.5) → 5 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (trans) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['trans_X'] (50 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 27.5802 12.5513 -13.206 -0.4545 2.8701]\n",
" [ 28.8187 10.3438 -13.1964 -1.2082 -4.3537]\n",
" [ -1.777 2.2861 12.3413 30.4629 -2.2863]\n",
" [ -6.7155 -26.169 -18.2502 -0.6037 0.9084]\n",
" [ 28.8888 12.6758 -13.6591 -1.4595 -4.2337]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(24.3581), np.float32(26.9298), np.float32(35.7606), np.float32(-48.3718), np.float32(31.0977)]\n",
"[DEBUG] Varianza media embedding: 26.3937[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002376, mean_dist=0.046996\n",
"[INFO] → Leiden completato su trans (res=1.5) → 5 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (int) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['int_X'] (400 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 0.0065 -0.0316 -0.0139 0.0031 -0.0115]\n",
" [-0.0204 -0.0564 -0.0212 0.0442 0.0256]\n",
" [-0.0321 0.0293 0.0398 0.0325 0.0722]\n",
" [-0.0312 -0.001 0.0551 0.0541 0.067 ]\n",
" [-0.0276 0.0152 0.0321 0.0311 -0.0224]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(19.3927), np.float32(18.6286), np.float32(19.5757), np.float32(18.5747), np.float32(19.2562)]\n",
"[DEBUG] Varianza media embedding: 0.0884[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.001461, mean_dist=0.001989\n",
"[INFO] → Leiden completato su int (res=1.5) → 12 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (variant) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['variant_X'] (30 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[-23.7148 2.554 54.0891 -0.8168 0.8134]\n",
" [-32.1915 -40.0711 -16.1456 2.0053 2.1533]\n",
" [-21.7155 2.5589 52.835 3.8058 0.8215]\n",
" [-22.7593 2.0298 53.5315 -0.1129 -0.8372]\n",
" [-31.3252 -38.9129 -18.8992 -0.4091 -0.2577]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(41.214), np.float32(-75.0844), np.float32(36.6839), np.float32(14.1632), np.float32(-99.744)]\n",
"[DEBUG] Varianza media embedding: 85.4580[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002743, mean_dist=0.040279\n",
"[INFO] → Leiden completato su variant (res=2.0) → 12 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (trans) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['trans_X'] (50 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 27.5802 12.5513 -13.206 -0.4545 2.8701]\n",
" [ 28.8187 10.3438 -13.1964 -1.2082 -4.3537]\n",
" [ -1.777 2.2861 12.3413 30.4629 -2.2863]\n",
" [ -6.7155 -26.169 -18.2502 -0.6037 0.9084]\n",
" [ 28.8888 12.6758 -13.6591 -1.4595 -4.2337]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(24.3581), np.float32(26.9298), np.float32(35.7606), np.float32(-48.3718), np.float32(31.0977)]\n",
"[DEBUG] Varianza media embedding: 26.3937[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.002376, mean_dist=0.046996\n",
"[INFO] → Leiden completato su trans (res=2.0) → 5 cluster\n",
"[INFO] === CLUSTERING DONE ===\n",
"\n",
"[INFO] === CALC OMiCS CLUSTERS (int) ===\n",
"[INFO] Numero celle: 5000\n",
"[INFO] Dataset grande → uso embedding autoencoder (AE-style)\n",
"[INFO] Embedding caricato da adata.uns['int_X'] (400 dimensioni)\n",
"[DEBUG] Prime 5×5 celle embedding:\n",
"[[ 0.0065 -0.0316 -0.0139 0.0031 -0.0115]\n",
" [-0.0204 -0.0564 -0.0212 0.0442 0.0256]\n",
" [-0.0321 0.0293 0.0398 0.0325 0.0722]\n",
" [-0.0312 -0.001 0.0551 0.0541 0.067 ]\n",
" [-0.0276 0.0152 0.0321 0.0311 -0.0224]]\n",
"[DEBUG] Somma prime 5 righe: [np.float32(19.3927), np.float32(18.6286), np.float32(19.5757), np.float32(18.5747), np.float32(19.2562)]\n",
"[DEBUG] Varianza media embedding: 0.0884[DEBUG] Neighbors graph: (5000, 5000), mean_conn=0.001461, mean_dist=0.001989\n",
"[INFO] → Leiden completato su int (res=2.0) → 21 cluster\n",
"[INFO] === CLUSTERING DONE ==="
]
}
],
"source": [
"import numpy as np\n",
"\n",
"# RNA - OK\n",
"adata = scVAR.transcriptomicAnalysis(\n",
" path_10x=in_path,\n",
" bcode_variants=barcode_var,\n",
" n_pcs=50, # oppure None se vuoi usare tutto (più lento/rumoroso)\n",
")\n",
"\n",
"# VAR - OK\n",
"adata = scVAR.variantAnalysis(adata, matrix_path=var_mat, bcode_path=barcode_var,\n",
" variants_path=snv, n_pcs=30, variant_filter_level=\"medium\",variant_rep=\"muon\")\n",
"\n",
"\n",
"adata = scVAR.omicsIntegration(\n",
" adata,\n",
" latent_dim=400,\n",
" num_epoch=3500,\n",
" lam_align=0.5,\n",
" lam_cross=7.7,\n",
" lam_recon_a=1.0,\n",
" lam_recon_b=0.8,\n",
")\n",
"\n",
"# Compute transcriptomics, genomics and integrated clusters at different resolutions\n",
"res_list = [0.5, 1.0, 1.5, 2.0]\n",
"for res in res_list:\n",
" adata = scVAR.calcOmicsClusters(adata, omic_key='variant', res=res)\n",
" adata = scVAR.calcOmicsClusters(adata, omic_key='trans', res=res)\n",
" adata = scVAR.calcOmicsClusters(adata, omic_key='int', res=res)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "4e52f021-0ca9-444a-a7b8-df7cd1fcb13d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AnnData object with n_obs × n_vars = 5000 × 10000\n",
" obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'variant_clust_0.5', 'trans_clust_0.5', 'int_clust_0.5', 'variant_clust_1', 'trans_clust_1', 'int_clust_1', 'variant_clust_1.5', 'trans_clust_1.5', 'int_clust_1.5', 'variant_clust_2', 'trans_clust_2', 'int_clust_2'\n",
" var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'mean', 'std'\n",
" uns: 'trans_raw', 'trans_raw_obs_names', 'trans_raw_var_names', 'log1p', 'pca', 'neighbors', 'umap', 'trans_X', 'variant_raw', 'variant_raw_obs_names', 'variant_raw_var_names', 'variant_X', 'int_X', 'int_metrics', 'variant_neighbors', 'variant_clust_0.5', 'trans_neighbors', 'trans_clust_0.5', 'int_neighbors', 'int_clust_0.5', 'variant_clust_1', 'trans_clust_1', 'int_clust_1', 'variant_clust_1.5', 'trans_clust_1.5', 'int_clust_1.5', 'variant_clust_2', 'trans_clust_2', 'int_clust_2'\n",
" obsm: 'X_pca', 'X_umap', 'trans_pca', 'trans_umap', 'variant_pca', 'variant_umap', 'int_umap', 'variant_X', 'trans_X', 'int_X'\n",
" varm: 'PCs'\n",
" obsp: 'distances', 'connectivities', 'variant_neighbors_distances', 'variant_neighbors_connectivities', 'trans_neighbors_distances', 'trans_neighbors_connectivities', 'int_neighbors_distances', 'int_neighbors_connectivities'\n",
"Index(['n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt',\n",
" 'pct_counts_mt', 'variant_clust_0.5', 'trans_clust_0.5',\n",
" 'int_clust_0.5', 'variant_clust_1', 'trans_clust_1', 'int_clust_1',\n",
" 'variant_clust_1.5', 'trans_clust_1.5', 'int_clust_1.5',\n",
" 'variant_clust_2', 'trans_clust_2', 'int_clust_2'],\n",
" dtype='object')"
]
}
],
"source": [
"print(adata)\n",
"print(adata.obs.columns)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "746bbe89-e96f-46e3-b35b-69b89951d8cd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'variant_clust_0.5', 'trans_clust_0.5', 'int_clust_0.5', 'variant_clust_1', 'trans_clust_1', 'int_clust_1', 'variant_clust_1.5', 'trans_clust_1.5', 'int_clust_1.5', 'variant_clust_2', 'trans_clust_2', 'int_clust_2']\n",
"variant_clust_0.5 4\n",
"trans_clust_0.5 5\n",
"int_clust_0.5 4\n",
"variant_clust_1 4\n",
"trans_clust_1 5\n",
"int_clust_1 5\n",
"variant_clust_1.5 5\n",
"trans_clust_1.5 5\n",
"int_clust_1.5 12\n",
"variant_clust_2 12\n",
"trans_clust_2 5\n",
"int_clust_2 21\n",
"Name: n_cluster, dtype: int64"
]
}
],
"source": [
"print(adata.obs.columns.tolist())\n",
"import pandas as pd\n",
"\n",
"cluster_cols = [c for c in adata.obs.columns if \"clust\" in c]\n",
"cluster_summary = {\n",
" c: adata.obs[c].nunique(dropna=True)\n",
" for c in cluster_cols\n",
"}\n",
"\n",
"print(pd.Series(cluster_summary, name=\"n_cluster\"))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "757bed63-e59b-4cb4-a614-d099a59f8dad",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== 🔍 ISPEZIONE DETTAGLIATA DI adata.uns ===\n",
" • trans_raw: shape=(5000, 10000), dtype=float64\n",
" • trans_raw_obs_names: shape=(5000,), dtype=object\n",
" • trans_raw_var_names: shape=(10000,), dtype=object\n",
" • log1p: type=dict, len=1\n",
" • pca: type=dict, len=3\n",
" • neighbors: type=dict, len=3\n",
" • umap: type=dict, len=1\n",
" • trans_X: shape=(5000, 50), dtype=float32\n",
" • variant_raw: shape=(5000, 20000), dtype=int64\n",
" • variant_raw_obs_names: shape=(5000,), dtype=object\n",
" • variant_raw_var_names: shape=(20000,), dtype=object\n",
" • variant_X: shape=(5000, 30), dtype=float32\n",
" • int_X: shape=(5000, 400), dtype=float32\n",
" • int_metrics: type=dict, len=7\n",
" • variant_neighbors: type=dict, len=3\n",
" • variant_clust_0.5: type=dict, len=2\n",
" • trans_neighbors: type=dict, len=3\n",
" • trans_clust_0.5: type=dict, len=2\n",
" • int_neighbors: type=dict, len=3\n",
" • int_clust_0.5: type=dict, len=2\n",
" • variant_clust_1: type=dict, len=2\n",
" • trans_clust_1: type=dict, len=2\n",
" • int_clust_1: type=dict, len=2\n",
" • variant_clust_1.5: type=dict, len=2\n",
" • trans_clust_1.5: type=dict, len=2\n",
" • int_clust_1.5: type=dict, len=2\n",
" • variant_clust_2: type=dict, len=2\n",
" • trans_clust_2: type=dict, len=2\n",
" • int_clust_2: type=dict, len=2"
]
}
],
"source": [
"print(\"=== 🔍 ISPEZIONE DETTAGLIATA DI adata.uns ===\")\n",
"\n",
"if hasattr(adata, \"uns\") and len(adata.uns.keys()) > 0:\n",
" for key in adata.uns.keys():\n",
" obj = adata.uns[key]\n",
" # prova a estrarre shape e dtype se è array-like o matrice\n",
" shape = getattr(obj, \"shape\", None)\n",
" dtype = getattr(obj, \"dtype\", type(obj))\n",
" if shape is not None:\n",
" print(f\" • {key}: shape={shape}, dtype={dtype}\")\n",
" else:\n",
" t = type(obj).__name__\n",
" size = len(obj) if hasattr(obj, \"__len__\") else \"-\"\n",
" print(f\" • {key}: type={t}, len={size}\")\n",
"else:\n",
" print(\"⚠️ Nessun elemento trovato in adata.uns.\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "a8627c7e-1641-483a-817e-7836fc039d83",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Carico metadati da: ../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/cell_metadata.csv[INFO] Create colonne combinate da 'TrueCellType' + 'TrueGenotype'\n",
"[INFO] Colonna 'TrueCombo' aggiunta ad adata.obs (5000 celle, 0 mancanti)"
]
}
],
"source": [
"# === BLOCCO RICARICA GROUND TRUTH (per ARI/NMI) ===\n",
"import pandas as pd\n",
"import os\n",
"\n",
"meta_csv = \"../tests/sim_c5000_g10000_v20000_ct5_gt4_es0.3_vs0.1_ee0.3_ve0.3_ln0.3/cell_metadata.csv\"\n",
"\n",
"if not os.path.exists(meta_csv):\n",
" raise FileNotFoundError(f\"[ERRORE] File metadati non trovato: {meta_csv}\")\n",
"\n",
"print(f\"[INFO] Carico metadati da: {meta_csv}\")\n",
"meta = pd.read_csv(meta_csv)\n",
"\n",
"# --- Identifica colonna delle celle ---\n",
"if \"Cell\" in meta.columns:\n",
" meta = meta.set_index(\"Cell\")\n",
"elif \"cell_id\" in meta.columns:\n",
" meta = meta.set_index(\"cell_id\")\n",
"else:\n",
" raise ValueError(f\"[ERRORE] Nessuna colonna 'Cell' o 'cell_id' trovata in {meta.columns.tolist()}\")\n",
"\n",
"# --- Identifica colonne di tipo e genotipo ---\n",
"celltype_col = next((c for c in [\"TrueCellType\", \"CellType\", \"cell_type\"] if c in meta.columns), None)\n",
"genotype_col = next((c for c in [\"TrueGenotype\", \"Genotype\", \"genotype\"] if c in meta.columns), None)\n",
"\n",
"if celltype_col is None and genotype_col is None:\n",
" raise ValueError(f\"[ERRORE] Nessuna colonna di tipo cellula o genotipo trovata in {meta.columns.tolist()}\")\n",
"\n",
"# --- Crea la colonna combinata TrueCombo ---\n",
"if celltype_col and genotype_col:\n",
" meta[\"TrueCombo\"] = meta[celltype_col].astype(str) + \"_\" + meta[genotype_col].astype(str)\n",
" print(f\"[INFO] Create colonne combinate da '{celltype_col}' + '{genotype_col}'\")\n",
"elif celltype_col:\n",
" meta[\"TrueCombo\"] = meta[celltype_col].astype(str)\n",
" print(f\"[INFO] Usata sola colonna cell_type '{celltype_col}' come TrueCombo\")\n",
"elif genotype_col:\n",
" meta[\"TrueCombo\"] = meta[genotype_col].astype(str)\n",
" print(f\"[INFO] Usata sola colonna genotype '{genotype_col}' come TrueCombo\")\n",
"\n",
"# --- Unisci con adata.obs ---\n",
"adata.obs = adata.obs.join(meta[[\"TrueCombo\"]], how=\"left\")\n",
"\n",
"# --- Controlli ---\n",
"if \"TrueCombo\" not in adata.obs.columns:\n",
" raise ValueError(\"[ERRORE] Non è stata creata correttamente la colonna TrueCombo.\")\n",
"n_missing = adata.obs[\"TrueCombo\"].isna().sum()\n",
"print(f\"[INFO] Colonna 'TrueCombo' aggiunta ad adata.obs ({len(adata.obs)} celle, {n_missing} mancanti)\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "baf95f33-2fe8-4766-a855-37959d1163e4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[INFO] Calcolo ARI/NMI Muon-style per le colonne int_clust_*\n",
"[INFO] Risoluzioni trovate (ordinate): [0.5, 1.0, 1.5, 2.0]\n",
"[INFO] res=0.5 (int_clust_0.5) → ARI=0.414944 | NMI=0.473148\n",
"[INFO] res=1.0 (int_clust_1) → ARI=0.532132 | NMI=0.562063[INFO] res=1.5 (int_clust_1.5) → ARI=0.535991 | NMI=0.555448\n",
"[INFO] res=2.0 (int_clust_2) → ARI=0.297338 | NMI=0.428204\n",
"\n",
"=== RISULTATI FINALI (Muon-style) ===\n",
" method resolution colname ARI_MuonCombo NMI_MuonCombo\n",
"ConcatPCA 0.5 int_clust_0.5 0.414944 0.473148\n",
"ConcatPCA 1.0 int_clust_1 0.532132 0.562063\n",
"ConcatPCA 1.5 int_clust_1.5 0.535991 0.555448\n",
"ConcatPCA 2.0 int_clust_2 0.297338 0.428204"
]
}
],
"source": [
"# === BLOCCO UNICO: ARI MUON-STYLE ADATTATO A SCVAR (fix colnames) ===\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score\n",
"\n",
"print(\"[INFO] Calcolo ARI/NMI Muon-style per le colonne int_clust_*\")\n",
"\n",
"# --- Trova tutte le colonne int_clust_* ---\n",
"int_cols = [c for c in adata.obs.columns if c.startswith(\"int_clust_\")]\n",
"if not int_cols:\n",
" raise ValueError(\"Nessuna colonna int_clust_* trovata in adata.obs!\")\n",
"\n",
"# --- Verifica presenza colonna TrueCombo ---\n",
"if \"TrueCombo\" not in adata.obs.columns:\n",
" raise ValueError(\"Colonna 'TrueCombo' mancante in adata.obs — impossibile calcolare ARI/NMI.\")\n",
"\n",
"# --- Costruisci mappa colonna -> risoluzione (float), evitando startswith ---\n",
"pairs = []\n",
"for c in int_cols:\n",
" suf = c.replace(\"int_clust_\", \"\", 1)\n",
" try:\n",
" res = float(suf)\n",
" except ValueError:\n",
" # gestisce casi tipo int_clust_XYZ: skip con warning\n",
" print(f\"[WARN] Suffix non numerico per colonna {c} — salto.\")\n",
" continue\n",
" pairs.append((res, c))\n",
"\n",
"if not pairs:\n",
" raise ValueError(\"Nessuna colonna int_clust_* con suffisso numerico valida.\")\n",
"\n",
"# Ordina per risoluzione (e poi per nome colonna per stabilità)\n",
"pairs = sorted(pairs, key=lambda x: (x[0], x[1]))\n",
"res_values = [p[0] for p in pairs]\n",
"print(f\"[INFO] Risoluzioni trovate (ordinate): {res_values}\")\n",
"\n",
"results_concat = []\n",
"\n",
"# --- Loop diretto su (res, colname) già accoppiati correttamente ---\n",
"for res, colname in pairs:\n",
" if colname not in adata.obs.columns:\n",
" print(f\"[WARN] Colonna {colname} non presente in adata.obs — salto.\")\n",
" continue\n",
"\n",
" valid = adata.obs.dropna(subset=[colname, \"TrueCombo\"])\n",
" if valid.empty:\n",
" print(f\"[WARN] Nessun dato valido per {colname}.\")\n",
" continue\n",
"\n",
" # Calcolo ARI e NMI rispetto a TrueCombo\n",
" y_true = valid[\"TrueCombo\"].astype(str)\n",
" y_pred = valid[colname].astype(str)\n",
" ari = adjusted_rand_score(y_true, y_pred)\n",
" nmi = normalized_mutual_info_score(y_true, y_pred)\n",
"\n",
" results_concat.append({\n",
" \"method\": \"ConcatPCA\",\n",
" \"resolution\": float(res),\n",
" \"colname\": colname,\n",
" \"ARI_MuonCombo\": float(ari),\n",
" \"NMI_MuonCombo\": float(nmi),\n",
" })\n",
" print(f\"[INFO] res={res} ({colname}) → ARI={ari:.6f} | NMI={nmi:.6f}\")\n",
"\n",
"# --- Output finale ---\n",
"if results_concat:\n",
" df_concat = pd.DataFrame(results_concat).sort_values([\"resolution\", \"colname\"]).reset_index(drop=True)\n",
" print(\"\\n=== RISULTATI FINALI (Muon-style) ===\")\n",
" print(df_concat.to_string(index=False))\n",
"else:\n",
" print(\"[WARN] Nessun risultato ARI calcolato.\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:scvar_env]",
"language": "python",
"name": "conda-env-scvar_env-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Metadata-Version: 2.4
Name: scVAR
Version: 0.0.1
Summary: A tool to integrate genomics and transcriptomics in scRNA-seq data.
Author: Samuele Manessi
Author-email: samuele.manessi@itb.cnr.it
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scanpy
Requires-Dist: torch
Requires-Dist: umap
Requires-Dist: leidenalg
Requires-Dist: igraph
Requires-Dist: anndata
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: matplotlib
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary
# scVAR
**scVAR** is a computational tool for extracting and integrating genetic variants from single-cell RNA-seq (scRNA-seq) data. It uses variational autoencoders to construct a latent space that combines transcriptional and genetic signals, helping to resolve cellular heterogeneity — particularly in complex diseases such as leukemia.
## 🔍 Motivation
Leukemias like AML and B-ALL exhibit high genetic and transcriptomic heterogeneity, making clonal analysis particularly challenging. Although scRNA-seq is widely used to study gene expression, it also contains valuable information on genetic variants. **scVAR** leverages this dual information to jointly analyze transcriptional and genetic signals from the same dataset, without requiring matched DNA sequencing.
## 🧠 What It Does
- Detects expressed genetic variants directly from scRNA-seq data
- Integrates transcriptomic and variant information using multi-input variational autoencoders
- Builds a shared latent space capturing both omics layers
- Enhances detection of rare subclones and subtle transcriptional states
- Recovers structure often missed when analyzing transcriptomic or genomic data in isolation
## 📊 Use Cases
- Clonal architecture analysis in AML and B-ALL
- Interpretation of relapse samples
- Joint modeling of gene expression and mutational signals
- Effective utilization of sparse variant data from 10x Genomics 5′ scRNA-seq
## 📁 Data & Results
In AML samples, **scVAR** identified subclones with distinct transcriptional programs that were not detectable using gene expression or variant data alone. In B-ALL, it revealed fine-grained cellular structures and helped disentangle overlapping transcriptional and genetic signals.
## 🚀 Getting Started
An example of workflow is provided in the `example/` folder. A jupyter notebbok is also provided in the `notebooks/` folder.
## 🛠️ Installation
To install **scVAR**, create a new environment using `mamba` and install the package from source:
```
mamba create -n scvar_env python=3.10
mamba activate scvar_env
git clone http://www.bioinfotiget.it/gitlab/custom/scvar.git
cd scvar
pip install .
```
**Note:** scVAR requires **Python == 3.10**.
## 📜 License
Distributed under the MIT License. See the `LICENSE` file for more information.
LICENSE
README.md
setup.py
scVAR/__init__.py
scVAR/scVAR.py
scVAR/scVAR_muon.py
scVAR.egg-info/PKG-INFO
scVAR.egg-info/SOURCES.txt
scVAR.egg-info/dependency_links.txt
scVAR.egg-info/requires.txt
scVAR.egg-info/top_level.txt
\ No newline at end of file
numpy
pandas
scanpy
torch
umap
leidenalg
igraph
anndata
scikit-learn
scipy
matplotlib
"""
scVAR package initialization
============================
Expose main analysis functions for transcriptomic and variant integration.
"""
from .scVAR import (
transcriptomicAnalysis,
variantAnalysis,
calcOmicsClusters,
weightsInit,
save_all_umaps,
omicsIntegration,
pairedIntegrationTrainer,
distributionClusters,
)
__all__ = [
"transcriptomicAnalysis",
"variantAnalysis",
"calcOmicsClusters",
"weightsInit",
"omicsIntegration",
"save_all_umaps",
"pairedIntegrationTrainer",
"distributionClusters",
]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment