# Single cell analyses

This repository contains the key scripts used to generate the final tables and figures featured in the accompanying manuscript:  
**Zonari E, Naldini MM, Barcella M, Volpin M, et al**, _Ex Vivo Expansion of Hematopoietic Stem and Progenitor Cells from Human Mobilized Peripheral Blood for Gene Therapy Applications_, 2026.  

## Abstract 

Ex vivo expansion of mobilized peripheral blood (mPB) hematopoietic stem cells
(HSCs) represents a promising approach to advance cell and gene therapy strategies
yet is hampered by loss of stem cell function when applying commonly used culture protocols. We performed in-depth characterization of mPB expansion cultures by
single cell RNA sequencing, which highlighted differentiation trajectories with
preservation of lineage fidelity in committed progenitors. Defining a putative HSC
cluster allowed an estimation of transduction efficiency in ex vivo cultures, which
correlated with long-term gene marking in xenografts and patients enrolled in a gene
therapy study. We then developed a clinically translatable, GMP-compliant process to
expand lentivirus (LV)-transduced HSCs from mPB of pediatric patients and adult
donors, by biologically informed protocol improvements of cytokine supplementation,
media choice, timing of LV transduction and combinations of small molecules
preventing the activation of differentiation programs. Our optimized process
outperforms validated state-of-the-art cord blood expansion protocols when applied to
mPB. LV integration site analysis and genomic barcode-based clonal tracking provided
definitive proof for symmetric HSC self-renewal divisions occurring during ex vivo
culture. These results warrant clinical testing of this HSC transduction/expansion
process in an upcoming clinical gene therapy trial for autosomal recessive
osteopetrosis (EU CT 2024-518972-30).  

PubMed:  
DOI:  
GEO: [GSE292604](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE292604) and [GSE292606](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE292606) for Fig. 2H and Fig. 2C datasets respectively.

## Data analysis workflow
### Dataset Fig2C

![Fig2C](https://www.dropbox.com/scl/fi/lzb8u6oj4bvd0xf6977aw/Dataset2C.png?rlkey=t0g2g067pr7ukcdludv3rgh8j&dl=1)

Figures 2C,D,E,F : [Figures_2CDEF.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Figures_2CDEF.R).   

Figure 2G: [Figures_2G.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Figures_2G.R). 

#### Basic analysis
The **scRNAseq basic** analysis was conducted according to the workflow outlined below:

1. QC: min.cells = 20; min_feature: between 200 and 6000; max pct.mito = 20.
Normalization (default seurat settings)
Scaling (with following variables to regress out: percent.mt + nCount_RNA and CC.Difference calculated as show in [vignette](https://satijalab.org/seurat/articles/cell_cycle_vignette.html#alternate-workflow-1))
3. Dimensionality reduction: PCA (top 25PCs using 15% most variable genes of whole gene list)
4. Harmony batch removal (sample/ orig.ident)
5. Clustering (Louvain improved: algorithm number 2 in FindCluster Seurat function)
6. Markers identification (resolution 0.6 + additional subclustering of clusters 5,7 and 9 with FindSubCluster Seurat function)
7. Automatic cell type annotation using singleR (v1.8.1) and Sakurai et al (23) as reference dataset.

Custom annotation of cells was then performed according to Sakurai classifcation and markers inspection providing two different annotations levels: Classification variable (more refined) and Population variable - less granular as shown in Fig2C.  
The complete list of markers produced by FindAllMarkers seurat function for all resolutions and annotation varibles (logfc.threshold = 0, min.pct = 0.2) is included in Data file S2.  
see: [FindAllMarkers_Population_dataset_Fig2C.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/FindAllMarkers_Population_dataset_Fig2C.R) and result in .rds format: [Full_GSEA_markers_Population.rds](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Full_GSEA_markers_Population.rds)

#### Evaluation of expansion culture on mPB CD34+ cells

To assess the effect of expansion culture on mPB CD34+ cells we identified genes
significantly deregulated across culture timepoints starting from uncultured (day0) cells to day4
and day8 expansion timepoints. To correct for differences in cell population abundances, we
performed all comparisons across timepoints within each cell population (Classification label).
First, we identified differentially expressed genes between timepoints (day4 vs day0; day8 vs day4
and day8 vs day0) by the FindMarker function from Seurat R package. According to fold change
direction we defined different patterns of modulation that could be simplified in up or down
regulated across culture (consistently up/down, early or late up/down according to day4 and day8
significance and direction). We next looked for a more comprehensive set of genes which could
summarize a global culture effect across most populations, and selected sets of common up and
down regulated genes shared in at least 4 populations out of 7 identified by our Population label.
Lists of common up/down regulated genes across culture were used as input for ORA using
clusterprofiler R package (v4.7.1) 

Code details and input for Supplementary Figure 2 can be found in:
- [Supplemental_fig2_part1.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Supplemental_fig2_part1.R)
- [Supplemental_fig2_part2.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Supplemental_fig2_part2.R)
- [Supplemental_fig2_input.rds](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Supplemental_fig2_input.rds)

### Dataset Fig2H

![Fig2H](https://www.dropbox.com/scl/fi/h3jhl7q67vw74nfh72epx/UMAP_dataset2H.png?rlkey=jkznuqwbs5fqty8wkprbclheq&dl=1) 

Figure 2H: [scGate_Fig2C_dataset_to_Fig2H_dataset.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/scGate_Fig2C_dataset_to_Fig2H_dataset.R). 

Figure 2I: [scGate_Fig2C_dataset_to_Fig2I_dataset.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/scGate_Fig2C_dataset_to_Fig2I_dataset.R)

#### Basic analysis
The **scRNAseq basic** analysis was conducted according to the workflow outlined below:

1. QC: min.cells = 20; min_feature: between 200 and 6000; max pct.mito = 20.
Normalization (default seurat settings)
2. ALRA (Imputation with default parameters in RunALRA function)
2. Scaling (with following variables to regress out: percent.mt + nCount_RNA and CC.Difference calculated as show in [vignette](https://satijalab.org/seurat/articles/cell_cycle_vignette.html#alternate-workflow-1))
3. Dimensionality reduction: PCA (top 25PCs using 15% most variable genes of whole gene list)
4. Harmony batch removal (timepoint (tp variable) + patient(pt variable))
5. Clustering (Louvain improved: algorithm number 2 in FindCluster Seurat function)
6. Markers identification (resolution 0.6 + RefinedAnnotation variable post scGATE annotation with refinement - see scripts below)

#### scGate annotation using Fig2C as reference dataset

Cell classification was performed by employing scGATE (v1.6.2) R package (61) using the Fig.2C datasetas reference. In particular we build a gating model by
using the top 15 positive markers for each population (level 1). For details see [markersdir_scGate_input](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/tree/main/markersdir_scGate_input) folder.     

The workflow consists of two steps:
- building of the model and the raw annotation using scGATE
- refinement of multiple annotation labels into final label according to custom rules as shown in the code. 

All details can be found in the script [scGate_Fig2C_dataset_to_Fig2H_dataset.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/scGate_Fig2C_dataset_to_Fig2H_dataset.R).  


#### WPRE quantification and mapping 

To evaluate the transduction efficiency of Fig2H patients samples, we assessed for expression
of WPRE (LV specific gene) transcripts as surrogate of the IDUA LV-derived transgene, which
itself could not be discriminated by IDUA transcripts originating from the native gene. Alignment
and quantification of WPRE transcripts was ensured by adding the WPRE sequence to the
reference human genome in cellranger (10X Genomics). Transduction efficiency (TE) was hence 
calculated as the fraction of cells bearing the WPRE transcripts within each sample and/or cell
population.

Code for performing quantification and mapping can be found in:  

- [WPRE_quantification.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/WPRE_quantification.R)
- [Fig2I.R](http://www.bioinfotiget.it/gitlab/custom/zonari_mpbhscexp_2025/zonari_mpbhscexp_2025_scrnaseq/-/blob/main/Fig2I.R)