Revolutionizing Single-Cell RNA Sequencing with CSI-GEP
CSI-GEP is a breakthrough computational tool that improves single-cell RNA-seq analysis using scalable, consensus-based NMF. It overcomes limitations of traditional methods by efficiently identifying reproducible gene expression programs, outperforming existing approaches in accuracy and speed.

Over the past decade, single-cell RNA sequencing (scRNA-seq) has transformed biological research, uncovering groundbreaking insights—from the diversity of cell types in the brain to drug-resistant cancer states and previously unknown immune cell functions. As this technology advances, generating large-scale datasets has become faster and more affordable. However, analyzing these massive datasets demands computational methods that can handle their growing complexity.
Currently, most scRNA-seq analyses start with principal component analysis (PCA), followed by nonlinear dimensionality reduction techniques like t-SNE or UMAP to visualize data in 2D. Clustering algorithms such as Louvain or Leiden then identify cell types. But these methods have a major drawback: compressing high-dimensional data into two dimensions can distort biological signals, sometimes leading to conflicting interpretations of similar datasets.
While neural network-based models (like variational autoencoders and transformers) offer scalability, their nonlinearity often produces hard-to-interpret results and risks overfitting. Recent benchmarks even suggest that simpler models sometimes outperform these advanced approaches.
The Promise—and Challenges—of NMF
Non-negative matrix factorization (NMF) presents a compelling alternative. Unlike clustering-based methods, NMF models scRNA-seq data as a combination of gene expression programs (GEPs), where each GEP represents a set of co-expressed genes. This approach captures shared biological processes across cell types rather than just rigid classifications. Its non-negativity constraint also aligns well with gene expression data, making results more interpretable.
Specialized NMF tools like cellHarmony and integrative NMF (iNMF) have been used to uncover key transcriptional programs in diseases like medulloblastoma. Yet, NMF has three major limitations:
-
Non-unique solutions – Different runs can yield varying results.
-
No reliable way to determine the optimal number of GEPs – Existing methods often disagree on dimensionality.
-
Poor scalability – Large datasets or high-rank decompositions become computationally prohibitive.
Introducing CSI-GEP: A Scalable, Reliable Solution
To overcome these challenges, we developed Consensus and Scalable Inference of Gene Expression Programs (CSI-GEP). Unlike traditional NMF, CSI-GEP:
-
Uses GPU acceleration for efficient, large-scale analysis.
-
Identifies reproducible GEPs across multiple rank values, avoiding arbitrary dimensionality choices.
-
Outperforms leading methods (including iNMF, ScVI, and scGPT) in accuracy, scalability, and efficiency.
In benchmark tests, CSI-GEP was the only method that reliably recovered true GEPs in simulated atlas-scale datasets. It also uncovered biologically meaningful programs in real-world data that other approaches missed.
What's Your Reaction?


