A Coding Guide to Building a Complete Pipeline for Single-Cell Sequencing Analysis Using Scanpy for Clustering and Cell Type Annotation

In this tutorial, we build a complete pipeline for single-cell RNA sequencing using Scanpy. We start by installing the necessary libraries and loading the PBMC 3k dataset, then perform quality control, sorting, and normalization to prepare the data for upstream analysis. We then identified the most mutated genes, performed dimensionality reduction PCA, and constructed a spatial graph to generate UMAP embeddings and Leiden clusters. Through marker gene detection and visualization, we examine how clusters correspond to biological cells and use a simple rule-based annotation strategy to understand cell types.
import sys
import subprocess
import importlib
def pip_install(*packages):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *packages])
required = [
"scanpy",
"anndata",
"leidenalg",
"igraph",
"harmonypy",
"seaborn"
]
pip_install(*required)
import os
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
import anndata as ad
sc.settings.verbosity = 2
sc.settings.set_figure_params(dpi=110, facecolor="white", frameon=False)
np.random.seed(42)
print("Scanpy version:", sc.__version__)
adata = sc.datasets.pbmc3k()
adata.var_names_make_unique()
print("nInitial AnnData:")
print(adata)
adata.layers["counts"] = adata.X.copy()
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)
print("nQC summary:")
display(
adata.obs[["n_genes_by_counts", "total_counts", "pct_counts_mt"]].describe().T
)
We install all the necessary dependencies and import the computer science libraries needed for the analysis. We configure the Scanpy settings, start the environment, and load the PBMC 3k single-cell RNA-seq dataset. We then calculated quality control metrics, including percentage of mitochondrial genes, total counts, and number of detected genes, for each cell.
fig, axs = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.violin(adata, ["n_genes_by_counts"], jitter=0.4, ax=axs[0], show=False)
sc.pl.violin(adata, ["total_counts"], jitter=0.4, ax=axs[1], show=False)
sc.pl.violin(adata, ["pct_counts_mt"], jitter=0.4, ax=axs[2], show=False)
plt.tight_layout()
plt.show()
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts", color="pct_counts_mt")
adata = adata[adata.obs["n_genes_by_counts"] >= 200].copy()
adata = adata[adata.obs["n_genes_by_counts"] <= 5000].copy()
adata = adata[adata.obs["pct_counts_mt"] < 10].copy()
sc.pp.filter_genes(adata, min_cells=3)
print("nAfter filtering:")
print(adata)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(
adata,
flavor="seurat",
min_mean=0.0125,
max_mean=3,
min_disp=0.5
)
print("nHighly variable genes selected:", int(adata.var["highly_variable"].sum()))
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var["highly_variable"]].copy()
We visualize quality control metrics using plots to assess the distribution of gene counts and mitochondrial content. We use filtering steps to remove low-quality cells and genes that do not meet basic expression parameters. We then normalized the data, applied log transformation, and identified the most variable genes for downstream analysis.
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True)
sc.pl.pca(adata, color=None)
sc.pp.neighbors(adata, n_neighbors=12, n_pcs=30, metric="euclidean")
sc.tl.umap(adata, min_dist=0.35, spread=1.0)
sc.tl.leiden(adata, resolution=0.6, key_added="leiden")
print("nCluster counts:")
display(adata.obs["leiden"].value_counts().sort_index().rename("cells_per_cluster").to_frame())
sc.pl.umap(adata, color=["leiden"], legend_loc="on data", title="PBMC 3k - Leiden clusters")
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
marker_table = sc.get.rank_genes_groups_df(adata, group=None)
print("nTop marker rows:")
display(marker_table.head(20))
We reverse engineer aggregates and scale the dataset to prepare it for dimensionality reduction. We performed principal component analysis to capture the structure of the large variance of the dataset. We then construct a local graph, calculate UMAP embeddings, perform a Leiden cluster, and identify marker genes for each cluster.
top_markers_per_cluster = (
marker_table.groupby("group")
.head(10)
.loc[:, ["group", "names", "logfoldchanges", "pvals_adj"]]
.reset_index(drop=True)
)
print("nTop 10 markers per cluster:")
display(top_markers_per_cluster)
candidate_markers = [
"IL7R", "LTB", "MALAT1", "CCR7",
"NKG7", "GNLY", "PRF1",
"MS4A1", "CD79A", "CD79B",
"LYZ", "S100A8", "FCER1A", "CST3",
"PPBP", "FCGR3A", "LGALS3", "CTSS",
"CD3D", "TRBC1", "TRAC"
]
candidate_markers = [g for g in candidate_markers if g in adata.var_names]
if candidate_markers:
sc.pl.dotplot(
adata,
var_names=candidate_markers,
groupby="leiden",
standard_scale="var",
dendrogram=True
)
sc.pl.matrixplot(
adata,
var_names=candidate_markers,
groupby="leiden",
standard_scale="var",
dendrogram=True
)
cluster_marker_reference = {
"T_cells": ["IL7R", "LTB", "CCR7", "CD3D", "TRBC1", "TRAC"],
"NK_cells": ["NKG7", "GNLY", "PRF1"],
"B_cells": ["MS4A1", "CD79A", "CD79B"],
"Monocytes": ["LYZ", "FCGR3A", "LGALS3", "CTSS", "S100A8", "CST3"],
"Dendritic_cells": ["FCER1A", "CST3"],
"Platelets": ["PPBP"]
}
We examine the most important marker genes found in each cluster and summarize the top markers. We visualize gene expression patterns across clusters using dot plots and matrix plots of known immune markers. We also describe a reference map of marker genes associated with immune cell types for later interpretation.
available_reference = {
celltype: [g for g in genes if g in adata.var_names]
for celltype, genes in cluster_marker_reference.items()
}
available_reference = {k: v for k, v in available_reference.items() if len(v) > 0}
for celltype, genes in available_reference.items():
sc.tl.score_genes(adata, gene_list=genes, score_name=f"{celltype}_score", use_raw=False)
score_cols = [f"{ct}_score" for ct in available_reference.keys()]
cluster_scores = adata.obs.groupby("leiden")[score_cols].mean()
display(cluster_scores)
cluster_to_celltype = {}
for cluster in cluster_scores.index:
best = cluster_scores.loc[cluster].idxmax().replace("_score", "")
cluster_to_celltype[cluster] = best
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).astype("category")
print("nCluster to cell-type mapping:")
display(pd.DataFrame.from_dict(cluster_to_celltype, orient="index", columns=["assigned_cell_type"]))
sc.pl.umap(
adata,
color=["leiden", "cell_type"],
legend_loc="on data",
wspace=0.45
)
sc.tl.rank_genes_groups(adata, groupby="cell_type", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)
celltype_markers = sc.get.rank_genes_groups_df(adata, group=None)
print("nTop markers by annotated cell type:")
display(
celltype_markers.groupby("group").head(8)[["group", "names", "logfoldchanges", "pvals_adj"]]
)
cluster_prop = (
adata.obs["cell_type"]
.value_counts(normalize=True)
.mul(100)
.round(2)
.rename("percent")
.to_frame()
)
print("nCell-type proportions (%):")
display(cluster_prop)
plt.figure(figsize=(7, 4))
cluster_prop["percent"].sort_values().plot(kind="barh")
plt.xlabel("Percent of cells")
plt.ylabel("Cell type")
plt.title("Estimated cell-type composition")
plt.tight_layout()
plt.show()
output_dir = "scanpy_pbmc3k_outputs"
os.makedirs(output_dir, exist_ok=True)
adata.write(os.path.join(output_dir, "pbmc3k_scanpy_advanced.h5ad"))
marker_table.to_csv(os.path.join(output_dir, "cluster_markers.csv"), index=False)
celltype_markers.to_csv(os.path.join(output_dir, "celltype_markers.csv"), index=False)
cluster_scores.to_csv(os.path.join(output_dir, "cluster_score_matrix.csv"))
print(f"nSaved outputs to: {output_dir}")
print("Files:")
for f in sorted(os.listdir(output_dir)):
print(" -", f)
summary = {
"n_cells_final": int(adata.n_obs),
"n_genes_final": int(adata.n_vars),
"n_clusters": int(adata.obs["leiden"].nunique()),
"clusters": sorted(adata.obs["leiden"].unique().tolist()),
"cell_types": sorted(adata.obs["cell_type"].unique().tolist()),
}
print("nAnalysis summary:")
for k, v in summary.items():
print(f"{k}: {v}")
We score each cell using sets of known marker genes and assign possible cell types to clusters based on expression patterns. We visualize the annotated cell types in the UMAP embedding and perform differential gene expression analysis for all predicted cells. Also, we calculate cell type ratios, generate summarized observations, and save the processed dataset and analysis results for further research.
In conclusion, we developed a full end-to-end workflow for analyzing single-cell transcriptomic data using Scanpy. We performed preprocessing, clustering, marker gene analysis, and cell type annotation, and visualized the data structure using UMAP and gene expression plots. By saving AnnData’s processed object and analysis results, we created a reusable dataset for further biological interpretation and advanced modeling. This workflow shows how Scanpy enables large-scale, reproducible single-cell analysis with a structured, modular Python pipeline.
Check it out Full Codes here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
The post Coding Guide to Building a Complete Pipeline for Single-Cell RNA Sequencing Analysis Using Scanpy for Visual Integration and Cell Type Annotation appeared first on MarkTechPost.



