View on GitHub

Transcriptomics

Fundamentals of Transcriptomics — from RNA sequencing basics to advanced expression analysis.

Computational Methods in Transcriptomics

1. Data Preprocessing

1.1 Quality Control of Sequencing Reads

Why?

Sequencers make mistakes — bad reads = wrong answers

Tools: FastQC, MultiQC

Step	MultiQC
1	Run `FastQC` on `.fastq` files
2	Check Per Base Quality (green = good, red = bad)
3	Trim low-quality ends and adapters with `Trim Galore!` or any other tool

Real Example:

Lung Cancer RNA-Seq

100 patient samples

FastQC showed adapter contamination in 30 samples

Trimmed → recovered 15% more reads → found EGFR mutation missed before

1.2 Read Alignment and Mapping

Goal: Match each 150-letter read to the human genome map.

Tools:

Tool	Type	Speed
STAR	Spliced Alignment	Fast
HISTA2	Graph-based	Memory-light
Bowtie2	Short reads	Ultra-fast

Step-by-step (STAR):

STAR --genomeDir hg38_index --readFilesIn sample_R1.fastq sample_R2.fastq --outSAMtype BAM

Real Example:

Alzheimer’s Brain Study

Used STAR to align 1 billion reads

Found 90% mapped → 10% unmapped = novel RNAs

→ Discovered new lncRNA near APP gene

1.3 Handling Alternative Splicing & Novel Transcripts

Problem: One gene → multiple RNA versions (isoforms)

Tools:

Tool	What it does
StringTie	Assemble transcripts
Salmon (quasi-mapping)	Quantifies isoforms

Real Example:

Muscular Dystrophy

Patient had normal DMD gene DNA but weak muscles

StringTie found exon 45 skipped → short protein

→ Exon-skipping therapy (Exondys 51) saved walking ability

2. Quantification of Gene Expression

2.1 RPKM, FPKM, TPM – What’s the Difference?

Step	Formula
RPKM/FPKM	$\frac{reads * 10^9}{\text {gene length} * \text {total reads}}$
TPM	$\frac{\frac{reads}{\text{gene length}}} * {\frac{10^6}{sum(\text{all TPM)}}$

Why TPM wins:

Gene A: 1000 reads, 1000 bp
Gene B: 1000 reads, 2000 bp → RPKM says equal, but TPM says Gene A is 2x higher

Real Example:

Liver vs. Brain Comparison

Used TPM → ALB (liver protein) 100 x higher in liver

RPKM gave wrong ratio due to library size

2.2 Tools for Quantification

Tools	Method	Speed
Salmon	Quasi-mapping (no alignment)	<10 min
Kallisto	Pseudoalignment	<5 min
featureCounts	After STAR	Accurate

Real Example:

COVID-19 Drug Screen

10,000 compounds, 96-well plates

Used Kallisto → quantified 20,000 genes in 3 minutes per plate

→ Found remdesivir blocks viral RNA polymerase

3. Differential Expression Analysis

3.1 Statistical Tools: DESeq2, edgeR

Tool	Input	Output
DESeq2	Raw counts	log2FoldChange, p-value, padj
edgeR	Raw counts	Same + better for small n

DESeq2 Formula (simplified):

dds <- DESeqDataSetFromMatrix(countData, colData, design = ~condition)
dds <- DESeq(dds)
results <- results(dds, contrast=c("condition","disease","control"))

3.2 Fold Change, p-value, FDR

Term	Meaning	Threshold
log2FC > 1	2x upregulated
p < 0.05	Statistically significant
FDR < 0.05	Controls false positives	Use this!

Volcano plot

x = log2FC
y = -log10(padj)

→ Top-right = up in disease, significant #### Real Example:

Parkinson’s vs. Healthy Brain

DESeq2 → 1,200 genes changed

Top hit: SNCA (alpha-synuclein) ↑ 4-fold, FDR = 10⁻15

→ New drug targeting SNCA RNA in clinical trial

4. Functional Annotation & Pathway Analysis

4.1 Gene Ontology (GO) & KEGG

Database	What it tells you
GO	Biological Process, Cellular Component, Molecular Function
KEGG	Metabolic pathways, diseases

Real Example:

Obesity RNA-Seq

500 upregulated genes → input to Enrichr

Top GO: “Lipid metabolic process” (p = 10⁻25)

KEGG: “PPAR signaling pathway”

→ Drug pioglitazone activates PPAR → weight loss in trial

4.2 Tools: DAVID, Enrichr, GSEA

Tool	Best For
Enrichr	Fast, web-based
GSEA	Pathway ranking (uses all genes)
clusterProfiler (R)	Custom plots

GSEA Example (2025):

Cancer vs. Normal

GSEA showed “Hypoxia pathway” enriched (NES = 2.8)

→ Tumors resist chemotherapy in low oxygen

→ New hypoxia-activated prodrug in Phase II

5. Visualization of Transcriptomic Data

5.1 Heatmaps, Volcano, MA Plots

Plot	What it shows
Heatmap	Expression patterns across samples
Volcano	Significance vs. fold change
MA PLot	log ratio vs. mean expression

Real Example:

5 Cancer Types

Heatmap → clear clusters: breast, lung, colon

→ AI model 99% accurate in diagnosing cancer type from RNA

5.2 Dimensionality Reduction: PCA, t-SNE, UMAP

Method	Use
PCA	Find main sources of variation
UMAP	Visualize clusters (better than t-SNE)

PCA Example:

scRNA-Seq of Immune Cells

PCA → PC1 = T cells vs. B cells

PC2 = activated vs. resting

→ Found new exhausted T-cell state in chronic infection

5.3 Tools: R, Python, Tableau

Tool	Best For
R (ggplot2, pheatmap)	Publication-quality
Python (scanpy, seaborn)	scRNA-Seq
Tableau	Interactive dashboards

Real Example:

Hospital Dashboard

Tableau + RNA-Seq → live cancer subtype predictor

Surgeon uploads biopsy → result in 2 hours

First Analysis Pipeline

# 1. QC
fastqc *.fastq
multiqc .

# 2. Trim
trim_galore --paired R1.fastq R2.fastq

# 3. Align & Quantify
salmon quant -i hg38_index -l A -1 R1_trimmed.fq -2 R2_trimmed.fq -o quant

# 4. Differential Expression (R)
library(DESeq2)
counts <- tximport("quant.sf", type="salmon")
dds <- DESeqDataSetFromTximport(counts, colData, ~condition)
dds <- DESeq(dds)
res <- results(dds)

# 5. Visualize
volcanoPlot(res)

Summary

Step	Tool
QC	FastQC
Align	STAR
Quantify	Salmon
DE	DESeq2
Pathway	GSEA
Plot	UMAP