View on GitHub

Transcriptomics

Fundamentals of Transcriptomics — from RNA sequencing basics to advanced expression analysis.

Computational Methods in Transcriptomics

1. Data Preprocessing

1.1 Quality Control of Sequencing Reads

Why?

Sequencers make mistakes — bad reads = wrong answers

Tools: FastQC, MultiQC

Step MultiQC
1 Run FastQC on .fastq files
2 Check Per Base Quality (green = good, red = bad)
3 Trim low-quality ends and adapters with Trim Galore! or any other tool

Real Example:

Lung Cancer RNA-Seq

  • 100 patient samples
  • FastQC showed adapter contamination in 30 samples
  • Trimmed → recovered 15% more reads → found EGFR mutation missed before

1.2 Read Alignment and Mapping

Goal: Match each 150-letter read to the human genome map.

Tools:

Tool Type Speed
STAR Spliced Alignment Fast
HISTA2 Graph-based Memory-light
Bowtie2 Short reads Ultra-fast

Step-by-step (STAR):

1
STAR --genomeDir hg38_index --readFilesIn sample_R1.fastq sample_R2.fastq --outSAMtype BAM

Real Example:

Alzheimer’s Brain Study

  • Used STAR to align 1 billion reads
  • Found 90% mapped10% unmapped = novel RNAs
  • → Discovered new lncRNA near APP gene

1.3 Handling Alternative Splicing & Novel Transcripts

Problem: One gene → multiple RNA versions (isoforms)

Tools:

Tool What it does
StringTie Assemble transcripts
Salmon (quasi-mapping) Quantifies isoforms

Real Example:

Muscular Dystrophy

  • Patient had normal DMD gene DNA but weak muscles
  • StringTie found exon 45 skipped → short protein
  • → Exon-skipping therapy (Exondys 51) saved walking ability

2. Quantification of Gene Expression

2.1 RPKM, FPKM, TPM – What’s the Difference?

Step Formula
RPKM/FPKM $\frac{reads * 10^9}{\text {gene length} * \text {total reads}}$
TPM $\frac{\frac{reads}{\text{gene length}}} * {\frac{10^6}{sum(\text{all TPM)}}$

Why TPM wins:

Real Example:

Liver vs. Brain Comparison

  • Used TPM → ALB (liver protein) 100 x higher in liver
  • RPKM gave wrong ratio due to library size

2.2 Tools for Quantification

Tools Method Speed
Salmon Quasi-mapping (no alignment) <10 min
Kallisto Pseudoalignment <5 min
featureCounts After STAR Accurate

Real Example:

COVID-19 Drug Screen

  • 10,000 compounds, 96-well plates
  • Used Kallisto → quantified 20,000 genes in 3 minutes per plate
  • → Found remdesivir blocks viral RNA polymerase

3. Differential Expression Analysis

3.1 Statistical Tools: DESeq2, edgeR

Tool Input Output
DESeq2 Raw counts log2FoldChange, p-value, padj
edgeR Raw counts Same + better for small n

DESeq2 Formula (simplified):

1
2
3
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~condition)
dds <- DESeq(dds)
results <- results(dds, contrast=c("condition","disease","control"))

3.2 Fold Change, p-value, FDR

Term Meaning Threshold
log2FC > 1 2x upregulated
p < 0.05 Statistically significant
FDR < 0.05 Controls false positives Use this!

Volcano plot

1
2
x = log2FC
y = -log10(padj)

→ Top-right = up in disease, significant #### Real Example:

Parkinson’s vs. Healthy Brain

  • DESeq2 → 1,200 genes changed
  • Top hit: SNCA (alpha-synuclein) ↑ 4-fold, FDR = 1015
  • → New drug targeting SNCA RNA in clinical trial

4. Functional Annotation & Pathway Analysis

4.1 Gene Ontology (GO) & KEGG

Database What it tells you
GO Biological Process, Cellular Component, Molecular Function
KEGG Metabolic pathways, diseases

Real Example:

Obesity RNA-Seq

  • 500 upregulated genes → input to Enrichr
  • Top GO: “Lipid metabolic process” (p = 1025)
  • KEGG: “PPAR signaling pathway”
  • → Drug pioglitazone activates PPAR → weight loss in trial

4.2 Tools: DAVID, Enrichr, GSEA

Tool Best For
Enrichr Fast, web-based
GSEA Pathway ranking (uses all genes)
clusterProfiler (R) Custom plots

GSEA Example (2025):

Cancer vs. Normal

  • GSEA showed “Hypoxia pathway” enriched (NES = 2.8)

→ Tumors resist chemotherapy in low oxygen

→ New hypoxia-activated prodrug in Phase II

5. Visualization of Transcriptomic Data

5.1 Heatmaps, Volcano, MA Plots

Plot What it shows
Heatmap Expression patterns across samples
Volcano Significance vs. fold change
MA PLot log ratio vs. mean expression

Real Example:

5 Cancer Types

  • Heatmap → clear clusters: breast, lung, colon
  • → AI model 99% accurate in diagnosing cancer type from RNA

5.2 Dimensionality Reduction: PCA, t-SNE, UMAP

Method Use
PCA Find main sources of variation
UMAP Visualize clusters (better than t-SNE)

PCA Example:

scRNA-Seq of Immune Cells

  • PCA → PC1 = T cells vs. B cells
  • PC2 = activated vs. resting

→ Found new exhausted T-cell state in chronic infection

5.3 Tools: R, Python, Tableau

Tool Best For
R (ggplot2, pheatmap) Publication-quality
Python (scanpy, seaborn) scRNA-Seq
Tableau Interactive dashboards

Real Example:

Hospital Dashboard

  • Tableau + RNA-Seq → live cancer subtype predictor
  • Surgeon uploads biopsy → result in 2 hours

First Analysis Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 1. QC
fastqc *.fastq
multiqc .

# 2. Trim
trim_galore --paired R1.fastq R2.fastq

# 3. Align & Quantify
salmon quant -i hg38_index -l A -1 R1_trimmed.fq -2 R2_trimmed.fq -o quant

# 4. Differential Expression (R)
library(DESeq2)
counts <- tximport("quant.sf", type="salmon")
dds <- DESeqDataSetFromTximport(counts, colData, ~condition)
dds <- DESeq(dds)
res <- results(dds)

# 5. Visualize
volcanoPlot(res)

Summary

Step Tool
QC FastQC
Align STAR
Quantify Salmon
DE DESeq2
Pathway GSEA
Plot UMAP