Computational Methods in Transcriptomics
1. Data Preprocessing
1.1 Quality Control of Sequencing Reads
Why?
Sequencers make mistakes — bad reads = wrong answers
Tools: FastQC, MultiQC
| Step | MultiQC |
|---|---|
| 1 | Run FastQC on
.fastq files |
| 2 | Check Per Base Quality (green = good, red = bad) |
| 3 | Trim low-quality ends and
adapters with Trim Galore! or any other
tool |
Real Example:
Lung Cancer RNA-Seq
- 100 patient samples
FastQCshowed adapter contamination in 30 samples
- Trimmed → recovered 15% more reads → found EGFR mutation missed before
1.2 Read Alignment and Mapping
Goal: Match each 150-letter read to the human genome map.
Tools:
| Tool | Type | Speed |
|---|---|---|
| STAR | Spliced Alignment | Fast |
| HISTA2 | Graph-based | Memory-light |
| Bowtie2 | Short reads | Ultra-fast |
Step-by-step (STAR):
1
STAR --genomeDir hg38_index --readFilesIn sample_R1.fastq sample_R2.fastq --outSAMtype BAM
Real Example:
Alzheimer’s Brain Study
- Used STAR to align 1 billion reads
- Found 90% mapped → 10% unmapped = novel RNAs
- → Discovered new lncRNA near APP gene
1.3 Handling Alternative Splicing & Novel Transcripts
Problem: One gene → multiple RNA versions (isoforms)
Tools:
| Tool | What it does |
|---|---|
| StringTie | Assemble transcripts |
| Salmon (quasi-mapping) | Quantifies isoforms |
Real Example:
Muscular Dystrophy
- Patient had normal DMD gene DNA but weak muscles
- StringTie found exon 45 skipped → short protein
- → Exon-skipping therapy (Exondys 51) saved walking ability
2. Quantification of Gene Expression
2.1 RPKM, FPKM, TPM – What’s the Difference?
| Step | Formula |
|---|---|
| RPKM/FPKM | $\frac{reads * 10^9}{\text {gene length} * \text {total reads}}$ |
| TPM | $\frac{\frac{reads}{\text{gene length}}} * {\frac{10^6}{sum(\text{all TPM)}}$ |
Why TPM wins:
- Gene A: 1000 reads, 1000 bp
- Gene B: 1000 reads, 2000 bp → RPKM says equal, but TPM says Gene A is 2x higher
Real Example:
Liver vs. Brain Comparison
- Used TPM → ALB (liver protein) 100 x higher in liver
- RPKM gave wrong ratio due to library size
2.2 Tools for Quantification
| Tools | Method | Speed |
|---|---|---|
| Salmon | Quasi-mapping (no alignment) | <10 min |
| Kallisto | Pseudoalignment | <5 min |
| featureCounts | After STAR | Accurate |
Real Example:
COVID-19 Drug Screen
- 10,000 compounds, 96-well plates
- Used Kallisto → quantified 20,000 genes in 3 minutes per plate
- → Found remdesivir blocks viral RNA polymerase
3. Differential Expression Analysis
3.1 Statistical Tools: DESeq2, edgeR
| Tool | Input | Output |
|---|---|---|
| DESeq2 | Raw counts | log2FoldChange, p-value, padj |
| edgeR | Raw counts | Same + better for small n |
DESeq2 Formula (simplified):
1
2
3
dds <- DESeqDataSetFromMatrix(countData, colData, design = ~condition)
dds <- DESeq(dds)
results <- results(dds, contrast=c("condition","disease","control"))
3.2 Fold Change, p-value, FDR
| Term | Meaning | Threshold |
|---|---|---|
| log2FC > 1 | 2x upregulated | |
| p < 0.05 | Statistically significant | |
| FDR < 0.05 | Controls false positives | Use this! |
Volcano plot
1
2
x = log2FC
y = -log10(padj)
→ Top-right = up in disease, significant #### Real Example:
Parkinson’s vs. Healthy Brain
- DESeq2 → 1,200 genes changed
- Top hit: SNCA (alpha-synuclein) ↑ 4-fold, FDR = 10−15
- → New drug targeting SNCA RNA in clinical trial
4. Functional Annotation & Pathway Analysis
4.1 Gene Ontology (GO) & KEGG
| Database | What it tells you |
|---|---|
| GO | Biological Process, Cellular Component, Molecular Function |
| KEGG | Metabolic pathways, diseases |
Real Example:
Obesity RNA-Seq
- 500 upregulated genes → input to Enrichr
- Top GO: “Lipid metabolic process” (p = 10−25)
- KEGG: “PPAR signaling pathway”
- → Drug pioglitazone activates PPAR → weight loss in trial
4.2 Tools: DAVID, Enrichr, GSEA
| Tool | Best For |
|---|---|
| Enrichr | Fast, web-based |
| GSEA | Pathway ranking (uses all genes) |
| clusterProfiler (R) | Custom plots |
GSEA Example (2025):
Cancer vs. Normal
- GSEA showed “Hypoxia pathway” enriched (NES = 2.8)
→ Tumors resist chemotherapy in low oxygen
→ New hypoxia-activated prodrug in Phase II
5. Visualization of Transcriptomic Data
5.1 Heatmaps, Volcano, MA Plots
| Plot | What it shows |
|---|---|
| Heatmap | Expression patterns across samples |
| Volcano | Significance vs. fold change |
| MA PLot | log ratio vs. mean expression |
Real Example:
5 Cancer Types
- Heatmap → clear clusters: breast, lung, colon
- → AI model 99% accurate in diagnosing cancer type from RNA
5.2 Dimensionality Reduction: PCA, t-SNE, UMAP
| Method | Use |
|---|---|
| PCA | Find main sources of variation |
| UMAP | Visualize clusters (better than t-SNE) |
PCA Example:
scRNA-Seq of Immune Cells
- PCA → PC1 = T cells vs. B cells
- PC2 = activated vs. resting
→ Found new exhausted T-cell state in chronic infection
5.3 Tools: R, Python, Tableau
| Tool | Best For |
|---|---|
| R (ggplot2, pheatmap) | Publication-quality |
| Python (scanpy, seaborn) | scRNA-Seq |
| Tableau | Interactive dashboards |
Real Example:
Hospital Dashboard
- Tableau + RNA-Seq → live cancer subtype predictor
- Surgeon uploads biopsy → result in 2 hours
First Analysis Pipeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 1. QC
fastqc *.fastq
multiqc .
# 2. Trim
trim_galore --paired R1.fastq R2.fastq
# 3. Align & Quantify
salmon quant -i hg38_index -l A -1 R1_trimmed.fq -2 R2_trimmed.fq -o quant
# 4. Differential Expression (R)
library(DESeq2)
counts <- tximport("quant.sf", type="salmon")
dds <- DESeqDataSetFromTximport(counts, colData, ~condition)
dds <- DESeq(dds)
res <- results(dds)
# 5. Visualize
volcanoPlot(res)
Summary
| Step | Tool |
|---|---|
| QC | FastQC |
| Align | STAR |
| Quantify | Salmon |
| DE | DESeq2 |
| Pathway | GSEA |
| Plot | UMAP |