View on GitHub

Transcriptomics

Fundamentals of Transcriptomics — from RNA sequencing basics to advanced expression analysis.

Introduction to Biology and Molecular Biology

We will cover the overview of molecular biology, including the central dogma, the structures and functions of key biomolecules, and the role of gene expression in cellular function. Then, we’ll delve into genes and genomes, types of RNA, and the intricate mechanisms of gene expression regulation.

Overview of Molecular Biology

Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity, particularly how genetic information is stored, replicated, and expressed to drive cellular processes. At its core lies the central dogma of molecular biology, a foundational theory proposed by Francis Crick in 1958 and refined over decades. This dogma describes the unidirectional flow of genetic information: from DNA to RNA to protein. In essence, DNA serves as the blueprint, RNA as the intermediary messenger, and proteins as the functional workhorses of the cell.

To elaborate, during transcription, a segment of DNA is copied into RNA by the enzyme RNA polymerase, producing a complementary RNA strand. This RNA, if it’s messenger RNA (mRNA), then undergoes translation in the cytoplasm, where ribosomes read the RNA sequence in groups of three nucleotides (codons) to assemble amino acids into proteins. While the central dogma is generally unidirectional, exceptions exist, such as reverse transcription in retroviruses, where RNA is converted back to DNA by reverse transcriptase. This framework is crucial because it explains how genetic information is preserved and utilized, forming the basis for fields like genomics and transcriptomics.

Next, let’s examine the structure and function of DNA, RNA, and proteins. DNA, or deoxyribonucleic acid, is a double-stranded helical molecule composed of nucleotides—each consisting of a deoxyribose sugar, a phosphate group, and one of four nitrogenous bases: adenine (A), thymine (T), cytosine (C), or guanine (G). The strands are held together by hydrogen bonds between complementary bases (A-T and C-G), allowing for stable storage and faithful replication during cell division. DNA’s primary function is to store genetic information, which can be passed from one generation to the next.

RNA, or ribonucleic acid, differs in several ways: it’s typically single-stranded, uses ribose sugar, and replaces thymine with uracil (U). This single-stranded nature allows RNA to fold into complex three-dimensional structures, enabling diverse functions beyond mere information transfer. Proteins, on the other hand, are polymers of amino acids linked by peptide bonds. Their structure unfolds in levels: primary (sequence of amino acids), secondary (alpha helices and beta sheets), tertiary (3D folding), and quaternary (multi-subunit assemblies). Proteins execute a vast array of functions, including catalysis (enzymes), structural support (collagen), transport (hemoglobin), and signaling (hormones).

The role of gene expression in cellular function cannot be overstated. Gene expression is the process by which the genetic code in DNA is converted into functional products, primarily proteins, that dictate a cell’s identity, behavior, and response to its environment. In a multicellular organism, all cells share the same genome, yet they differentiate into specialized types—such as neurons, muscle cells, or immune cells—through selective gene expression. For instance, in response to environmental signals like hormones or stress, cells can upregulate or downregulate specific genes to adapt, such as activating heat shock proteins during temperature stress. Dysregulated gene expression underlies many diseases, including cancer, where oncogenes are overexpressed. In transcriptomics, we study the RNA products of gene expression to map these dynamic patterns, revealing insights into health and disease.

Genes and Genomes

Moving on, let’s define what a gene is and explore its components. A gene is a functional unit of heredity, essentially a segment of DNA that encodes the information necessary to produce a specific RNA molecule, which may then be translated into a protein. In eukaryotes, genes are more complex, comprising several key elements. The promoter is a regulatory region upstream of the coding sequence where RNA polymerase and transcription factors bind to initiate transcription. Exons are the coding segments that are retained in the mature mRNA and translated into protein, while introns are non-coding intervening sequences that are spliced out during RNA processing. This exon-intron structure allows for flexibility in gene expression, as we’ll discuss later. For example, the human beta-globin gene has three exons and two introns, encoding the hemoglobin subunit.

Now, consider the structure of eukaryotic and prokaryotic genomes. Eukaryotic genomes are large, linear, and organized into multiple chromosomes within the nucleus, with extensive non-coding regions, repetitive sequences, and introns making up the bulk of the DNA. The human genome, for instance, spans about 3 billion base pairs across 23 chromosome pairs, with only about 1-2% coding for proteins. In contrast, prokaryotic genomes are compact, typically consisting of a single circular chromosome in the cytoplasm, with high coding density (up to 90%) and operons—clusters of genes transcribed together for coordinated expression. Bacterial genomes like that of Escherichia coli are around 4-5 million base pairs, lacking introns and focusing on efficiency. These structural differences reflect evolutionary adaptations: eukaryotes prioritize complex regulation for multicellularity, while prokaryotes emphasize rapid response in single-celled environments.

An essential layer beyond DNA sequence is epigenetics and its influence on gene expression. Epigenetics refers to heritable modifications that affect gene activity without changing the underlying DNA sequence. Key mechanisms include DNA methylation, where methyl groups are added to cytosine residues, often silencing genes by blocking transcription factor access, and histone modifications, such as acetylation (which loosens chromatin for activation) or methylation (which can repress or activate depending on the site). These changes are dynamic and responsive to environmental factors, like diet or stress, influencing development, aging, and disease. For example, aberrant DNA methylation in promoter regions can lead to tumor suppressor gene silencing in cancer. Epigenetics adds a regulatory dimension to the transcriptome, as it determines which genes are accessible for transcription.

Types of RNA

RNA is not a monolithic entity; it encompasses various types, each with specialized roles in the central dogma and beyond. The three primary classes involved in protein synthesis are messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA).

Beyond these:

NcRNAs expand the regulatory landscape, with over 80% of the human genome transcribed into them, influencing metabolism, disease, and evolution.

Regulation of Gene Expression

Finally, gene expression is not a passive process; it’s meticulously regulated at multiple levels to ensure precision and adaptability. We distinguish between transcriptional regulation, which controls the initiation and rate of transcription, and post-transcriptional regulation, which modulates RNA processing, stability, and translation.

Transcriptional regulation involves transcription factors (TFs), proteins that bind DNA to activate or repress genes. Enhancers are distal DNA sequences that boost transcription by looping to interact with promoters, often in a tissue-specific manner. Conversely, silencers repress transcription by recruiting repressive complexes. For example, the TF p53 activates enhancers for DNA repair genes during stress. These elements can function independently of orientation or distance, adding flexibility.

Post-transcriptional regulation includes mRNA capping, polyadenylation, export, and degradation, often mediated by miRNAs or RNA-binding proteins. A key mechanism is alternative splicing, where exons are selectively included or excluded to generate multiple mRNA isoforms from one gene, vastly increasing proteomic diversity. Over 95% of human multi-exon genes undergo alternative splicing, enabling adaptations like isoform switches in development or disease. For instance, alternative splicing of the FN1 gene produces fibronectin isoforms for wound healing. Dysregulation can lead to pathologies, such as splicing mutations in spinal muscular atrophy. This process underscores the significance of post-transcriptional control in evolution and complexity.

These concepts—the central dogma, biomolecular structures, genes and genomes, RNA types, and gene regulation, with a spotlight on epigenetics—form the foundation of molecular biology. They are essential for understanding transcriptomics, which maps RNA to reveal gene expression dynamics. The expanded examples of epigenetics, from cancer to imprinting and environmental responses, highlight its pervasive role in shaping the transcriptome.