• Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics

The Human Genome Project

  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in April 2003, the Human Genome Project’s signature accomplishment – generating the first sequence of the human genome – provided fundamental information about the human blueprint, which has since accelerated the study of human biology and improved the practice of medicine.

Learn more about the Human Genome Project below.

Virtual Exhibit

A virtual exhibit exploring the 1990 letter writing campaign to oppose the HGP.

G5 Reunion

A virtual discussion with the leaders of the five genome-sequencing centers that provides the untold story on how they got the HGP across the finish line in 2003.

DNA sequencing by gel electrophoresis

A fact sheet detailing how the project began and how it shaped the future of research and technology.

Human Genome Project Timeline of Events | NHGRI

An interactive timeline listing key moments from the history of the project.

HGP Timeline

A downloadable poster containing major scientific landmarks before and throughout the project.

Francis Collins

Prominent scientists involved in the project reflect on the lessons learned.

HGP Banbury Meeting

Commentary in the journal Nature written by NHGRI leaders discussing the legacies of the project.

Science and Nature Covers

Lecture-oriented slides telling the story of the project by a front-line participant.

Human Genome Project

Related Content

Jay Shendure

Last updated: May 14, 2024

assignment on human genome project

Image source: igemhq / Flickr .

  • People & medicine

The Human Genome Project—discovering the human blueprint

The human genome is the complete set of instructions required to create a human being.

Expert reviewers

Professor Jenny Graves AO FAA

Professor Jenny Graves AO FAA

School of Life Science

La Trobe University

Professor John Shine AO FAA

Professor John Shine AO FAA

Garvan Institue for Medical Research

  • The blueprint for any living organism is contained in its DNA. DNA is a long molecule made up of many smaller units, called nucleotide bases
  • The order, or sequence, of the bases within the DNA provides the instructions for creating that organism—the genetic code
  • Functional chunks of DNA with particular combinations of base pairs are called genes
  • A genome is an organism's complete set of DNA—all of its genes and other non-genic DNA
  • The human genome is the complete set of instructions required to build a human being

Although every person on our planet is built from the same blueprint, no two people are exactly the same. While we are similar enough to readily distinguish ourselves from other living creatures we also celebrate our individual uniqueness. So what is it that makes us all human, yet unique? Our DNA.

The stuff that makes us who we are

Our DNA ( D eoxyribo N ucleic A cid) is found in the nucleus of every cell in our body (apart from red blood cells, which don’t have a nucleus). DNA is a long molecule, made up of lots of smaller units. To make a DNA molecule you need:

  • nitrogenous bases—there are four of these: adenine (A), thymine (T), cytosine (C), guanine (C)
  • carbon sugar molecules
  • phosphate molecules

Adenine

If you take one of the four nitrogenous bases, and put it together with a sugar molecule and a phosphate molecule, you get a nucleotide base. The sugar and phosphate molecules connect the nucleotide bases together to form a single strand of DNA.

Two of these strands then wind around each other, making the twisted ladder shape of the DNA double helix. The nucleotide bases pair up to make rungs of the ladder, and the sugar and phosphate molecules make the sides. The bases pair up together in specific combinations: A always pairs with T, and C always pairs with G to make base pairs.

assignment on human genome project

Put three billion of these base pairs together in the right order, and you have a complete set of human DNA—the human genome. This amounts to a DNA molecule about a metre long.

It’s the order in which the base pairs are arranged—their sequence—in our DNA that provides the blueprint for all living things and makes us what we are. The DNA sequence of the base pairs in a fish’s DNA is different to those in a monkey.

The base pair sequence of all people is nearly identical—that’s what makes us all humans. However, there are small differences in the order of the three billion base pairs in everyone’s DNA that cause the variations we see in hair colour, eye colour, nose shape etc. No two people have exactly the same DNA sequence (except for identical twins, because they came from a single egg that split into two, forming two copies of the same DNA).

We get our DNA from our parents. The DNA of the human genome is broken up into 23 pairs of chromosomes (46 in total). We receive 23 from our mother and 23 from our father. Egg and sperm cells have only one copy of each chromosome so that when they come together to form a baby, the baby has the normal 2 copies.

Three billion is a lot of cats to herd

Three billion is a lot of base pairs, and together they contain an enormous amount of information. If they were all written out as a list, they would fill around 10,000 epic fantasy-novel-sized (think Game of Thrones thickness) books. They aren’t just random lists of information though. Rather, within this long string, there are distinct sections of DNA that affect a particular characteristic or condition. These stretches of DNA are known as  genes.  Their base pair sequence is used to create the amino acids that join together to make a protein. Some genes are small, only around 300 base pairs, and others contain over one million.

Genes make up only around 1.5 per cent of our DNA—the rest is extra that initially didn’t appear to have any specific purpose, and was dubbed ‘junk DNA’. Turns out, though, that at least some of this ‘junk’ is actually pretty useful—it’s used to define where some genes start and finish, and to regulate how the genes behave. While most of the junk DNA comes from copies of virus genomes that invaded our distant ancestors, new studies suggest much of this DNA may have also gained functions during our evolution.

Genes contain information to make proteins

Within a gene, the base pairs are read in sets of three and those sets are called codons. These are triplets of base pairs that provide a ‘code’ for the production of a particular amino acid. Amino acids then combine together to build proteins. Proteins build all living structures as well as acting as catalysts (enzymes) that control biochemical reactions. Proteins build tissues, and tissues build the organs that make up our body. The genes that determine that you will have brown eyes contain instructions for the cells in the iris of your eye to make a brown-coloured protein. A different sequence of bases would spell a different message, making different proteins and give blue eyes—rather like spelling out a different sentence using the same letters of the alphabet.

Genes can be switched on or off

So if every cell in our body contains the same DNA, how do we end up with the complex arrangement of different cells that is the human (or any other creature’s, for that matter) body?

The secret is that although every cell contains the same sequence of genes, not every gene is ‘switched on’ or expressed in every cell. The cells that make pigment in the eye also contain the genes for making tooth enamel or liver cell proteins, but fortunately don’t do so because those genes are inactive in the eye cells. There are stretches of DNA that do not code for proteins, but rather act as the ‘punctuation’ within the genome that controls the functioning of genes and other processes.

It’s all of this—the genes plus the 'punctuation' plus the ‘junk’—that makes up our genome.

Why study our genome?

Working out the sequence of the base pairs in all our genes enables us to understand the code that makes us who we are. This knowledge can then give us clues on how we develop as embryos, why humans have more brainpower than other animals and plants, and what happens in the body to cause cancer. But establishing the sequence of three billion base pairs is a BIG task. The great and ambitious research program that sought to do this was called the Human Genome Project.

The idea of the Human Genome Project was born in the 1970s, when scientists learned how to ‘clone’ small bits of DNA, around the size of a gene. To clone DNA, scientists cut out a fragment of human DNA from the long strand and then incorporate it into the genome of a bacteria, or a bacterial virus. The fragment is then is replicated within the bacterial cell many times and every time the bacterial cell divides, the new cells also contain the introduced DNA fragment. Bacterial cells reproduce prolifically, and so this process ends up making millions of cells that all contain the introduced DNA fragment, enough that researchers can study it in detail and figure out the sequence of the base pairs.

With time, researchers have been able to study an ever greater number of different DNA fragments, that is, different genes. It became clear that certain variant DNA sequences were associated with particular conditions: diseases such as cystic fibrosis or breast cancer, or normal, non-harmful variants like red hair.

There was initially a lot of opposition to the Human Genome Project, even from some scientists. Considering only around 1.5 per cent of our genome is actual genes that code for proteins, it was thought that much of the $3 billion cost to sequence the entire human genome would be wasted on the ‘junk’ DNA that scientists thought didn’t get used. The important role the ‘junk’ DNA plays in gene regulation wasn’t yet appreciated.

Research groups in many countries, including Australia, began to sequence different genes, providing the beginnings of a total human gene map. In 1989, the Human Genome Organisation (HUGO) was founded by leading scientists to coordinate the massive international effort involved in collecting sequence data to unravel the secrets of our genes.

assignment on human genome project

Francis Collins, former director of the National Human Genome Research Institute, led the Human Genome Project. Image credit: World Economic Forum on Flickr .

The Human Genome Project

So complex that at first it seemed unachievable.

The Human Genome Project aimed to map the entire genome, including the position of every human gene along the DNA strand, and then to determine the sequence of each gene’s base pairs. At the time, sequencing even a small gene could take months, so this was seen as a stupendous and very costly undertaking. Fortunately, biotechnology was advancing rapidly, and by the time the project was finishing it was possible to sequence the DNA of a gene in a few hours. Even so, the project took ten years to complete; the first draft of the human genome was announced in June 2000.

Humans surprisingly simple?

In February 2001, the publicly funded Human Genome Project and the private company Celera both announced that they had mapped virtually all of the human genome, and had begun the task of working out the functions of the many new genes that were identified. Scientists were surprised to find that humans only have around 25,000 genes, not much more than the roundworm Caenorhabditis elegans, and fewer than a tiny water crustacean called Daphnia, which has around 30,000. However, genome sequencing was making it clear that an organism's complexity is not necessarily related to its number f genes.

Also, while we might have a surprisingly small number of genes, they are often expressed in multiple and complex ways. Numerous genes have as many as a dozen different functions and may be translated into several different versions active in different tissues. We also have a lot of extra DNA that doesn’t make up specific genes. So even though the puffer fish Tetraodon nigroviridis has more genes than we do—nearly 28,000—the size of its entire genome is actually only around one tenth of ours as it has much less of the non-coding DNA.

In April 2003, the 50th anniversary of the publication of the structure of DNA, the complete final map of the Human Genome was announced. The DNA from a large number of donors, women and men from different nations and of different races, contributed to this ‘typical’ Human Genome Sequence.

Of the 25,000 or so human genes that have been identified as coding for proteins, most exist in several sequence variants, called alleles. Sometimes these variations are harmless. The gene that codes for eye colour has several alleles—one for blue eyes, another for brown eyes. Sometimes these genetic variations can cause a disease. For example, a mutation in the gene that transports ions across the membrane of lung cells can cause cystic fibrosis.

So, although our alleles may be different, all humans mostly share the same genes. The Human Genome Project identified the full set of human genes, sequenced them all, and identified some of the alleles, particularly those that can cause disease when they get mutated.

Genes can be mapped relative to physical features of the chromosome, or relative to other genes. When different genes are close together on the same chromosome they are said to be linked because they will usually be passed on together (‘co-inherited’) to a child. However, chromosomes break and re-join when eggs and sperm are formed (‘meiosis’), so even genes that are close together can sometimes become separated. The closer together the genes are, the more likely they are to stay together. Analysing how often genes become separated from each other can help establish the distance between genes, and produce a genetic linkage map. In the Human Genome Project, the first task was to make a genetic linkage map for each chromosome.

A genetic linkage map is made from studying patterns in gene separation, and shows the relative locations of genes on a chromosome. It does not tell us anything about the actual physical distances between the genes. A physical map, made by hybridizing a fluorescent-tagged probe to chromosomes, can be aligned with the linkage map. Molecular scale maps can be constructed from sequence markers in the DNA molecule, and quantifies these distances, usually in terms of how many base pairs there are between genes. Together, the genetic linkage map, the physical map, molecular maps and sequence give us the complete picture of the genome.

assignment on human genome project

A technician extracts DNA for tests at the AIDS Vaccine Design and Development Laboratory in Brooklyn, New York. Image credit: © 2008, Getty Images for International AIDS Vaccine Initiative .

It's all about me

It’s all very nice to make a map of these three billion pairs and figure out how they all fit together in order to understand the fundamental essence of a human being. But how much significance does this have for our everyday lives?

Actually, quite a lot. As the cost of sequencing a genome plummets—the first human genome sequenced in 2003 cost somewhere in the order of US$2.7 billion, while it can now be done for less than US$1,000—doctors have a new and extremely powerful tool at their disposal. Identifying how our genes interact and which parts of our genome affect certain diseases and conditions has meant that doctors and scientists are able to better understand how these conditions work and how to treat them. Combine this with the exact knowledge of a particular person’s genes and their mutations, and we are embarking on a new age of personalised medicine.

Doctors can tailor a patient’s medical treatment to be an exact fit, the way a tailor adjusts a suit or a dress for each individual. Drug treatments can be developed that are based on specific genetic mutations, and doctors may be able to diagnose a disease in a patient who is not showing typical symptoms. Scientists anticipate that soon we will move from a “one drug fits all” style of treatment to a more effective, highly personalised, targeted approach. For example, based on a patient’s genome, doctors may be able to predict if they will respond to certain cancer therapies. This can help avoid putting the patient through devastating chemotherapy treatments unnecessarily.

Mapping an individual’s genome can also provide doctors with the ability to predict or anticipate any diseases that the individual may be predisposed to. These conditions could then be addressed with a preventative approach, before they take serious hold.

A researcher reviews a DNA sequence

A researcher reviews a DNA sequence. Image credit:  University of Michigan School of Natural Resources on Flickr .

Ethical controversies

There is no doubt that information from the Human Genome Project provides huge benefits to human health in helping to understand and treat genetic diseases (such as breast cancer, cystic fibrosis and sickle cell anaemia). However, some people see ethical issues, and wonder if scientists are “playing God” with our genomes. 

Could genetic information be misused; for example, through genetic discrimination by employers or insurance companies?  Most people agree that gene testing can be used ethically to prevent serious diseases such as cancer, or during pregnancy to avoid the birth of someone with a severe handicap, but should we allow gene testing to choose a child who will be able to be better at sports, or more intelligent? What about sex selection, already a  problem in some countries ? And will it become possible to use genetic information to change genes in children or adults for the better? Do we  really want to know  if we run the risk of developing a particular disease that may or may not be treatable? What are the  privacy issues  regarding genome screening on a population scale?

All of these ethical, legal and social issues associated with genetic information are being considered worldwide by scientists and ethicists. The potential for medical advancement is immense, but as with so many other great scientific advances, new knowledge brings huge new responsibilities.

A printed copy of a human genome

A printed copy of a human genome. Image credit:  Adam Nieman on Flickr .

Gene editing with CRISPR

Stem cells—sorting the hype from the reality, it’s not all in the genes—the role of epigenetics.

main logo

Human Genome Project

Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. This website details HGP history.

Explore the Project's History

assignment on human genome project

HGP Goals +

assignment on human genome project

Ethical, Legal, and Social Issues +

Blue Hazy Glassy Dna 3d

Human Genome News +

Explore the chromosomes.

assignment on human genome project

All chromosomes

assignment on human genome project

Chromosome 1

assignment on human genome project

Chromosome 2

assignment on human genome project

Chromosome 3

assignment on human genome project

Chromosome 4

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

The Human Genome Project

The 20 th century opened with rediscoveries of Gregor Mendel’s studies on patterns of inheritance in peas and closed with a research project in molecular biology that was heralded as the initial and necessary step for attaining a complete understanding of the hereditary nature of humankind. Both basic science and technological feat, the Human Genome Project (HGP) sought to map and sequence the haploid human genome’s 22 autosomes and 2 sex chromosomes, bringing to biology a “big science” model previously confined to physics. Officially launched in October 1990, the project’s official date of completion was timed to coincide with celebrations of the 50 th anniversary of James D. Watson and Francis Crick’s discovery of the double-helical structure of DNA. On 12 April 2003, heads of government of the six countries that contributed to the sequencing efforts (the U.S., the U.K., Japan, France, Germany, and China) issued a joint proclamation that the “essential sequence of three billion base pairs of DNA of the Human Genome, the molecular instruction book of human life,” had been achieved (Dept. of Trade 2003). HGP researchers compared their feat to the Apollo moon landing and splitting the atom. They foresaw the dawn of a new era, “the era of the genome,” in which the genome sequence would provide a “tremendous foundation on which to build the science and medicine of the 21 st century” (NHGRI 2003).

This article begins by providing a brief history of the Human Genome Project. An overview of various scientific developments that unfolded in the aftermath of the HGP follows; these developments came to be referred to as “postgenomics” to distinguish them from activities labelled “genomics” that are associated specifically with the mapping and sequencing of the genomes of humans and other organisms. The article then discusses some of the conceptual and social and ethical issues that gained the attention of philosophers during the project’s planning stages and as it unfolded, and which remain salient today. Novel with the HGP was the decision of its scientific leadership to set aside funds to study the project’s ethical, legal, and social implications (ELSI). Today, from a vantage point more than two-and-a-half decades after that decision was made, it is possible to reflect on the ELSI model and its relevance for ongoing biomedical research in genetics and genomics/postgenomics.

1.1 Map First, Sequence Later

1.2 race to the genome, 1.3 aftermath, 2.1 geneticization, genetic reductionism, and genetic determinism, 2.2 genetic testing, genetic discrimination, and genetic privacy, 2.3 identity and difference: the “normal” human genome, 2.4 identity and difference: race, ethnicity, and the genome, 2.5 elsi and its legacy, other internet resources, related entries, 1. the human genome project: from genomics to postgenomics.

The idea of sequencing the entire human genome arose in the U.S. in the mid-1980s and is attributed to University of California at Santa Cruz chancellor Robert Sinsheimer, Salk Institute researcher Renato Dulbecco, and the Department of Energy’s (DOE’s) Charles DeLisi. While the idea found supporters among prominent molecular biologists and human geneticists such as Walter Bodmer, Walter Gilbert, Leroy Hood, Victor McKusick, and James D. Watson, many of their colleagues expressed misgivings. There were concerns among molecular biologists about the routine nature of sequencing and the amount of “junk DNA” that would be sequenced, that the expense and big science approach would drain resources from smaller and more worthy projects, and that knowledge of gene sequence was inadequate to yield knowledge of gene function (Davis and Colleagues 1990).

Committees established to study the feasibility of a publicly funded project to sequence the human genome released reports in 1988 that responded to these concerns. The Office for Technology Assessment report, Mapping Our Genes: Genome Projects: How Big, How Fast? downplayed the concerns of scientist critics by emphasizing that there was not one but many genome projects, that these were not on the scale of the Manhattan or Apollo projects, that no agency was committed to massive sequencing, and that the study of other organisms was needed to understand human genes. The National Research Council report, Mapping and Sequencing the Human Genome , sought to accommodate the scientists’ concerns by formulating recommendations that genetic and physical mapping and the development of cheaper, more efficient sequencing technologies precede large-scale sequencing, and that funding be provided for the mapping and sequencing of nonhuman (“model”) organisms as well. Genome projects were underway even before the Office for Technology Assessment and National Research Council reports were released. The DOE made the first push toward a “big science” genome project, with DeLisi advancing a five-year plan in 1986. The DOE undertaking produced consternation among biomedical researchers who were traditionally supported by the National Institutes of Health’s (NIH’s) intramural and extramural programs, and James Wyngaarden, head of the NIH, was persuaded to lend his agency’s support to the project in 1987. Congressional funding for both agencies was in place in time for fiscal year 1988. The National Research Council report estimated the total cost of the HGP at $3 billion.

The DOE and NIH coordinated their efforts with a Memorandum of Understanding in 1988 that agreed on an official launch of the Human Genome Project on October 1, 1990 and an expected date of completion of 2005. The DOE established three genome centers in 1988–89: at Lawrence Berkeley, Lawrence Livermore, and Los Alamos National Laboratories. David Smith led the DOE-HGP at the outset; he was followed by David Galas from 1990 to 1993, and Ari Patrinos for the remainder of the project. The NIH instituted a university grant-based program for human genome research and placed Watson, co-discoverer of the structure of DNA and director of Cold Spring Harbor Laboratory, in charge in 1988. In October 1989, Watson assumed the helm of the newly established National Center for Human Genome Research (NCHGR) at the NIH. During 1990 and 1991, Watson expanded the grants-based program to fund seven genome centers for five-year periods to work on large-scale mapping projects: Washington University, St. Louis; University of California, San Francisco; Massachusetts Institute of Technology; University of Michigan; University of Utah; Baylor College of Medicine; and Children’s Hospital of Philadelphia. Francis Collins succeeded Watson in 1993, establishing an intramural research program at the NCHGR to complement the extramural program of grants for university-based research that already existed. In 1997, the NCHGR was elevated to the status of a research institute and renamed the National Human Genome Research Institute (NHGRI).

Although the HGP’s inceptions were in the U.S., it did not take long for mapping and sequencing the human genome to become an international venture (see Cook-Deegan 1994). France began to fund genome research in 1988 and had developed a more centralized, although not very well-funded, program by 1990. More significant were the contributions of Centre d’Etudes du Polymorphisme Humain (CEPH) and Généthon. CEPH, founded in 1983 by Jean Dausset, maintained a collection of DNA donated by intergenerational families to help in the study of hereditary disease; in 1991, with funding from the French muscular dystrophy association, CEPH director Daniel Cohen oversaw the launching of Généthon as an industrial-sized mapping and sequencing operation. The U.K.’s genome project received its official start in 1989, though Sydney Brenner had commenced genome research at the Medical Research Council laboratory several years before this. Medical Research Council funding was supplemented with private monies from the Imperial Cancer Research Fund and, later, the Wellcome Trust. The Sanger Centre, led by John Sulston and funded by Wellcome and the Medical Research Council, opened in October 1993. Japan, ahead of the U.S. in having funded the development of automated sequencing technologies since the early 1980s, was the major genome player outside the U.S. and Europe with several government agencies beginning small-scale genome projects in the late-1980s and early-1990s (Swinbanks 1991). Germany and China subsequently joined the U.S., France, U.K., and Japan in the publicly funded international consortium that was ultimately responsible for sequencing the genome.

The NIH and DOE released a joint five-year plan in 1990 that set specific benchmarks for mapping, sequencing, and technological development. The plan was updated in 1993 to accommodate progress that had been made, with the new five-year plan in effect through 1998 (Collins and Galas 1993). As the National Research Council report had recommended, priority at the outset of the project was given to mapping rather than sequencing the human genome. HGP scientists sought to construct two kinds of maps: genetic maps and physical maps. Genetic maps order polymorphic markers linearly on chromosomes; the aim is to have these markers densely enough situated that linkage relations can be used to locate chromosomal regions containing genes of interest to researchers. Physical maps order collections (or “libraries”) of cloned DNA fragments that cover an organism’s genome; these fragments can then be replicated in quantity for sequencing. Technological progress was needed to make sequencing more efficient and less costly for any significant progress to be made. For the meantime, efforts would focus on sequencing the smaller genomes of less complex model organisms (Watson 1990). The model organisms selected for the project were the bacterium Escherichia coli , the yeast Saccharomyces cerevisiae , the roundworm Caenorhabditis elegans , the fruitfly Drosophila melanogaster , and the mouse Mus musculans .

As 1998, the last year of the revised five-year plan and midpoint of the project’s projected 15-year span, approached, many mapping goals had been met. In 1994, Généthon completed a genetic map with more than 2,000 microsatellite markers at an average spacing of 2.9 centimorgans (cM) and only one gap larger than 20 cM (Gyapay et al. 1994); the goal was a resolution of 2 to 5 cM by 2005. The genetic mapping phase of the project came to a final close in March 1996 with Généthon’s completion of a genetic map containing 5,264 microsatellite markers located to 2,335 positions with an average spacing of 1.6 cM (Dib et al. 1996). In 1995, a physical map with 94 percent coverage of the genome and 15,086 sequence-tagged site (STS) markers at average intervals of 199 kilobases (kb) was published (Hudson et al. 1995); the initial goal was STS markers spaced approximately 100 kb apart by 1995, a deadline the revised plan extended to 1998. In 1998, a physical map of 41,664 STS markers was published (Deloukas et al. 1998). Sequencing presented more of a challenge, despite ramped-up sequencing efforts over the previous several years at the U.K.’s Wellcome Trust-funded Sanger Centre in Cambridge and the NHGRI (previously NCHGR)-funded centers at Houston’s Baylor College of Medicine, Stanford University, The Institute for Genomic Research (TIGR), University of Washington-Seattle, Washington University School of Medicine in St. Louis, and Whitehead Institute for Biomedical Research/MIT Genome Center. The genomes of the smallest model organisms had been sequenced. In April 1996, an international consortium of mostly European laboratories published the sequence for S. cerevisiae which was the first eukaryote completed, with 12 million base pairs and 5,885 genes and at a cost of $40 million (Goffeau et al. 1996). In January 1997, University of Wisconsin researchers completed the sequence of E. coli with 4,638,858 base pairs and 4,286 genes (Blattner et al. 1997). However, with only three percent of the human genome sequenced, sequencing costs hovering at $.40/base, and the desired high output not yet achieved by the sequencing centers, and about $1.8 billion spent, doubts existed about whether the HGP’s target date of 2005 could be met.

Suddenly, the publicly funded HGP faced a challenge from the private sector. In May 1998, TIGR’s J. Craig Venter announced a partnership with Applied Biosystems to sequence the entire genome in three short years and for a fraction of the cost. The new company, based in Rockville, MD and later named Celera Genomics, planned to use “whole-genome shotgun” (WGS) sequencing, an approach different from the HGP’s. The HGP confined the shotgun method to cloned fragments already mapped to specific chromosomal regions: these are broken down into smaller bits then amplified by bacterial clones, sequences are generated randomly by automated machines, and computational resources are used to reassemble sequence using overlapping areas of bits. Shotgunning is followed by painstaking “finishing” to fill in gaps, correct mistakes, and resolve ambiguities. What Celera was proposing for the shotgun method was to break the organism’s entire genome into millions of pieces of DNA with high-frequency sound waves, sequence these pieces using hundreds of Applied Biosystem’s new capillary model machines, and reassemble the sequences with one of the world’s largest civilian supercomputers without the assistance provided by the preliminary mapping of clones to chromosomes. When WGS sequencing was considered as a possibility by the HGP, it was rejected because of the risk that repeat sequences would yield mistakes in reassembly (Green 1997; Venter et al. 1996; Weber and Myers 1997). But Venter by this time had successfully used the method to sequence the 1.83 million nucleotide bases of the bacterium Hemophilus influenzae —the first free-living organism to be completely sequenced—in a year’s time (Fleischmann et al. 1995).

HGP scientists downplayed the media image of a race to sequence the genome often over the next couple of years, but they were certainly propelled by worries that funding would dry up before the sequence was complete given private sector willingness to take over and that the sequence data would become proprietary information. Wellcome more than doubled its funds to the Sanger Centre (to £205 million) and the center changed its goal from sequencing one-sixth of the genome to sequencing one-third, and possibly one-half (Dickson 1998). The NHGRI and DOE published a new five-year plan for 1998-2003 (Collins et al. 1998). The plan moved the final completion date forward from 2005 to 2003 and aimed for a “working draft” of the human genome sequence to be completed by December 2001. This would be achieved by delaying the finishing process, no longer going clone-by-clone to shotgun, reassemble, and finish the sequence of one clone before proceeding to the next. With only six percent of the human genome sequence completed, the plan called for new and improved sequencing technologies that could increase the sequencing capacity from 90 Mb per year at about $.50 per base to 500 Mb per year at no more than $.25 per base. Goals for completing the sequencing of the remaining model organisms were also set: December 1998 for C. elegans which was 80 percent complete, 2002 for D. melanogaster which was nine percent complete, and 2005 for M. musculus which was still at the physical mapping stage.

An interim victory for the publicly funded project followed when, on schedule, the first animal sequence, that of C. elegans with 97 million bases and 19,099 genes, was published in Science in December 1998 (The C. elegans Sequencing Consortium 1998). This was the product of a 10-year collaboration between scientists at Washington University in St. Louis (headed by Bob Waterston) and the Sanger Centre (headed by John Sulston), carried out at a semi-industrial scale with more than 200 people employed in each lab working around the clock. In March 1999, the main players—the NHGRI, Sanger Centre, and DOE—advanced the date of completion of the “working draft”: five-fold coverage of at least 90 percent of the genome was to be completed by the following spring (Pennisi 1999; Wadman 1999). This change reflected improved output of the new model of automated sequencing machines, diminished sequencing costs at $.20 to $.30 per base, and the desire to speed up the release of medically relevant data. NHGRI would take responsibility for 60 percent of the sequence, concentrating these efforts at Baylor, Washington University, and Whitehead/MIT; 33 percent of the sequence would be the responsibility of the Sanger Centre; and the remaining sequence would be supplied by the DOE’s Joint Genome Institute (JGI) in Walnut Creek, CA into which its three centers had merged in January 1997.

The first chromosomes to be completed (this was to finished, not working draft, standards) were the two smallest: the sequence for chromosome 22 was published by scientists at the Sanger Centre and partners at University of Oklahoma, Washington University in St. Louis, and Keio University in Japan in December 1999 (Dunham et al. 1999); the sequence for chromosome 21 was published by an international consortium of mostly Japanese and German labs—with half the sequencing carried out at Japan’s RIKEN—in May 2000 (Hattori et al. 2000). The remaining chromosomes lagged behind. On 26 June 2000, when Collins, Venter, and the DOE’s Patrinos joined U.S. President Bill Clinton (and British Prime Minister Tony Blair by satellite link) at a White House press conference (see Clinton, et al. 2000) to announce that the human genome had been sequenced, this was more an arranged truce than a tie for the prize. An editorial in Nature described the fanfare of 26 June as an “extravagant” example—one reaching “an all-out zenith or nadir, according to taste”—of scientists making public announcements not linked to peer-reviewed publication, here to bolster share prices (Celera) and for political effect (the HGP) given the “months to go before even a draft sequence will be scientifically useful” (Anonymous 2000, p. 981). Neither of the two sequence maps was complete (Pennisi 2000). The HGP had not met its previous year’s goal of a working draft covering 90 percent of the genome. Assisted by its researchers’ access to HGP data stored on public databases, [ 1 ] Celera’s efforts were accepted as being further along: the company’s press release that day announced 99 percent coverage of the genome.

Peer-reviewed publications came almost eight months later. Negotiated plans for joint publication in Science broke down when terms of agreement over data release could not be negotiated, with the journal’s editors willing to publish Celera’s findings without Venter meeting the standard requirement that the sequence data be submitted to GenBank. Press conferences in London and Washington, D.C. on 12 February preceded publications that week—by HGP scientists in Nature on 15 February 2001 and by Venter’s team in Science on 16 February 2001. The HGP draft genome sequence covered about 94 percent of the genome, with about 25 percent in the finished form already attained for chromosomes 21 and 22. Indeed, the authors themselves described it as “an incomplete, intermediate product” which “contains many gaps and errors” (International Human Genome Sequencing Consortium 2001, p. 871). The results published by Celera had 84–90 percent of the genome covered by scaffolds at least 100 kb in length, with the composition of the scaffolds averaging 91–92 percent sequence and 8–9 percent gaps (Venter et al. 2001). In the end, Celera’s published genome assembly made significant use of the HGP’s publicly available map and sequence data, which left open for debate the question whether WGS sequencing alone would have worked (see Waterston et al. 2002; Green 2002; and Myers et al. 2002).

Since the gaps in the sequence were unlikely to contain genes, and only genes as functional segments of DNA have potential commercial value, Celera was happy to leave the gaps for the HGP scientists to fill in. Despite being timed to coincide with celebrations of the 50 th anniversary of the Watson–Crick discovery of the double-helical structure of DNA, there was less fanfare surrounding the official date of completion of the HGP in April 2003, two years earlier than had been anticipated at the time of its official launch in October 1990, and several months earlier than called for in the most recent five-year plan. In the end, sequencing—the third phase of the publicly-funded project—was carried out at 16 centers in six countries by divvying up among them sections of chromosomes for sequencing. 85 percent of the sequencing, however, was done at the five major sequencing centers (Baylor, Washington University, Whitehead/MIT, Sanger Center, and DOE’s JGI), with the Sanger Centre responsible for nearly one-third. The cost was lower than anticipated, with $2.7 billion spent by U.S. agencies and £150 million spent by Wellcome Trust. The “finished” reference DNA sequence for Homo sapiens was made publicly accessible on the Internet. However, for various technical reasons, the human genome’s 3.1 billion nucleotide bases had not yet been completely sequenced at the close of the HGP in 2003. [ 2 ]

Public support was won for the HGP through scientists’ promises of the revolutionary benefits of genome-based research for pharmaceutical and other biomedical applications. At the outset of the HGP, these promises were sometimes alarmingly deterministic, reductionistic, and overblown, such as when Science editor Daniel Koshland (1989) submitted that genes are responsible not only for manic-depression and schizophrenia but also poverty and homelessness, and that sequencing the genome represented “a great new technology to aid the poor, the infirm, and the underprivileged” (p. 189). More circumspect claims by scientist-proponents of the HGP were no less optimistic. Leroy Hood expressed the belief that “we will learn more about human development and pathology in the next twenty-five years than we have in the past two thousand” (1992, p. 163). Hood expected the HGP to facilitate movement from a reactive to preventive mode of medicine, which would “enable most individuals to live a normal, healthy, and intellectually alert life without disease” (p. 158). Francis Collins predicted that sequencing the genome would “dramatically accelerate the development of new strategies for the diagnosis, prevention, and treatment of disease, not just for single-gene disorders but for the host of more common complex diseases (e.g., diabetes, heart disease, schizophrenia, and cancer)” (1999, p. 29). Collins envisioned that, by 2010, genetically-based “individualized medicine” would be a reality: physicians would routinely take cheek swabs from patients and send their DNA out for testing; based on results of genetic testing (returned within a week), physicians would be able to advise their patients about their absolute and relative risks for contracting various adult-onset diseases; by taking preventive measures (e.g., quitting smoking, having an annual colonoscopy, etc.), patients would be able to prevent the onset of any such diseases or minimize their effects; and the field of pharmacogenomics would have “blossomed” sufficiently for physicians to be able to prescribe prophylactic medications tailored precisely to the genetic make-up of their patients, so to promote efficacy and prevent adverse reactions.

Genome-wide association studies (GWAS), in which single nucleotide polymorphisms (SNPs) across the genome are compared in case–control fashion, are the main approach used to investigate the genetic bases of complex traits. The importance of developing rapid, inexpensive methods of genome sequencing and building a database of single nucleotide polymorphisms (SNPs) to support the investigation of complex traits was recognized by the project’s leadership even before completion of the HGP. Worried about the private sector’s efforts to patent SNPs, which would make them costly to use for research, the NHGRI-DOE’s five-year plan for 1998–2003 included the goal of mapping 100,000 SNPs by 2003 (Collins et al. 1998). The development of a public database of SNPs received a $138 million push from the International HapMap Project, a three-year public-private partnership completed in 2005 that mapped variation in four population groups (The International HapMap Consortium 2005). The 1000 Genomes Project, which ran between 2008 and 2015 (Birney and Soranzo 2015), sought to identify genetic variants that occur with a frequency of at least one percent in the populations studied, with the final data set consisting of 2,504 individuals from 26 populations from five continental regions.

The first genome-wide association study was published in 2005; by 2010, 500 genome-wide association studies had been published (Green et al. 2011); and by 2018, more than 5,000 genome-wide association studies had been published and their results added to the GWAS Catalog (Buniello et al. 2019). Despite these efforts, variants isolated by GWAS for complex traits account for a low percentage of heritability associated with the traits, a phenomenon known as “missing heritability” (Maher 2008). Although knowledge of the pathogenesis of common complex diseases such as diabetes, heart disease, schizophrenia, and cancer that arise due to the interaction of numerous genetic and nongenetic factors is lacking despite the numerous GWAS completed, the pathogenesis of so-called single-gene disorders is far better understood. Aided by the HGP’s dense map of genetic markers for use in positional mapping and subsequent development of genome-wide sequencing technologies, since the draft sequence was published 20 years ago, the number of Mendelian diseases with a known genetic basis has increased from 1,257 to 4,377 (Alkuraya 2021). This progress speeds up clinical diagnosis and facilitates prenatal genetic testing. Prevention and treatment remain challenging, however. There are only 59 “actionable genes” on the American College of Medical Genetics and Genomics’ most recent list of genes that are highly penetrant and associated with established interventions (Kalia et al. 2017). Gene therapy, touted as a potential cure for such disorders, yielded discouraging results in trials; only with the discovery of CRISPR, a novel technology of gene editing, has optimism been restored, though how well this basic science translates into clinical applications is yet to be seen (Doudna and Sternberg 2017; Baylis 2019).

The confident claims of leading genome scientists such as Hood and Collins proved to be overly so. Commentary at the time of the 10-year anniversary of completion of the HGP showed unison among scientists in recognizing that the promised revolution had not yet arrived. The HGP has been advanced as a case study in support of the “social bubble” hypothesis: the hypothesis that people will dive into new opportunities that present without due regard for potential risks because they are carried along by social interactions driven by enthusiasts who generate high expectations of returns—for the HGP, projections of commercially lucrative pharmaceutical and other biomedical applications advanced by project proponents (Gisler et al. 2010). The scientific consensus appears to be that although the HGP failed at least in the short term to fulfill proponents’ overly optimistic prognostications for clinical applications, it has been a boon for basic science (Evans 2010). In the HGP’s early years, Norton Zinder, who chaired the NIH’s Program Advisory Committee on the Human Genome, characterized it in this way: “This Project is creating an infrastructure for doing science; it’s not the doing of the science per se. It will provide the biological community with the basic materials for doing research in human biology” (in Cooper 1994, p. 74).

Indeed, the infrastructure of mapping and sequencing technologies and bioinformatics that was developed as part of the HGP—especially the ability to sequence entire genomes of organisms and traffic in big data—has changed the way biology, not just human biology, is done (Stevens 2013). It is recognized that genome structure by itself tells us only so much. Functional genomics places interest in how entire genomes—not just individual genes—function. A surprising discovery of the HGP was that the number of coding genes in humans is many orders smaller than what scientists had assumed at the outset of the project—that is, around 20,000, as in other vertebrates, rather than 80,000–100,000, though the final number remains an open question (Salzberg 2018). The majority of the genome’s DNA is transcribed but not translated and serves a regulatory function, with causal processes at the molecular level associated with interactive networks not linear pathways. The deterministic and reductionistic assumptions underlying the HGP that portrayed the genome as a blueprint for organismal development have been undermined by the research in molecular biology the project made possible (Keller 2000). Systems biology has emerged as a new discipline that seeks to understand this complexity by using computational methods (Hood 2003). In fact, since completion of the HGP, discovery of non-protein-coding elements of the genome and their contributions to regulatory networks has far exceeded the discovery of protein-coding genes (Gates et al. 2021). Evolutionary studies are aided by the ability of scientists to compare the human genome reference sequence to reference sequences for close relatives, such as Neandertals (Green et al. 2010), bonobos (Mao et al. 2021), and chimpanzees (The Chimpanzee Sequencing and Analysis Consortium 2005).

For scientists to deliver on their promises for a revolution in medicine, genome sequencing would need to be far faster, easier, and cheaper for its use to become routine in both research and clinical settings. In the closing years of the HGP, the cost of sequencing a human genome using existing Sanger technology was about $100 million. With high-throughput next-generation sequencing technology, in 2007, Venter’s diploid genome was sequenced at a cost of $10 million, and by 2013, the cost of sequencing an average genome had been lowered to $5,000. These developments were aided by a NHGRI grant program that set its sights on a $1,000 genome, a price point believed to place the “personal genome” within reach for routine use (Check Hayden 2014). Early in 2014, Illumina, a Californian company, claimed a win in the contest, with availability of its HiSeq X Ten system for population-scale whole-genome sequencing initiatives (Sheridan 2014); in March 2016, Veritas Genetics, a company cofounded by Harvard medical geneticist George M. Church, announced commercial availability of whole-genome sequencing for individuals, including interpretation and counseling, for $999 (Veritas Genetics 2016). Church had initiated the Personal Genome Project in 2005 (Church 2005); there are now Personal Genome Projects in Canada, the U.K., Austria, and China as well. The projects recruit volunteers who are willing to support research by releasing their genomes and health and physical information publicly. Research that takes this longitudinal approach combining genetic and clinical data is considered crucial for the promise of genomics to be fulfilled: with faster, easier, and cheaper genome sequencing technologies, genetic data are readily obtainable, but analysis of the data, without which there can be no revolution in medicine, remains challenging.

The National Research Council’s 2011 report saw the route to “personalized medicine,” or “precision medicine” as it was renamed (see Juengst et al. 2016 and Ferryman and Pitcan 2018 for an account of that change), proceeding via a “New Taxonomy” of disease informed by two data repositories: an “Information Commons” that stores molecular data (genome, transcriptome, proteome, metabolome, lipidome, and epigenome) and additional information (phenotypes, treatment outcomes, test results, etc.) gleaned from the electronic health records of millions of individuals; and a “Knowledge Network of Disease” that integrates this information with “fundamental biological knowledge.” Disease generalizations would be “built up from” this large number of individuals, a departure from studies that group individuals based on particular characteristics (e.g., GWAS). Research efforts approaching the scale called for include the All of Us Research Program in the U.S. and the UK Biobank. The data-intensive approach to molecular biology made possible by information technology need not stop at electronic health records but could also include other electronic records such as credit card purchases and social media postings (Weber et al. 2014), and biometric measurements from mobile apps and fitness trackers (Shi and Wu 2017). The increased importance of “big data” is illustrated by contrasting futuristic scenarios envisioned by Collins (1999) and Hood and Rowen (2013) almost 15 years apart. The “individualized medicine” circa 2010 forecasted by Collins is centered in the physician’s office and assumes a traditional view of the doctor–patient relationship. Hood and Rowen foresee individual genome sequences playing a larger role in medical practice and a changed doctor–patient relationship, driven by “patients” who are likely to bring consumer genetic data to their appointments and understand themselves to be active participants in their medical care. The new “P4 medicine” will be not only predictive, preventive, and personalized, but participatory, and based on a data-driven systems approach to disease. Write Hood and Rowen, “We envision a time in the future when all patients will be surrounded by a virtual cloud of billions of data points, and when we will have the analytical tools to reduce this enormous data dimensionality to simple hypotheses to optimize wellness and minimize disease for each individual” (p. 83). In the meanwhile, as Jenny Reardon (2017) tells us, the vast expanse between data and meaning characterizes “the postgenomic condition.”

Physicians are likely to encounter patients who bring consumer genetic data to their appointments because of a development largely unanticipated by HGP proponents and critics alike: “personal genomics” or “recreational genomics.” Efforts to compile SNP databases and develop rapid, inexpensive, whole-genome sequencing technologies have not yet supported the dawn of a new era of personalized medicine and drug development guided by pharmacogenomics, but direct-to-consumer (DTC) genomics has taken off as an industry, with profit-making seemingly unhampered by the lack of treatments for diseases based on knowledge of DNA sequences. The first DTC whole-genome test was marketed in 2006 (Green at al. 2011). By 2018, more than 10 million people had ordered DTC personal genomics tests (Khan and Mittelman 2018), and, that year, the NHGRI, celebrating the HGP’s 15 th anniversary, identified DTC genetic testing as one of the “15 for 15” ways in which genomics is influencing the world. In 2019, the global DTC genetic testing market was valued at over $1 billion and forecast to climb to $3.4 billion by 2028 (Ugalmugle and Swain 2020). Although health and ancestry are the most common genetic tests sought, there is a broad range of tests available, and DTC companies usually offer more than one service (Phillips 2016). 23andMe and Ancestry.com, for example, offer both health and ancestry tests. The family match function offered by these tests allows biological parentage to be discovered in cases of adoption and gamete donorship/sale. For people who want to confirm paternity, out a cheating spouse, ascertain athletic ability, identify nutritional needs, or find a romantic partner, there are genetic tests and companies for those interests too.

2. Philosophy and the Human Genome Project

At an October 1988 news conference called to announce his appointment, Watson, in an apparently off-the-cuff response to a reporter who asked about the social implications of the project, promised that a portion of the funding would be set aside to study such issues (Marshall 1996b). The result was the NIH/DOE Joint Working Group on Ethical, Legal, and Social Implications (ELSI) of Human Genome Research, chaired by Nancy Wexler, which began to meet in September 1989. The Joint Working Group identified four areas of high priority: “quality and access in the use of genetic tests; fair use of genetic information by employers and insurers; privacy and confidentiality of genetic information; and public and professional education” (Wexler in Cooper 1994, p. 321). The NIH and DOE each established ELSI programs: philosopher Eric T. Juengst served as the first director of the NIH-NCHGR ELSI program from 1990 to 1994. ELSI was funded initially to the tune of three percent of the HGP budget for both agencies; this was increased to four and later five percent at the NIH, a huge boost in bioethics funding, on the order of tens of millions of dollars each year.

Ethical issues such as genetic privacy, access to genetic testing, and genetic discrimination were not the only considerations of interest to philosophers, and besides ethicists, philosophers of science, political theorists and philosophers working in other areas benefited from ELSI-related funding. There is now a vast literature on human genome-related topics. From among these topics, this section attempts to provide a synopsis of those that are most directly associated with the HGP itself, of greatest concern and enduring interest to philosophers, and not covered in other SEP entries. Since there is interest in exporting the ELSI model to other biomedical contexts, such as neuroscience, consideration is also given to its legacy.

Various HGP proponents told us that we would discover our human essence in the genome. According to Dulbecco (1986), “the sequence of the human DNA is the reality of our species” (p. 1056); Gilbert was quoted as saying “sequencing the human genome is like pursuing the holy grail” (in Lee 1991, p. 9); on the topic of his decision to dedicate three percent of HGP funds to ELSI, Watson wrote: “The Human Genome Project is much more than a vast roll call of As, Ts, Gs, and Cs: it is as precious a body of knowledge as humankind will ever acquire, with a potential to speak to our most basic philosophical questions about human nature, for purposes of good and mischief alike” (with Berry 2003, p. 172).

“Geneticization” is a term used to describe the phenomenon characterized by an increasing tendency to reduce human differences to genetic ones (Lippman 1991). The several billion dollars of funding for the HGP was justified by the belief that genes are key determinants of not only rare Mendelian diseases like Huntington’s disease or cystic fibrosis but common multi-factorial conditions like cancer, depression, and heart disease. Wrote an early critic of the HGP: “Without question, it was the technical prowess that molecular biology had achieved by the early 1980s that made it possible even to imagine a task as formidable as that of sequencing what has come to be called ‘the human genome.’ But it was the concept of genetic disease that created the climate in which such a project could appear both reasonable and desirable” (Keller 1992, p. 293). Given that the development of any trait involves the interaction of both genetic and nongenetic factors, on what bases can genes be privileged as causes to claim that a particular disease or nondisease trait is “genetic” or caused by a “genetic susceptibility” or “genetic predisposition”? This question has led philosophers of science to grapple with appropriate definitions for terms such as “genetic disease” and “genetic susceptibility” and how best to conceptualize genetic causation and gene–environment interaction (e.g., Kitcher 1996; Gannett 1999; Kronfeldner 2009). Closely related to the concepts of geneticization and genetic disease/ susceptibility/ predisposition are assumptions about genetic reductionism and genetic determinism.

Genetic reductionism can be understood as governing the whole–part relation in which organismal properties are explained solely in terms of genes and organisms are identified with their genomes. Definitions of health and disease attach to organisms and their physiological and developmental processes in particular contexts (provided by populations and environments) and cannot simply be relocated to the level of the genome (Griesemer 1994; Limoges 1994; Lloyd 1994), however, and diseases do not become more objectively defined entities once they receive a genetic basis since social and cultural values implicated in designations of health and disease can become incorporated at the level of the genome, in what counts as a normal or abnormal gene (Gannett 1998). In contrast to physical reductionism, which does not privilege DNA but considers it on par with proteins, lipids, and other molecules (Sarkar 1998), genetic reductionism assumes that genes are in some sense more causally efficacious. Genetic determinism concerns such assumptions about the causal efficacy of genes.

In a public lecture held to celebrate completion of the HGP, Collins characterized the project as “an amazing adventure into ourselves, to understand our own DNA instruction book, the shared inheritance of all humankind” (see National Human Genome Research Institute, 2003). At the cellular level, the book is said to contain “the genetic instructions for the entire repertoire of cellular components” (Collins et al. 2003, p. 3). At this level, genetic determinism is sustained by metaphors of Weismannism and DNA as “code” or “master molecule” (Griesemer 1994; Keller 1994), which accord DNA causal priority over other cellular components. This may be in a physical sense: Weismannism assumes (falsely) that intergenerational continuity exists only for germ cell nuclei whereas somatic cells and germ cell cytoplasm arise anew in each generation. It may also be in the sense of a point of origin for the transfer of information: the central dogma of molecular biology, which represents a 1950s reformulation of Weismannism in terms of information theory, asserts that information travels unidirectionally from nucleic acids to protein, and never vice versa. It is contentious, however, whether amongst the cell’s components only nucleic acids can be said to transmit information: for some philosophers, genetic coding plays a theoretical role at least at this cellular level (Godfrey-Smith 2000); for others, genetic coding is merely (and misleadingly) metaphorical, and all cellular components are potential bearers of information (Griffiths 2001; Griffiths and Gray 1994; Sarkar 1996).

At the organismal level, new research in functional genomics may lead to less deterministic accounts even of so-called single gene disorders. For these, the concepts of penetrance and expressivity operate in ways that accommodate the one–one genetic determinist model where the mutation is necessary and/or sufficient for both the presence of the condition and confounding patterns of phenotypic variability. But the severity of even a fully penetrant condition like Huntington’s disease seems to depend on not just genetic factors like the number of DNA repeats in the mutation but epigenetic factors like the sex of the parent who transmitted the mutation (Ridley et al. 1991). For complex conditions to which both genetic and environmental differences contribute—for example, psychiatric disorders or behavioral differences—genetic determinism is denied, and everyone is an interactionist these days, in some sense of “interaction.” Both genes and environment are recognized to be necessary for development: by themselves, genes cannot determine or do anything. Yet, theorists still seem to give the nod to one or the other, suggesting that it is mostly genes or mostly the environment, mostly nature or mostly nurture, that make us what we are. This implies that it is possible to apportion the relative contributions of each. Gilbert (1992) suggests this in his dismissal of a more simplistic version of genetic determinism: “We must see beyond a first reaction that we are the consequences of our genes; that we are guilty of a crime because our genes made us do it; or that we are noble because our genes made us so. This shallow genetic determinism is unwise and untrue. But society will have to wrestle with the questions of how much of our makeup is dictated by the environment, how much is dictated by our genetics, and how much is dictated by our own will and determination” (pp. 96–97). However, the assertion that the relative contributions of genes and environment can be apportioned in this way is misleading if not outright false. Building on R. C. Lewontin’s (1974) classic paper on heritability, work in developmental systems theory (DST) undermines any such attempts to apportion causal responsibility in organismal development: traits are jointly determined by multiple causes, each context-sensitive and contingent (Griffiths and Gray 1994; Griffiths and Knight 1998; Oyama 1985; Oyama et al. 2001; Robert 2004).

Geneticization, genetic reductionism, and genetic determinism helped to sell the HGP. Gilbert (1992) endorsed the reduction of individual humans to their genes: “The information carried on the DNA, that genetic information passed down from our parents,” he wrote, “is the most fundamental property of the body” (p. 83), so much so, in fact, that “one will be able to pull a CD out of one’s pocket and say, ‘Here is a human being; it’s me!’” (p. 96). Cancers that we consider to be environmental in their origins were recast as genetically determined. In Watson’s words: “Some call New Jersey the Cancer State because of all the chemical companies there, but in fact, the major factor is probably your genetic constitution” (in Cooper 1994, p. 326). In Bodmer’s words: “Cancer, scientists have discovered, is a genetic condition in which cells spread uncontrollably, and cigarette smoke contains chemicals which stimulate those molecular changes” (Bodmer and McKie 1994, p. 89). (See Proctor 1992 and Plutynski 2018 for discussions of cancer as a genetic disease.) Marking the 20 th anniversary of the release of the draft sequences, Richard Gibbs, director of the sequencing center at Baylor, admits that “there was plenty of hype that was shared with the media and the wider community” and that such “outlandish visions” as personalizing therapies, revealing the “mysteries of the architecture of common complex diseases,” and predicting criminality have not been realized. But Gibbs excuses the hype as necessary for generating support for the project: “The hyperbole that we look back on did not, however, come from the front line. It came from those who championed the programme, mindful of its long-term benefits. Thanks to them, they generated the enthusiasm to fund this transformative work” (2020, p. 575).

Like Gibbs, bioethicist Timothy Caulfield (2018) finds the hype and hyperbole of scientists understandable: “Enthusiasm and optimistic predictions of near-future applications are required in order to mobilize the scientific community and potential funders, both public and private. This is particularly so in areas like genomics, where large amounts of sustained funding are required in order to achieve the hoped for scientific and translational goals” (p. 561). However, unlike Gibbs, Caulfield details possibilities of “real harm,” which include “potentially eroding public trust and support for science; inappropriately skewing research priorities and the allocation of resources and funding; creating unrealistic expectations of benefit for patients; facilitating the premature uptake of expensive and potentially harmful emerging technologies by health systems; misinforming policy and ethics debates; and accelerating the marketing and utilization of unproven therapies” (p. 567). The hype and hyperbole used to promote personalized (or precision) medicine carry the risks Caulfield mentions.

Approaching 20 years since completion of the HGP, genome science has not revolutionized medicine or markedly improved human health. Progress has been made on rare diseases (e.g., spinal muscular atrophy) and some forms of cancer (e.g., non-small cell lung cancer), though these interventions can be prohibitively expensive (Tabery 2023). For most complex diseases, however, predictions based on “family history, neighborhood, socioeconomic circumstances, or even measurements made with nothing more than a tape measure and a bathroom scale” outperform predictions based on the possession of genetic variants identified by GWAS. Pressing public health problems such as increasing obesity, the opiate epidemic, and mental illness fail to be addressed by the “human genome-driven research agenda” to which the lion’s share of resources go (Joyner and Paneth 2019). So even though the deterministic and reductionistic assumptions underlying the HGP have been undermined by the research in molecular biology the project made possible (Keller 2000), the critics’ worries about geneticization, genetic reductionism, and genetic determinism remain relevant, in particular their belief that embracing a reductionist approach to medicine that conceives of human health and disease in wholly molecular or genetic terms individualizes these and detracts attention from risks factors associated with our shared social and physical environments (Nelkin and Tancredi 1989; Hubbard and Wald 1993; Tabery 2023).

Genetic testing is carried out for a range of purposes: diagnostic, predictive, and reproductive. Genetic testing carried out at the population level for any of these purposes is referred to as genetic screening. Diagnostic genetic testing is performed on individuals already experiencing signs and symptoms of disease as part of their clinical care. Newborn screening programs to diagnose conditions such as PKU and hemoglobinopathies based on blood components and circulating metabolites (thus providing indirect genetic tests) have been carried out for many decades. Predictive genetic testing is performed on individuals who are at risk for inheriting a familial condition, such as cystic fibrosis or Huntingdon’s disease, but do not yet show any signs or symptoms. Reproductive genetic testing is carried out through carrier screening, prenatal testing of the fetus in utero, and preimplantation genetic diagnosis (PGD) of embryos created by in vitro fertilization (IVF). In carrier screening, prospective parents find out whether they are at risk for passing on disease-related genes to their offspring. Prenatal genetic testing of fetuses in utero is conducted using blood tests early in a woman’s pregnancy, chorionic villus sampling (CVS) at 10–12 weeks, and amniocentesis at 15–18 weeks. Testing is increasingly offered to all women who are pregnant, not just those for whom risk is elevated because of age or family history; based on the results, women can elect to continue the pregnancy or abort the fetus. In PGD, a single cell is removed from the 8-cell embryo for testing; based on the results, a decision is made about which embryo(s) to implant in the woman’s uterus.

There are significant ethical issues associated with genetic testing. These issues are informed by empirical studies of the psychosocial effects of testing (Wade 2019). Increased knowledge that comes from predictive genetic testing is not an unmitigated good: denial may be a coping mechanism; individuals may feel guilty for passing on harmful mutations to their offspring or stigmatized as having the potential to do so; survivor guilt may arise in those who find out they are not at risk for a disease such as Huntingdon’s after all, or they may become at a loss about how to live their lives differently; those who find out they are destined to develop Huntingdon’s or early-onset Alzheimer’s disease may become depressed or even suicidal; paternity may not be what it is assumed to be; decisions about disclosing results have implications for family members. During debates about the HGP, many authors appealed to the history of eugenics to warn about the dangers of reproductive genetic testing and urge caution as we move forward—so much so that historian Diane Paul (1994) characterized eugenics as the “‘approved’ project anxiety” (p. 143). Paul noted that attempts to draw lessons from the history of eugenics are confounded by disagreements about how to define “eugenics”—whether to characterize eugenics according to a program’s intentions or effects, its use of coercive rather than voluntary means, or its appeals to social and political aims that extend beyond the immediate concerns of individual families. The label “liberal eugenics” has become increasingly accepted for characterizing offspring selection based on parental choice. Reproductive rights are no longer just about the right not to have a child (to use contraception, to have an abortion) or the right to bear a child (to refuse population control measures). Reproductive rights have come to encompass the right to access technological assistance to procreate and to have a certain kind of child (Callahan 1998).

Concerns about genetic discrimination resulting from genetic testing were frequently expressed at the outset of the HGP. Concerns focused mostly on insurance companies and employers, but possibilities for genetic discrimination occurring in other institutional settings were raised as well (Nelkin 1992; Nelkin and Tancredi 1989). A number of general arguments have been made against institutional forms of genetic discrimination: we don’t choose our genes and ought not be punished for what is outside our control (Gostin 1991); the social costs of creating a “biologic” or “genetic underclass” of people who lack health care and are unemployed or stuck in low-wage jobs are too great (Lee 1993; Nelkin and Tancredi 1989); people’s fears of genetic discrimination, whether realistic or not, may lead them to forego genetic testing that might benefit their lives and be less inclined to participate in genetic research (Kass 1997); people have the right not to know their genetic risk status (Kass 1997). Genetic discrimination may also occur in less formal circumstances. Mate choice could increasingly proceed based on genetic information, with certain people being labeled as undesirable. As more and more fetuses are aborted on genetic grounds, families of children born with similar conditions, and people with disabilities and their advocates more broadly, worry that increased stigmatization will result. In addition, group-based genetic research into diseases or behavioral differences risks stigmatizing people based on racial, ethnic, and gender differences, with such risks informed by the troubling history of the study of the genetics of intelligence (Tabery 2015).

In the U.S., where unlike other industrialized countries there is no publicly funded system of universal health care, genetic discrimination by insurance companies and employers has been a particularly serious worry; existing or prospective employees found to be at genetic risk could be fired or not hired by employers to reduce costs of providing health care coverage. ELSI research relating to genetic privacy and the risk of genetic discrimination is credited with bringing about changes in federal law with “far reaching” effects on society (McEwen et al. 2014)—in particular, passage of the Genetic Information Nondiscrimination Act (GINA) in May 2008, which prohibits U.S. health insurance companies and employers from discriminating based on genetic information, defined to include genetic test results and family history but not manifest disease. The Affordable Care Act, passed in 2010, by prohibiting discrimination by health insurers based on preexisting conditions, which include genetic test results and manifest disease, fills in that gap and negates the need for GINA in the context of health insurance. As for employment, there remains a gap: employees who are substantially impaired are covered by the Americans with Disabilities Act and employees who are asymptomatic with genetic tests showing a predisposition for disease are covered by GINA, but employees with manifest disease who are not substantially impaired are covered by neither (Green et al. 2015). GINA does not prohibit use of genetic information in underwriting for life, disability, or mortgage insurance. Discrimination takes the form of refusing coverage on the basis that the genetic susceptibility counts as a “preexisting condition,” charging high premiums for the policy, limiting benefits, or excluding certain conditions. In 2020, Florida became the first state to prohibit use of genetic test results by life insurance companies.

The insurance industry argues that there is no principled reason to treat genetic information any differently from other medical information used in underwriting. They point to the problem of “adverse selection”: people who know themselves to be at high risk are more likely to seek insurance than people who know themselves to be at low risk, which threatens the market when insurers are deprived of the same information (Meyer 2004; Pokorski 1994). Taking an approach to legislation and policy that singles out genetic information for protection has also been criticized philosophically for being based on “misconceptions [that] include the presumption that a clear distinction exists between genetic and nongenetic information, tests, and diseases and the genetic essentialist belief that genetic information is more definitive, has greater predictive value, and is a greater threat to our privacy than is nongenetic medical information” (Beckwith and Alper 1998, p. 208; see also Rothstein 2005). The approach has been dubbed “genetic exceptionalism” and is criticized for drawing from, and in turn fostering, myths of genetic determinism and genetic reductionism (Murray 1997, p. 61; see also O’Neill 2001). Rather than assuming the binarism implicated in here—that genetic information is unique and targeted policies are necessary or that genetic information is not unique and targeted policies are unnecessary—“genomic contextualism” has been recommended as alternative approach (Garrison et al. 2019a). This approach recognizes that there are both similarities and differences between genomic and other types of clinically relevant information and that the specific context in which the policy or practice is implemented determines how best to proceed. Since completion of the HGP, the contexts in which genomics is practiced have changed sufficiently that while privacy concerns remain pressing, the debate has been largely recast.

At the outset of the HGP, concerns about genetic privacy focused on how to protect the public from intrusive governments, employers, and insurance companies. But the explosion of DTC genomics has raised privacy concerns caused by the very public in need of protection! In DTC genomics, Y-chromosomal, mitochondrial, and autosomal DNA ancestry tests are used to provide familial matches. These matches enable adoptees in closed adoptions and offspring of anonymous gamete donors to track down biological parents and other family members, raising obvious privacy concerns for those who gave up children for adoption or agreed to donate or sell eggs or sperm expecting that their anonymity would be protected. Additional privacy concerns arise as the result of genetic genealogy’s use of matches to cousins of various degrees to fill out missing branches in family trees. Genome scientists rely on cell lines, DNA sequence data, and clinical data sets that have been di-identified to protect the anonymity of volunteers; however, access to genetic genealogy databases that combine genetic and traditional genealogical information makes re-identification possible. In genetic genealogy, surname projects based on the co-inheritance of surnames and Y-chromosomal haplotypes furnish candidate surnames when sequence information is available; using Internet searches to match surnames with year of birth and U.S. state of residence, researchers were able to identify individuals who had participated in the 1000 Genomes Project; by extension, they also identified family members who had not participated in the project or consented to share information (Gymrek et al. 2013). Based on population genetic modelling, researchers suggest that with a genetic genealogy database that covers two percent of a target population, a third cousin match can be obtained for 99 percent of the population. With this match, family trees constructed using traditional genealogical methods and additional sources of information can be used to identify an unknown individual for whom DNA is available: this is the “long range familial search” approach that law enforcement is using in an increasing number of active as well as cold cases (Erlich et al. 2018).

A response to the inability to guarantee anonymity for participants in genomic research is to consider concerns about genetic privacy and privacy more generally to be passée. Such concerns are increasingly seen to stand in the way of scientific progress. Watson and Venter have promoted the idea that there is nothing to fear by making one’s sequence public rather than protecting it as private. Venter’s diploid genome was fully sequenced and the findings published in the October 2007 issue of PLoS Biology (Levy et al. 2007); this was followed by the publication of Watson’s “complete” genome in the 17 April 2008 issue of Nature (Wheeler et al. 2008). Relevant to the privacy question, Watson did not bare all: at Watson’s request, the APOE gene which is linked to Alzheimer’s disease was omitted from his sequence. Along similar lines, the volunteers Church recruits for the Personal Genome Project agree to release their genomes and health and physical information publicly, a model of “open consent” replacing genetic privacy (Lunshof et al. 2008). Says Church: “Ideally, everybody on the planet would share their medical and genomic information” (in Dizikes 2007). Large-scale biobanking initiatives such as the All of Us Research Program, originally called the Precision Medicine Initiative or PMI, appeal to collective altruism. The Internet has radically changed people’s expectations of privacy, the boundary between their personal and public lives, and their expectations of accessing information, and these changes are welcomed by genome scientists. To ensure “the free flow of research data,” the 2011 National Research Council report calls for the “[g]radual elimination of institutional, cultural, and regulatory barriers to widespread sharing of the molecular profiles and health histories of individuals, while still protecting patients’ rights” (p. 60).

With this free flow of research data, across institutions and globally, with varying degrees of oversight, people’s ability to consent to use of their biospecimens and data for some but not other purposes becomes impossible (Zarate et al. 2016). Privacy concerns are amplified insofar as genomics is a science driven by big data. The promise of personalized and precision medicine is premised, for some, on amassing all obtainable data on individuals, whether mined from electronic health records, government databases, DTC genomics, genealogy sites, mobile devices, social media, credit card transactions, fitbits, etc. Data-driven biology forgoes hypotheses for algorithms, but these algorithms are not innocuous. Hallam Stevens (2021) describes “an emerging medical-industrial complex” that presents “substantial challenges for privacy, data ownership, and algorithmic bias,” which, if not addressed, will lead to a genomic science that operates in the interests of “surveillance capitalism” and the corporate tech giants (pp. 565–566). Governments, especially authoritarian ones, are building databases that supplement DNA with biometics and social media posts to carry out genomic surveillance that often targets minorities (Moreau 2019).

Of course, significant social privilege attaches to some people’s ability not to worry that genetic testing offered as an employee benefit (Singer 2018) or their virtual cloud of billions of data points will cause them harm. Or that they will regret spitting into a tube and sending it off to 23andMe. As Anna Jabloner (2019) argues, “molecular identification technologies … tell a tale of two molecular Californias: one is a tale of an unchanged biological determinism that continues to mark some bodies as risky and criminal, the other tale is of individual empowerment through the consumption of molecular knowledge” (p. 15). Black and Latino men are overrepresented in CAL-DNA, which is one of the largest criminological DNA databases in the world, while 23andMe’s even larger database contains the DNA of mostly wealthy white Americans. Similarly, Reardon (2017) comments on the tension between the democratizing impulse of the open-data model of Church’s Personal Genome Project and the overwhelming Whiteness, affluence, and maleness of the tech-savvy volunteers who have contributed their genomes

Indigenous groups resist this open consent model that appeals to collective altruism and seek to maintain control over biospecimens contributed and data generated. As participants in scientific studies, they have experienced lack of support for their interests and priorities, failure to share benefits of research, disrespect for cultural and spiritual beliefs, theft of traditional knowledge, conduct of unapproved secondary research, and opportunistic commercialization (Garrison et al. 2019b). Genetics and genomics bring specific concerns, as “genomic data are commonly seen by Indigenous communities as more sensitive than other types of health data, particularly with regard to genealogy and ancestry research that can influence traditionally held beliefs, cultural histories and identity claims affecting rights to land and other resources ” (Hudson et al. 2020, p. 378). Given distrust of funding agencies, universities, and researchers arising from these experiences and concerns, Indigenous underrepresentation in biobanks, DNA sequence databases, and clinical datasets is not surprising. Genome scientists desire access to biospecimens and data of Indigenous peoples for a range of purposes: geographical isolation of populations over tens of thousands of years can yield genetic variants of physiological and clinical interest associated with adaptive responses to environments; comparative genomics and ancient DNA studies contribute to knowledge of human evolutionary history; and efforts to identify genetic contributions to complex traits using GWAS and admixture mapping depend on access to populations that incorporate the genetic diversity of the species. Any progress in the prevention and treatment of disease that arises through precision medicine will be weighted towards populations most studied. Indigenous peoples make up only 0.022% of participants in GWAS conducted worldwide (Mills and Rahal 2019).

In recognition of the overrepresentation of people of European descent in biobanks, DNA sequence databases, and clinical datasets, the NIH’s All of Us initiative seeks to include at least 50 percent underrepresented minorities, motivated by the goal that precision medicine benefit everyone. Keolu Fox (2020) points out that NIH plans to include Indigenous communities in the All of Us initiative fail to appreciate that the open-source data approach used for previous government-funded, large-scale human genome sequencing efforts such as the International HapMap Project and 1000 Genomes Project facilitates the commodification of data by pharmaceutical and ancestry-testing companies. Nanibaa’ A. Garrison et al. (2019b) suggest that development of alternative models for genetic and genomic research involving Indigenous peoples begin by recognizing Indigenous sovereignty, which is “the inherent right and capacity of Indigenous peoples to develop culturally, socially, and economically along lines consistent with their respective histories and values” (pp. 496–497). There should be tangible benefits for Indigenous communities (e.g., support for health promotion, meaningful results), with equitable sharing of profits should commercialization occur (Hudson et al. 2020). Community engagement is crucial; this may extend to community-based participatory research that views Indigenous communities as partners in research, not merely subjects of research. Individual consent is insufficient and should be preceded by collective consent, and consent needs to be an ongoing process for any subsequent research contemplated (Garrison et al. 2019b; Tsosie et al. 2019). Indigenous control over the use and disposal of DNA samples is needed: on the “DNA on loan” approach developed in Canada (Arbour and Cook 2006), the participant or community retains ownership of biological materials and entrusts these to researchers or research institutions as stewards. As for data gleaned from biological materials, the model of open consent is counter to the concept of Indigenous data sovereignty, which is “the inherent and inalienable rights and interests of indigenous peoples relating to the collection, ownership and application of data about their people, lifeways and territories” (Kukutai and Taylor 2016, p. 2).

Early in the debates surrounding plans for the HGP, questions arose concerning what it means to map and sequence the human genome—“get the genome,” as Watson (1992) put it. About these concerns, McKusick (1989) wrote: “The question often asked, especially by journalists, is ‘Whose genome will be sequenced?’ The answer is that it need not, and surely will not, be the genome of any one person. Keeping track of the origin of the DNA that is studied will be important, but the DNA can come from different persons chosen for study for particular parts of the genome” (p. 913). The HGP and Celera reference sequences are indeed composites based on chromosomal segments that originate from different individuals: the sequence in any given region of the genome belongs to a single individual, but sequences in different regions of the genome belong to different individuals. However, in both cases, the majority of the sequence originates from just one person. As HGP sequencing efforts accelerated, concerns arose that only four genomes, a couple of which belonged to known laboratory personnel, were being used for physical mapping and sequencing (Marshall 1996a). The decision was made to construct 10 new clone libraries for sequencing with each library contributing about 10 percent of the total DNA. In the end, 74.3 percent of the total number of bases sequenced was derived from a single clone library—that of a male, presumably from the Buffalo area; seven other clone libraries contributed to an additional 17.3 percent of the sequence (International Human Genome Sequencing Consortium 2001, p. 866). A similar proportion—close to 71 percent—of the Celera sequence belongs to just one male even though five ethnically diverse donors were selected; incredibly enough, rumors were eventually confirmed that this individual is Venter himself (McKie 2002).

The deeper question, of course, is how we might understand a single human genome sequence, a composite that belongs to no actual individual in its entirety and only a handful of individuals in its parts, to be representative of the entire species. This seems to ignore the extensive genetic variability that exists. Early critics of the HGP pointed out numerous faults with the concept of a representative or putatively normal genome: many DNA polymorphisms are functionally equivalent (Sarkar and Tauber 1991); the genome sequence will contain unknown defective genes (since no one, including donors, is free of these), and it is impossible to identify the genetic basis of a disorder simply by comparing the sequences of sick and well people since there will be many differences between them (Lewontin 2000 [1992]); and from an evolutionary viewpoint, mutations are not “errors” in the genetic code or “damage” to the genome’s structure, but the genetic variants that provide the raw materials that make it possible for new species to arise (Limoges 1994, p. 124). There were related worries that the human genome reference sequence would arbitrate a standard of genetic normality; for example, the application of concepts like “genetic error” and “damage” to the genome institutes a call for correction or repair (Limoges 1994; also Murphy 1994). Indeed, the 1988 Office for Technology Assessment report on the HGP recommended the “eugenic use of genetic information … to ensure … that each individual has at least a modicum of normal genes” (p. 85).

Science named “human genetic variation” as “Breakthrough of the Year” for 2007. Humans have been found to be 99.9 percent alike genetically, but notable for the magazine was the extent to which individuals had been found to differ genetically from one another—in SNPs, insertions, deletions, and other structural elements—and the promise this apparently unexpected amount of variation holds for using genome-wide association studies (GWAS) to discover the genetic bases for complex traits, both disease and non-disease traits, to which multiple genetic and nongenetic factors contribute. The human genome reference sequence has been useful as a tool for discovering and cataloguing that genetic variation by providing a standard shared by the scientific community: “The current reference genome assembly works as the foundation for all genomic data and databases. It provides a scaffold for genome assembly, variant calling, RNA or other sequencing read alignment, gene annotation, and functional analysis. Genes are referred to by their loci, with their base positions defined by reference genome coordinates. Variants and alleles are labeled as such when compared to the reference (i.e., reference (REF) versus alternative (ALT)). Diploid and personal genomes are assembled using the reference as a scaffold, and RNA-seq reads are typically mapped to the reference genome” (Ballouz et al. 2019, p. 159). However, what had portrayed as a journalist’s or philosopher’s question is now being asked by genome scientists too. Even as a tool, there are challenges to overcome. If alleles included in the reference sequence are relatively rare, “reference bias” is introduced: genomes that resemble the reference genome are easier to assemble and align, and variants are missed or misidentified (Ballouz et al. 2019). And when entire stretches of sequence are missing in the reference sequence, those sequences will be discarded and missed entirely because of the reliance on the genome reference sequence for assembling and aligning sequenced genomes (Sherman and Salzberg 2020).

Since 2003, there have been ongoing efforts to update the human genome reference sequence by filling gaps, correcting errors, and replacing minor alleles (the current version of the reference sequence is GRCh38). Diversity has been incorporated in the reference sequence by tacking on additional sequences, but by maintaining a linear representation, this loses location information (Kaye and Wasserman 2021). Suggestions have been made for ways to further improve the genome reference sequence. Recently developed long read sequencing technologies facilitate the discovery of large structural variants (SVs) and not just genetic variants (i.e., SNPs and smaller insertions and deletions, or “indels”). One suggestion calls for “reconstructing a more precise canonical human reference genome” by using those SVs to correct misassemblies in the reference sequence and adding the more common SVs to improve variant detection (Yang et al. 2019). Another suggestion recommends adopting a “consensus sequence” approach, in which the most common alleles and variants in the population are chosen for inclusion (Ballouz et al. 2019). Problems remain, however: these remain composite genomes and may contain sequences that would not be found together in any individual, and ongoing updates undermine the stability of the reference sequence as a reference (Kaye and Wasserman 2021). Suggestions have also been made for the replacement of the genome reference sequence. Long read sequencing allows the “ de novo assembly” of genomes, which obviates need of the reference sequence for scaffolding (Chiasson et al. 2015). A human “pan-genome” would accommodate variation by serving as a collection of all the DNA sequences found in the species, both SVs and genetic variants, and replace a linear representation of the genome with more complex genome graphs (Sherman and Salzberg 2020; Miga and Wang 2021). Another possibility is “The Genome Atlas,” which foregoes use of the reference sequence even as a coordinate system, with entries in the atlas instead features of the genome assigned unique feature object identifiers (FOIs). The database generates blueprint genomes based on selected features for use in sequencing reads (Kaye and Wasserman 2021).

Philosophical concerns about whether a human genome reference sequence arbitrates a standard of genetic normality remain, though these may be mitigated by the pan-genome and genome atlas approaches. An empirically validated consensus sequence approach that includes the most common alleles and variants in the population in the genome reference sequence does not imply that those alleles and variants are of biomedical significance because they are conducive to health or of evolutionary significance because they are ancestral. Science ’s 18 February 2011 issue in celebration of the 10-year anniversary of publication of the draft human genome sequence contains an essay by genome scientist Maynard V. Olson, which asks, in its title, “What Does a ‘Normal’ Human Genome Look Like?” Although the HGP was criticized as anti-evolutionary, pre-Darwinian, typological, and essentialist for seemingly instituting a standard of genetic normality, it was also argued that the HGP might be seen instead as incorporating a specific set of evolutionary assumptions (Gannett 2003). Indeed, from an evolutionary perspective, Olson contends that genetic variability among relatively healthy humans is largely composed of deleterious mutations, rather than adaptive mutations due to balancing or diversifying selection, and that, consequently, “there actually is a ‘wild-type’ human genome—one in which most genes exist in an evolutionarily optimized form” (p. 872), though individual humans inevitably “fall short of this Platonic ideal.” Judgments about what constitutes a “normal,” “wild-type,” or “ideal” human genome do not escape the socio-cultural contexts in which they arise. Social and cultural values that attach to judgements at the phenotypic level are simply embedded in the genome, where they are less visible as such (Gannett 1998).

In promoting the HGP, Gilbert (1992) suggested that we will find answers to the age-old question about human nature in our genome: “At the end of the genome project, we will want to be able to identify all the genes that make up a human being…. So by comparing a human to a primate, we will be able to identify the genes that encode the features of primates and distinguish them from other mammals. Then, by tweaking our computer programs, we will finally identify the regions of DNA that differ between the primate and the human—and understand those genes that make us uniquely human” (p. 94). Although philosophers challenge the species essentialism that defines species in terms of genetic properties shared by all and only their members (Gannett 2003; Robert and Baylis 2003), genome scientists are indeed comparing the human genome reference sequence to chimpanzee, bonobo, and Neandertal genome reference sequences to explore questions about human nature. Already in 1969, Sinsheimer foresaw the promise of molecular biology to remake human nature: “For the first time in all time, a living creature understands its origin and can undertake to design its future” (in Kevles 1992, p. 18). Remaking human nature is likely to begin with genetic modifications that convey the possibility of resistance to a serious disease, like HIV/AIDS, or minimize the effects of aging to extend lifespan, but transhumanists who view human nature as “a work-in-progress, a half-baked beginning that we can learn to remold in desirable ways” welcome improvements in memory, intelligence, and emotional capacities as well (Bostrom 2003, p. 493). Theories of justice are typically based on conceptions of human nature; although the new field of sociogenomics continues to favor nature over nurture (Bliss 2018), with the capacity to remake human nature, this foundation disappears (Fukuyama 2002; Habermas 2003).

Concerns about race, ethnicity, and the genome were raised in the early years of the HGP. Racial profiling in the legal system was one such concern: if people belonging to particular racial and ethnic groups are more likely to be arrested, charged, or convicted of a criminal offense, they are more likely to be required to provide DNA samples to forensic databases, and therefore more likely to come back into the system with future offenses (Kitcher 1996). Another concern was that if genetic discrimination by insurers and employers creates a “genetic underclass” (Lee 1993; Nelkin and Tancredi 1989), then to the extent that race and ethnicity correlate with socioeconomic status, some groups—already affected by disparities in health outcomes unrelated to genetic differences—will be disproportionately represented among this “genetic underclass.” A further concern was the social stakes involved when group-based differences are identified, whether these involve sequences localized to particular groups or varying in frequency among groups (Lappé 1994). And given the history of using biological explanations to provide ideological justification for social inequalities associated with oppressive power structures, the prospective use of molecular genetics to explain race differences was met with caution (Hubbard 1994). These concerns were dismissed by HGP proponents, who argued that mapping and sequencing the human genome celebrate our common humanity. At the June 26, 2000 White House press conference announcing completion of “a working draft” of the sequence of the human genome, Venter announced that the results show that “the concept of race has no genetic or scientific basis” (see Clinton, et al. 2000).

Post-HGP genetics and genomics have not lived up to the mantra that because we are 99.9 percent the same, there is no such thing as race. Indeed, a predominantly African American racial identity has been ascribed to the human genome reference sequence itself (Reich et al. 2009), and the de novo assembly of genomes permitted by long read sequencing has led to several countries, such as China, Korea, and Denmark, producing their own “ethnicity-specific reference genomes” (Kowal and Llamas 2019). The International HapMap Project, which was initiated in 2002 with the goal of compiling a haplotype map adequately dense with SNP markers to permit the identification of genes implicated in common diseases and drug responses, sampled the DNA of four populations (European-Americans in Utah; Yoruba in Ibadan, Nigeria; Japanese in Tokyo; and Han Chinese in Beijing). This reproduction of racial categories (here, European, African, and Asian) at the level of the genome has been characterized as the “molecular reinscription of race” (Duster 2015). DTC ancestry testing appeals to a range of group categories, which are defined by geography, nationality, ethnicity, and race: Ancestry.com advertises tests for Irish ancestry in early March each year; FamilyTreeDNA confirms Jewish ancestry, whether Ashkenazi or Sephardi; African Ancestry, Inc. finds ancestral ties to present-day African countries and ethnic groups dating back more than 500 years; and DNAPrint’s panels of ancestry informative markers determine proportions of (Indo)European, East Asian, sub-Saharan African, and Native American heritage.

The resurgence in biological thinking about race and ethnicity since the HGP is due in large part to the postgenomic use of racial and ethnic categories of difference to try to capture patterns of group genetic differences in various fields of research. The revolutionary benefits of postgenomic “personalized” or “precision” medicine were supposed to focus on individual genetic differences within populations, not group genetic differences across populations. Pharmaceuticals, a powerful engine driving post-HGP research into human genetic differences, were supposed to be tailored to individual genomes. In 2003, Venter opposed the U.S. Food and Drug Administration (FDA) proposal to carry out pharmaceutical testing using the Office of Management and Budget (OMB) racial and ethnic classification system, arguing that these are “social” not “scientific” categories of race and ethnicity and that the promise of pharmacogenetics lies in its implementation as individualized medicine given the likelihood that variation in drug responses will vary more within racial and ethnic groups than among them (Haga and Venter 2003). However, en route to a “personalized” or “precision” medicine based on individual genetic differences and pharmaceuticals tailored to individual genomes, a detour via research into group genetic differences has been taken. Now that group genetic differences have become of interest to more than just evolutionary biologists and population geneticists, impetus is provided to debates to which philosophers of science have contributed: longstanding debates about whether race is biologically real or socially constructed (Andreasen 2000; Pigliucci and Kaplan 2003; Gannett 2010; Hochman 2013; Spencer 2014) and more recent ones concerning the appropriateness of the use of racial categories in biomedical research (Root 2003; Gannett 2005; Kaplan 2010; Hardimon 2013).

Despite the detour via group genetic differences en route to “personalized” or “precision” medicine, genome-based research into common diseases and drug responses has focused predominantly on Europeans. For GWAS, a 2009 analysis showed that 96 percent of participants were of European descent; by 2016, though the proportion of participants of European descent had decreased to 81 percent, the change was mostly accounted for by a greater number of studies being carried out in Asian countries (Popejoy and Fullerton 2016). Even though African populations are the most genetically diverse in the world, by 2018, only 2 percent of GWAS participants had been African in origin (Sirugo et al. 2019). This European bias makes it more difficult to isolate rare genetic variants contributing to disease, to provide accurate and informative genetic test results to nonEuropeans, and to ensure that any clinical benefits that arise from genomic research will be equitably distributed within the U.S. and globally. Sociologist Dorothy E. Roberts (2021) calls for genetic researchers to “stop using a white, European standard for human genetics and instead study a fuller range of human genetic variation,” which will “give scientists a richer resource to understand human biology” as well as promoting equitable access to the benefits of research. That study of human genetic variation, Roberts argues, should abandon use of race “as a biological variable that can explain differences in health, disease, or responses to therapies,” as it obscures “how structural racism has biological effects and produces health disparities in racialized populations” (p. 566). Structural racism is advanced as a key determinant of population health: for example, the racial and economic segregation of neighborhoods contributes to health disparities because of differences in quality of housing, exposure to pollutants and toxins, good education and employment opportunities, and access to decent health care (Bailey et al. 2017).

The NHGRI has affirmed its commitment to improve the inclusion of participants from diverse populations in research and begun to appreciate that genomic studies are designed in ways that fail to consider the contribution of social and physical environments to disease (Hindorff et al. 2018). However, as sociologist Steven Epstein (2007) has argued more generally, the “inclusion-and-difference paradigm” that has prevailed in U.S. research over the past couple of decades, though correcting researchers’ previously held default assumption of the white, middle-aged, white male as the normative standard, serves to amplify the role of biology in health and disease while drawing attention away from society. Genome scientists understand the ramifications attached to their use of racial and ethnic categories in research, and despite longstanding, well-considered ELSI-funded research that urges care be attached to the use of these categories (Sankar and Cho 2002; Sankar et al. 2007), problems are ongoing and attended to by NHGRI leaders concerned about “the misuse of social categories of race and ethnicity as a proxy for genomic variation” (Bonham et al. 2018, p. 1534). The “weaponization” of genomics by White nationalists and the alt-right has raised the stakes for genome scientists. The self-described fascist, White supremacist, racist, and anti-Semite who murdered 10 African Americans at a Buffalo supermarket in 2022 posted writings that cited dozens of scientific studies, including use of a GWAS of educational attainment to support hereditarian views of racial differences in intelligence and use of a principal components analysis (PCA) that resolved human genetic diversity into continental-level clusters to support realism about biological race (Carlson 2022).

White supremacists chug milk to celebrate their origins in European populations that evolved the ability to digest lactose in adulthood (Harmon 2018), and they appeal to traces of Neanderthal DNA in their genomes to celebrate their origins in populations that evolved outside Africa (Wolinsky 2019). While it seems incumbent on genome scientists to confront racist misuses of their research, they face challenges in doing so. Misuse may result from misunderstanding science, but not always: lay experts among White nationalists capably use cutting-edge genomics research to justify hereditarian views, thereby building a counter-knowledge (Doron, in press) or citizen science of sorts (Panofsky and Donovan 2019). In the U.S., White nationalists take DTC ancestry tests to prove their genetic purity, that they are 100% European/White/non-Jewish (Panofsky and Donovan 2019), while in Europe, this assumed homogeneity of Whiteness comes into question with Nordicists and Mediterraneans using genetic admixture mapping as they vie to prove themselves the most European of Europeans (Doron, in press). When scientists engage with racists, even ones who are scientifically literate, there is the risk of unintentionally helping their cause. “Furthermore,” as Aaron Panofsky et al. (2021) argue, “many of the findings about human evolution and variation are genuinely complex, ambiguous, contested, changing, and involve historically contingent judgments” (p. 396); hence, it may be difficult to claim that research has been misconstrued. As Claude-Olivier Doron notes, insofar as lay experts exploit ambiguities constitutive of scientific discourse in population genetics as they transfer scientific findings to a White supremacist ideological framework, these findings operate within the framework without distortion.

Genome scientists have made recommendations about sampling protocols and standards for visualizations in population genomics to discourage the misappropriation of research by racists who draw conclusions inconsistent with the intentions of scientists (Carlson et al. 2022). However, these recommendations portray population genetic structure, unlike race, as wholly biological, and as science studies scholarship suggests, the challenge of accessing the biological without recourse to the social—and the interests, biases, and imaginaries associated with the social—may be impossible to overcome. Historians, sociologists, and anthropologists of science have insightfully documented how social, political, and cultural constructions of identity are incorporated in, and become defined by, genetics and genomics research in ways specific to their locations: reenactment of continentally-defined races as biogeographical ancestry in the U.S. (Fullwiley 2008; Gannett 2014); influence of population genetics on Irish origin stories and genealogy of the Irish Travellers (Nash 2008; Nash 2017); naturalization and even pathologization of caste and regional differences in India (Egorova 2010); geneticization/genomicization of Mestizo identity in the context of Mexico’s own genome mapping project (López Beltrán 2011); post-apartheid South Africa’s “genomic archive” bound to apartheid’s racialized subjectivities despite advancing nonracial unity through common origins (Schramm 2021); genetic ancestry testing as a basis for Jewishness (El-Haj 2012); Native American DNA as proof of tribal identity (TallBear 2013); and nationalism’s role in interpreting differences between the Korean Reference Genome (KOREF) and HGP’s genome reference sequence as occurring at the population rather than individual level (Kowal and Llamas 2019).

Use of “ancestry” as a category for genetics and genomics research is considered a means of averting problems associated with the use of “race” and “ethnicity”; for example, for the mapping of complex traits, rather than relying on self-identified race, it has been recommended that population structure be assessed empirically by genotyping individuals to determine their “continental ancestry” proportions (Shields et al. 2005). Critics contend, however, that “continental ancestry” belies the continuous pattern with which genetic variation is distributed across the species and reenacts race as it has been traditionally defined, thus contributing to its reification (Fullwiley 2008; Gannett 2014; Lewis et al. 2022). In 2023, the National Academy of Sciences published a Consensus Study Report, “Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field,” in response to a request by the National Institutes of Health to assess the status of use of race, ethnicity, ancestry, and other population descriptors. The report recommends against use of racial labels in genetics and genomics research, a possible exception being studies of health disparities with genomic data, in which race serves as a proxy for environmental variables (e.g., racism). A distinction is drawn between genetic ancestry (paths through which an individual’s DNA is inherited from specific ancestors, known as the “ancestral recombination graph”) and genetic similarity (a quantitative measure of genetic resemblance among individuals that reflects shared genetic ancestry). The report recognizes geographic origins, ethnicity, and genetic ancestry as appropriate categories for reconstructing human evolutionary history, but advocates use of genetic similarity to constitute groups in most other research contexts, including gene discovery for complex traits.

Relying on genetic similarity does not necessarily lead researchers away from race, ethnicity, and ancestry. In an ethnographic study, sociologists Joan H. Fujimura and Ramya Rajagopalan (2011) found that although the statistical machinery associated with GWAS allows researchers to avoid race and ethnic categories by analyzing samples based wholly on genetic similarity, there was “slippage” from genetic similarity to shared ancestry, which, in turn, since mediated by geography and genealogy, became interpreted as racial or ethnic. While ancestral recombination graphs situate individuals in the context of their genealogical relations without assigning them to geographically or culturally defined populations or groups, in practice, these categories are almost always incorporated in genetic ancestry estimation (Lewis et al. 2022). Population geneticist Graham Coop (2022) favours replacing genetic ancestry with genetic similarity, arguing that describing a sampled individual’s genetic similarity to a reference panel (e.g., “ X is genetically similar to the GBR 1000 Genome samples”) is preferable to attributing genetic ancestry to that individual (e.g., “ X has Northwestern European genetic ancestry”), as it recognizes the conventionalism of the reference panel and continuity of genetic variation (pp. 11–12). However, given that the 1000 Genomes Project’s 26 populations across five continental regions are named using language that reflects “both the ancestral geography or ethnicity of each population and the geographic location where the samples from that population were collected” (Coriell Institute)—e.g., “British from England and Scotland”—the slippage remarked upon by Fujimura and Rajagopalan is encouraged. Nevertheless, as sociologists Aaron Panofsky and Catherine Bliss (2017) observe, “Geneticists face a complex set of pressures regarding population labeling” (p. 75). Ambiguous labels for populations that conflate geography, race, and ethnicity may offer geneticists the flexibility to fulfill their own research goals while accessing repositories of preclassified biospecimens and data, collaborating with researchers pursuing quite different agendas, and maintaining goodwill with populations studied. These pressures compete with pressures about labels imposed by funders and journals, which may provoke skepticism and resistance.

Although ELSI may have had an inauspicious argued beginning in Watson’s apparent off-the-cuff remarks at a 1988 news conference, the research program has outlasted the HGP itself. ELSI funding is mandated through the National Institutes of Health Revitalization Act of 1993, which calls for a minimum of 5 percent of the NIH budget for the HGP—the monies directed to the NCHGR-NHGRI—to be set aside to study the ethical, legal, and social implications of the science of genomics (McEwen et al. 2014). The Division of Genomics and Society at the NIH’s NHGRI, created in 2012, maintains the Ethical, Legal and Social Implications (ELSI) Research Program as an extramural grant funding initiative to the present day; the division also includes an intramural bioethics program. Input concerning the ELSI program is provided by the NHGRI Genomics and Society Working Group through the National Advisory Council for Human Genome Research.

A review article by NIH staff (McEwen et al. 2014) characterizes the ELSI program as “an ongoing experiment.” Since it was established in 1990, the ELSI program has supported empirical and conceptual research carried out by researchers from a broad range of disciplines: “genetics and genomics, clinical medicine, bioethics, the social sciences (e.g., psychology, sociology, anthropology, political science, and communication science), history, philosophy, literature, law, economics, health services, and public policy” (p. 485). This research is considered to have had impacts on genomics studies (e.g., requirements for informed consent, protection of the privacy of subjects, and nomenclature for socially defined groups), genomic medicine (e.g., personal impacts of acquiring genetic information from screening and testing carried out in clinical, research, and direct-to-consumer settings), and wider society (e.g., federal legislation prohibiting genetic discrimination in health insurance and employment, increased awareness about DNA forensics, and policies on gene patenting). The experimental aspect remarked upon refers to the organizational and physical situation of ELSI, a program charged with critically evaluating the implications genomics research, within the very agency that funds that research. While this institutional arrangement supports the growing trend to integrate ELSI research with genomics research and policy formulation, ensuring that ELSI research is scientifically informed and practically relevant, excessive proximity also risks compromising “the autonomy, objectivity, and intellectual independence of ELSI investigators” (McEwen et al. 2014).

Since the early years of the HGP, bioethicists have criticized ELSI on various institutional grounds: for a lack of independence from scientist-overseers (Murray 1992; Yesley 2008), for an absence of structure conducive to providing guidance regarding policy (Hanna 1995), and for a negative impact on bioethics in narrowing the range of topics covered and creating an isolated subspecialty within the field (Annas and Elias 1992; Hanna et al. 1993). And from history and philosophy of science quarters, broader philosophical concerns have been raised about ELSI’s focus on the ethical, legal, and social implications of genetic research. Such a focus promotes a “downstream” rather than “upstream” framework for understanding the relationship between science and ethics that fails to appreciate that foundational concepts in genetic research such as normality and mutation are themselves evaluative and operate as directives to action (Limoges 1994). ELSI’s European counterpart, Ethical, Legal and Social Aspects or ELSA, chose to use the term “aspects” in order to avoid the connotations of narrowness, linearity, and determinism attached to the term “implications” (Hilgartner et al. 2016).

The “ongoing experiment” at the NIH’s NHGRI has come to define a particular model for doing research in bioethics that is being exported to other rapidly developing scientific fields, as expressed in the title of a target article published in AJOB Neuroscience , “To ELSI or Not to ELSI Neuroscience: Lessons for Neuroethics from the Human Genome Project” (Klein 2010). Indeed, although “ELSI” entered the lexicon as an acronym for the specific extramural research program supported by US government funding set aside for the HGP (elsewhere in the world, similar programs received their own appellations and acronyms—e.g., Genomics and its Ethical, Environmental, Economic, Legal, and Social Aspects or GE 3 LS in Canada), with that program offering a possible model for other emerging sciences, the term has come to receive a broader meaning that refers instead to a field of research, defined by its “research and scholarship content, rather than a particular set of funding sources” (Morrissey and Walker 2012, p. 52). Given the interest in exporting the ELSI model, its merits as a field of research, as bioethicists Clair Morrissey and Rebecca L. Walker argue (Morrissey and Walker 2012; Walker and Morrissey 2014), need to be examined. Investigating the content and methods of ELSI as a field of research, Morrissey and Walker combed through hundreds of articles and book chapters published between 2003 and 2008 (Morrissey and Walker 2012, Walker and Morrissey 2014). They found that funding sources influenced what research is carried out: though only 17 percent of all publications involved empirical research, for publications whose authors received US government funding, 30 percent of those with non-NHGRI support and 52 percent of those with NHGRI support were empirically based. They found that institutional and professional forces, irrespective of funding sources, promoted the coverage of topics of greatest interest to affluent populations (e.g., “genomics and clinical practice,” “intellectual property,” “genetic enhancement,” and “biorepositories”). They found that the vast majority (89 percent) of publications were prescriptive, recommending to diverse actors (scientists, clinicians, bioethicists, government, etc.) that certain policies or practices be pursued. Given this overwhelmingly prescriptive posture, they were dismayed to find that publications made use of multiple bioethical methods in piecemeal fashion with little depth. For the most part (77 percent), publications did not reflect on methods.

ELSI itself has become an object of research in science and technology studies (STS) scholarship. The ELSI model, as incorporated more recently in areas such as nano and synthetic biology, is understood as serving as “a new governance tool built on the prior institutionalization of ‘bioethics’ as a way to manage problems of moral ambiguity and disagreement in biomedicine” (Hilgartner et al. 2016, p. 824). At the outset of the HGP, ELSI relied on a governance model in which ethicists and social scientists lent their expertise by producing a body of scholarship that could inform public policy; subsequently, especially in Europe, social scientists were expected to facilitate mechanisms for making public policy in more democratic ways by engaging stakeholders and the broader public. Hilgartner et al. (2016) argue that STS scholarship sits somewhat uneasily alongside ELSI scholarship inasmuch as STS scholarship problematizes elements of the “traditional imaginaries of orderly science-society relations” to which ELSI subscribes, such as the fact/value distinction, the “neutrality” of science and technology, and “the self-evidence of power relations” (p. 832). Criticisms are also made that as a tool of governance, bioethics exercises power in ways often unseen, thereby foreclosing questions asked and debates had—for e.g., taking the boundary between facts and values as given not made, presenting rational moral arguments as outside politics to dismiss issues of public concern, or circumventing legislation by justifying the extension of existing regulations.

  • Alkuraya, Fowzan S., 2021, “A Genetic Revolution in Rare-Disease Medicine,” Nature 590 (11 Feb): 218–219.
  • Andreasen, Robin O., 2000, “Race: Biological Reality or Social Construct?” Philosophy of Science , 67: S653–S666.
  • Annas, G. J., and S. Elias, 1992, “Social Policy Research Priorities for the Human Genome Project,” in Gene Mapping: Using Law and Ethics as Guides , edited by G. J. Annas and S. Elias, 269–275, New York: Oxford University Press.
  • Anonymous, 2000, “Human Genome Projects: Work in Progress,” Nature , 405 (29 June): 981.
  • Anonymous, 2003, “International Consortium Completes Human Genome Project,” Genomics & Genetics Weekly (9 May): 32.
  • Arbour, Laura, and Doris Cook, 2006, “DNA on Loan: Issues to Consider when Carrying Out Genetic Research with Aboriginal Families and Communities,” Community Genetics 9: 153–160.
  • Bailey, Zinzi D., Nancy Krieger, Madina Agénor, Jasmine Graves, Natalia Linos, Mary T. Bassett, 2017, “Structural Racism and Health Inequities in the USA: Evidence and Interventions,” The Lancet , 389: 1453–63.
  • Ballouz, Sara, Alexander Dobin, and Jesse A. Gillis, 2019, “Is It Time To Change the Reference Genome?” Genome Biology , 20: 159 [9pp]. doi:10.1186/s13059-019-1774-4
  • Baylis, Françoise, 2019, Altered Inheritance: CRISPR and the Ethics of Human Genome Editing , Harvard University Press.
  • Beckwith, Jon, and Joseph S. Alper, 1998, “Reconsidering Genetic Antidiscrimination Legislation,” Journal of Law, Medicine & Ethics , 26: 205–210.
  • Birney, Ewan, and Nicole Soranzo, 2015, “The End of the Start for Population Sequencing,” Nature , 526 (30 Sep): 52–53.
  • Blattner, Frederick R. et al., 1997, “The Complete Genome Sequence of Escherichia coli K-12,” Science , 277 (5 Sep): 1453–1462.
  • Bliss, Catherine, 2018, Social by Nature: The Promise and Peril of Sociogenomics , Stanford University Press.
  • Bodmer, Walter, and Robin McKie, 1994, The Book of Man: The Quest to Discover Our Genetic Heritage , Toronto: Viking Press.
  • Bonham, Vence L., Eric D. Green, and Eliseo J. Pérez-Stable, 2018, “Examining How Race, Ethnicity, and Ancestry Data Are Used in Biomedical Research,” JAMA , 320: 1533–1534. doi:10.1001/jama.2018.13609
  • Bostrom, Nick, 2003, “Human Genetic Enhancements: A Transhumanist Perspective,” Journal of Value Inquiry , 37: 493–506.
  • Buniello, Annalisa, Jacqueline A.L. MacArthur, Maria Cerezo, Laura W. Harris, James Hayhurst, Cinzia Malangone, Aoife McMahon, Joannella Morales, Edward Mountjoy, Elliot Sollis, Daniel Suveges, Olga Vrousgou, Patricia L. Whetzel, Ridwan Amode, Jose A. Guillen, Harpreet S. Riat, Stephen J. Trevanion, Peggy Hall, Heather Junkins, Paul Flicek, Tony Burdett, Lucia A. Hindorff, Fiona Cunningham, and Helen Parkinson, 2019, “The NHGRI-EBI GWAS Catalog of Published Genome-Wide Association Studies, Targeted Arrays and Summary Statistics 2019,” Nucleic Acids Research 47: D1005–D1012. doi:10.1093/nar/gky1120
  • Callahan, Daniel, 1998, “Cloning: Then and Now,” Cambridge Quarterly of Healthcare Ethics 7: 141–144.
  • Carlson, Jedidiah, 2022, “Spread This Like Wildfire!” Science for the People , [ available online ]
  • Carlson, Jedidiah, Brenna M. Henn, Dana R. Al-Hindi, and Sohini Ramachandran, 2022, “Counter the Weaponization of Genetics Research by Extremists,” Nature 610 (20 Oct): 444–447.
  • Caulfield, Timothy, 2018, “Spinning the Genome: Why Science Hype Matters,” Perspectives in Biology and Medicine , 61: 560–571. doi:10.1353/pbm.2018.0065
  • Chaisson, M.J.P., R.K. Wilson, and E.E. Eichler, 2015, “Genetic Variation and the De Novo Assembly of Human Genomes,” Nature Reviews Genetics , 16: 627–640.
  • Check Hayden, Erika, 2014, “Technology: The $1,000 Genome,” Nature , 507 (19 Mar): 294–295. doi:10.1038/507294a
  • Church, G.M., 2005, “The Personal Genome Project,” Molecular Systems Biology , 1: article no. 0030. doi:10.1038/msb4100040
  • Clinton, Bill, Tony Blair, Francis S. Collins, J. Craig Venter, 2000, “White House Remarks on Decoding of Genome”, transcript of June 26, 2000 news conference, New York Times , June 27, 2000. [ Clinton, et al. 2000 available online ]
  • Collins, Francis S., 1999, “Medical and Societal Consequences of the Human Genome Project,” New England Journal of Medicine , 341: 28–37.
  • Collins, Francis and David Galas, 1993, “A New Five-Year Plan for the U.S. Human Genome Project,” Science , 262 (1 Oct): 43–46.
  • Collins, Francis S., Ari Patrinos, Elke Jordan, Aravinda Chakravarti, Raymond Gesteland, LeRoy Walters, and the members of the DOE and NIH planning groups, 1998, “New Goals for the U.S. Human Genome Project: 1998–2003,” Science , 282 (23 Oct): 682–689.
  • Collins, Francis S., Eric D. Green, Alan E. Guttmacher, and Mark S. Guyer. 2003. “A Vision for the Future of Genomics Research: A Blueprint for the Genomic Era,” Nature , 422 (24 April): 1–13.
  • Cook-Deegan, Robert, 1994, The Gene Wars: Science, Politics, and the Human Genome , New York: W. W. Norton.
  • Coop, Graham, 2022, “Genetic Similarity versus Genetic Ancestry Groups as Sample Descriptors in Human Genetics,” [ available online ]
  • Cooper, Necia Grant, 1994, The Human Genome Project: Deciphering the Blueprint of Heredity , Mill Valley, CA: University Science Books.
  • Coriell Institute for Medical Research, n.d., “Guidelines for Referring to Populations,” [ available online ]
  • Cranor, Carl F. (ed.), 1994, Are Genes Us? : The Social Consequences of the New Genetics , New Brunswick, NJ: Rutgers University Press.
  • Davis, Bernard D. and Colleagues, 1990, “The Human Genome and Other Initiatives,” Science 249 (27 July): 342–343.
  • Deloukas, P., et al., 1998, “A Physical Map of 30,000 Human Genes,” Science , 282 (23 Oct): 744–746.
  • Department of Trade and Industry (U.K.), 2003, “Heads of Government Congratulate Scientists on Completion of Human Genome Project,” Hermes Database (12 April); LexisNexis Academic.
  • Dib, Colette et al., 1996, “A Comprehensive Genetic Map of the Human Genome Based on 5,264 Microsatellites,” Nature , 380 (14 Mar): 152–154.
  • Dickson, David, 1998, “British Funding Boost is Wellcome News,” Nature , 393 (21 May): 201.
  • Dizikes, Peter, 2007, “Gene Information Opens New Frontier in Privacy Debate,” Boston Globe , 24 September 2007.
  • Doron, Claude-Olivier, in press, “Who is the Most European of Us All? Occidentalism, White Supremacy and the Counter-Knowledge on Race and Genetics,” in Ordering People, Naming Populations: Critical Perspective on Biological Diversity and the Classification Concepts in the Life Sciences , edited by N. Ellebrecht, T. Plümeke, V. Lipphardt, J. Reardon.
  • Doudna, Jennifer A., and Samuel H. Sternberg, 2017, A Crack in Creation: Gene Editing and the Unthinkable Power to Control Evolution , Boston: Houghton Mifflin Harcourt.
  • Dulbecco, Renato, 1986, “A Turning Point in Cancer Research: Sequencing the Human Genome,” Science , 231 (7 Mar): 1055–1056.
  • Dunham, I. et al., 1999, “The DNA Sequence of Human Chromosome 22,” Nature , 402 (2 Dec): 489–495.
  • Duster, Troy, 2015, “A Post-genomic Surprise: The Molecular Reinscription of Race in Science, Law and Medicine,” The British Journal of Sociology , 66: 1–27. doi:10.1111/1468-4446.12118
  • Egorova, Yulia, 2010, “Castes of Genes? Representing Human Genetic Diversity in India,” Genomics, Society and Policy , 6(3): 32–49.
  • El-Haj, Nadia Abu, 2012, The Genealogical Science: The Search for Jewish Origins and the Politics of Epistemology , University of Chicago Press.
  • Epstein, Steven, 2007, Inclusion: The Politics of Difference in Medical Research , University of Chicago Press.
  • Erlich, Yaniv, Tal Shor, Itsik Pe’er, and Shai Carmi, 2018, “Identity Inference of Genomic Data Using Long-range Familial Searches,” Science , 362 (9 Nov): 690–694.
  • Evans, James P., 2010, “The Human Genome Project at 10 Years: A Teachable Moment,” Genetics in Medicine , 12: 477. doi:10.1097/GIM.0b013e3181ef16b6
  • Ferryman, Kadija, and Mikaela Pitcan, 2018, “What Is Precision Medicine,” Data & Society , Report, February 26, 2018. [ Ferryman & Pitcan 2018 available online ]
  • Fleischmann, Robert D. et al., 1995, “Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd,” Science , 269 (28 Jul): 496–512.
  • Fox, Keolu, 2020, “The Illusion of Inclusion—The ‘All of Us’ Research Program and Indigenous Peoples’ DNA,” New England Journal of Medicine , 383: 411–414. doi:10.1056/NEJMp1915987
  • Fujimura, Joan H., and Ramya Rajagopalan, 2011, “Different Differences: The Use of ‘Genetic Ancestry’ versus Race in Biomedical Human Genetic Research,” Social Studies of Science 41: 5–30. doi:10.1177/0306312710379170
  • Fukuyama, Francis, 2002, Our Posthuman Future: Consequences of the Biotechnology Revolution , New York: Picador.
  • Fullwiley, Duana, 2008, “The Biologistical Construction of Race: ‘Admixture’ Technology and the New Genetic Medicine,” Social Studies of Science , 38: 695–735. doi:10.1177/0306312708090796.
  • Gannett, Lisa, 1998, “Genetic Variation: Difference, Deviation, or Deviance?” Ph.D. Dissertation, University of Western Ontario, Gannett 1998 available online .
  • –––, 1999, “What’s in a Cause? The Pragmatic Dimensions of Genetic Explanations,” Biology and Philosophy , 14: 349–374.
  • –––, 2003, “The Normal Genome in Twentieth-Century Evolutionary Thought,” Studies in History and Philosophy of Biological and Biomedical Sciences 34: 143–185.
  • –––, 2005, “Group Categories in Pharmacogenetics Research,” Philosophy of Science 72: 1232–1247.
  • –––, 2010, “Questions Asked and Unasked: How by Worrying Less about the ‘ Really Real’ Philosophers of Science Might Better Contribute to Debates about Genetics and Race,” Synthese , 177: 363–385.
  • –––, 2014, “Biogeographical Ancestry and Race,” Studies in History and Philosophy of Biological and Biomedical Sciences , 47: 173–184. doi:10.1016/j.shpsc.2014.05.017
  • Garrison, Nanibaa’ A., Kyle B. Brothers, Aaron J. Goldenberg, and John A. Lynch, 2019a, “Genomic Contextualism: Shifting the Rhetoric of Genetic Exceptionalism,” The American Journal of Bioethics , 19: 51–63. doi:10.1080/15265161.2018.1544304
  • Garrison, Nanibaa’ A., Māui Hudson, Leah L. Ballantyne, Ibrahim Garba, Andrew Martinez, Maile Taualii, Laura Arbour, Nadine R. Caron, and Stephanie Carroll Rainie, 2019b, “Genomic Research Through an Indigenous Lens: Understanding the Expectations,” Annual Review of Genomics and Human Genetics 20: 495–517. doi:10.1146/annurev-genom-083118-015434
  • Gates, Alexander J., Deisy Morselli Gysi, Manolis Kellis, and Albert-László Barabási, 2021, “A Wealth of Discovery Built on the Human Genome Project — By the Numbers,” Nature , 590 (11 Feb): 212–215.
  • Gibbs, Richard A., 2020, “The Human Genome Project Changed Everything,” Nature Reviews Genetics , 21: 575–576. doi: 10.1038/s41576-020-0275-3
  • Gilbert, Walter, 1992, “A Vision of the Grail,” in The Code of Codes: Scientific and Social Issues in the Human Genome Project , edited by Daniel J. Kevles and Leroy Hood, 83–97, Cambridge, MA and London: Harvard University Press.
  • Gisler, Monika, Didier Sornette, and Ryan Woodard, 2010, “Exuberant Innovation: The Human Genome Project,” Swiss Finance Institute Research Paper No. 10–12. doi:10.2139/ssrn.1573682
  • Godfrey-Smith, Peter, 2000, “On the Theoretical Role of ‘Genetic Coding,’” Philosophy of Science , 67: 26–44.
  • Goffeau, A. et al., 1996, “Life with 6000 Genes,” Science , 274 (25 Oct): 546–567.
  • Gostin, Larry, 1991, “Genetic Discrimination: The Use of Genetically Based Diagnostic and Prognostic Tests by Employers and Insurers,” American Journal of Law and Medicine , 17(1–2): 109–144.
  • Green, Eric D., Mark S. Guyer, and National Human Genome Research Institute, 2011, “Charting a Course for Genomic Medicine from Base Pairs to Bedside,” Nature , 470 (10 Feb): 204–213. doi:10.1038/nature09764
  • Green, Philip, 1997, “Against a Whole-Genome Shotgun,” Genome Research , 7: 410–417.
  • –––, 2002, “Whole-Genome Disassembly,” Proceedings of the National Academy of Sciences , 99: 4143–4144.
  • Green, Richard E., et al., 2010, “A Draft Sequence of the Neandertal Genome,” Science , 328 (7 May): 710–722. doi: 10.1126/science.1188021
  • Green, Robert C., Denise Lautenbach, and Amy L. McGuire, 2015, “GINA, Genetic Discrimination, and Genomic Medicine,” The New England Journal of Medicine 372: 397–399. doi:10.1056/NEJMp1404776
  • Griesemer, James R., 1994, “Tools for Talking: Human Nature, Weismannism, and the Interpretation of Genetic Information,” in Cranor (ed.) 1994, 69–88.
  • Griffiths, Paul E., 2001, “Genetic Information: A Metaphor in Search of a Theory,” Philosophy of Science , 68: 394–412.
  • Griffiths, P.E. and R.D. Gray, 1994, “Developmental Systems and Evolutionary Explanation,” Journal of Philosophy , 91: 277–304.
  • Griffiths, Paul E. and Robin D. Knight, 1998, “What Is the Developmentalist Challenge?” Philosophy of Science , 65: 253–258.
  • Gyapay, Gabor et al., 1994, “The 1993–94 Généthon Human Genetic Linkage Map,” Nature Genetics , 7: 246–339.
  • Gymrek, Melissa, Amy L. McGuire, David Golan, Eran Halperin, and Yaniv Erlich, 2013, “Identifying Personal Genomes by Surname Inference,” Science , 339 (18 Jan): 321–324. doi:10.1126/science.1229566
  • Habermas, Jürgen, 2003, The Future of Human Nature , Polity Press.
  • Haga, Susanne B., and J. Craig Venter (2003), “FDA Races in Wrong Direction,” Science , 301 (25 Jul): 466. doi:10.1126/science.1087004
  • Hanna, K. E., 1995, “The Ethical, Legal, and Social Implications Program of the National Center for Human Genome Research: A Missed Opportunity?” in Society’s Choices: Social and Ethical Decision Making in Biomedicine , edited by R. E. Bulger, E. M. Bobby, and H. V. Fineberg, 432–457, Washington, DC: National Academy Press.
  • Hanna, K. E., R. M. Cook-Deegan, and R. Y. Nishimi. 1993, “Finding a Forum for Bioethics in U.S. Public Policy,” Politics and the Life Sciences: The Journal of the Association for Politics and the Life Sciences , 12: 205–219.
  • Hardimon, Michael O., 2013, “Race Concepts in Medicine,” Journal of Medicine and Philosophy , 38: 6–31.
  • Harmon, Amy, 2018, “Why White Supremacists Are Chugging Milk (and Why Geneticists Are Alarmed),” New York Times , 17 October 2018. [ available online ]
  • Hattori, M., et al., 2000, “The DNA Sequence of Human Chromosome 21,” Nature , 405 (18 May): 311–319.
  • Hilgartner Stephen, Barbara Prainsack, and J. Benjamin Hurlbut, 2016, “Ethics as Governance in Genomics and Beyond,” in Handbook of Science and Technology Studies , edited by Ulrike Felt, Rayvon Fouche, Clark A. Miller, and Laurel Smith-Doerr, Cambridge, MA: MIT Press.
  • Hindorff, Lucia A., Vence L. Bonham, Jr, Lawrence C. Brody, Margaret E. C. Ginoza, Carolyn M. Hutter, Teri A. Manolio, and Eric D. Green, 2018, Nature Reviews Genetics 19: 175–185. doi:10.1038/nrg.2017.89.
  • Hochman, Adam, 2013, “Against the New Racial Naturalism,” The Journal of Philosophy , 110: 331–351.
  • Hood, Leroy, 1992, “Biology and Medicine in the Twenty-First Century,” in The Code of Codes , 136–163.
  • –––, 2003, “Systems Biology: Integrating Technology, Biology, and Computation,” Mechanisms of Ageing and Development , 124: 9–16. doi:10.1016/S0047-6374(02)00164-1.
  • Hood, Leroy, and Lee Rowen. “The Human Genome Project: Big Science Transforms Biology and Medicine,” Genome Medicine , 5: 79 (8pp). doi:10.1186/gm483
  • Hubbard, Ruth, 1994, “Constructs of Genetic Difference: Race and Sex,” in Genes and Human Self-Knowledge: Historical and Philosophical Reflections on Modern Genetics , edited by Robert F. Weir, Susan C. Lawrence, and Evan Fales, Iowa City, IA: University of Iowa Press.
  • Hubbard, Ruth and Elijah Wald, 1993, Exploding the Gene Myth: How Genetic Information is Produced and Manipulated by Scientists, Physicians, Employers, Insurance Companies, Educators, and Law Enforcers , Boston: Beacon Press.
  • Hudson, Māui, Nanibaa’ A. Garrison, Rogena Sterling, Nadine R. Caron, Keolu Fox, Joseph Yracheta, Jane Anderson, Phil Wilcox, Laura Arbour, Alex Brown, Maile Taualii, Tahu Kukutai, Rodney Haring, Ben Te Aika, Gareth S. Baynam, Peter K. Dearden, David Chagné, Ripan S. Malhi, Ibrahim Garba, Nicki Tiffin, Deborah Bolnick, Matthew Stott, Anna K. Rolleston, Leah L. Ballantyne, Ray Lovett, Dominique David-Chavez, Andrew Martinez, Andrew Sporle, Maggie Walter, Jeff Reading, and Stephanie Russo Carroll, 2020, “Rights, Interests and Expectations: Indigenous Perspectives on Unrestricted Access to Genomic Data,” Nature Reviews Genetics 21: 377–384.
  • Hudson, Thomas J., et al., 1995, “An STS-Based Map of the Human Genome,” Science (22 Dec): 1945–1954. International Human Genome Sequencing Consortium, 2001, “Initial Sequencing and Analysis of the Human Genome,” Nature , 409 (15 Feb): 860–921.
  • Jabloner, Anna, 2019, “A Tale of Two Molecular Californias,” Science as Culture , 28: 1–24. doi: 10.1080/09505431.2018.1524863
  • Jones, Kathryn Maxson, Rachel A. Ankeny, and Robert Cook-Deegan, 2018, “The Bermuda Triangle: The Pragmatics, Policies, and Principles for Data Sharing in the History of the Human Genome Project,” Journal of the History of Biology , 51: 693–805.
  • Joyner, Michael J., and Nigel Paneth, 2019, “Promises, Promises, and Precision Medicine,” The Journal of Clinical Investigation , 129: 946–948. doi:10.1172/JCI126119
  • Juengst, Eric, Michelle L. McGowan, Jennifer R. Fishman, and Richard A. Settersten, Jr., 2016, “From ‘Personalized’ to ‘Precision’ Medicine: The Ethical and Social Implications of Rhetorical Reform in Genomic Medicine,” Hastings Center Report , 46: 21–33. doi:10.1002/hast.614.
  • Kalia, Sarah S., Kathy Adelman, Sherri J. Bale SJ, Wendy K. Chung, Christine Eng, James P. Evans, Gail E. Herman, Sophia B. Hufnagel, Teri E. Klein, Bruce R. Korf, Kent D. McKelvey, Kelly E. Ormond, C. Sue Richards, Christopher N. Vlangos, Michael Watson, Christa L. Martin, and David T. Miller, 2017, “Recommendations for Reporting of Secondary Findings in Clinical Exome and Genome Sequencing, 2016 Update (ACMG SF v2.0): A Policy Statement of the American College of Medical Genetics and Genomics,” Genetics in Medicine , 19: 249–255. doi:10.1038/gim.2016.190
  • Kaplan, Jonathan M., 2010, “When Socially Determined Categories Make Biological Realities: Understanding Black/White Health Disparities in the U.S.,” Monist , 93: 281–297.
  • Kass, Nancy E., 1997, “The Implications of Genetic Testing for Health and Life Insurance,” in Genetic Secrets: Protecting Privacy and Confidentiality in the Genetic Era , edited by Mark A. Rothstein, 299–316, New Haven and London: Yale University Press.
  • Kaye, Alice M., and Wyeth W. Wasserman, 2021, “The Genome Atlas: Navigating a New Era of Reference Genomes,” Trends in Genetics , in press. doi:10.1016/j.tig.2020.12.002
  • Keller, Evelyn Fox, 1992. “Nature, Nurture, and the Human Genome Project.” In Code of Codes , 281–299.
  • –––, 1994, “Master Molecules,” in Cranor (ed.) 1994, 89–98.
  • –––, 2000, The Century of the Gene , Cambridge, MA and London: Harvard University Press.
  • Kevles, Daniel J., 1992, “Out of Eugenics: The Historical Politics of the Human Genome,” in Code of Codes , 3–36.
  • Khan, Razib, and David Mittelman, 2018, “Consumer Genomics Will Change Your Life, Whether You Get Tested or Not,” Genome Biology , 19: 120–123. doi:10.1186/s13059-018-1506-1
  • Kitcher, Philip, 1996, The Lives to Come: The Genetic Revolution and Human Possibilities , New York: Simon & Schuster.
  • Klein, Eran, 2010, “To ELSI or Not to ELSI Neuroscience: Lessons for Neuroethics from the Human Genome Project,” AJOB Neuroscience , 1 (4): 3–8. doi:10.1080/21507740.2010.510821
  • Koshland, Daniel E. Jr., 1989, “Sequences and Consequences of the Human Genome,” Science , 246 (13 Oct): 189.
  • Kowal, Emma, and Bastien Llamas, 2019, “Race in a Genome: Long Read Sequencing, Ethnicity-Specific Reference Genomes and the Shifting Horizon of Race,” Journal of Anthropological Sciences , 97: 91–106. doi: 10.4436/jass.97004
  • Kronfeldner, Maria E., 2009, “Genetic Determinism and the Innate-Acquired Distinction in Medicine,” Medicine Studies , 1: 167–181. doi:10.1007/s12376-009-0014-8
  • Kukutai, Tahu, and John Taylor, 2016, “Data Sovereignty for Indigenous Peoples: Current Practice and Future Needs,” in Indigenous Data Sovereignty: Toward an Agenda , edited by Tahu Kukutai and John Taylor, 1–22, Australia National University Press. [ Kukutai and Taylor 2016 available online ]
  • Lappé, Marc A., 1994, “Justice and the Limitations of Genetic Knowledge,” in Justice and the Human Genome Project , 153–168.
  • Lee, Carol, 1993, “Creating A Genetic Underclass: The Potential for Genetic Discrimination by the Health Insurance Industry,” Pace Law Review , 13: 189–228.
  • Lee, Thomas F., 1991, The Human Genome Project: Cracking the Genetic Code of Life , New York: Plenum Press.
  • Levy, Samuel et al., 2007, “The Diploid Genome Sequence of an Individual Human,” PLoS Biology , 5: 2113–2144.
  • Lewis, Anna C. F., Santiago J. Molina, Paul S. Appelbaum, Bege Dauda, Anna Di Rienzo, Agustin Fuentes, Stephanie M. Fullerton, Nanibaa’ A. Garrison, Nayanika Ghosh, Evelynn M. Hammonds, David S. Jones, Eimear E. Kenny, Peter Kraft, Sandra S.-J. Lee, Madelyn Mauro, John Novembre, Aaron Panofsky, Mashaal Sohail, Benjamin M. Neale, and Danielle S. Allen, 2022, “Getting Genetic Ancestry Right for Science and Society,” Science 376 (15 Apr): 250–252.
  • Lewontin, R. C., 1974, “The Analysis of Variance and the Analysis of Causes,” American Journal of Human Genetics , 26: 400–411.
  • –––, 2000, It Ain’t Necessarily So: The Dream of the Human Genome and Other Illusions , New York: New York Review of Books; chapter 5, “The Dream of the Human Genome,” was originally published on May 28, 1992 in The New York Review of Books .
  • Limoges, Camille, 1994, “ Errare Humanum Est : Do Genetic Errors Have a Future?” in Cranor (ed.) 1994, 113–124.
  • Lippman, Abby. 1991. “Prenatal Genetic Testing and Screening: Constructing Needs and Reinforcing Inequities.” American Journal of Law and Medicine , 42: 15–50.
  • Lloyd, Elisabeth A., 1994, “Normality and Variation: The Human Genome Project and the Ideal Human Type,” in Cranor (ed.) 1994, 99–112.
  • López Beltrán, Carlos, 2011, ed., Genes (&) Mestizos: Genómica y Raza en la Biomedicina Mexicana , Mexico: Ficticia.
  • Lunshof, Jeantine, Ruth Chadwick, Daniel B. Vorhaus, and George M. Church, 2008, “From Genetic Privacy to Open Consent,” Nature Reviews Genetics , 9: 406–411. doi: 10.1038/nrg2360
  • Maher, Brendan, 2008, “The Case of the Missing Heritability,” Nature , 456 (6 Nov): 18–21.
  • Mao, Yafei, Claudia R. Catacchio, LaDeana W. Hillier, et al., 2021, “A High-Quality Bonobo Genome Refines the Analysis of Hominid Evolution,” Nature , 594: 77–81. doi:10.1038/s41586-021-03519-x
  • Marshall, Eliot, 1996a, “Whose Genome Is It, Anyway?” Science , 273 (27 Sept): 1788–1789.
  • –––, 1996b, “The Genome Project’s Conscience,” Science , 274 (25 Oct): 488–490.
  • McEwen, Jean E., Joy T. Boyer, Kathie Y. Sun, Karen R. Rothenberg, Nicole C. Lockhart, and Mark S. Guyer, 2014, “The Ethical, Legal, and Social Implications Program of the National Human Genome Research Institute: Reflections on an Ongoing Experiment,” Annual Review of Genomics and Human Genetics 15: 481–505. doi: 10.1146/annurev-genom-090413-025327
  • McKie, Robin, 2002, “I’m the Human Genome, says ‘Darth Venter’ of Genetics,” Observer , 28 April 2002.
  • McKusick, Victor A., 1989, “Mapping and Sequencing the Human Genome,” The New England Journal of Medicine , 320 (6 Apr): 910–915.
  • Meyer, Roberta A., 2004, “The Insurer Perspective,” in Genetics and Life Insurance: Medical Underwriting and Social Policy , edited by Mark A. Rothstein, 27–47, Cambridge and London: MIT Press.
  • Miga, Karen H., 2021, “Bridging the Gaps,” Nature , 590 (11 Feb): 217–218.
  • Miga, Karen H., and Ting Wang, 2021, “The Need for a Human Pangenome Reference Sequence,” Annual Review of Genomics and Human Genetics , 22: 11.1–11.22. doi: 10.1146/annurev-genom-120120-081921
  • Mills, Melinda C., and Charles Rahal, 2019, “A Scientometric Review of Genome-Wide Association Studies,” Communications Biology 2: 9. doi:10.1038/s42003-018-0261-x
  • Moreau, Yves, 2019, “Crack Down on Genomic Surveillance,” Nature , 576 (5 Dec): 36–38.
  • Morrissey, Clair, and Rebecca L. Walker, 2012, “Funding and Forums for ELSI Research: Who (or What) Is Setting the Agenda?” AJOB Primary Research , 3(3): 51–60. doi: 10.1080/21507716.2012.678550
  • Murphy, Timothy F., 1994, “The Genome Project and the Meaning of Difference,” in Justice and the Human Genome Project , 1–13.
  • Murray, T. H., 1992, “Speaking Unsmooth Things about the Human Genome Project,” in Gene Mapping: Using Law and Ethics as Guides , edited by G. J. Annas and S. Elias, 246–254, New York: Oxford University Press.
  • –––, 1997, “Genetic Exceptionalism and ‘Future Diaries’: Is Genetic Information Different from Other Medical Information?” in Genetic Secrets , 60–73.
  • Myers, Eugene M., Granger G. Sutton, Hamilton O. Smith, Mark D. Adams, and J. Craig Venter, 2002, “On the Sequencing and Assembly of the Human Genome,” Proceedings of the National Academy of Sciences , 99: 4145–4146.
  • Nash, Catherine, 2008, Of Irish Descent: Origin Stories, Genealogy, and the Politics of Belonging , Syracuse University Press.
  • –––, 2017, “The Politics of Genealogical Incorporation: Ethnic Difference, Genetic Relatedness and National Belonging,” Ethnic and Racial Studies , 40: 2539–2557. doi: 10.1080/01419870.2016.1242763
  • National Academies of Sciences, Engineering, and Medicine, 2023, “Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field,” Washington, DC: The National Academies Press. doi:10.17226/26902
  • National Human Genome Research Institute, 2003, “International Consortium Completes Human Genome Project” (14 April). [ available online ]
  • National Research Council Committee on Mapping and Sequencing the Human Genome, 1988, Mapping and Sequencing the Human Genome , Washington, D.C.: National Academy Press.
  • National Research Council Committee on A Framework for Developing a New Taxonomy of Disease, 2011, Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease , Washington, DC: The National Academies Press. doi:10.17226/13284
  • Nelkin, Dorothy, 1992, “The Social Power of Genetic Information,” in Code of Codes , pp. 177–190.
  • Nelkin, Dorothy, and Laurence Tancredi, 1989, Dangerous Diagnostics: The Social Power of Biological Information , New York: Basic Books.
  • Office of Technology Assessment, 1988, Mapping Our Genes: Genome Projects: How Big, How Fast? Baltimore and London: Johns Hopkins University Press.
  • Olson, Maynard V., 2011, “What Does a ‘Normal’ Human Genome Look Like?” Science , 331 (18 Feb): 872.
  • O’Neill, Onora, 2001, “Informed Consent and Genetic Information,” Studies in History and Philosophy of Biological and Biomedical Sciences 32: 689–704.
  • Oyama, Susan, 1985, The Ontogeny of Information: Developmental Systems and Evolution , Cambridge: Cambridge University Press.
  • Oyama, Susan, Paul E. Griffiths, and Russell Gray, editors, 2001, Cycles of Contingency: Developmental Systems and Evolution , Cambridge, MA: MIT Press.
  • Panofsky, Aaron, and Catherine Bliss, 2017, “Ambiguity and Scientific Authority: Population Classification in Genomic Science,” American Sociological Review 82: 59–87. doi:10.1177/0003122416685812
  • Panofsky, Aaron, and Joan Donovan, 2019, “Genetic Ancestry Testing among White Nationalists: From Identity Repair to Citizen Science,” Social Studies of Science 49: 653–681. doi:10.1177/0306312719861434
  • Panofsky, Aaron, Kushan Dasgupta, and Nicole Iturriaga, 2021, “How White Nationalists Mobilize Genetics: From Genetic Ancestry and Human Biodiversity to Counterscience and Metapolitics,” American Journal of Physical Anthropology 175: 387–398. doi:10.1002/ajpa.24150
  • Paul, Diane B., 1994, “Eugenic Anxieties, Social Realities, and Political Choices,” in Cranor (ed.) 1994, 142–154.
  • Pennisi, Elizabeth, 1999, “Academic Sequencers Challenge Celera in a Sprint to the Finish,” Science , (18 Mar): 1822–1823.
  • –––, 2000, “Finally, the Book of Life and Instructions for Navigating It,” Science , 288 (30 Jun): 2304–2307.
  • Phillips, Andelka M., 2016, “Only a Click Away—DTC Genetics for Ancestry, Health, Love…and More: A View of the Business and Regulatory Landscape,” Applied & Translational Genomics , 8: 16–22. doi:10.1016/j.atg.2016.01.001
  • Pigliucci, Massimo, and Jonathan Kaplan, 2003, “On the Concept of Biological Race and Its Applicability to Humans,” Philosophy of Science , 70: 1161–1172.
  • Plutynski, Anya, 2018, Explaining Cancer , Oxford University Press.
  • Pokorski, Robert J., 1994, “Use of Genetic Information by Private Insurers,” in Justice and the Human Genome Project , 91–109.
  • Popejoy, Alice B., and Stephanie M. Fullerton, 2016, “Genomics Is Failing on Diversity,” Nature , 538 (12 Oct): 161–164. doi: 10.1038/538161a
  • Proctor, Robert N., 1992, “Genomics and Eugenics: How Fair Is the Comparison?” in Gene Mapping: Using Law and Ethics as Guides , edited by George J. Annas and Sherman Elias, 57–93, New York and Oxford: Oxford University Press.
  • Reardon, Jenny, 2004, Race to the Finish: Identity and Governance in an Age of Genomics , Princeton: Princeton University Press.
  • –––, 2017, The Postgenomic Condition: Ethics, Justice, and Knowledge after the Genome , University of Chicago Press.
  • Reich, David, Michael A. Nalls, W.H. Linda Kao, Ermeg L. Akylbekova, Arti Tandon, Nick Patterson, James Mullikin, Wen-Chi Hsueh, Ching-Yu Cheng, Josef Coresh, Eric Boerwinkle, Man Li, Alicja Waliszewska, Julie Neubauer, Rongling Li, Tennille S. Leak, Lynette Ekunwe, Joe C. Files, Cheryl L. Hardy, Joseph M. Zmuda, Herman A. Taylor, Elad Ziv, Tamara B. Harris, James G. Wilson, 2009, “Reduced Neutrophil Count in People of African Descent Is Due to a Regulatory Variant in the Duffy Antigen Receptor for Chemokines Gene,” PLOS Genetics , 5 (1): 1–14. doi:10.1371/journal.pgen.1000360
  • Ridley, R.M., C.D. Frith, L.A. Farrer, and P.M. Conneally, 1991, “Patterns of Inheritance of the Symptoms of Huntington’s Disease Suggestive of an Effect of Genomic Imprinting,” Journal of Medical Genetics , 28: 224–231.
  • Robert, Jason Scott, 2004, Embryology, Epigenesis, and Evolution , Cambridge: Cambridge University Press.
  • Robert, Jason Scott, and Françoise Baylis, 2003, “Crossing Species Boundaries,” American Journal of Bioethics , 3(3): 1–13.
  • Roberts, Dorothy E., 2021, “End the Entanglement of Race and Genetics,” Science , 371 (5 Feb): 566.
  • Root, Michael, 2003, “The Use of Race in Medicine as a Proxy for Genetic Differences,” Philosophy of Science , 70: 1173–1183.
  • Rothstein, Mark A., 2005, “Genetic Exceptionalism and Legislative Pragmatism,” Hastings Center Report , 35(4): 27–33.
  • Salzberg, Steven L., 2018, “Open Questions: How Many Genes Do We Have?” BMC Biology , 16: 94–96.
  • Sankar, Pamela, and Mildred K. Cho, 2002, “Toward a New Vocabulary of Human Genetic Variation,” Science , 298 (15 Nov): 1337–1338. doi:10.1126/science.1074447
  • Sankar, Pamela, Mildred K. Cho, and Joanna Mountain, 2007, “Race and Ethnicity in Genetic Research,” American Journal of Medical Genetics , Part A, 143A: 961–970. doi:10.1002/ajmg.a.31575
  • Sarkar, Sahotra, 1996, “Biological Information: A Sceptical Look at Some Central Dogmas of Molecular Biology,” in The Philosophy and History of Molecular Biology: New Perspectives , edited by Sahotra Sarkar, 187–232, Dordrecht: Kluwer.
  • –––, 1998, Genetics and Reductionism , Cambridge: Cambridge University Press.
  • Sarkar, Sahotra and Alfred I. Tauber, 1991, “Fallacious Claims for the HGP,” Nature , 353 (24 Oct): 691.
  • Schneider, Valerie A., et al., 2017, “Evaluation of GRCh38 and De Novo Haploid Genome Assemblies Demonstrates the Enduring Quality of the Reference Assembly,” Genome Research , 27: 849–864.
  • Schramm, Katharina, 2021, “Race, Genealogy, and the Genomic Archive in Post-apartheid South Africa,” Social Analysis: The International Journal of Anthropology 65(4): 49–69.
  • Sheridan, Cormac, 2014, “Illumina Claims $1,000 Genome Win,” Nature Biotechnology , 32: 115. doi:10.1038/nbt0214-115a
  • Sherman, Rachel M. and Steven L. Salzberg, 2020, “Pan-genomics in the Human Genome,” Nature Reviews Genetics , 21: 243–254. doi:10.1038/s41576-020-0210-7
  • Shi, Xinghua, and Xintao Wu, 2017, “An Overview of Human Genetic Privacy,” Annals of the New York Academy of Sciences , 1387: 61–72. doi:10.1111/nyas.13211
  • Singer, Natasha, 2018, “Employees Jump at Genetic Testing. Is That a Good Thing?” New York Times , 15 April 2018. [ Singer 2018 available online ]
  • Sirugo, Giorgio, Scott M. Williams, and Sarah A. Tishkoff, 2019, “The Missing Diversity in Human Genetic Studies,” Cell , 177: 26–31. doi:10.1016/j.cell.2019.02.048
  • Spencer, Quayshawn, 2014, “A Radical Solution to the Race Problem,” Philosophy of Science , 81: 1025–1038.
  • Stevens, Hallam, 2013, Life Out of Sequence: A Data-Driven History of Bioinformatics , University of Chicago Press. doi: 10.7208/9780226080345
  • –––, “Algorithmic Biology Unleashed,” Science , 371 (5 Feb): 565–566.
  • Swinbanks, David, 1991, “Japan’s Human Genome Project Takes Shape,” Nature , 351 (20 Jun): 593.
  • Tabery, James, 2015, “Why Is Studying the Genetics of Intelligence So Controversial?” Hastings Center Report , 45(5): S9–S14. doi: 10.1002/hast.492
  • –––, 2023, Tyranny of the Gene: Personalized Medicine and Its Threat to Public Health , Penguin Random House.
  • TallBear, Kimberly, 2013, Native American DNA: Tribal Belonging and the False Promise of Genetic Science , University of Minnesota Press.
  • The C. elegans Sequencing Consortium, 1998, “Genome Sequence of the Nematode C. elegans : A Platform for Investigating Biology,” Science , 282 (11 Dec): 2012–2018.
  • The Chimpanzee Sequencing and Analysis Consortium, 2005, “Initial Sequence of the Chimpanzee Genome and Comparison with the Human Genome,” Nature , 437 (1 Sep): 69–87, doi.org: 10.1038/nature04072.
  • The International HapMap Consortium, 2005, “A Haplotype Map of the Human Genome,” Nature , 437 (27 Oct): 1299–1320. doi:10.1038/nature04226
  • Tsosie, Krystal S., Joseph M. Yracheta, and Donna Dickenson, 2019, “Overvaluing Individual Consent Ignores Risks to Tribal Participants,” Nature Reviews Genetics 20: 497–498.
  • Ugalmugle, Sumant, and Rupali Swain, 2020, “Direct-To-Consumer (DTC) Genetic Testing Market Size By Test Type (Carrier Testing, Predictive Testing, Ancestry & Relationship Testing, Nutrigenomics Testing), By Distribution Channel (Online Platforms, Over-the-Counter), By Technology (Targeted Analysis, Single Nucleotide Polymorphism (SNP) Chips, Whole Genome Sequencing (WGS)), Industry Analysis Report, Regional Outlook, Application Potential, Price Trends, Competitive Market Share & Forecast, 2022–2028,” Global Market Insights , Report GMI3033. [ Ugalmugle & Swain 2020 available online ]
  • Venter, J. Craig, Hamilton O. Smith, and Leroy Hood, 1996, “A New Strategy for Genome Sequencing,” Nature , 381 (30 May): 364–366.
  • Venter, J. Craig, et al., 2001, “The Sequence of the Human Genome,” Science , 291 (16 Feb): 1304–1351.
  • Veritas Genetics, 2016, “Veritas Genetics Launches $999 Whole Genome And Sets New Standard For Genetic Testing,” PRNewswire , 3 March 2016. [ available online ]
  • Wade, Christopher H., 2019, “What Is the Psychosocial Impact of Providing Genetic and Genomic Health Information to Individuals? An Overview of Systematic Reviews,” Hastings Center Report , 49 (3): S88–S96. doi:10.1002/hast.1021
  • Wade, Nicholas, 2003, “Once Again, Scientists Say Human Genome is Complete,” New York Times , 15 April 2003, F1; LexisNexis Academic.
  • Wadman, Meredith, 1999, “Human Genome Project Aims to Finish ‘Working Draft’ Next Year,” Nature , 398 (18 Mar): 177.
  • Walker, Rebecca L., and Clair Morrissey, 2014, “Bioethics Methods in the Ethical, Legal, and Social Implications of the Human Genome Project Literature,” Bioethics , 28: 481–490. doi:10.1111/bioe.12023
  • Waterston, Robert H., Eric S. Lander, and John E. Sulston, 2002, “On the Sequencing of the Human Genome,” Proceedings of the National Academy of Sciences , 99: 3712–3716.
  • Watson, James D. 1990. “The Human Genome Project: Past, Present, and Future.” Science , 248 (6 Apr): 44–49.
  • Watson, James D., 1992, “A Personal View of the Project,” in Code of Codes , pp. 164–173.
  • –––, with Andrew Berry, 2003, DNA: The Secret of Life , New York: Alfred A. Kopf.
  • Weber, Griffin M., Kenneth D. Mandl, and Isaac S. Kohane, 2014, “Finding the Missing Link for Big Biomedical Data,” The Journal of the American Medical Association , 311: 2479–2480. doi:10.1001/jama.2014.4228
  • Weber, James L., and Eugene W. Myers, 1997, “Human Whole-Genome Shotgun Sequencing,” Genome Research 7: 401–409.
  • Wheeler, David A., et al., 2008, “The Complete Genome of an Individual by Massively Parallel DNA Sequencing,” Nature , 452 (17 April): 872–876.
  • Wolinsky, Howard, 2019, “Ancient DNA and Contemporary Politics,” EMBO Reports 20: e49507. doi:10.15252/embr.201949507
  • Yang, Xiaofei, Wan-Ping Lee, Kai Ye, and Charles Lee, 2019, “One Reference Genome Is Not Enough,” Genome Biology , 20: 104 [3pp]. doi:10.1186/s13059-019-1717-0
  • Yesley, M. S., 2008, “What’s ELSI Got to Do with It? Bioethics and the Human Genome Project,” New Genetics and Society , 27(1): 1–6.
  • Zarate, Oscar A., Julia Green Brody, Phil Brown, Mónica D. Ramírez-Andreotta, Laura Perovich, and Jacob Matz, 2016, “Balancing Benefits and Risks of Immortal Data: Participants’ Views of Open Consent in the Personal Genome Project,” Hastings Center Report , 46(1): 36–45. doi:10.1002/hast.523
  • Zhao, Tingting, Zhongqu Duan, Georgi Z Genchev, and Hui Lu, 2020, “Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences,” G3 Genes|Genomes|Genetics , 10: 2801–2809. doi:10.1534/g3.120.401280
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • About | National Institutes of Health (NIH) — All of Us
  • Council for Responsible Genetics
  • Department of Energy (DOE) Human Genome Project Information
  • Ensembl Human Genome Browser
  • ESRC Centre for Genomics in Society (Egenis), University of Exeter
  • Human Genome Organization (HUGO)
  • Human Genome News (DOE/NHGRI publication)
  • HumGen International (ELSI resources)
  • National Center for Biotechnology Information (NCBI) Human Genome Resources
  • National Human Genome Research Institute (NHGRI) All About the Human Genome Project (HGP)
  • Native BioData Consortium
  • NCBI Human Genome Resources
  • NHGRI ELSI Research Program
  • Nature Human Genome Collection
  • Nuffield Council on Bioethics
  • Science Human Genome Special Issue 16 February 2001
  • The Human Genome Project: An Annotated & Interactive Scholarly Guide to the Project in the United States (cshl.edu)
  • UK Biobank – UK Biobank
  • Wellcome Trust Sanger Institute Human Genome Project
  • World Health Organization Genomic Resource Centre

biological development: theories of | developmental biology | developmental biology: evolution and development | disability: critical disability theory | disability: definitions and models | donation and sale of human eggs and sperm | ethics, biomedical: privacy and medicine | eugenics | feminist philosophy, interventions: bioethics | feminist philosophy, interventions: philosophy of biology | feminist philosophy, topics: perspectives on disability | feminist philosophy, topics: perspectives on reproduction and the family | gene | genetics | genetics: genotype/phenotype distinction | genetics: molecular | genetics: population | genomics and postgenomics | health | heritability | human enhancement | human nature | information: biological | medicine, philosophy of | molecular biology | parenthood and procreation | race | reduction, scientific: in biology | scientific research and big data | systems and synthetic biology, philosophy of

Acknowledgment

For the original entry, I am grateful for the assistance of a California State University Faculty Development Grant and the help of three very capable student research assistants: Isabel Casimiro at California State University, Chico and Andrew Inkpen and Ashley Pringle at Saint Mary’s University. This revised entry has benefited from the excellent advice and thoughtful encouragement provided by Jim Griesemer and Jim Tabery.

Copyright © 2023 by Lisa Gannett < lisa . gannett @ smu . ca >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

main logo

Human Genome Project

Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. This website details HGP history.

Explore the Project's History

assignment on human genome project

HGP Goals +

assignment on human genome project

Ethical, Legal, and Social Issues +

Blue Hazy Glassy Dna 3d

Human Genome News +

Explore the chromosomes.

assignment on human genome project

All chromosomes

assignment on human genome project

Chromosome 1

assignment on human genome project

Chromosome 2

assignment on human genome project

Chromosome 3

assignment on human genome project

Chromosome 4

  • Published: 13 September 2013

The Human Genome Project: big science transforms biology and medicine

  • Leroy Hood 1 &
  • Lee Rowen 1  

Genome Medicine volume  5 , Article number:  79 ( 2013 ) Cite this article

158k Accesses

122 Citations

122 Altmetric

Metrics details

The Human Genome Project has transformed biology through its integrated big science approach to deciphering a reference human genome sequence along with the complete sequences of key model organisms. The project exemplifies the power, necessity and success of large, integrated, cross-disciplinary efforts - so-called ‘big science’ - directed towards complex major objectives. In this article, we discuss the ways in which this ambitious endeavor led to the development of novel technologies and analytical tools, and how it brought the expertise of engineers, computer scientists and mathematicians together with biologists. It established an open approach to data sharing and open-source software, thereby making the data resulting from the project accessible to all. The genome sequences of microbes, plants and animals have revolutionized many fields of science, including microbiology, virology, infectious disease and plant biology. Moreover, deeper knowledge of human sequence variation has begun to alter the practice of medicine. The Human Genome Project has inspired subsequent large-scale data acquisition initiatives such as the International HapMap Project, 1000 Genomes, and The Cancer Genome Atlas, as well as the recently announced Human Brain Project and the emerging Human Proteome Project.

Origins of the human genome project

The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transformation of medicine [ 1 – 3 ]. The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an understanding of cancer [ 4 ]. In May 1985 a meeting focused entirely on the HGP was held, with Robert Sinsheimer, the Chancellor of the University of California, Santa Cruz (UCSC), assembling 12 experts to debate the merits of this potential project [ 5 ]. The meeting concluded that the project was technically possible, although very challenging. However, there was controversy as to whether it was a good idea, with six of those assembled declaring themselves for the project, six against (and those against felt very strongly). The naysayers argued that big science is bad science because it diverts resources from the ‘real’ small science (such as single investigator science); that the genome is mostly junk that would not be worth sequencing; that we were not ready to undertake such a complex project and should wait until the technology was adequate for the task; and that mapping and sequencing the genome was a routine and monotonous task that would not attract appropriate scientific talent. Throughout the early years of advocacy for the HGP (mid- to late 1980s) perhaps 80% of biologists were against it, as was the National Institutes of Health (NIH) [ 6 ]. The US Department of Energy (DOE) initially pushed for the HGP, partly using the argument that knowing the genome sequence would help us understand the radiation effects on the human genome resulting from exposure to atom bombs and other aspects of energy transmission [ 7 ]. This DOE advocacy was critical to stimulating the debate and ultimately the acceptance of the HGP. Curiously, there was more support from the US Congress than from most biologists. Those in Congress understood the appeal of international competitiveness in biology and medicine, the potential for industrial spin-offs and economic benefits, and the potential for more effective approaches to dealing with disease. A National Academy of Science committee report endorsed the project in 1988 [ 8 ] and the tide of opinion turned: in 1990, the program was initiated, with the finished sequence published in 2004 ahead of schedule and under budget [ 9 ].

What did the human genome project entail?

This 3-billion-dollar, 15-year program evolved considerably as genomics technologies improved. Initially, the HGP set out to determine a human genetic map, then a physical map of the human genome [ 10 ], and finally the sequence map. Throughout, the HGP was instrumental in pushing the development of high-throughput technologies for preparing, mapping and sequencing DNA [ 11 ]. At the inception of the HGP in the early 1990s, there was optimism that the then-prevailing sequencing technology would be replaced. This technology, now called ‘first-generation sequencing’, relied on gel electrophoresis to create sequencing ladders, and radioactive- or fluorescent-based labeling strategies to perform base calling [ 12 ]. It was considered to be too cumbersome and low throughput for efficient genomic sequencing. As it turned out, the initial human genome reference sequence was deciphered using a 96-capillary (highly parallelized) version of first-generation technology. Alternative approaches such as multiplexing [ 13 ] and sequencing by hybridization [ 14 ] were attempted but not effectively scaled up. Meanwhile, thanks to the efforts of biotech companies, successive incremental improvements in the cost, throughput, speed and accuracy of first-generation automated fluorescent-based sequencing strategies were made throughout the duration of the HGP. Because biologists were clamoring for sequence data, the goal of obtaining a full-fledged physical map of the human genome was abandoned in the later stages of the HGP in favor of generating the sequence earlier than originally planned. This push was accelerated by Craig Venter’s bold plan to create a company (Celera) for the purpose of using a whole-genome shotgun approach [ 15 ] to decipher the sequence instead of the piecemeal clone-by-clone approach using bacterial artificial chromosome (BAC) vectors that was being employed by the International Consortium. Venter’s initiative prompted government funding agencies to endorse production of a clone-based draft sequence for each chromosome, with the finishing to come in a subsequent phase. These parallel efforts accelerated the timetable for producing a genome sequence of immense value to biologists [ 16 , 17 ].

As a key component of the HGP, it was wisely decided to sequence the smaller genomes of significant experimental model organisms such as yeast, a small flowering plant ( Arabidopsis thaliana ), worm and fruit fly before taking on the far more challenging human genome. The efforts of multiple centers were integrated to produce these reference genome sequences, fostering a culture of cooperation. There were originally 20 centers mapping and sequencing the human genome as part of an international consortium [ 18 ]; in the end five large centers (the Wellcome Trust Sanger Institute, the Broad Institute of MIT and Harvard, The Genome Institute of Washington University in St Louis, the Joint Genome Institute, and the Whole Genome Laboratory at Baylor College of Medicine) emerged from this effort, with these five centers continuing to provide genome sequence and technology development. The HGP also fostered the development of mathematical, computational and statistical tools for handling all the data it generated.

The HGP produced a curated and accurate reference sequence for each human chromosome, with only a small number of gaps, and excluding large heterochromatic regions [ 9 ]. In addition to providing a foundation for subsequent studies in human genomic variation, the reference sequence has proven essential for the development and subsequent widespread use of second-generation sequencing technologies, which began in the mid-2000s. Second-generation cyclic array sequencing platforms produce, in a single run, up to hundreds of millions of short reads (originally approximately 30 to 70 bases, now up to several hundred bases), which are typically mapped to a reference genome at highly redundant coverage [ 19 ]. A variety of cyclic array sequencing strategies (such as RNA-Seq, ChIP-Seq, bisulfite sequencing) have significantly advanced biological studies of transcription and gene regulation as well as genomics, progress for which the HGP paved the way.

Impact of the human genome project on biology and technology

First, the human genome sequence initiated the comprehensive discovery and cataloguing of a ‘parts list’ of most human genes [ 16 , 17 ], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs. Understanding a complex biological system requires knowing the parts, how they are connected, their dynamics and how all of these relate to function [ 20 ]. The parts list has been essential for the emergence of ‘systems biology’, which has transformed our approaches to biology and medicine [ 21 , 22 ].

As an example, the ENCODE (Encyclopedia Of DNA Elements) Project, launched by the NIH in 2003, aims to discover and understand the functional parts of the genome [ 23 ]. Using multiple approaches, many based on second-generation sequencing, the ENCODE Project Consortium has produced voluminous and valuable data related to the regulatory networks that govern the expression of genes [ 24 ]. Large datasets such as those produced by ENCODE raise challenging questions regarding genome functionality. How can a true biological signal be distinguished from the inevitable biological noise produced by large datasets [ 25 , 26 ]? To what extent is the functionality of individual genomic elements only observable (used) in specific contexts (for example, regulatory networks and mRNAs that are operative only during embryogenesis)? It is clear that much work remains to be done before the functions of poorly annotated protein-coding genes will be deciphered, let alone those of the large regions of the non-coding portions of the genome that are transcribed. What is signal and what is noise is a critical question.

Second, the HGP also led to the emergence of proteomics, a discipline focused on identifying and quantifying the proteins present in discrete biological compartments, such as a cellular organelle, an organ or the blood. Proteins - whether they act as signaling devices, molecular machines or structural components - constitute the cell-specific functionality of the parts list of an organism’s genome. The HGP has facilitated the use of a key analytical tool, mass spectrometry, by providing the reference sequences and therefore the predicted masses of all the tryptic peptides in the human proteome - an essential requirement for the analysis of mass-spectrometry-based proteomics [ 27 ]. This mass-spectrometry-based accessibility to proteomes has driven striking new applications such as targeted proteomics [ 28 ]. Proteomics requires extremely sophisticated computational techniques, examples of which are PeptideAtlas [ 29 ] and the Trans-Proteomic Pipeline [ 30 ].

Third, our understanding of evolution has been transformed. Since the completion of the HGP, over 4,000 finished or quality draft genome sequences have been produced, mostly from bacterial species but including 183 eukaryotes [ 31 ]. These genomes provide insights into how diverse organisms from microbes to human are connected on the genealogical tree of life - clearly demonstrating that all of the species that exist today descended from a single ancestor [ 32 ]. Questions of longstanding interest with implications for biology and medicine have become approachable. Where do new genes come from? What might be the role of stretches of sequence highly conserved across all metazoa? How much large-scale gene organization is conserved across species and what drives local and global genome reorganization? Which regions of the genome appear to be resistant (or particularly susceptible) to mutation or highly susceptible to recombination? How do regulatory networks evolve and alter patterns of gene expression [ 33 ]? The latter question is of particular interest now that the genomes of several primates and hominids have been or are being sequenced [ 34 , 35 ] in hopes of shedding light on the evolution of distinctively human characteristics. The sequence of the Neanderthal genome [ 36 ] has had fascinating implications for human evolution; namely, that a few percent of Neanderthal DNA and hence the encoded genes are intermixed in the human genome, suggesting that there was some interbreeding while the two species were diverging [ 36 , 37 ].

Fourth, the HGP drove the development of sophisticated computational and mathematical approaches to data and brought computer scientists, mathematicians, engineers and theoretical physicists together with biologists, fostering a more cross-disciplinary culture [ 1 , 21 , 38 ]. It is important to note that the HGP popularized the idea of making data available to the public immediately in user-friendly databases such as GenBank [ 39 ] and the UCSC Genome Browser [ 40 ]. Moreover, the HGP also promoted the idea of open-source software, in which the source code of programs is made available to and can be edited by those interested in extending their reach and improving them [ 41 , 42 ]. The open-source operating system of Linux and the community it has spawned have shown the power of this approach. Data accessibility is a critical concept for the culture and success of biology in the future because the ‘democratization of data’ is critical for attracting available talent to focus on the challenging problems of biological systems with their inherent complexity [ 43 ]. This will be even more critical in medicine, as scientists need access to the data cloud available from each individual human to mine for the predictive medicine of the future - an effort that could transform the health of our children and grandchildren [ 44 ].

Fifth, the HGP, as conceived and implemented, was the first example of ‘big science’ in biology, and it clearly demonstrated both the power and the necessity of this approach for dealing with its integrated biological and technological aims. The HGP was characterized by a clear set of ambitious goals and plans for achieving them; a limited number of funded investigators typically organized around centers or consortia; a commitment to public data/resource release; and a need for significant funding to support project infrastructure and new technology development. Big science and smaller-scope individual-investigator-oriented science are powerfully complementary, in that the former generates resources that are foundational for all researchers while the latter adds detailed experimental clarification of specific questions, and analytical depth and detail to the data produced by big science. There are many levels of complexity in biology and medicine; big science projects are essential to tackle this complexity in a comprehensive and integrative manner [ 45 ].

The HGP benefited biology and medicine by creating a sequence of the human genome; sequencing model organisms; developing high-throughput sequencing technologies; and examining the ethical and social issues implicit in such technologies. It was able to take advantage of economies of scale and the coordinated effort of an international consortium with a limited number of players, which rendered the endeavor vastly more efficient than would have been possible if the genome were sequenced on a gene-by-gene basis in small labs. It is also worth noting that one aspect that attracted governmental support to the HGP was its potential for economic benefits. The Battelle Institute published a report on the economic impact of the HGP [ 46 ]. For an initial investment of approximately $3.5 billion, the return, according to the report, has been about $800 billion - a staggering return on investment.

Even today, as budgets tighten, there is a cry to withdraw support from big science and focus our resources on small science. This would be a drastic mistake. In the wake of the HGP there are further valuable biological resource-generating projects and analyses of biological complexity that require a big science approach, including the HapMap Project to catalogue human genetic variation [ 47 , 48 ], the ENCODE project, the Human Proteome Project (described below) and the European Commission’s Human Brain Project, as well as another brain-mapping project recently announced by President Obama [ 49 ]. Similarly to the HGP, significant returns on investment will be possible for other big science projects that are now under consideration if they are done properly. It should be stressed that discretion must be employed in choosing big science projects that are fundamentally important. Clearly funding agencies should maintain a mixed portfolio of big and small science - and the two are synergistic [ 1 , 45 ].

Last, the HGP ignited the imaginations of unusually talented scientists - Jim Watson, Eric Lander, John Sulston, Bob Waterston and Sydney Brenner to mention only a few. So virtually every argument initially posed by the opponents of the HGP turned out to be wrong. The HGP is a wonderful example of a fundamental paradigm change in biology: initially fiercely resisted, it was ultimately far more transformational than expected by even the most optimistic of its proponents.

Impact of the human genome project on medicine

Since the conclusion of the HGP, several big science projects specifically geared towards a better understanding of human genetic variation and its connection to human health have been initiated. These include the HapMap Project aimed at identifying haplotype blocks of common single nucleotide polymorphisms (SNPs) in different human populations [ 47 , 48 ], and its successor, the 1000 Genomes project, an ongoing endeavor to catalogue common and rare single nucleotide and structural variation in multiple populations [ 50 ]. Data produced by both projects have supported smaller-scale clinical genome-wide association studies (GWAS), which correlate specific genetic variants with disease risk of varying statistical significance based on case–control comparisons. Since 2005, over 1,350 GWAS have been published [ 51 ]. Although GWAS analyses give hints as to where in the genome to look for disease-causing variants, the results can be difficult to interpret because the actual disease-causing variant might be rare, the sample size of the study might be too small, or the disease phenotype might not be well stratified. Moreover, most of the GWAS hits are outside of coding regions - and we do not have effective methods for easily determining whether these hits reflect the mis-functioning of regulatory elements. The question as to what fraction of the thousands of GWAS hits are signal and what fraction are noise is a concern. Pedigree-based whole-genome sequencing offers a powerful alternative approach to identifying potential disease-causing variants [ 52 ].

Five years ago, a mere handful of personal genomes had been fully sequenced (for example, [ 53 , 54 ]). Now there are thousands of exome and whole-genome sequences (soon to be tens of thousands, and eventually millions), which have been determined with the aim of identifying disease-causing variants and, more broadly, establishing well-founded correlations between sequence variation and specific phenotypes. For example, the International Cancer Genome Consortium [ 55 ] and The Cancer Genome Atlas [ 56 ] are undertaking large-scale genomic data collection and analyses for numerous cancer types (sequencing both the normal and cancer genome for each individual patient), with a commitment to making their resources available to the research community.

We predict that individual genome sequences will soon play a larger role in medical practice. In the ideal scenario, patients or consumers will use the information to improve their own healthcare by taking advantage of prevention or therapeutic strategies that are known to be appropriate for real or potential medical conditions suggested by their individual genome sequence. Physicians will need to educate themselves on how best to advise patients who bring consumer genetic data to their appointments, which may well be a common occurrence in a few years [ 57 ].

In fact, the application of systems approaches to disease has already begun to transform our understanding of human disease and the practice of healthcare and push us towards a medicine that is predictive, preventive, personalized and participatory: P4 medicine. A key assumption of P4 medicine is that in diseased tissues biological networks become perturbed - and change dynamically with the progression of the disease. Hence, knowing how the information encoded by disease-perturbed networks changes provides insights into disease mechanisms, new approaches to diagnosis and new strategies for therapeutics [ 58 , 59 ].

Let us provide some examples. First, pharmacogenomics has identified more than 70 genes for which specific variants cause humans to metabolize drugs ineffectively (too fast or too slow). Second, there are hundreds of ‘actionable gene variants’ - variants that cause disease but whose consequences can be avoided by available medical strategies with knowledge of their presence [ 60 ]. Third, in some cases, cancer-driving mutations in tumors, once identified, can be counteracted by treatments with currently available drugs [ 61 ]. And last, a systems approach to blood protein diagnostics has generated powerful new diagnostic panels for human diseases such as hepatitis [ 62 ] and lung cancer [ 63 ].

These latter examples portend a revolution in blood diagnostics that will lead to early detection of disease, the ability to follow disease progression and responses to treatment, and the ability to stratify a disease type (for instance, breast cancer) into its different subtypes for proper impedance match against effective drugs [ 59 ]. We envision a time in the future when all patients will be surrounded by a virtual cloud of billions of data points, and when we will have the analytical tools to reduce this enormous data dimensionality to simple hypotheses to optimize wellness and minimize disease for each individual [ 58 ].

Impact of the human genome project on society

The HGP challenged biologists to consider the social implications of their research. Indeed, it devoted 5% of its budget to considering the social, ethical and legal aspects of acquiring and understanding the human genome sequence [ 64 ]. That process continues as different societal issues arise, such as genetic privacy, potential discrimination, justice in apportioning the benefits from genomic sequencing, human subject protections, genetic determinism (or not), identity politics, and the philosophical concept of what it means to be human beings who are intrinsically connected to the natural world.

Strikingly, we have learned from the HGP that there are no race-specific genes in humans [ 65 – 68 ]. Rather, an individual’s genome reveals his or her ancestral lineage, which is a function of the migrations and interbreeding among population groups. We are one race and we honor our species’ heritage when we treat each other accordingly, and address issues of concern to us all, such as human rights, education, job opportunities, climate change and global health.

What is to come?

There remain fundamental challenges for fully understanding the human genome. For example, as yet at least 5% of the human genome has not been successfully sequenced or assembled for technical reasons that relate to eukaryotic islands being embedded in heterochromatic repeats, copy number variations, and unusually high or low GC content [ 69 ]. The question of what information these regions contain is a fascinating one. In addition, there are highly conserved regions of the human genome whose functions have not yet been identified; presumably they are regulatory, but why they should be strongly conserved over a half a billion years of evolution remains a mystery.

There will continue to be advances in genome analysis. Developing improved analytical techniques to identify biological information in genomes and decipher what this information relates to functionally and evolutionarily will be important. Developing the ability to rapidly analyze complete human genomes with regard to actionable gene variants is essential. It is also essential to develop software that can accurately fold genome-predicted proteins into three dimensions, so that their functions can be predicted from structural homologies. Likewise, it will be fascinating to determine whether we can make predictions about the structures of biological networks directly from the information of their cognate genomes. Indeed, the idea that we can decipher the ‘logic of life’ of an organism solely from its genome sequence is intriguing. While we have become relatively proficient at determining static and stable genome sequences, we are still learning how to measure and interpret the dynamic effects of the genome: gene expression and regulation, as well as the dynamics and functioning of non-coding RNAs, metabolites, proteins and other products of genetically encoded information.

The HGP, with its focus on developing the technology to enumerate a parts list, was critical for launching systems biology, with its concomitant focus on high-throughput ‘omics’ data generation and the idea of ‘big data’ in biology [ 21 , 38 ]. The practice of systems biology begins with a complete parts list of the information elements of living organisms (for example, genes, RNAs, proteins and metabolites). The goals of systems biology are comprehensive yet open ended because, as seen with the HGP, the field is experiencing an infusion of talented scientists applying multidisciplinary approaches to a variety of problems. A core feature of systems biology, as we see it, is to integrate many different types of biological information to create the ‘network of networks’ - recognizing that networks operate at the genomic, the molecular, the cellular, the organ, and the social network levels, and that these are integrated in the individual organism in a seamless manner [ 58 ]. Integrating these data allows the creation of models that are predictive and actionable for particular types of organisms and individual patients. These goals require developing new types of high-throughput omic technologies and ever increasingly powerful analytical tools.

The HGP infused a technological capacity into biology that has resulted in enormous increases in the range of research, for both big and small science. Experiments that were inconceivable 20 years ago are now routine, thanks to the proliferation of academic and commercial wet lab and bioinformatics resources geared towards facilitating research. In particular, rapid increases in throughput and accuracy of the massively parallel second-generation sequencing platforms with their correlated decreases in cost of sequencing have resulted in a great wealth of accessible genomic and transcriptional sequence data for myriad microbial, plant and animal genomes. These data in turn have enabled large- and small-scale functional studies that catalyze and enhance further research when the results are provided in publicly accessible databases [ 70 ].

One descendant of the HGP is the Human Proteome Project, which is beginning to gather momentum, although it is still poorly funded. This exciting endeavor has the potential to be enormously beneficial to biology [ 71 – 73 ]. The Human Proteome Project aims to create assays for all human and model organism proteins, including the myriad protein isoforms produced from the RNA splicing and editing of protein-coding genes, chemical modifications of mature proteins, and protein processing. The project also aims to pioneer technologies that will achieve several goals: enable single-cell proteomics; create microfluidic platforms for thousands of protein enzyme-linked immunosorbent assays (ELISAs) for rapid and quantitative analyses of, for example, a fraction of a droplet of blood; develop protein-capture agents that are small, stable, easy to produce and can be targeted to specific protein epitopes and hence avoid extensive cross-reactivity; and develop the software that will enable the ordinary biologist to analyze the massive amounts of proteomics data that are beginning to emerge from human and other organisms.

Newer generations of DNA sequencing platforms will be introduced that will transform how we gather genome information. Third-generation sequencing [ 74 ] will employ nanopores or nanochannels, utilize electronic signals, and sequence single DNA molecules for read lengths of 10,000 to 100,000 bases. Third-generation sequencing will solve many current problems with human genome sequences. First, contemporary short-read sequencing approaches make it impossible to assemble human genome sequences de novo ; hence, they are usually compared against a prototype reference sequence that is itself not fully accurate, especially with respect to variations other than SNPs. This makes it extremely difficult to precisely identify the insertion-deletion and structural variations in the human genome, both for our species as a whole and for any single individual. The long reads of third-generation sequencing will allow for the de novo assembly of human (and other) genomes, and hence delineate all of the individually unique variability: nucleotide substitutions, indels, and structural variations. Second, we do not have global techniques for identifying the 16 different chemical modifications of human DNA (epigenetic marks, reviewed in [ 75 ]). It is increasingly clear that these epigenetic modifications play important roles in gene expression [ 76 ]. Thus, single-molecule analyses should be able to identify all the epigenetic marks on DNA. Third, single-molecule sequencing will facilitate the full-length sequencing of RNAs; thus, for example, enhancing interpretation of the transcriptome by enabling the identification of RNA editing, alternative splice forms with a given transcript, and different start and termination sites. Last, it is exciting to contemplate that the ability to parallelize this process (for example, by generating millions of nanopores that can be used simultaneously) could enable the sequencing of a human genome in 15 minutes or less [ 77 ]. The high-throughput nature of this sequencing may eventually lead to human genome costs of $100 or under. The interesting question is how long it will take to make third-generation sequencing a mature technology.

The HGP has thus opened many avenues in biology, medicine, technology and computation that we are just beginning to explore.

Abbreviations

Bacterial artificial chromosome

Department of Energy

Enzyme-linked immunosorbent assay

Genome-wide association studies

  • Human Genome Project

National Institutes of Health

Single nucleotide polymorphism

University of California, Santa Cruz.

Hood L: Acceptance remarks for Fritz J. and Delores H. Russ Prize. The Bridge. 2011, 41: 46-49.

Google Scholar  

Collins FS, McKusick VA: Implications of the Human Genome Project for medical science. JAMA. 2001, 285: 540-544. 10.1001/jama.285.5.540.

Article   CAS   PubMed   Google Scholar  

Green ED, Guyer MS, National Human Genome Research Institute: Charting a course for genomic medicine from base to bedside. Nature. 2011, 470: 204-213. 10.1038/nature09764.

Dulbecco R: A turning point in cancer research: sequencing the human genome. Science. 1984, 231: 1055-1056.

Article   Google Scholar  

Sinsheimer RL: The Santa Cruz workshop - May 1985. Genomics. 1989, 5: 954-956. 10.1016/0888-7543(89)90142-0.

Cooke-Degan RM: The Gene Wars: Science, Politics and the Human Genome. 1994, New York: WW Norton

Report on the Human Genome Initiative for the Office of Health and Environmental Research. http://www.ornl.gov/sci/techresources/Human_Genome/project/herac2.shtml ,

National Academy of Science: Report of the Committee on Mapping and Sequencing the Human Genome. 1988, Washington DC: National Academy Press

Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.

Understanding Our Genetic Inheritance. The United States Human Genome Project, The First Five Years: Fiscal Years. 1991, http://www.genome.gov/10001477 , –1995,

Collins FS, Galas D: A new five-year plan for the U.S. Human Genome Program. Science. 1993, 262: 43-46. 10.1126/science.8211127.

Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SBH, Hood LE: Fluorescence detection in automated DNA sequence analysis. Nature. 1986, 321: 674-679. 10.1038/321674a0.

Church G, Kieffer-Higgins S: Multiplex DNA sequencing. Science. 1988, 240: 185-188. 10.1126/science.3353714.

Strezoska Z, Paunesku T, Radosavljević D, Labat I, Drmanac R, Crkvenjakov R: DNA sequencing by hybridization: 100 bases read by a non-gel-based method. Proc Natl Acad Sci USA. 1991, 88: 10089-10093. 10.1073/pnas.88.22.10089.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M: Shotgun sequencing of the human genome. Science. 1998, 280: 1540-1542. 10.1126/science.280.5369.1540.

International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Miklos GLG, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.

International Human Genome Sequencing Consortium. http://www.genome.gov/11006939 ,

Shendure J, Aiden ER: The expanding scope of DNA sequencing. Nat Biotechnol. 2012, 30: 1084-1094. 10.1038/nbt.2421.

Hood L: A personal journey of discovery: developing technology and changing biology. Annu Rev Anal Chem. 2008, 1: 1-43. 10.1146/annurev.anchem.1.031207.113113.

Article   CAS   Google Scholar  

Committee on a New Biology for the 21st Century: A New Biology for the 21st Century. 2009, Washington DC: The National Academies Press

Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001, 2: 343-372. 10.1146/annurev.genom.2.1.343.

Encyclopedia of DNA Elements. http://encodeproject.org/ENCODE/ ,

ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.

Editorial: Form and function. Nature. 2013, 495: 141-142.

ENCODE Project Consortium: A user’s guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol. 2011, 9: e1001046-10.1371/journal.pbio.1001046.

Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature. 2003, 422: 198-207. 10.1038/nature01511.

Picotti P, Aebersold R: Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat Methods. 2012, 9: 555-566. 10.1038/nmeth.2015.

Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R: The PeptideAtlas Project. Nucleic Acids Res. 2006, 34: D655-D658. 10.1093/nar/gkj040.

Deutsch ED, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii A, Aebersold R: A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010, 10: 1150-1159. 10.1002/pmic.200900375.

Genomes Online Database: complete genome projects. http://www.genomesonline.org/cgi-bin/GOLD/index.cgi?page_requested=Complete+Genome+Projects ,

Theobald DL: A formal test of the theory of universal common ancestry. Nature. 2010, 465: 219-222. 10.1038/nature09014.

Wolfe KE, Li W-H: Molecular evolution meets the genomics evolution. Nat Genet. 2003, Suppl 33: 255-265.

Marques-Bonet T, Ryder OA, Eichler EE: Sequencing primate genomes: what have we learned?. Annu Rev Genomics Hum Genet. 2009, 10: 355-386. 10.1146/annurev.genom.9.081307.164420.

Noonan JP: Neanderthal genomics and the evolution of modern human. Genome Res. 2010, 20: 547-553. 10.1101/gr.076000.108.

Stoneking M, Krause J: Learning about human population history from ancient and modern genomes. Nat Rev Genet. 2011, 12: 603-614.

Sankararaman S, Patterson N, Li H, Paabo S, Reich D: The date of interbreeding between Neanderthals and Modern Humans. PLoS Genet. 2012, 8: e1002947-10.1371/journal.pgen.1002947.

Schatz MC: Computational thinking in the era of big data biology. Genome Biol. 2012, 13: 177-10.1186/gb-2012-13-11-177.

Article   PubMed Central   PubMed   Google Scholar  

Mizrachi I: GenBank: the Nucleotide Sequence Database. The NCBI Handbook. Edited by: McEntyre J, Ostell J. 2002, Bethesda: National Center for Biotechnology Information

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.

SourceForge. http://sourceforge.net/ ,

Bioconductor: open source software for bioinformatics. http://www.bioconductor.org/ ,

Field D, Sansone S-A, Collina A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, Millard S, Mugabushaka M, Perrin N, Remacle JE, Remington K, Rocca-Serra P, Taylor CF, Thorley M, Tiwari B, Wilbanks J: Omics data sharing. Science. 2009, 326: 234-236. 10.1126/science.1180598.

Knoppers BM, Harris JR, Tasse AM, Budin-Ljosne I, Kaye J, Deschenes M, Zawati M: Towards a data-sharing Code of Conduct for international genomic research. Genome Med. 2011, 3: 46-10.1186/gm262.

Hood L: Biological complexity under attack: a personal view of systems biology and the coming of “big science”. Genet Eng Biotechnol News. 2011, 31: 17-

Tripp S, Grueber M: Economic Impact of the Human Genome Project. 2011, Columbus: Battelle Memorial Institute

International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.

The International HapMap3 Consortium: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467: 52-58. 10.1038/nature09298.

Abbott A: Neuroscience: solving the brain. Nature. 2013, 499: 272-274. 10.1038/499272a.

The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.

Article   PubMed Central   Google Scholar  

A Catalog of Published Genome-wide Association Studies. http://www.genome.gov/gwastudies/ ,

Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, Shendure J, Drmanac R, Jorde LB, Hood L, Galas DJ: Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010, 328: 636-639. 10.1126/science.1186802.

Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, et al: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.

Wheeler DA, Srinivasian M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen Y-J, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song X, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872-876. 10.1038/nature06884.

International Cancer Genome Consortium. http://icgc.org/ ,

The Cancer Genome Atlas. http://cancergenome.nih.gov/ ,

Pandey A: Preparing for the 21 st century patient. JAMA. 2013, 309: 1471-1472. 10.1001/jama.2012.116971.

Hood L, Flores M: A personal view on systems medicine and the emergence of proactive P4 medicine: predictive, preventive, personalized and participatory. Nat Biotechnol. 2012, 29: 613-624.

CAS   Google Scholar  

Price ND, Edelman LB, Lee I, Yoo H, Hwang D, Carlson G, Galas DJ, Heath JR, Hood L: Systems biology and the emergence of systems medicine. Genomic and Personalized Medicine: From Principles to Practice. Volume 1. Edited by: Ginsburg G, Willard H. 2009, Philadelphia: Elsevier, 131-141.

Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, McGuire A, Nussbaum RL, O’Daniel JM, Ormond KE, Rehm HL, Watson MS, Williams MS, Biesecker LG: ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing. 2013, Bethesda: American College of Medical Genetics and Genomics

Meyerson M, Gabriel S, Getz G: Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010, 11: 685-696. 10.1038/nrg2841.

Qin S, Zhou Y, Lok AS, Tsodikov A, Yan X, Gray L, Yuan M, Moritz RL, Galas D, Omenn GS, Hood L: SRM targeted proteomics in search for biomarkers of HCV-induced progression of fibrosis to cirrhosis in HALT-C patients. Proteomics. 2012, 12: 1244-1252. 10.1002/pmic.201100601.

Li X-J, Hayward C, Fong P-Y, Dominguez M, Hunsucker SW, Lee LW, McClean M, Law S, Butler H, Schirm M, Gingras O, Lamontague J, Allard R, Chelsky D, Price ND, Lam S, Massion PP, Pass H, Rom WN, Vachani A, Fang KC, Hood L, Kearney P: A blood-based proteomic classifier for the molecular characterization of pulmonary nodules. Sci Transl Med. in press

Knoppers BM, Thorogood A, Chadwick R: The Human Genome Organisation: towards next-generation ethics. Genome Med. 2013, 5: 38-10.1186/gm442.

Hood L: Who we are: the book of life. Commencement Address. Whitman College Magazine. 2002, 4-7.

Foster MW, Sharp RR: Beyond race: towards a whole-genome perspective on human populations and genetic variation. Nat Rev Genet. 2004, 5: 790-796. 10.1038/nrg1452.

Royal CDM, Dunston GM: Changing the paradigm from ‘race’ to human genetic variation. Nat Genet. 2004, 36: S5-S7. 10.1038/ng1454.

Witherspoon DJ, Wooding S, Rogers AR, Marchani EE, Watkins WS, Batzer MA, Jorde LB: Genetic similarities within and between populations. Genetics. 2007, 176: 351-359. 10.1534/genetics.106.067355.

Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K, Pasaniuk B, Price AL, Reich D, Morton CC, Pollak MR, Wilson JG, McCarroll SA: Using population admixture to help complete maps of the human genome. Nat Genet. 2013, 45: 406-414. 10.1038/ng.2565.

Fernandez-Suarez XM, Galperin MY: The, Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2013, 2013: D1-D7.

Human Proteome Project. http://www.hupo.org/research/hpp/ ,

Hood LE, Omenn GS, Moritz RL, Aebersold R, Yamamoto KR, Amos M, Hunter-Cevera J, Locascio L, Workshop Participants: New and improved proteomics technologies for understanding complex biological systems: addressing a grand challenge in the life sciences. Proteomics. 2012, 12: 2773-2783. 10.1002/pmic.201270086.

Editorial: The call of the human proteome. Nat Methods. 2010, 7: 661-

Schadt E, Turner S, Kasarskis A: A window into third-generation sequencing. Hum Mol Genet. 2010, 19: R227-R240. 10.1093/hmg/ddq416.

Kim JK, Samaranayake M, Pradhan S: Epigenetic mechanisms in mammals. Cell Mol Life Sci. 2009, 66: 596-612. 10.1007/s00018-008-8432-4.

Hon G, Ren B, Wang W: ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008, 4: e1000201-10.1371/journal.pcbi.1000201.

Hayden EC: Nanopore genome sequencer makes its debut. Nature News. 2012,  -10.1038/nature.2012.10051.

Download references

Acknowledgements

The authors gratefully acknowledge support from the Luxembourg Centre for Systems Biomedicine and the University of Luxembourg; from the NIH, through award 2P50GM076547-06A; and the US Department of Defense (DOD), through award W911SR-09-C-0062. LH receives support from NIH P01 NS041997; 1U54CA151819-01; and DOD awards W911NF-10-2-0111 and W81XWH-09-1-0107.

Author information

Authors and affiliations.

Institute for Systems Biology, 401 Terry Ave N., Seattle, WA, 98109, USA

Leroy Hood & Lee Rowen

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Leroy Hood or Lee Rowen .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Hood, L., Rowen, L. The Human Genome Project: big science transforms biology and medicine. Genome Med 5 , 79 (2013). https://doi.org/10.1186/gm483

Download citation

Published : 13 September 2013

DOI : https://doi.org/10.1186/gm483

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Human Genome Sequence
  • Human Brain Project
  • Small Science
  • Individual Genome Sequence

Genome Medicine

ISSN: 1756-994X

assignment on human genome project

  • Share full article

Advertisement

Supported by

Scientists Finish the Human Genome at Last

The complete genome uncovered more than 100 new genes that are probably functional, and many new variants that may be linked to diseases.

assignment on human genome project

By Carl Zimmer

Two decades after the draft sequence of the human genome was unveiled to great fanfare, a team of 99 scientists has finally deciphered the entire thing. They have filled in vast gaps and corrected a long list of errors in previous versions, giving us a new view of our DNA.

The consortium has posted six papers online in recent weeks in which they describe the full genome. These hard-sought data, now under review by scientific journals, will give scientists a deeper understanding of how DNA influences risks of disease, the scientists say, and how cells keep it in neatly organized chromosomes instead of molecular tangles.

For example, the researchers have uncovered more than 100 new genes that may be functional, and have identified millions of genetic variations between people. Some of those differences probably play a role in diseases.

For Nicolas Altemose, a postdoctoral researcher at the University of California, Berkeley, who worked on the team, the view of the complete human genome feels something like the close-up pictures of Pluto from the New Horizons space probe.

“You could see every crater, you could see every color, from something that we only had the blurriest understanding of before,” he said. “This has just been an absolute dream come true.”

Experts who were not involved in the project said it will enable scientists to explore the human genome in much greater detail. Large chunks of the genome that had been simply blank are now deciphered so clearly that scientists can start studying them in earnest.

“The fruit of this sequencing effort is amazing,” said Yukiko Yamashita, a developmental biologist at the Whitehead Institute for Biomedical Research at the Massachusetts Institute of Technology.

While scientists have known for decades that genes were spread across 23 pairs of chromosomes, these strange, wormlike microscopic structures remained largely a mystery.

By the late 1970s, scientists had gained the ability to pinpoint a few individual human genes and decode their sequence. But their tools were so crude that hunting down a single gene could take up an entire career.

Toward the end of the 20th Century, an international network of geneticists decided to try to sequence all the DNA in our chromosomes. The Human Genome Project was an audacious undertaking, given how much there was to sequence. Scientists knew that the twin strands of DNA in our cells contained roughly three billion pairs of letters — a text long enough to fill hundreds of books.

When that team began its work, the best technology the scientists could use sequenced bits of DNA just a few dozen letters, or bases, long. Researchers were left to put them together like the pieces of a vast jigsaw puzzle. To assemble the puzzle, they looked for fragments with identical ends, meaning that they came from overlapping portions of the genome. It took years for them to gradually assemble the sequenced fragments into larger swaths.

The White House announced in 2000 that scientists had finished the first draft of the human genome, and details of the project were published the following year. But long stretches of the genome remained unknown, while scientists struggled to figure out where millions of other bases belonged.

It turned out that the genome was a very hard puzzle to put together from small pieces. Many of our genes exist as multiple copies that are nearly identical to each other. Sometimes the different copies carry out different jobs. Other copies — known as pseudogenes — are disabled by mutations. A short fragment of DNA from one gene might fit just as well into the others.

And genes only make up a small percentage of the genome. The rest of it can be even more baffling . Much of the genome is made up of virus-like stretches of DNA that exist largely just to make new copies of themselves that get inserted back into the genome.

In the early 2000s, scientists got a little better at putting together the genome puzzle from its tiny pieces. They made more fragments, read them more accurately, and developed new computer programs to assemble them into bigger chunks of the genome.

Periodically, researchers would unveil the latest, best draft of the human genome — known as the reference genome. Scientists used the reference genome as a guide for their own sequencing efforts. For example, clinical geneticists would catalog disease-causing mutations by comparing genes from patients to the reference genome.

The newest reference genome came out in 2013. It was a lot better than the first draft, but it was a long way from complete. Eight percent of it was simply blank.

“There’s basically an entire human chromosome that had gone missing,” said Michael Schatz, a computational biologist at Johns Hopkins University.

In 2019, two scientists — Adam Phillippy, a computational biologist at the National Human Genome Research Institute, and Karen Miga, a geneticist at the University of California, Santa Cruz — founded the Telomere-to-Telomere Consortium to complete the genome.

Dr. Phillippy admitted that part of his motivation for such an audacious project was that the missing gaps annoyed him. “They were just really bugging me,” he said. “You take a beautiful landscape puzzle, pull out a hundred pieces, and look at it — that’s very bothersome to a perfectionist.”

Dr. Phillippy and Dr. Miga put out a call for scientists to join them to finish the puzzle. They ended up with 99 scientists working directly on sequencing the human genome, and dozens more pitching in to make sense of the data. The researchers worked remotely through the pandemic, coordinating their efforts over Slack, a messaging app.

“It was a surprisingly nice ant colony,” Dr. Miga said.

The consortium took advantage of new machines that can read stretches of DNA reaching tens of thousands of bases long. The researchers also invented techniques to figure out where particularly mysterious repeating sequences belonged in a genome.

All told, the scientists added or fixed more than 200 million base pairs in the reference genome. They can now say with confidence that the human genome measures 3.05 billion base pairs long.

Within those new sequences of DNA, the scientists discovered more than 2,000 new genes. Most appear to be disabled by mutations, but 115 of them look as if they can produce proteins — the function of which scientists may need years to figure out. The consortium now estimates that the human genome contains 19,969 protein-coding genes.

With a complete genome finally assembled, the researchers could take a better look at the variation in DNA from one person to the next. They discovered more than two million new spots in the genome where people differ. Using the new genome also helped them to avoid identifying disease-linked mutations where none actually exist.

“It’s a great advance for the field,” said Dr. Midhat Farooqui, the director of molecular oncology at Children’s Mercy, a hospital in Kansas City, Mo., who was not involved in the project.

Dr. Farooqi has started using the genome for his research into rare childhood diseases, aligning DNA from his patients against the newly filled gaps to search for mutations.

Switching to the new genome may be a challenge for many clinical labs, however. They’ll have to shift all of their information about the links between genes and diseases to a new map of the genome. “There will be a big effort, but it will take a couple years,” said Dr. Sharon Plon, a medical geneticist at Baylor College of Medicine in Houston.

Dr. Altemose plans on using the complete genome to explore a particularly mysterious region in each chromosome known as the centromere. Instead of storing genes, centromeres anchor proteins that move chromosomes around a cell as it divides. The centromere region contains thousands of repeated segments of DNA.

In their first look, Dr. Altemose and his colleagues were struck by how different centromere regions can be from one person to another. That observation suggests that centromeres have been evolving rapidly, as mutations insert new pieces of repeating DNA into the regions or cut other pieces out.

While some of this repeating DNA may play a role in pulling chromosomes apart, the researchers have also found new segments — some of them millions of bases long — that don’t appear to be involved. “We don’t know what they’re doing,” Dr. Altemose said.

But now that the empty zones of the genome are filled in, Dr. Altemose and his colleagues can study them up close. “I’m really excited moving forward to see all the things we can discover,” he said.

An earlier version of this article misstated when scientists first arrived at the correct number of human chromosomes. It was in the 1960s, not a century ago.

How we handle corrections

Carl Zimmer writes the “Matter” column. He is the author of fourteen books, including “Life's Edge: The Search For What It Means To Be Alive.” More about Carl Zimmer

The Mysteries and Wonders of Our DNA

Women are much more likely than men to have an array of so-called autoimmune diseases, like lupus and multiple sclerosis. A new study offers an explanation rooted in the X chromosome .

DNA fragments from thousands of years ago are providing insights  into multiple sclerosis, diabetes, schizophrenia and other illnesses. Is this the future of medicine ?

A study of DNA from half a million volunteers found hundreds of mutations that could boost a young person’s fertility  and that were linked to bodily damage later in life.

In the first effort of its kind, researchers now have linked DNA from 27 African Americans buried in the cemetery to nearly 42,000 living relatives .

Environmental DNA research has aided conservation, but scientists say its ability to glean information about humans poses dangers .

That person who looks just like you is not your twin. But if scientists compared your genomes, they might find a lot in common .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Alcohol Health Res World
  • v.19(3); 1995

Logo of ahrw

The Human Genome Project

The Human Genome Project is an ambitious research effort aimed at deciphering the chemical makeup of the entire human genetic code (i.e., the genome). The primary work of the project is to develop three research tools that will allow scientists to identify genes involved in both rare and common diseases. Another project priority is to examine the ethical, legal, and social implications of new genetic technologies and to educate the public about these issues. Although it has been in existence for less than 6 years, the Human Genome Project already has produced results that are permeating basic biological research and clinical medicine. For example, researchers have successfully mapped the mouse genome, and work is well under way to develop a genetic map of the rat, a useful model for studying complex disorders such as hypertension, diabetes, and alcoholism.

The Human Genome Project is an international research project whose primary mission is to decipher the chemical sequence of the complete human genetic material (i.e., the entire genome), identify all 50,000 to 100,000 genes contained within the genome, and provide research tools to analyze all this genetic information. This ambitious project is based on the fact that the isolation and analysis of the genetic material contained in the DNA 1 ( figure 1 ) can provide scientists with powerful new approaches to understanding the development of diseases and to creating new strategies for their prevention and treatment. Nearly all human medical conditions, except physical injuries, are related to changes (i.e., mutations) in the structure and function of DNA. These disorders include the 4,000 or so heritable “Mendelian” diseases that result from mutations in a single gene; complex and common disorders that arise from heritable alterations in multiple genes; and disorders, such as many cancers, that result from DNA mutations acquired during a person’s lifetime. (For more information on the genetics of alcoholism, see the articles by Goate, pp. 217–220, and Grisel and Crabbe, pp. 220–227.)

An external file that holds a picture, illustration, etc.
Object name is arhw-19-3-190f1.jpg

Artist’s rendering of the DNA molecule from a single cell.

Although scientists have performed many of these tasks and experiments for decades, the Human Genome Project is unique and remarkable for the enormity of its effort. The human genome contains 3 billion DNA building blocks (i.e., nucleotides), enough to fill approximately one thousand 1,000-page telephone books if each nucleotide is represented by one letter. Given the size of the human genome, researchers must develop new methods for DNA analysis that can process large amounts of information quickly, cost-effectively, and accurately. These techniques will characterize DNA for family studies of disease, create genomic maps, determine the nucleotide sequence of genes and other large DNA fragments, identify genes, and enable extensive computer manipulations of genetic data.

Focus of the Human Genome Project

The primary work of the Human Genome Project has been to produce three main research tools that will allow investigators to identify genes involved in normal biology as well as in both rare and common diseases. These tools are known as positional cloning ( Collins 1992 ). These advanced techniques enable researchers to search for disease-linked genes directly in the genome without first having to identify the gene’s protein product or function. (See the article by Goate, pp. 217–220.) Since 1986, when researchers first found the gene for chronic granulomatous disease 2 through positional cloning, this technique has led to the isolation of considerably more than 40 disease-linked genes and will allow the identification of many more genes in the future ( table 1 ).

Disease Genes Identifed Using Positional Cloning

Each of the three tools being developed by the Human Genome Project helps bring the specific gene being sought into better focus (see sidebar , pp. 192–193). The first of these tools, the genetic map, consists of thousands of landmarks—short, distinctive pieces of DNA—more or less evenly spaced along the chromosomes. With this tool, researchers can narrow the location of a gene to a region of the chromosome. Once this region has been identified, investigators turn to a second tool, the physical map, to further pinpoint the specific gene. Physical maps are sets of overlapping DNA that may span an entire chromosome. These sets are cloned and frozen for future research. Once the physical map is complete, investigators will simply be able to go to the freezer and pick out the actual piece of DNA needed, rather than search through the chromosomes all over again. The final tool will be the creation of a complete sequence map of the DNA nucleotides, which will contain the exact sequence of all the DNA that makes up the human genome.

Genetic Maps Provide Blueprint for Human Genome

A primary focus of the Human Genome Project is to develop tools that will enable investigators to analyze large amounts of hereditary material quickly and efficiently. The success of this project hinges on the accurate mapping of each chromosome. The Human Genome Project is using primarily three levels of maps, each of which helps to increase understanding not only of the construction of individual genes but also of their relation to each other and to the entire chromosomal structure.

Genetic Mapping

Genetic mapping, also called linkage mapping, provides the first evidence that a disease or trait (i.e., a characteristic) is linked to the gene(s) inherited from one’s parents. Through genetic mapping, researchers can approximate the location of a gene to a specific region on a specific chromosome; the process is like establishing towns on a road map ( figure 1 ). For example, Interstate 10 runs from Florida to California. It would be difficult to find a landmark along that highway if the only cities mapped were Jacksonville and Los Angeles. It would be much easier, however, to pinpoint the landmark if one knew that it was located between markers that are closer together (e.g., El Paso and San Antonio).

Genetic mapping begins with the collection of blood or tissue samples from families in which a disease or trait is prevalent. After extracting the DNA from the samples, researchers track linearly the frequency of a recurring set of nucleotides (represented, for example, by the letters “CACACA”) along a region of a chromosome. If this sequence is shared among family members who have the disease, the scientists may have identified a marker for the disease-linked gene. Mapping additional DNA samples from other people with and without the disease allows researchers to determine the statistical probability that the marker is linked to the development of the disease.

An external file that holds a picture, illustration, etc.
Object name is arhw-19-3-190f2.jpg

Genetic Map. Just as locating a landmark on a particular highway is easier if one can narrow the area of the search to between two nearby points, or markers (e.g., El Paso and San Antonio on Interstate 10), researchers first try to narrow their search for particular genes to a segment of chromosome denoted by a specific sequence of nucleotides (e.g., CACACA).

Physical Mapping

Physical mapping generates sets of overlapping DNA fragments that span regions of—or even whole—chromosomes. These DNA fragments, which can be isolated and stored for future analysis ( figure 2 ), serve as a resource for investigators who want to isolate a gene after they have mapped it to a particular chromosome or chromosomal region. The physical map allows scientists to limit the gene search to a particular subregion of a chromosome and thus zero in on their target more rapidly.

One early goal of the physical mapping component of the Human Genome Project was to isolate contiguous DNA fragments that spanned at least 2 million nucleotides. Considerable progress has been made in this area, with sets of contiguous DNA fragments (“contigs”) now frequently ranging from 20 to 50 million nucleotides in length. Because the order of DNA fragments in a physical map should reflect their actual order on a chromosome, correct alignment of contigs also requires a set of markers to serve as mileposts, similar to those of an interstate highway. Genome scientists have developed a physical map that currently contains about 23,000 markers, called sequence tagged sites (STS’s). Scientists likely will meet their ultimate goal of establishing 30,000 STS markers on the physical map—one every 100,000 nucleotides—within the next year or two. This detailed STS map will allow researchers to pinpoint the exact location of any gene within 50,000 nucleotides of an STS marker.

An external file that holds a picture, illustration, etc.
Object name is arhw-19-3-190f3.jpg

Physical Map. Using various methods, A) whole chromosomes are B) snipped into large fragments of DNA (i.e., sequences of nucleotides) and then cloned. C) These cloned DNA pieces then are realigned in the order in which they originally occurred in the chromosomes and stored. The stored pieces can be used for further studies such as D) finding specific genes.

An external file that holds a picture, illustration, etc.
Object name is arhw-19-3-190f4.jpg

Part of the DNA sequence map of a virus containing 10,000 nucleotide bases. For comparison, the human genome contains approximately 3 billion nucleotide bases.

Researchers also are attempting to use fragments of expressed genes known as expressed sequence tags (EST’s), which are made from complimentary DNA, as markers on the physical genome map. By using EST’s, they hope to increase the power of maps for finding specific genes. A recent collaboration between Merck and Co. (a major pharmaceutical corporation) and researchers at Washington University in St. Louis, Missouri, will provide a resource for placing tens of thousands of such markers derived from actual genes on the physical map.

Marker development to be used in creating both the linkage and the physical maps also takes into account the need for connectivity between these two types of maps. Information learned from one stage of the gene-finding process must be easily translatable to the next.

The DNA Sequence Map

The Human Genome Project’s most challenging goal is to determine the order (i.e., sequence), unit by unit, of all 3 billion nucleotides that make up the human genome. Once the genetic and physical maps are completed, a sequence map can be constructed, which will allow scientists to find genes, characterize DNA regions that control gene activity, and link DNA structure to its function.

To date, the technology for this work has been developed and implemented primarily in model organisms. For example, researchers now have sequenced 25 million DNA nucleotides from the roundworm—about 25 percent of the animal’s genome—and, in the process, have increased their annual sequencing rate to 11 million nucleotide bases ( figure 3 ). The investigators expect to finish sequencing the roundworm genome by the end of 1998. The complete DNA sequence of yeast and E. coli genomes will be determined even sooner.

—Francis S. Collins and Leslie Fink

To make all this information available to researchers worldwide, the project has the additional goal of developing computer methods for easy storage, retrieval, and manipulation of data. Moreover, because researchers often can obtain valuable information about human genes and their functions by comparing them with the corresponding genes of other species, the project has set goals for mapping and sequencing the genomes of several important model organisms, such as the mouse, rat, fruit fly, roundworm, yeast, and the common intestinal bacterium E. coli .

Technological Advances in Genomic Research

The need for large-scale approaches to DNA sequencing has pushed technology toward both increasing capacity and decreasing instrument size. This demand has led, for example, to the development of automated machines that reduce the time and cost of the biochemical processes involved in sequencing, improve the analysis of these reactions, and facilitate entering the information obtained into databases. Robotic instruments also have been developed that expedite repetitive tasks inherent in large-scale research and reduce the chance for error in several sequencing and mapping steps.

Miniaturization technology is facilitating the sequencing of more—and longer—DNA fragments in less time and increasing the portability of sequencing processes, a capability that is particularly important in clinical or field work. In 1994, for example, the National Institutes of Health (NIH), through its National Center for Human Genome Research (NCHGR), began a new initiative for the development of microtechnologies to reduce the size of sequencing instrumentation and thereby increase the speed of the sequencing process. NCHGR also is exploring new strategies for minimizing time-consuming sequencing bottlenecks by developing integrated, matched components that will help ensure that each step in the sequencing process proceeds as efficiently as possible. The overall sequencing rate is only as fast as its slowest step.

Other developments in DNA sequencing have aimed to reduce the costs associated with the technology. Through refinements in current sequencing methods, the cost has been lowered to about $0.50 per nucleotide. Research on new DNA sequencing techniques is addressing the need for rapid, inexpensive, large-scale sequencing processes for comparison of complex genomes and clinical applications. Further improvements in the efficiency of current processes, along with the development of entirely new approaches, will enable researchers to determine the complete sequence of the human genome perhaps before the year 2005.

Applications of the Human Genome Project

The detailed genetic, physical, and sequence maps developed by the Human Genome Project also will be critical to understanding the biological basis of complex disorders resulting from the interplay of multiple genetic and environmental influences, such as diabetes; heart disease; cancer; and psychiatric illnesses, including alcoholism. In 1994, for example, researchers used genetic maps to discover at least five different chromosome regions that appear to play a role in insulin-dependent (i.e., type 1) diabetes ( Davies et al. 1994 ). Analyses to identify the genetic components of these complex diseases require high-resolution genetic maps and must be conducted on a scale much larger than was previously possible. Automated microsatellite marker technology 3 now makes it possible to determine the genetic makeup (i.e., the genotype) of enough subjects so that genes for common diseases can be mapped reliably in a reasonable amount of time. NCHGR is planning a technologically advanced genotyping facility to assist investigators in designing research studies; performing genetic analyses; and developing new techniques for analyzing common, multigene diseases.

Molecular Medicine

Efforts to understand and treat disease processes at the DNA level are becoming the basis for a new molecular medicine. The discovery of disease-associated genes provides scientists with the foundation for understanding the course of disease, treating disorders with synthetic DNA or gene products, and assessing the risk for future disease. Thus, by going directly to the genetic source of human illness, molecular medicine strategies will offer a more customized health management based on the unique genetic constitution of each person. Molecular medicine also will increase clinicians’ focus on prevention by enabling them to predict a person’s risk for future disease and offer prevention or early treatment strategies. This approach will apply not only to classical, single-gene hereditary disorders but also to more common, multi-gene disorders, such as alcoholism.

During the past 3 years, positional cloning has led to the isolation of more than 30 disease-associated genes. Although this number has increased dramatically, compared with the years predating the Human Genome Project, it is still a small fraction of the 50,000 to 100,000 genes that await discovery in the entire genome. NCHGR has helped develop efficient biological and computer techniques to identify all the genes in large regions of the genome. One technique was used successfully last year to isolate BRCA1 , the first major gene linked to inherited breast cancer. The location of BRCA1 first was narrowed to a DNA fragment of several hundred thousand nucleotides containing many genes. A process that isolates the protein-coding sequences of a gene (i.e., exon trapping) allowed researchers to identify and examine not only the correct BRCA1 gene in that region but also several new genes that now serve as disease-gene candidates for future investigations.

Diagnostics

Clinical tests that detect disease-causing mutations in DNA are the most immediate commercial application of gene discovery. These tests may positively identify the genetic origin of an active disease, foreshadow the development of a disease later in life, or identify healthy carriers of recessive diseases such as cystic fibrosis. 4 Genetic tests can be performed at any stage of the human life cycle with increasingly less invasive sampling procedures. Although DNA testing offers a powerful new tool for identifying and managing disease, it also poses several medical and technical challenges. The number and type of mutations for a particular disease may be few, as in the case of achondroplasia, 5 or many, as in the case of cystic fibrosis and hereditary breast cancer. Thus, it is essential to establish for each potential DNA test how often it detects disease-linked mutations and how often and to what degree detection of mutations correlates with the development of disease.

Therapeutics

Gene discovery also provides opportunities for developing gene-based treatment for hereditary and acquired diseases. These treatment approaches range from the mass production of natural substances (e.g., blood-clotting factors, growth factors and hormones, and interleukins and interferons 6 ) that are effective in treating certain diseases to gene-therapy strategies. Gene therapy is designed to deliver DNA carrying a functional gene to a patient’s cells or tissues and thereby correct a genetic alteration.

Currently, more than 100 companies conduct human clinical trials on DNA-based therapies ( Pharmaceutical Research and Manufacturers of America [PRMA] 1995 ). The top U.S. public biotechnology companies have an estimated 2,000 drugs in early development stages ( Ernst and Young 1993 ). Since 1988, NIH’s Recombinant DNA Advisory Committee has approved more than 100 human gene-therapy or gene-transfer protocols (Office of Recombinant DNA Activities, NIH, personal communication, April 1995). Seventeen gene-therapy products are now in commercial development for hereditary disorders, cancer, and AIDS ( PRMA 1995 ).

Ethical, Legal, and Social Concerns of the Human Genome Project

Implications for disease detection.

The translation of human genome technologies into patient care brings with it special concerns about how these tools will be applied. A principal arena in which psychosocial issues related to these technologies are being raised is the testing of people who may be at risk for a genetically transmitted disease but who do not yet show the disease’s symptoms (i.e., are asymptomatic). These concerns stem largely from the delay between scientists’ technical ability to develop DNA-based diagnostic tests that can identify a person’s risk for future disease and their ability to develop effective prevention or treatment strategies for the disorders those tests portend. In the meantime, people who undergo genetic tests run the risk of discrimination in health insurance and may have difficulty adapting to test results—particularly in families in which hereditary disease is common—regardless of whether a test indicates future disease. When no treatment is available and when no other medical course of action can be taken on the basis of such tests, the negative social, economic, and psychological consequences of knowing one’s medical fate must be carefully evaluated in light of the meager medical benefits of such knowledge.

To help ensure that medical benefits are maximized without jeopardizing psychosocial and economic well-being, the Human Genome Project, from its beginning, has allocated a portion of its research dollars to study the ethical, legal, and social implications (ELSI) of the new genetic technologies. A diverse funding program supports research in four priority areas: the ethical issues surrounding the conduct of genetic research, the responsible integration of new genetic technologies into the clinic, the privacy and fair use of genetic information, and the professional and public education about these issues.

Because of the many unresolved questions surrounding DNA testing in asymptomatic patients, in 1994 NCHGR’s advisory body released a statement urging health care professionals to offer DNA testing for the predisposition to breast, ovarian, and colon cancers only within approved pilot research programs until more is known about the science, psychology, and sociology of genetic testing for some diseases ( National Advisory Council for Human Genome Research 1994 ). The American Society of Human Genetics and the National Breast Cancer Coalition have issued similar statements. More recently, the NIH–DOE [Department of Energy] Working Group on ELSI launched a task force to perform a comprehensive, 2-year evaluation of the current state of genetic testing technologies in the United States. The task force will examine safety, accuracy, predictability, quality assurance, and counseling strategies for the responsible use of genetic tests.

In a related project, NCHGR’s ELSI branch spearheaded a new group of pilot studies shortly after researchers isolated BRCA1 and several genes for colon cancer predisposition. These 3-year studies are examining the psychosocial and patient-education issues related to testing healthy members of families with high incidences of cancer for the presence of mutations that greatly increase the risk of developing cancer. The results will provide a thorough base of knowledge on which to build plans for introducing genetic tests for cancer predisposition into medical practice.

Implications for Complex Traits

Research in human genetics focuses not only on the causes of disease and disability but also on genes and genetic markers that appear to be associated with other human characteristics, such as height, weight, metabolism, learning ability, sexual orientation, and various behaviors ( Hamer et al. 1993 ; Brunner et al. 1993 ). Associating genes with human traits that vary widely in the population raises unique and potentially controversial social issues. Genetic studies elucidate only one component of these complex traits. The findings of these studies, however, may be interpreted to mean that such characteristics can be reduced to the expression of particular genes, thus excluding the contributions of psychosocial or environmental factors. Genetic studies can also be interpreted in a way that narrows the range of variation considered “normal” or “healthy.”

Both reducing complex human characteristics to the role of genes and restricting the definition of what is normal can have harmful—even devastating—consequences, such as the devaluation of human diversity and social discrimination based on a person’s genetic makeup. The Human Genome Project must therefore foster a better understanding of human genetic variation among the general public and health care professionals as well as offer research policy options to prevent genetic stigmatization, discrimination, and other misuses and misinterpretations of genetic information.

Progress on Genetic and Physical Maps

In the United States, NCHGR and DOE, through its Office of Environmental Health Research, are the primary public supporters of major genome research programs. In 1990, when the 15-year Human Genome Project began, NCHGR and DOE established ambitious goals to guide the research through its first years ( U.S. Department of Health and Human Services and U.S. Department of Energy 1990 ). After nearly 6 years, scientists involved in the Human Genome Project have met or exceeded most of those goals—some ahead of time and all under budget. Because scientific advances may rapidly make the latest technologies obsolete, a second 5-year plan was published in 1993 ( Collins and Galas 1993 ) to keep ahead of the project’s progress. Already, further technological advances make it likely that a new plan will be needed, perhaps as early as this year.

In 1994, an international consortium headed by the Genome Science and Technology Center in Iowa published a genetic map of the human genome containing almost 6,000 markers spaced less than 1 million nucleotides apart ( Cooperative Human Linkage Center et al. 1994 ). This map was completed more than 1 year ahead of schedule, and its density of markers is four to six times greater than that called for by the 1990 goals. This early achievement is largely a result of the discovery and development of micro-satellite DNA markers and of large-scale methods for marker isolation and analysis.

In a related project, technology developed so quickly that a high resolution genetic map of the mouse genome was completed in just 2 years. NCHGR is now helping to coordinate an initiative with other NIH institutes, particularly the National Heart, Lung, and Blood Institute and the National Institute on Alcohol Abuse and Alcoholism, to develop a high-resolution genetic map of the rat, a useful model for studying complex disorders such as hypertension, diabetes, and alcoholism.

The original 5-year goal to isolate contiguous DNA fragments that span at least 2 million nucleotides was met early on; soon, more than 90 percent of the human genome will be accounted for using sets of overlapping DNA fragments, each of which is at least 10 million nucleotides long. Complete physical maps now exist for human chromosomes 21, 22, and Y. Nearly complete maps have been developed for chromosomes 3, 4, 7, 11, 12, 16, 19, and X. 7

As the end of the first phase of the Human Genome Project draws near, its impact already is rippling through basic biological research and clinical medicine. From deciphering information in genes, researchers have gained new knowledge about the nature of mutations and how they cause disease. Even after someday identifying all human genes, scientists will face the daunting task of elucidating the genes’ functions. Furthermore, new paradigms will emerge as researchers and clinicians understand interactions between genes, the molecular basis of multigene disorders, and even tissue and organ function.

The translation of this increasing knowledge into improved health care already is under way; however, the value of gene discovery to the promising new field of molecular medicine will be fully realized only when the public is secure in the use of genetic technologies.

1 For a definition of this and other technical terms used in this article, see central glossary, pp. 182–183.

2 Chronic granulomatous disease is an inherited disease of the immune system.

3 Microsatellite markers are short DNA sequences that vary in length from person to person. The length of a particular marker is inherited from one’s parents, allowing researchers to track the markers through several generations of the same family.

4 For a recessive disease to develop, a person must inherit two altered gene copies, one from each parent. People who inherit only one altered gene copy usually are healthy (i.e., they do not show symptoms of the disease); these people are called asymptomatic carriers.

5 Achondroplasia is a disorder that results in defective skeletal development in the fetus and dwarfism. Affected children often die before or within their first year of life.

6 Interleukins and interferons are substances that stimulate and regulate the immune system.

7 Of the 23 chromosome pairs in human cells, 22 pairs are numbered according to their size, with chromosome 1 being the largest and chromosome 22 being the smallest chromosome. The gender-determining chromosomes are referred to as X and Y.

  • Brunner HG, Nelen M, Breakefield XO, Ropers HH, van Oost BA. Abnormal behavior associated with a point mutation in the structural gene for monoamine oxidase A. Science. 1993; 261 :321–327. [ PubMed ] [ Google Scholar ]
  • Collins FS. Let’s not call it reverse genetics. Nature Genetics. 1992; 1 :3–6. [ PubMed ] [ Google Scholar ]
  • Collins FS, Galas D. A new five-year plan for the U.S. human genome project. Science. 1993; 262 :43–46. [ PubMed ] [ Google Scholar ]
  • Cooperative Human Linkage Center (CHLC) Murray JC, Buetow KH, Weber JL, Ludwigsen S, Scherpbier-Heddema T, Manion F, Quillen J, Sheffield VC, Sunden S, Duyk GM, et al. A comprehensive human linkage map with centimorgan density. Science. 1994; 265 :2049–2054. [ PubMed ] [ Google Scholar ]
  • Davies JL, Kawaguchi Y, Bennett ST, Copeman JB, Cordell HJ, Pritchard LE, Reed PW, Gough SC, Jenkins SC, Palmer SM, et al. A genome-wide search for human type 1 diabetes susceptibility genes. Nature. 1994; 371 :130–136. [ PubMed ] [ Google Scholar ]
  • Ernst, Young . The Industry Annual Report. 1993. Biotech 94: Long-term Value, Short-term Hurdles; p. 31. [ Google Scholar ]
  • Hamer DH, Hu S, Magnuson VL, Hu N, Pattatucci AML. A linkage between DNA markers on the X chromosome and male sexual orientation. Science. 1993; 262 :578–580. [ PubMed ] [ Google Scholar ]
  • National Advisory Council for Human Genome Research. Statement on use of DNA testing for presymptomatic identification of cancer risk. Journal of the American Medical Association. 1994; 271 (10):785. [ PubMed ] [ Google Scholar ]
  • Pharmaceutical Research and Manufacturers of America. Biotechnology Medicines in Development. Mar, 1995. [ Google Scholar ]
  • U.S. Department of Health and Human Services and U.S. Department of Energy. The U.S. Human Genome Project: The First Five Years. Bethesda, MD: National Institutes of Health; 1990. Understanding Our Genetic Inheritance. NIH Publication No. 90–1590. [ Google Scholar ]

Introduction to Life Sciences – Week 4 Assignment

Human Genome Project

Research the purpose and history of the human genome project.   Present your findings on the human genome project and discuss the benefits and potential drawbacks of the project. Provide an analysis on the implications of understanding the human genome in its entirety.

Your assignment should be 500 words in length.

View your assignment rubric .

Title: Grantham Copyright - Description: Grantham Copyright 2018

  • Biology Article
  • Human Genome Project Goals Significance

Human Genome Project

Human genome project (HGP) was an international scientific research project which got successfully completed in the year 2003 by sequencing the entire human genome of 3.3 billion base pairs. The HGP led to the growth of bioinformatics which is a vast field of research. The successful sequencing of the human genome could solve the mystery of many disorders in humans and gave us a way to cope up with them.

Goals of the human genome project

Goals of the human genome project include:

  • Optimization of the data analysis.
  • Sequencing the entire genome.
  • Identification of the complete human genome.
  • Creating genome sequence databases to store the data.
  • Taking care of the legal, ethical and social issues that the project may pose.

Methods of the human genome project

In this project, two different and significant methods are typically used.

  • Expressed sequence tags wherein the genes were differentiated into the ones forming a part of the genome and the others which expressed RNAs.
  • Sequence Annotation wherein the entire genome was first sequenced and the functional tags were assigned later.

The process of the human genome project

  • The complete gene set was isolated from a cell.
  • It was then split into small fragments.
  • This DNA structure was then amplified with the help of a vector which mostly was BAC (Bacterial artificial chromosomes) and YAC (Yeast artificial chromosomes).
  • The smaller fragments were then sequenced using DNA sequencers.
  • On the basis of overlapping regions, the sequences were then arranged.
  • All the information of this genome sequence was then stored in a computer-based program.
  • This way the entire genome was sequenced and stored as genome database in computers. Genome mapping was the next goal which was achieved with the help of microsatellites (repetitive DNA sequences).

Features of the Human genome project include:

  • Our entire genome is made up of 3164.7 million base pairs.
  • On average, a gene is made up of 3000 nucleotides.
  • The function of more than 50 percent of the genes is yet to be discovered.
  • Proteins are coded by less than 2 percent of the genome.
  • Most of the genome is made up of repetitive sequences which have no coding purposes specifically but such redundant codes can help us better understand of genetic development of humanity through the ages.

Applications of HGP

As the goals of the human genome project were achieved, it led to great advancement in research. Today, if any disease arises due to some alteration in a certain gene, then it could be traced and compared to the genome database that we already have. In this way, a more rational step could be taken to deal with the problem and can be fixed with more ease.

This project has opened up new horizons which can be learned in much detail with our expert faculty.

Stay tuned with BYJU’S to learn more in detail about the human genome project.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Biology related queries and study materials

Your result is as below

Request OTP on Voice Call

Leave a Comment Cancel reply

Your Mobile number and Email id will not be published. Required fields are marked *

Post My Comment

assignment on human genome project

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

Talk to our experts

1800-120-456-456

Human Genome Project

Human genome project: an introduction.

The Human Genome Project is based on the fact that isolating and analysing the genetic material contained in DNA can provide scientists with powerful new approaches to understanding disease development and developing new strategies for disease prevention and treatment. Except for physical injuries, nearly all human medical conditions are linked to changes (i.e., mutations) in the structure and function of DNA. The HGP accelerated the growth of bioinformatics , a vast field of study.

The project's primary goal is to create research tools that enable scientists to identify genes involved in rare and common diseases. In this article, we will study the various features of this megaproject as well as its applications in various fields, and the steps taken up by scientists to sequence the whole genome.

What is the Human Genome Project?

The Human Genome Project is an international research project with the primary goal of deciphering the chemical sequence of the entire human genetic material (i.e., the entire genome). It identifies all 50,000 to 100,000 genes contained within the genome and provides research tools to analyse all of this genetic information.

After the US government picked up the idea in 1984 and began planning, the project was formally launched in 1990 and completed in 2003.

The National Institutes of Health (NIH) of the United States, as well as numerous other organisations from around the world, provided funding.

The Human Genome Project (HGP) aims to determine the sequence of chemical base pairs that comprise human DNA, map the entire human genome, and identify its complex structures and functions.

Differences in the genetic make-up are caused by differences in DNA nucleotide sequences. The goal of scientists has always been to map the human genome. Advances in genetic engineering techniques have made it possible to isolate and clone DNA fragments and determine their nucleotide sequences.

The HGP has transformed biology with its multidisciplinary approach to deciphering a reference human genome sequence.

This audacious endeavour resulted in the creation of novel technologies and analytical tools.

Finally, the HGP has inspired several other exciting projects that have the potential to open up new avenues in biology, medicine , and psychology .

Aim and Objective of the Human Genome Project

To sequence the whole genome at 3 billion bps.

To create a physical map of the human genome.

To store this information in the database.

To improve the tools for data analysis.

To transfer this information to the other related industries.

To solve any ethical, legal, or social issues regarding this project.

To make the information available to all the researchers.

Steps of the Human Genome Project

The whole DNA of the cell is isolated and randomly broken into fragments.

They are inserted into special vectors like BAC (Bacterial Artificial Chromosomes) and YAC (Yeast Artificial Chromosomes).

These fragments are then cloned into suitable hosts like bacteria and yeast.

A Polymerase Chain Reaction ( PCR ) is used to make copies of DNA fragments.

The fragments are sequenced using Sanger sequencing.

The sequences are then arranged based on the overlapping regions.

The sequences were then annotated and assigned to different chromosomes .

The genetic and physical maps are also made with the help of polymorphism of microsatellites and restriction endonuclease.

Steps used in Human Genome Project Image

Steps used in Human Genome Project Image

Salient Features of the Human Genome Project

The human genome is made up of 3164.7 million nucleotides .

The average gene is 3000 base pairs long . On the X-chromosome, the largest gene is Duchenne Muscular Dystrophy. It has 2.4 million base pairs (2400 kilo). The genes for B-globin and insulin are less than 10 kilobases long.

The human genome contains approximately 30,000 genes . It was previously estimated that it contained 80,000 to 100,000 genes. The number of genes in humans is roughly equal to that of mice.

More than half of the discovered genes' functions are unknown.

Proteins are coded for in less than 2% of the genome.

Repetitive sequences are nucleotide sequences that are repeated hundreds or thousands of times. They do not directly code but provide information about chromosome structure, dynamics, and evolution .

Approximately 1 million copies of short 5-8 base pair repeated sequences are clustered around centromeres and near the ends of chromosomes. They represent junk DNA.

Chromosome I has the most genes (2968) and Y has the fewest (231).

In humans, there are approximately 1.4 million locations where single-base DNA differences (SNPs- Single nucleotide polymorphism) occur.

Structure of DNA

Structure of DNA

The Technique Used in HGP:

The Human Genome Project used Sanger sequencing to determine the sequences of relatively small fragments of human DNA (900 bp or less).

These fragments were then used to piece together larger DNA fragments and, eventually, entire chromosomes.

The advancement of next-generation sequencing (NGS) technologies has accelerated genomics research.

Applications of Human Genome Project

Gene discovery also opens up the possibility of developing gene-based treatments for both hereditary and acquired diseases.

It's detailed genetic, physical, and sequence maps will also be critical in understanding the biological basis of complex disorders caused by the interaction of multiple genetic and environmental influences, such as diabetes, heart disease, cancer , and psychiatric illnesses such as alcoholism.

It helps in the identification of mutations linked to different forms of cancer.

It also helps in advancing research in Forensic Sciences.

Agriculture, environment, and biotechnology are some other fields that have benefitted from the use of human genome projects.

Diversity of Genomic Applications to Various Fields

Diversity of Genomic Applications to Various Fields

The Human Genome Project (HGP) is an international scientific research project that aimed to identify, map, and sequence all of the genes in the human genome from both a physical and functional standpoint. Each individual's "genome" is unique; mapping the "human genome" requires sequencing a small number of individuals and then assembling these to obtain a complete sequence for each chromosome. As a result, the completed human genome is a combination that does not represent any single individual.

FAQs on Human Genome Project

1. What is a draft genome sequence vs a finished genome sequence?

Coverage, the number of gaps, and the error rate determine which ones are draft genome sequences and which ones are finished genome sequences. The draft sequence managed to cover 90% of the genome and had an error rate, which was 1 in 1000 base pairs. However, there were over 150,000 gaps with the presence of only 28% of the genome, which had gone up to the finishing. In 2003, in the month of April, there were around 400 gaps, with 99% of the genome reaching the finishing stage and the error rate of 1 in 10,000 base pairs. 

2. What is a genome?

A set of deoxyribonucleic acids, also commonly known as DNA, is called a genome. This chemical compound has valuable information pertaining to genetics, studying which one can analyze and develop the functions of every organism. DNA molecules are composed of two strands that are twisted and are found in pairs. Every individual strand is composed of four chemical compartments, which are adenine (A), thymine (T), guanine (G), and cytosine (C). A is always paired with T, and C is always paired with G i. e, they are paired with opposite strands.

3. What are the medical benefits of the Human Genome Project?

Human genome project has been very beneficial for the field of molecular medicine. It contributed to better diagnosis of diseases and early detection of certain diseases which can be very harmful to the human body. It also helped in gene therapy. Due to the Human Genome Project,in the future,  there will be molecular medicine that doesn't treat the symptoms but works on the cause of the problem at hand. Thus, the Human Genome Project has proved out to be very beneficial for the medicinal field. 

4. What is Whole Genome sequencing?

Whole-genome sequencing is also known as complete genome sequencing, full genome sequencing, or entire genome sequencing etc. It is the process of examining the entirety of the DNA sequencing of someone's genome. This includes sequencing of all the chromosomal DNA of a human body, as well as the DNA contained in the mitochondria. In the case of plants, It examines and sequences the chloroplast. It is a wide form of sequencing and has many other concepts related to it.

5. How is the Human Genome profile calculated?

The result of the human genome profile is based on intron and exon distribution. Here, Introns are the sequence that separates the gene's protein coding sequence. Exons are the protein coding sequences of the genes. This calculation uses very minute parameters and has great accuracy. According to the human genome project, the percentage of introns ranged between 24% to 37%. Also, In the chromosomes, 80% of the exons are smaller than 200 bp in length. The total value of intergenic and introns DNA on each chromosome is correlated to the size of that chromosome. This way, scientists determined the sequence of chemical bases that forms the human DNA.

6. What was an economic drawback of the Human Genome program?

One of the main drawbacks of this project was related to insurance claims. People were afraid that employers and health insurance companies would refuse to provide health insurance to people because of a health concern indicated by someone's genes.For this, the government passed the 'Health Insurance Portability and Accountability Act' to protect the people from unauthorized release of someone's health information. For the Human genome program, the government decided the starting budget of USD1.57 million in 1990, but later it was increased to around USD18 million in 2014.

7. What is the difference between physical mapping and genetic mapping?

There are two types of maps used in genome mapping: genetic and physical. The differences include:

Genetic maps are based on genetic linkage information, whereas physical maps are based on actual physical distances as measured by the number of base pairs. 

The two most important factors in genetic mapping are genetic markers and the size of the mapping population. However, physical mapping necessitates the fragmentation of the genome, either through restriction digestion or physical shattering.

Genetic maps frequently provide insights into the nature of different chromosomal regions, whereas physical maps provide a more accurate representation of the genome.

8. What are the future challenges of the Human Genome Project?

Various challenges faced by scientists are:

It is a massive task that will necessitate the expertise and creativity of many people from various disciplines in both the public and private sectors worldwide.

 New high-throughput technologies and a large sum of money will be required.

Moving, analysing, interpreting, and storing large amounts of genetic data requires significant resources and costs, many of which are currently beyond the capabilities of the majority of routine diagnostic laboratories.

Scientists must keep in mind the ELSI (Ethical, Legal and Social Issues).

9. Which topics are frequently asked about from the Human Genome Project in the examination?

Human Genome Project is one of the most interesting topics in molecular biology, and also you’ll be able to see one or two questions from this topic in the examination. They generally ask about the process and key steps used in this project, the human genome project diagram, and the use of different enzymes. Role of PCR, sequencing, restriction endonuclease, vectors such as bacterial artificial chromosome and yeast artificial chromosome and their role. A student is advised to make proper human genome project notes for easy revision and recall.

Biology • Class 12

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 22 May 2024

Mapping medically relevant RNA isoform diversity in the aged human frontal cortex with deep long-read RNA-seq

  • Bernardo Aguzzoli Heberle   ORCID: orcid.org/0000-0002-6177-9316 1 , 2   na1 ,
  • J. Anthony Brandon 1   na1 ,
  • Madeline L. Page   ORCID: orcid.org/0000-0001-9990-1500 1 ,
  • Kayla A. Nations 1 ,
  • Ketsile I. Dikobe 1 ,
  • Brendan J. White   ORCID: orcid.org/0009-0003-3900-2257 1 ,
  • Lacey A. Gordon 1 ,
  • Grant A. Fox 1 , 2 ,
  • Mark E. Wadsworth 1 ,
  • Patricia H. Doyle   ORCID: orcid.org/0000-0002-6447-6433 1 , 2 ,
  • Brittney A. Williams 3 ,
  • Edward J. Fox 4 ,
  • Anantharaman Shantaraman   ORCID: orcid.org/0000-0003-1384-941X 5 ,
  • Mina Ryten 6 , 7 , 8 ,
  • Sara Goodwin 9 ,
  • Elena Ghiban 9 ,
  • Robert Wappel 9 ,
  • Senem Mavruk-Eskipehlivan 9 ,
  • Justin B. Miller   ORCID: orcid.org/0000-0002-5309-1570 1 , 10 , 11 , 12 ,
  • Nicholas T. Seyfried   ORCID: orcid.org/0000-0002-4507-624X 4 ,
  • Peter T. Nelson 1 ,
  • John D. Fryer 13 &
  • Mark T. W. Ebbert   ORCID: orcid.org/0000-0001-9158-4440 1 , 2 , 10  

Nature Biotechnology ( 2024 ) Cite this article

2715 Accesses

64 Altmetric

Metrics details

  • Alzheimer's disease
  • Neural ageing
  • RNA sequencing
  • RNA splicing

Determining whether the RNA isoforms from medically relevant genes have distinct functions could facilitate direct targeting of RNA isoforms for disease treatment. Here, as a step toward this goal for neurological diseases, we sequenced 12 postmortem, aged human frontal cortices (6 Alzheimer disease cases and 6 controls; 50% female) using one Oxford Nanopore PromethION flow cell per sample. We identified 1,917 medically relevant genes expressing multiple isoforms in the frontal cortex where 1,018 had multiple isoforms with different protein-coding sequences. Of these 1,018 genes, 57 are implicated in brain-related diseases including major depression, schizophrenia, Parkinson’s disease and Alzheimer disease. Our study also uncovered 53 new RNA isoforms in medically relevant genes, including several where the new isoform was one of the most highly expressed for that gene. We also reported on five mitochondrially encoded, spliced RNA isoforms. We found 99 differentially expressed RNA isoforms between cases with Alzheimer disease and controls.

Similar content being viewed by others

assignment on human genome project

A deep catalogue of protein-coding variation in 983,578 individuals

assignment on human genome project

Analysis of gene expression in the postmortem brain of neurotypical Black Americans reveals contributions of genetic ancestry

assignment on human genome project

Integrating human endogenous retroviruses into transcriptome-wide association studies highlights novel risk factors for major psychiatric conditions

Human protein-coding genes average more than eight RNA isoforms, resulting in almost four distinct protein-coding sequences 1 , 2 . As a result of practical limitations in standard short-read sequencing technologies, researchers have historically been forced to collapse all isoforms into a single gene expression measurement, a major oversimplification of the underlying biology. Many unique isoforms from a single gene body appear to have unique interactomes at the protein level 3 . Distinct functions for individual isoforms from a single gene body have already been demonstrated for a handful of genes 4 , 5 , 6 . Notably, isoforms can play entirely different, or even opposite, roles within a given cell; a classic example includes two well-studied BCL-X ( BCL2L1 ) transcripts with opposite functions, where BCL-X L is anti-apoptotic and BCL-X S is pro-apoptotic 6 . Changes in the expression ratio between the BCL-X isoforms are implicated in cancer and are being studied as therapeutic targets 7 , demonstrating the importance of understanding individual RNA isoform function rather than treating them as a ‘single’ gene.

Knowing which tissues and cell types express each isoform is an important first step in understanding isoform function. The limitations of using short-read sequencing for studying differential RNA isoform expression/usage 8 , 9 include relying on heuristics to assemble and quantify isoforms 10 , 11 , 12 . As a result of these limitations, detailed analysis of individual isoforms has been limited to highly studied genes. In principle, long reads can sequence the entire isoforms directly 12 . However, the imperfections of long-read data 13 still require some heuristics to estimate the expression of each isoform 13 , 14 . Recent long-read RNA sequencing (RNA-seq) studies used targeted approaches to uncover aberrant splicing events in sporadic Alzheimer disease (AD) 15 , dystrophinopathies 16 and cancers 17 , 18 . Two other studies demonstrated that long-read sequencing can discover new RNA isoforms across several human tissues, including the brain 19 , 20 . Although both studies revealed important biology, including reporting new RNA isoforms, they had limited sequencing coverage (averaging <6 million aligned reads per sample). Read depth is essential to accurately quantify individual RNA isoforms, given that a total of >250,000 annotated RNA isoforms have been reported, as of July 2023 (ref. 2 ). In addition, neither of the studies focused on the medical relevance of using long-read RNA-seq. Although long-read sequencing does not resolve all challenges related to isoform sequencing (for example, those related to RNA degradation), our goal is to demonstrate the utility and importance of using long-read sequencing for both academic research and clinical diagnostics in the context of RNA isoforms (for example, reporting newly discovered RNA isoforms in medically relevant genes and variant interpretation in genes expressing multiple RNA isoforms).

In the present study, we demonstrate that RNA isoform quantification through deep long-read sequencing can be a step toward understanding the function of individual RNA isoforms, and provide insights into how they may impact human health and disease. Specifically, in addition to discovering new (that is, unannotated) RNA isoforms in known medically relevant genes, we also discovered new spliced mitochondria-encoded RNA isoforms and entirely new gene bodies in nuclear DNA and demonstrated the complexity of RNA isoform diversity for medically relevant genes within a single tissue (human frontal cortex from patients with AD and controls). Last, we showed the potential of differential RNA isoform expression analysis to reveal disease-relevant transcriptomic signatures unavailable at the gene level (that is, when collapsing all isoforms into a single expression measurement). Summary data from the present study are readily explorable through a public web application to visualize individual RNA isoform expression in aged human frontal cortex tissue ( https://ebbertlab.com/brain_rna_isoform_seq.html ).

Methodological and results overview

Traditional RNA-seq studies relied on short-read sequencing approaches that excel at quantifying gene-level expression, but cannot accurately assemble and quantify a large proportion of RNA isoforms 11 , 21 (Fig. 1a ). Thus, we sequenced 12 postmortem, aged, dorsolateral prefrontal cortex (Brodmann area 9/46) brain samples individually from six patients with AD and six cognitively unimpaired controls (50% female; Fig. 1b ). All samples had postmortem intervals <5 h and an RNA integrity score (RIN) ≥ 9.0; demographics, summary sequencing statistics and read length distributions are shown in Supplementary Table 1 and Supplementary Figs. 1 – 4 . Poly(A)-enriched complementary DNA from each sample was sequenced using one PromethION flow cell. Sequencing yielded a median of 35.5 million aligned reads per sample after excluding reads lacking the primer on either end and those with a mapping quality <10 (Extended Data Fig. 1a ). By excluding all reads missing primers, reads included in the present study should closely represent the RNA as it was at extraction.

figure 1

a , Background explaining the improvements long-read sequencing brings to the study of RNA isoforms. b , Details for experimental design, methods and a summary of the topics explored in this article. MS, mass spectrometry. Created with BioRender.com .

We performed RNA isoform quantification and discovery (including new gene bodies) using bambu 14 (Fig. 1b )—a tool with emphasis on reducing false-positive RNA isoform discovery compared with other commonly used tools 14 . Bambu was highlighted as a top performer in a recent benchmark study 13 . However, as a tradeoff for higher precision, bambu is unable to discover new RNA isoforms that only differ from annotated RNA isoforms at the transcription start and/or end site (for example, shortened 5′-UTR). When it comes to quantification, the increasing complexity of annotations can impact quantification owing to non-unique reads being split between multiple transcripts. For example, if a read maps equally well to two RNA isoforms, each isoform will receive credit for 0.5 reads.

For our 12 samples, bambu reported an average of 42.4% reads uniquely assigned to an RNA isoform and 17.5% reads spanning a full-length RNA isoform (Extended Data Fig. 1c ). We considered an isoform to be expressed above noise levels only if its median counts per million (CPM) was >1 (that is, at least half of the samples had a CPM > 1); this threshold is dependent on overall depth, because lower depths will require a higher, more stringent CPM threshold. Using this threshold, we observed 28,989 expressed RNA isoforms from 18,041 gene bodies in our samples (Extended Data Fig. 2a–c ). Of the RNA isoforms expressed with median CPM > 1, exactly 20,183 were classified as protein coding, 2,303 as long noncoding RNAs, 3,213 as having a retained intron and the remaining 3,290 were scattered across other biotypes—including new transcripts (Extended Data Fig. 3 ).

We used publicly available mass spectrometry (MS) data from aged, human dorsolateral prefrontal cortex tissue (Brodmann area 9) 22 , 23 and human cell lines 24 to validate new RNA isoforms at the protein level, resulting in a small number of successful validations. We also leveraged existing short-read RNA-seq data from the Religious Orders Study Memory and Aging Project (ROSMAP) 25 , 26 and long-read RNA-seq data from Glinos et al. 19 to validate our newly discovered RNA isoforms and gene bodies.

Discovery of new RNA isoforms from known gene bodies

Our first goal was to identify and quantify new RNA isoforms expressed in human frontal cortex. In total, bambu discovered 1,534 new transcripts from known (that is, annotated) nuclear gene bodies. Of these 1,534 new RNA isoforms, exactly 1,106 had a median CPM ≤ 1. Although we expect that many of these new RNA isoforms with a median CPM ≤ 1 are legitimate, we consider them low-confidence discoveries and exclude them throughout the remainder of our analyses, except where explicitly noted.

After excluding all isoforms with a median CPM ≤ 1,428, isoforms remained that we consider high confidence (Fig. 2a,b ), where 303 were from protein-coding genes (Fig. 2a ). We report substantially fewer new isoforms compared with Glinos et al. 19 (~70,000) and Leung et al. 20 (~12,000) because of: (1) differences in the reference database; (2) the discovery tool employed 13 , 27 (that is, bambu 14 versus FLAIR 28 versus Cupcake 29 ); and (3) sequencing depth and stringency in what constitutes a new isoform. Specifically, Glinos et al. 19 used gene annotations from 2016 when determining new isoforms. This is likely because they were trying to maintain consistency with previous Genotyope-Tisse Expression (GTEx) releases, but approximately 50,000 new isoforms have already been annotated since then 2 . We also set a stricter threshold for high-confidence isoforms, using a median CPM > 1. Given the depth of our data, a CPM = 1 corresponds to an average of 24 observed copies (that is, counts) per sample. Exactly 297 (69.4%) of our newly discovered isoforms are unique to our data, when compared with Ensembl v.107, Glinos et al. 19 and Leung et al. 20 (Supplementary Tables 2 and 3 ).

figure 2

a – f , New transcripts from annotated gene bodies. a , Number of newly discovered transcripts across the median CPM threshold. The cutoff is shown as the dashed line set at median CPM = 1. b , Distribution of log 10 (median CPM values) for newly discovered transcripts. The dashed line shows the cutoff point of median CPM = 1. c – f , Data only from transcripts above this expression cutoff. c , Histogram showing distribution of transcript length for new transcripts from annotated gene bodies. d , Bar plot showing the distribution of the number of exons for newly discovered transcript. e , Bar plot showing the kinds of events that gave rise to new transcripts (in part created with BioRender.com ). f , Bar plot showing the prevalence of canonical splice site motifs for annotated exons from transcripts with median CPM > 1 versus new exons from new transcripts. g , Gel electrophoresis validation using PCR amplification for a subset of new RNA isoforms from known genes. This is an aggregate figure showing bands for several different gels. Each gel electrophoresis PCR experiment was independently performed once with similar results. Individual gel figures are available in Supplementary Figs. 5 – 26 . h , Protein level validation using publicly available MS proteomics data. The y axis shows the number of spectral counts from uniquely matching peptides (unique spectral counts). New transcripts from known gene bodies were considered validated at the protein level when reaching more than five unique spectral counts. i , RNA isoform structure and expression for OAZ2 transcripts (cellular growth/proliferation). The new isoform Tx572 was most expressed and validated at the protein level (highlighted with the green box). Boxplot format: median (center line), quartiles (box limits), 1.5 × interquartile range (IQR) (whiskers) ( n  = 12 biologically independent samples).

We performed a down-sampling analysis to assess the importance of depth on our discoveries. Including all discoveries (even those with median CPM ≤ 1), we discovered only 490 new isoforms from known genes with 20% of our aligned reads compared with 1,534 using 100% of our aligned reads (difference of 1,044; Extended Data Fig. 4a ). Looking only at high-confidence discoveries in known genes, we discovered 238 and 428 at 20% and 100% of reads, respectively (Extended Data Fig. 4b ), showing the importance of depth in our data. Although both annotations and read depth were important factors impacting new RNA isoform discovery, these do not explain the dramatic difference in reported discoveries between our work and that of Glinos et al. 19 . Thus, we conclude that the primary driver of these differences is the discovery tool employed. We observed a 33.8% increase in transcript discovery overlap between our dataset and GTEx when using the same tools and annotation, supporting the idea that these are large drivers of differences between our findings (Extended Data Fig. 5 ). We analyzed data from all tissue types from Glinos et al. 19 to ensure consistency between our approaches. The discovery of new isoforms unique to GTEx when using the identical pipeline and annotations from our study probably results from tissue-specific isoforms that do not occur in the brain.

New high-confidence isoforms had a median of 761.5 nucleotides in length, ranging from 179 nt to 3,089 nt (Fig. 2c ) and the number of exons ranged between 2 and 14, with most isoforms falling on the lower end of the distribution (Fig. 2d ). Our data were enriched for new RNA isoforms containing all new exons and exon–exon boundaries (that is, exon junctions; Fig. 2e ). The 428 new high-confidence isoforms contained 737 new exon–intron boundaries, where 94.9% (356/370) and 100% (367/367) of the 5′- and 3′-splice sites matched canonical splice site motifs, respectively, supporting their biological feasibility (Fig. 2f ). We successfully validated 9 of 17 attempts for new high-confidence isoforms through PCR and gel electrophoresis (Fig. 2g , Supplementary Figs. 5 – 26 and Supplementary Table 4 ). Of the eight RNA isoforms that failed via standard PCR (no visible band on gel), six were validated through real-time quantitative PCR (RT–qPCR) using a conservative cutoff of C t  < 35 (ref. 30 ) (Supplementary Table 5 ). Of the 15 transcripts successfully validated through PCR and gel electrophoresis or RT–qPCR, 11 were unique to the present study. For additional validation, we compared relative abundance for known and new RNA isoforms between long-read sequencing and RT–qPCR for MAOB , SLC26A1 and MT-RNR2 . The expression patterns were concordant for all three genes tested (Extended Data Fig. 6 and Supplementary Tables 6 and 7 ).

We further attempted to validate our new high-confidence transcripts from known genes using long-read RNA-seq data from five GTEx 19 brain samples (Brodmann area 9) and short-read RNA-seq data from 251 ROSMAP 25 brain samples (Brodmann area 9/46). Approximately 98.8% of the new high-confidence transcripts from known gene bodies had at least one uniquely mapped read in either GTEx or ROSMAP data and 69.6% had at least 100 uniquely mapped reads in either dataset (Extended Data Fig. 7 and Supplementary Table 8 ).

Out of interest, we also validated 6 RNA isoforms from the 99 newly predicted protein-coding genes reported in Nurk et al. 31 using the new telomere-to-telomere (T2T) CHM13 reference genome (Extended Data Fig. 8 ). Our validation threshold for the CHM13 analysis was at least 10 uniquely mapped reads in total across our 12 frontal cortex samples.

Using MS data from the same brain region and human cell lines, we validated 11 of the new high-confidence isoforms from known genes at the protein level (Fig. 2h,i ). Three of the eleven that we validated were unique to our study (BambuTx1879, BambuTx1758 and BambuTx2189).

Medically relevant genes

Identification and quantification of all isoforms are especially important for known medically relevant genes because, for example, when clinicians interpret the consequence of a genetic mutation, it is interpreted in the context of a single isoform of the parent gene body. That isoform may not even be expressed in the relevant tissue or cell type, however. Thus, knowledge about which tissues and cell types express each isoform will allow clinicians and researchers to better interpret the consequences of genetic mutations in human health and disease. To assess RNA isoform expression for medically relevant genes in the frontal cortex, we used the list of medically relevant genes defined in ref. 32 , also adding genes relevant to brain-related diseases 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 .

Of the 428 new high-confidence isoforms, 53 originated from 49 medically relevant genes and we quantified the proportion of total expression for the gene that came from the new isoform(s) (Fig. 3a and Supplementary Fig. 27 ). The genes with the largest percentage of reads from a newly discovered isoform include SLC26A1 (86%; kidney stones 43 and musculoskeletal health 44 ), CAMKMT (61%; hypotonia–cystinuria syndrome, neonatal seizures, severe developmental delay and so on 45 ) and WDR4 (61%; microcephaly 46 and Galloway–Mowat syndrome-6 (ref. 47 )). Other notable genes with new high-confidence isoforms include MTHFS (25%; major depression, schizophrenia and bipolar disorder 48 ), CPLX2 (10%; schizophrenia, epilepsy and synaptic vesicle pathways 49 ) and MAOB (9%; currently targeted for Parkinson’s disease treatment 50 ; Fig. 3c ). We also found an unannotated RNA isoform for TREM2 (16%; Fig. 3b ), one of the top AD risk genes 51 , which skips exon 2. This isoform was reported as new in our data because it remains unannotated by Ensembl as of June 2023 (ref. 2 ), but has previously been reported by two groups 52 , 53 . The articles identifying this new TREM2 isoform reported a relative abundance of around 10%, corroborating our long-read sequencing results 52 , 53 . The new isoform for POLB —a gene implicated in base-excision repair for nuclear and mitochondrial genomes 54 , 55 —accounted for 28% of the gene’s expression (Fig. 3d ). We discovered an additional 66 new transcripts from medically relevant genes with median CPM ≤ 1, including new RNA isoforms for SMN1 and SMN2 (spinal muscular atrophy 56 ; Supplementary Figs. 28 and 29 ). Medically relevant genes with new RNA isoforms that did not meet our high confidence are shown in Supplementary Fig. 30 .

figure 3

a , Gene names for medically relevant genes where we discovered a new RNA isoform that was not annotated in Ensembl v.107. It included only new RNA isoforms with a median CPM > 1. The size of the gene name is proportional to the relative abundance of the new RNA isoform. Relative abundance values relevant to this figure can be found in Supplementary Fig. 27 . b – d , RNA isoform structure and CPM expression for isoforms from TREM2 ( b ), MAOB ( c ) and POLB ( d ). For TREM2 and MAOB all isoforms are shown (four each). For POLB only the top five most highly expressed isoforms in human frontal cortex are shown. e – g , New spliced, mitochondrially encoded transcripts. We included only new mitochondrial transcripts with median full-length counts >40. e , Structure for new spliced mitochondrial transcripts in red/coral denoted by ‘Tx’. MT-RNR2 ribosomal RNA is represented in green (overlapping four out of five spliced mitochondrial isoforms) and known protein-coding transcripts in blue. f , Bar plot showing number of full-length counts (log 10 ) for new spliced mitochondrial transcripts and known protein-coding transcripts. g , Bar plot showing the prevalence of canonical splice site motifs for annotated exons from nuclear transcripts with median CPM > 1 versus new exon from spliced mitochondrial transcripts. All boxplots in this panel follow the following format: median (center line), quartiles (box limits), 1.5 × IQR (whiskers) ( n  = 12 biologically independent samples).

Spliced, mitochondrially encoded isoforms

We identified a new set of spliced, mitochondrially encoded isoforms containing two exons (Fig. 3e ), a highly unexpected result given that annotated mitochondrial transcripts contain only one exon. New mitochondrial isoforms were filtered using a count threshold based on full-length reads rather than a median CPM threshold owing to technical difficulties in quantification arising from the polycistronic nature of mitochondrial transcription. Bambu identified a total of 34 new spliced mitochondrial isoforms, but, after filtering using a strict median full-length count threshold of 40, only 5 high-confidence isoforms remained. Four of the new high-confidence isoforms span the MT-RNR2 transcript. Not only does MT-RNR2 encode the mitochondrial 16S rRNA, but it is also partially translated into a purported anti-apoptotic, 24-amino acid peptide (humanin) by inhibiting the Bax protein 57 . The fifth new high-confidence isoform spans the MT-ND1 and MT-ND2 genes, but on the opposite strand. Our results support previous important work by Herai et al. demonstrating splicing events in mitochondrial RNA 58 .

For context, although expression for the new mitochondrial isoforms was low compared with known mitochondrial genes (Fig. 3f ), their expression was relatively high when compared with all nuclear isoforms. All five exons from new high-confidence mitochondrial isoforms contained the main nucleotides from the canonical 3′-splice site motif (AG), whereas three out of five (60%) contained the main nucleotides from the canonical 5′-splice site motif (GT) (Fig. 3g ).

We attempted to validate three new high-confidence mitochondrially encoded isoforms through PCR and successfully validated two of them (Supplementary Figs. 25 and 26 ). It was not possible to design specific primers for the other two new high-confidence mitochondrial isoforms because of low sequence complexity or overlap with other lowly expressed (low-confidence) mtRNA isoforms found in our data. However, we were able to validate all five high-confidence spliced mitochondrial transcripts in the data from Glinos et al. 19 because each had at least 100 uniquely aligned counts across each of the 5 GTEx brain samples (Extended Data Fig. 7 ). Mitochondria are essential to human cell life (and most eukaryotes) and have been implicated in a range of human diseases, including seizure disorders 59 , ataxias 60 , neurodegeneration 61 and other age-related diseases 62 . Thus, although function for the new isoforms is not clear, determination of their function is important because they could have important biological roles or serve as biomarkers for mitochondrial function.

Discovery of transcripts from new gene bodies

RNA isoforms from new gene bodies refer to poly(adenylated) RNA species coming from regions of the genome where transcription was unexpected (that is, unannotated). Bambu identified a total of 1,860 isoforms from 1,676 new gene bodies. We observed a total of 1,593 potential new gene body isoforms with a CPM ≤ 1. We considered these potential discoveries as low confidence and excluded them from the remainder of our analyses, leaving 267 high-confidence isoforms from 245 gene bodies (Fig. 4a,b ). Glinos et al. 19 did not specifically report on new gene bodies, but Leung et al. 20 reported 54 new gene bodies in human cortex where 5 overlapped with our high-confidence isoforms from new genes. The new isoforms from new gene bodies had a median length of 1,529 nt, ranging between 109 nt and 5,291 nt (Fig. 4c ). The number of exons ranged between 2 and 4, with 96.6% of isoforms having only 2 exons (Fig. 4d ). Given the large proportion of transcripts containing only two exons, it is possible that we sequenced only a fragment of larger RNA molecules.

figure 4

a , Number of newly discovered transcripts from new gene bodies represented across the median CPM threshold. The cutoff is shown as the dashed line set at the median CPM = 1. b , Distribution of log 10 (median CPM values) for new transcripts from new gene bodies. The dashed line shows the cutoff point of the median CPM = 1. c – g , Data from transcripts above this expression cutoff. c , Histogram showing length distribution for new transcripts from new gene bodies. d , Bar plot showing the distribution of the number of exons for new transcripts from new gene bodies. Given the large proportion of transcripts containing only two exons, it is possible that we sequenced only a fragment of larger RNA molecules. e , Bar plot showing the kinds of events that gave rise to new transcripts from new gene bodies (in part created with BioRender.com ). f , Bar plot showing the prevalence of canonical splice site motifs for annotated exons from transcripts with a median CPM > 1 versus new exons from new gene bodies. g , RNA isoform structure and CPM expression for isoforms from new gene body ( BambuGene290099 ). Boxplot format: median (center line), quartiles (box limits), 1.5 × IQR (whiskers) ( n  = 12 biologically independent samples). h , Gel electrophoresis validation using PCR amplification for a subset of new isoforms from new genes. This is an aggregate figure showing bands for several different gels. Each gel electrophoresis PCR experiment was independently performed once with similar results. Individual gel figures are available in Supplementary Figs. 5 – 26 . i , Protein level validation using publicly available MS proteomics data. The y axis shows the number of spectral counts from uniquely matching peptides (unique spectral counts); new transcripts from new genes were considered to be validated at the protein level if they had more than five unique spectral counts.

Of the 267 new high-confidence isoforms from new gene bodies, 130 overlapped a known gene body on the opposite strand, 97 came from a completely new locus and 40 came from within a known gene body, but did not overlap a known exon (Fig. 4e ). These 170 new transcripts from new gene bodies located in intragenic regions could be a result of leaky transcription and splicing. A recent article 63 suggests that spurious intragenic transcription may result from aging in mammalian tissues. In new isoforms from new gene bodies, 82.5% (222 of 269) of exons contained the primary ‘GT’ nucleotides from the canonical 5′-splice site motif, whereas 90.7% (244 of 269) contained the primary ‘AG’ nucleotides from the canonical 3′-splice site motif (Fig. 4f ). It is interesting that one new gene body ( BambuGene290099 ) had three high-confidence RNA isoforms (Fig. 4g ). We successfully validated 11 of 12 attempts for new high-confidence RNA isoforms from new gene bodies through PCR and gel electrophoresis (Fig. 4h , Supplementary Figs. 5 – 26 and Supplementary Table 4 ), where the 12th was successfully validated through RT–qPCR (mean C t  = 23.2; Supplementary Table 5 ). All 12 new RNA isoforms from new gene bodies validated through PCR were unique to the present study.

Over 94.4% of the new high-confidence transcripts from new gene bodies had at least one uniquely mapped read in either GTEx or ROSMAP data and >44.2% had at least 100 uniquely mapped reads in either dataset (Extended Data Fig. 7 and Supplementary Table 8 ). The validation rate for new transcripts from known gene bodies was higher than new transcripts from new gene bodies, indicating that some of our newly discovered genes could be aging related. Whether these newly discovered gene bodies are biologically meaningful or ‘biological noise’ is unclear. We validated three RNA isoforms from new gene bodies at the protein level using MS data from the same brain region and human cell lines (Fig. 4i ); all three were unique to the present study.

During isoform discovery, we identified a new low-abundance RNA isoform (median CPM < 1) with two exons for the External RNA Controls Consortium (ERCC) RNA spike-ins (Supplementary Figs. 31 and 32 ). We were skeptical about this discovery because ERCCs contain only one exon, but we validated these results by PCR across two different batches of ERCC (Supplementary Figs. 33 and 34 ).

Medically relevant genes expressing multiple RNA isoforms

We found 7,042 genes expressing two or more RNA isoforms with a median CPM > 1, where 3,387 genes expressed ≥2 isoforms with distinct protein sequences (Fig. 5a,b ). Of the 5,035 medically relevant genes included in the present study 32 , 1,917 expressed multiple isoforms and 1,018 expressed isoforms with different protein-coding sequences (Fig. 5c ), demonstrating the isoform diversity of medically relevant genes in a single tissue and the importance of interpreting genetic variants in the proper context of tissue-specific isoforms. Of the 7,418 transcripts from medically relevant genes expressed with median CPM > 1, 5,695 are longer than 2,000 nt (Supplementary Fig. 35 ). Given the length of these 5,695 RNA isoforms, it is likely that their quantification is less accurate, despite the advantages that long-read sequencing offers.

figure 5

a , Gene bodies with multiple transcripts across the median CPM threshold. b – i , Gene bodies with multiple transcripts at median CPM > 1. b , Gene bodies expressing multiple transcripts. c , Medically relevant gene bodies expressing multiple transcripts. d , Brain disease-relevant gene bodies expressing multiple transcripts. e , Transcripts expressed in the frontal cortex for a subset of genes implicated in AD. f , APP transcript expression. g , MAPT transcript expression. h , BIN1 transcript expression. i , Same as e but for genes implicated in other neurodegenerative diseases. LATE, limbic-predominant, age-related TDP-43 encephalopathy. j , TARDBP transcript expression. k , Same as e but for genes implicated in neuropsychiatric disorders. In i and k , the dashed lines are delimiters, separating the genes that are associated with different brain-related disorders. l , SHANK3 transcript expression. Boxplot format for entire panel: median (center line), quartiles (box limits), 1.5 × IQR (whiskers) ( n  = 12 biologically independent samples).

It is interesting that 98 genes implicated in brain-related diseases expressed multiple RNA isoforms in human frontal cortex, including AD genes such as APP (Aβ-precursor protein) with 5, MAPT (tau protein) with 4 and BIN1 with 8 (Fig. 5d–h ). Notably, we observed only four MAPT isoforms with a median CPM > 1, where two were expressed at levels many times greater than the others, whereas substantial previous research suggests that there are six tau proteins expressed in the central nervous system 64 , 65 , 66 . Similarly, several genes implicated in other neurogenerative diseases and neuropsychiatric disorders expressed multiple isoforms in human frontal cortex, including SOD1 (amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD); Fig. 5i ) with two isoforms expressed with a median CPM > 1, SNCA (Parkinson’s disease (PD); Fig. 5i ) with four, TARDBP (TDP-43 protein; involved in several neurodegenerative diseases; Fig. 5i,j ) with four and SHANK3 (autism spectrum disorder; Fig. 5k,l ) with three.

RNA isoform expression reveals patterns hidden at gene level

Perhaps the most compelling value in long-read RNA-seq is the ability to perform differential isoform expression analyses. Through these analyses, we can begin to distinguish which isoforms are expressed in specific cell types and tissue types and ultimately determine their associations with human health and disease. Thus, as proof of principle, we performed differential gene and isoform expression analyses comparing six pathologically confirmed cases of AD and six cognitively unimpaired controls. The dataset is not large enough to draw firm disease-specific conclusions, but it does demonstrate the need for larger studies.

We found 176 differentially expressed genes and 105 differentially expressed RNA isoforms (Fig. 6a,b and Supplementary Tables 9 and 10 ). Of these 105 isoforms, 99 came from genes that were not differentially expressed when collapsing all isoforms into a single gene measurement (Fig. 6a,b ), demonstrating the utility of differential isoform expression analyses. It is interesting that there were two differentially expressed isoforms from the same gene ( TNFSF12 ), with opposite trends. The TNFSF12-219 isoform was upregulated in cases with AD whereas TNFSF12-203 was upregulated in controls (Fig. 6c–e ), even though the TNFSF12 gene was not differentially expressed when collapsing all transcripts into a single gene measurement (Fig. 6c ).

figure 6

a , Differential gene expression between cases with AD and cognitively unimpaired controls. The horizontal line is at the FDR-corrected P value ( q value) = 0.05. Vertical lines are at log 2 (fold-change) = −1 and +1. The threshold for differential gene expression was set at q value < 0.05 and log 2 (fold-change) > 1. The names displayed represent a subset of genes that are not differentially expressed but have at least one RNA isoform that is differentially expressed. FC, fold-change; NS, not significant. b , Same as a but for differential RNA isoform expression analysis. We used the DESeq2 R package with two-sided Wald’s test for statistical comparisons and the Benjamini–Hochberg correction for multiple comparisons in the differential expression analyses presented in a and b . c , Expression for TNFSF12 between cases with AD and controls (CT). The TNFSF12 gene does not meet the differential expression threshold. d , TNFSF12-219 transcript expression between AD and CT. TNFSF12-219 is upregulated in AD. e , Expression for the TNFSF12-203 transcript between AD and CT. TNFSF12-203 is upregulated in CT. All boxplots in this panel follow the following format: center line, median; box limits, upper and lower quartiles; whiskers, 1.5 × IQR. All figures come from n  = 12 biologically independent samples (AD, n  = 6; CT, n  = 6).

Out of interest, we measured the expression patterns for the TNFSF12-203 and TNFSF12-219 isoforms in the five GTEx long-read RNA-seq samples from Brodmann area 9 to assess whether the expression pattern matched what we observed in our cognitively unimpaired controls (Extended Data Fig. 9 ). We found that the expression for both TNFSF12 isoforms shows greater variability than either of our groups, but arguably more closely resembles the pattern in our controls.

Out of interest, we also provided plots from a principal component analysis at both the gene and the isoform level where we observed a potential separation between cases and controls (Supplementary Fig. 36 ). We encourage caution to avoid overinterpreting this potential separation between cases and controls given the small sample size.

By applying deep long-read RNA-seq, we identified new gene bodies and RNA isoforms expressed in human frontal cortex, demonstrating that substantial gaps remain in our understanding of RNA isoform diversity (Figs. 2a , 3e and 4a ). We quantified the individual RNA isoform expression levels in human frontal cortex as a step toward functional analysis of these isoforms. We found 7,042 genes expressing multiple RNA isoforms, with 1,917 being medically relevant genes (that is, implicated in human disease; Fig. 5a–c ). Some of these medically relevant genes expressing multiple RNA isoforms in human frontal cortex are implicated in brain-related diseases, including AD, PD, autism spectrum disorder, substance use disorder and others (Fig. 5d ). Together, these findings highlight the importance of measuring individual RNA isoform expression accurately to discern the possible roles of each isoform within human health and disease, and to interpret the effects of a given genetic variant.

We performed differential RNA isoform expression analysis to reveal expression patterns associated with disease that were hidden when performing gene-level analysis (Fig. 6a,b ). Given the 99 isoforms that were differentially expressed where the gene as a whole was not, we demonstrated that performing differential gene-level expression is important, but may be insufficient in many cases if we want to truly understand the biological complexities afforded by alternative splicing. We further suggest that deep long-read RNA-seq is necessary to understand the full complexity of transcriptional changes during disease. The gene TNFSF12 is a key example because, although the gene itself is not differentially expressed in our data, the TNFSF12-219 isoform is significantly upregulated in cases with AD whereas the TNFSF12-203 isoform is significantly upregulated in controls (Fig. 6c–e ).

We also identified five new high-confidence, spliced mitochondrially encoded RNA isoforms with two exons each. This is a surprising finding given that all annotated human mitochondrial transcripts have only one exon (Fig. 2e,f ). Previous work in human cell cultures corroborates our findings 58 . To our knowledge, no previous study has identified spliced mtRNA isoform expression directly in human tissue. Given the involvement of mitochondria in many age-related diseases 62 , it would be of interest to determine the function, if any, of these spliced mtRNA isoforms.

Long reads present an improvement over short-read RNA-seq, but it remains challenging to accurately quantify RNA isoforms in genes with many large and similar isoforms (Extended Data Fig. 10 ). Thus, although this work is a substantial improvement over short-read sequencing, the data are not perfect and future improvements in sequencing, transcriptome annotation and bioinformatic quantification will continue to improve the accuracy of long-read RNA-seq. Our data showed a pronounced 3′ bias that can hinder RNA isoform quantification, especially for genes where the exon diversity is closer to the 5′-end (Supplementary Fig. 37 ).

The small sample size limits the generalizability of the differential RNA isoform expression results, serving primarily as a proof of concept for the value of measuring individual RNA isoform expression in disease tissue. We refrained from performing differential isoform usage analysis and pathway analysis to avoid overinterpreting results from only 12 samples; however, these analyses could provide valuable insights in larger studies. In addition, the present study is based on ‘bulk’ RNA-seq, rather than single-cell sequencing; bulk sequencing is likely to obscure critical cell type-specific expression patterns that single-cell sequencing can elucidate, although the cost of single-cell sequencing combined with long-read sequencing is still a major hurdle in making a large study of this kind feasible.

In conclusion, we demonstrate that a large proportion of medically relevant genes express multiple RNA isoforms in human frontal cortex, with many encoding different protein-coding sequences that could potentially perform different functions. We also demonstrate that differential RNA isoform analysis can reveal transcriptomic signatures in AD that are not available at the gene level. Our study highlights the advantage of long-read RNA-seq in assessing RNA expression patterns in complex human diseases to identify new molecular targets for treatment and diagnosis.

Sample collection, RNA extraction and quality control

Frozen postmortem, human frontal cortex brain samples were collected at the University of Kentucky Alzheimer’s Disease Research Center autopsy cohort 67 , snap-frozen in liquid nitrogen at autopsy and stored at −80 °C. Postmortem interval (from death to autopsy) was <5 h in all samples. All samples came from white individuals. Approximately 25 mg of gray matter from the frontal cortex was chipped on dry ice into prechilled, 1.5-ml low-bind tubes (Eppendorf, cat. no. 022431021), kept frozen throughout the process and stored at −80 °C. RNA was extracted using the Lexogen SPLIT RNA extraction kit (cat. no. 008.48) using protocol v.008UG005V0320 ( Supplementary Information , pp. 51–75).

Briefly, ~25 mg of tissue was removed from −80 °C storage and kept on dry ice until processing began. Then, 400 μl of chilled isolation buffer (4 °C; Lexogen SPLIT RNA kit) was added to each tube and the tissue homogenized using a plastic pestle (Kontes Pellet Pestle, VWR, cat. no. KT749521-1500). Samples remained on ice to maintain RNA integrity while other samples were homogenized. Samples were then decanted into room-temperature, phase-lock gel tubes, 400 μl of chilled phenol (4 °C) was added and the tube inverted 5× by hand. Acidic buffer (AB, Lexogen), 150 μl, was added to each sample, the tube inverted 5× by hand before 200 μl of chloroform was added and inverted for 15 s. After a 2-m incubation at room temperature, samples were centrifuged for 2 min at 12,000 g and 18–20 °C and the upper phase (approximately 600 μl) was decanted in a new 2-ml tube. Total RNA was precipitated by the addition of 1.75× the volume of isopropanol to the sample and then loaded on to a silica column by centrifugation (12,000 g , 18 °C for 20 s; flow-through discarded). The column was then washed twice with 500 μl of isopropanol and 3× with 500 μl of wash buffer (Lexogen), while the column was centrifuged (12,000 g , 18 °C for 20 s; flow-through discarded each time). The column was transferred to a new low-bind tube and the RNA eluted by the addition of 30 μl of elution buffer (incubated for 1 min and then centrifuged at 12,000 g , 18 °C for 60 s) and the eluted RNA immediately placed on ice to prevent degradation.

RNA quality was determined initially by nanodrop ( A 260 : A 280 and A 260 : A 230 absorbance ratios) and then via Agilent Fragment Analyzer 5200 using the RNA (15 nt) DNF-471 kit (Agilent). All samples achieved nanodrop ratios >1.8 and fragment analyzer RIN > 9.0 before sequencing (Supplementary Figs. 38 – 49 and Supplementary Table 1 ).

RNA spike-ins

ERCC RNA spike-in controls (Thermo Fisher Scientific, cat. no. 4456740) were added to the RNA at the point of starting cDNA sample preparation at a final dilution of 1:1,000.

Library preparation, sequencing and base calling

Isolated RNA was kept on ice until quality control testing was completed as described above. Long-read cDNA library preparation commenced, utilizing the Oxford Nanopore Technologies PCR-amplified cDNA kit (cat. no. SQK-PCS111). The protocol was performed according to the manufacturer’s specifications, with two notable modifications being that the cDNA PCR amplification expansion time was 6 min and we performed 14 PCR amplification cycles. Poly(A) enrichment is inherent to this protocol and happens at the start of the cDNA synthesis. The cDNA quality was determined using an Agilent Fragment Analyzer 5200 and Genomic DNA (50 kb) kit (Agilent DNF-467) (see Supplementary Figs. 50 – 61 for cDNA traces). The cDNA libraries were sequenced continuously for 60 h on the PromethION P24 platform with flow cell R9.4.1 (one sample per flow cell). Data were collected using MinKNOW v.23.04.5. The.fast5 files obtained were base called using the Guppy graphics processing unit (GPU) base-caller v.3.9 with configuration dna_r9.4.1_450bps_hac_prom.cfg.

Read preprocessing, genomic alignment and quality control

Nanopore long-read sequencing reads were preprocessed using pychopper 68 v.2.7.2 with the PCS111 sequencing kit setting. Pychopper filters out any reads not containing primers on both ends and rescues fused reads containing primers in the middle. Pychopper then orients the reads to their genomic strand and trims the adapters and primers off the reads.

The preprocessed reads were then aligned to the GRCh38 human reference genome (without alternative contigs and with added ERCC sequences) using minimap2 (ref. 69 ) v.2.22-r1101 with parameters ‘-ax splice -uf’. Full details and scripts are available on our GitHub (‘Code availability’). Aligned reads with a mapping quality (MAPQ) score <10 were excluded using SAMtools 70 v.1.6. Secondary and supplementary alignments were also excluded using SAMtools v.1.6. The resulting bam alignment files were sorted by genomic coordinate and indexed before downstream analysis. Quality control reports and statistics were generated using PycoQC 71 v.2.5.2. Information about mapping rate and read length and other sequencing statistics can be found in Supplementary Table 1 and Supplementary Figs. 1 – 4 .

Transcript discovery and quantification

Filtered BAM files were utilized for transcript quantification and discovery using bambu 14 v.3.0.5. We ran bambu using Ensembl 2 v.07, a gene transfer format (GTF) annotation file, with added annotations for the ERCC spike-in RNAs and the GRCh38 human reference genome sequence with added ERCC sequences. The BAM file for each sample was individually preprocessed with bambu and the resulting 12 RDS (R data serialization) files were provided as input all at once to perform transcript discovery and quantification using bambu. The new discovery rate (NDR) was determined based on the recommendation by the bambu machine learning model (NDR = 0.288). Bambu outputs three transcript-level count matrices, including total counts (all counts including reads that were partially assigned to multiple transcripts), unique counts (only counts from reads that were assigned to a single transcript) and full-length reads (only counts from reads containing all exon–exon boundaries from its respective transcript). Except where specified otherwise, expression values reported in this article come from the total count matrix.

We used full-length reads for quantification in the mitochondria because the newly discovered spliced mitochondrial transcripts caused issues in quantification. Briefly, owing to polycistronic mitochondrial transcription, many nonspliced reads were partially assigned to spliced mitochondrial transcripts, resulting in a gross overestimation of spliced mitochondrial transcript expression values. We bypassed this issue by using only full-length counts (that is, counting only reads that match the exon–exon boundaries of newly discovered spliced mitochondrial transcripts).

We included only newly discovered (that is, unannotated) transcripts with a median CPM > 1 in downstream analysis (that is, high-confidence new transcripts) unless explicitly stated otherwise. New transcripts from mitochondrial genes were the exception, being filtered using a median full-length reads >40 threshold.

Data from transcriptomic analysis can be visualized in the web application we created using R v.4.2.1 and Rshiny v.1.7.4: https://ebbertlab.com/brain_rna_isoform_seq.html .

Analysis using CHM13 reference

We processed the RNA-seq data from the 12 dorsolateral, prefrontal cortex samples (Brodman area 9/46) from the present study using the same computational pipeline described above and below, except for two changes: (1) we used the CHM13 reference genome rather than GRCh38 and (2) we set bambu to quantification-only mode rather than quantification and discovery. The reference fasta and gff3 files were retrieved from the T2T-CHM13 GitHub ( https://github.com/marbl/CHM13 ). The following are the links to the reference genome sequence ( https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz ) and the GFF3 annotation ( https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v2.0.gene_annotation.gff3 ). We then quantified expression for the extra 99 predicted protein-coding genes from CHM13 reported in Nurk et al. 31 .

Subsampling discovery analysis

Nanopore long-read sequencing data were randomly subsampled at 20% increments, generating the following subsamples for each sample: 20%, 40%, 60% and 80%. The 12 subsampled samples for each increment were run through our long-read RNA-seq discovery and quantification pipeline described above and below. We compared the number of discovered transcripts between the subsamples and the full samples to assess the effect of read depth on the number of transcripts discovered using bambu. The CPM values were re-calculated based on the new sequencing depth for each subsampling increment, so the absolute count threshold to reach median CPM > 1 became lower as the sequencing depth decreased.

Transcript discovery GTEx data with bambu

We obtained the long-read RNA-seq data from 90 GTEx samples across 15 human tissues and cell lines sequenced with the Oxford Nanopore Technologies, PCR-amplified cDNA protocol (PCS109) generated by Glinos et al. 19 . We then processed these data through our long-read RNA-seq discovery and quantification pipeline described above and below. We used the same Ensembl v.88 annotations originally used in Glinos et al. 19 and compared the results between the original Glinos et al. 19 results and the results from our data to assess the effect of the isoform discovery tool (that is, bambu 14 versus FLAIR 28 ) on the number of newly discovered transcripts. We also compared the number of newly discovered transcripts when running GTEx data through our computational pipeline with the Ensembl v.88 annotation and the Ensembl v.107 annotation to assess the effect of different annotations in the number of transcripts discovered. Last, we compared the overlap between new transcripts from known genes discovered in our study using 12 brain samples with the original results 19 and the results we obtained from running the GTEx data through our computational pipeline using the Ensembl v.107 annotations.

Validation of new transcripts using GTEx data

We obtained publicly available GTEx, nanopore, long-read RNA-seq data from six brain samples (Brodmann area 9). One of the samples was excluded because it had <50,000 total reads, so 5 samples were used for all downstream analysis. These data had been previously analyzed in Glinos et al. 19 . Fastq files were preprocessed using pychopper 68 v.2.7.2 with the PCS109 sequencing kit setting. Downstream from that the files were processed as described above and below, except for two changes: (1) we set bambu to quantification-only mode and (2) we used a GTF annotation file containing all transcripts from Ensembl v.107, the ERCC spike-in RNAs and all the new transcripts discovered in the present study. The transcript-level unique count matrix outputted by bambu was utilized for validating the newly discovered transcripts in the present study.

Validation of new transcripts using ROSMAP data

We obtained publicly available ROSMAP (Illumina), 150-bp paired-end RNA-seq data from 251 brain samples (Brodmann area 9/46). These data had been previously analyzed in ref. 25 and described in ref. 26 . Fastq files were preprocessed and quality controlled using trim galore v.0.6.6. We generated the reference transcriptome using the GTF annotation file containing all transcripts from Ensembl v.107, the ERCC spike-in RNAs and all the new transcripts discovered in the present study. We used this annotation in combination with the GRCh38 reference genome and gffread v.0.12.7 to generate our reference transcriptome for alignment. The preprocessed reads were then aligned to this reference transcriptome using STAR 72 v.2.7.10b. Full details and scripts are available on our GitHub (‘Code availability’). Aligned reads with a MAPQ score <255 were excluded using SAMtools 70 v.1.6, keeping only reads that uniquely aligned to a single transcript. We quantified the number of uniquely aligned reads using salmon 73 v.0.13.1. The count matrix containing uniquely aligned read counts outputted by salmon was utilized for validating the newly discovered transcripts in the present study.

Splice site motif analysis

We utilized the online meme suite tool 74 v.5.5.3 ( https://meme-suite.org/meme/tools/meme ) to create canonical 5′- and 3′-splice site motifs and estimated the percentage of exons containing these motifs. For known genes, we included only exons from multi-exonic transcripts that were expressed with a median CPM > 1 in our samples. If two exons shared a start or an end site, one of them was excluded from the analysis. For new high-confidence transcripts, we filtered out any exon start or end sites contained in the Ensembl annotation. If two or more exons shared a start or an end site, we used only one of those sites for downstream analyses. For the 5′-splice site analysis, we included the last 3 nt from the exon and the first 6 nt from the intron. For the 3′-splice site analysis, we included the last 10 nt from the intron and the first 3 nt from the exon. The coordinates for 5′- and 3′-splice site motifs were chosen based on previous studies 75 , 76 . The percentage of exons containing the canonical 5′-splice site motif was calculated using the proportion of 5′-splice site sequences containing GT as the two last nucleotides in the intron. The percentage of exons containing the canonical 3′-splice site motif was calculated by taking the proportion of 3′-splice site sequences containing AG as the first 2 nt in the intron. Fasta files containing 5′-splice site sequences from each category of transcript ((1) known transcript from known gene body, (2) new transcript from known gene, (3) new transcript from new gene body and (4) transcript from mitochondrial gene body) were individually submitted to the online meme suite tool to generate splice site motifs. The same process was repeated for 3′-splice site sequences. Owing to the small number of transcripts, it was not possible to generate reliable splice site motif memes for new transcripts from mitochondrial transcripts; instead we just used the 5′-GT sequence and 3′-AG sequence to represent them in Fig. 2g .

Comparison between annotations

Annotations from new high-confidence transcripts discovered in the present study were compared with annotations from previous studies using gffcompare 77 v.0.11.2. Transcripts were considered to overlap when gffcompare found a complete match of the exon–exon boundaries (that is, intron chain) between two transcripts. The annotation from Glinos et al. 19 was retrieved from https://storage.googleapis.com/gtex_analysis_v9/long_read_data/flair_filter_transcripts.gtf.gz . The annotation from Leung et al. 20 was retrieved from https://zenodo.org/record/7611814/preview/Cupcake_collapse.zip#tree_item12/HumanCTX.collapsed.gff .

Differential gene expression analysis

Although bambu outputs a gene-level count matrix, this matrix includes intronic reads. To create a gene-level count matrix without intronic reads, we summed the transcript counts for each gene using a customized Python script (v.3.10.8). This gene-level count matrix without intronic reads was used for all gene-level analysis in the present study. We performed differential gene expression analysis only on genes with a median CPM > 1 (20,448 genes included in the analysis). The count matrix for genes with CPM > 1 was loaded into R v.4.2.2. We performed differential gene expression analysis with DESeq2 (ref. 78 ) v.1.38.3 using default parameters. Differential gene expression analysis was performed between samples from patients with AD and cognitively unimpaired controls. We set the threshold for differential expression at log 2 (fold-change) > 1 and false discovery rate (FDR)-corrected P value ( q value) <0.05. Detailed descriptions of statistical analysis results can be found in Supplementary Table 9 . DESeq2 utilizes Wald’s test for statistical comparisons.

Differential isoform expression analysis

For differential isoform expression analysis, we used the transcript count matrix output by bambu. We performed differential isoform expression analysis only on transcripts with a median CPM > 1 coming from genes expressing two or more transcripts with median CPM > 1 (19,423 transcripts from 7,042 genes included in the analysis). This filtered count matrix was loaded into R v.4.2.2. We performed differential isoform expression analysis with DESeq2 v.1.38.3 using default parameters. Differential isoform expression analysis was performed using the same methods as the gene-level analysis, comparing samples from patients with AD and cognitively unimpaired controls, including the same significance thresholds (log 2 (fold-change) > 1) and FDR-corrected P  < 0.05. Detailed descriptions of statistical analysis results can be found in Supplementary Table 10 . DESeq2 utilizes Wald’s test for statistical comparisons.

Figures and tables

Figures and tables were generated using customized R (v.4.2.2) scripts and customized Python (v.3.10.8) scripts. We used the following R libraries: tidyverse (v.1.3.2), EnhancedVolcano (v.1.18.0), DESeq2 (v.1.38.3) and ggtranscript 79 (v.0.99.3). We used the following Python libraries: numpy (v.1.24.1), pandas (v.1.5.2), regex (v.2022.10.31), matplotlib (v.3.6.2), seaborn (v.0.12.2), matplotlib_venn (v.0.11.7), wordcloud (v.1.8.2.2), plotly (v.5.11.0) and notebook (v.6.5.2). See ‘Code availability’ for access to the customized scripts used to generate figures and tables.

PCR primer design

We used the extended annotation output by bambu to create a reference transcriptome for primer design. This extended annotation contained information for all transcripts contained in Ensembl v.107 with the addition of all newly discovered transcripts by bambu (without applying a median CPM filter) and the ERCC spike-in transcripts. This annotation was converted into a transcriptome sequence fasta file using gffread (v.0.12.7) and the GRCh38 human reference genome. We used the online National Center for Biotechnology Information (NCBI) primer design tool ( https://www.ncbi.nlm.nih.gov/tools/primer-blast ) to design primers. We utilized default settings for the tool; however, we provided the transcriptome described above as the customized database to check for primer pair specificity. We moved forward with validation only when we could generate a primer pair specific to a single new high-confidence transcript. Detailed information about the primers—including primer sequence—used for gel electrophoresis PCR and RT–qPCR validations can be found in Supplementary Tables 4 and 5 .

PCR and gel electrophoresis validations

New isoform and gene validations were conducted using PCR and gel electrophoresis. For this purpose, 2 μg of RNA was transcribed into cDNA using the High-Capacity cDNA Reverse Transcription kit (AB Applied Biosystems, cat. no. 4368814) following the published protocol. The resulting cDNA was quantified using a nanodrop and its quality was assessed using the Agilent Fragment analyzer 5200 with the DNA (50 kb) kit (Agilent, DNF-467). Next, 500 ng of the cDNA was combined with primers specific to the newly identified isoforms and genes (Supplementary Table 4 ). The amplification was performed using Invitrogen Platinum II Taq Hot start DNA Polymerase (Invitrogen, cat. no. 14966-005) in the Applied Biosystem ProFlex PCR system. The specific primer sequences, annealing temperatures and number of PCR cycles are detailed in Supplementary Table 4 . After the PCR amplification, the resulting products were analyzed on a 1% agarose Tris-acetate-EDTA gel containing 0.5 μg ml −1 of ethidium bromide. The gel was run for 30 min at 125 V and the amplified cDNA was visualized using an ultraviolet light source. Gels from PCR validation for each transcript can be found in Supplementary Figs. 5 – 26 , 33 and 34 . Some gels contain data from all 12 samples whereas others contain data only from 8 out of the 12 samples because we ran out of brain tissue for 4 of the samples.

RT–qPCR validations

The RT–qPCR assays were performed using the QuantStudieo 5 Real-Time PCR System (Applied Biosystems). Amplifications were carried out in 25 μl of reaction solutions containing 12.5 μl of 2× PerfeCTa SYBR green SuperMix (Quantabio, cat. no. 95054-500), 1.0 μl of first-stranded cDNA, 1 μl of each specific primer (10 mM; Supplementary Table 5 ) and 9.0 μl of ultra-pure, nuclease-free water. RT–qPCR conditions involved an initial hold stage: 50 °C for 2 min followed by 95 °C for 3 min with a ramp of 1.6 °C s −1 followed by PCR stage of 95 °C for 15 s and 60 °C for 60 s for a total of 50 cycles. MIQE guidelines from ref. 30 suggest C t  < 40 as a cutoff for RT–qPCR validation, but we used a more stringent cutoff of C t  < 35 to be conservative. This means that we considered a new RNA isoform to be validated by RT–qPCR only if the mean C t value for our samples was <35. We attempted to validate new RNA isoforms only through RT–qPCR if they first failed to be validated through standard PCR and gel electrophoresis. We did this because RT–qPCR is a more sensitive method, allowing us to validate RNA isoforms that are less abundant or that are harder to amplify through PCR. We performed RT–qPCR only using 8 of the 12 samples included in the present study because we ran out of brain tissue for 4 of the samples.

In addition, we performed quantification of new and known RNA isoforms from the following genes: SLC26A1 , MT-RNR2 and MAOB (Supplementary Tables 6 and 7 ). We followed recommendations in ref. 80 and used the CYC1 as the gene for C t  value normalization in our human postmortem brain samples. To allow for comparison between different isoforms from the same gene, we used 2 −Δ Ct as the expression estimate instead of the more common 2 −ΔΔ Ct expression estimate. This is because the 2 −ΔΔ Ct expression estimate is optimized for comparisons between samples within the same gene/isoform, but does not work well for comparison between different genes/isoforms. On the other hand, the 2 −Δ Ct expression estimate allows for comparison between different genes/isoforms. RNA isoform relative abundance for RT–qPCR and long-read RNA-seq was calculated as follows:

Proteomics analysis

We utilized publicly available tandem MS data from round 2 of the ROSMAP brain proteomics study, previously analyzed in refs. 22 and 23 . We also utilized publicly available deep tandem MS data from six human cell lines, processed with six different proteases and three tandem MS fragmentation methods, previously analyzed in ref. 24 . This cell-line dataset represents one of the largest human proteomes with the highest sequence coverage ever reported as of 2023. We started the analysis by creating a protein database containing the predicted protein sequence from all three reading frames for the 700 new high-confidence RNA isoforms that we discovered, totaling 2,100 protein sequences. We translated each high-confidence RNA isoform in three reading frames using pypGATK 81 v.0.0.23. We also included the protein sequences for known protein-coding transcripts that came from genes represented in the 700 new high-confidence RNA isoforms and had a median CPM > 1 in our RNA-seq data. We used this reference protein fasta file to process the brain and cell-line proteomics data separately using FragPipe 82 , 83 , 84 , 85 , 86 , 87 , 88 v.20.0—a Java-based graphic user interface that facilitates the analysis of MS-based proteomics data by providing a suite of computational tools. Detailed parameters used for running FragPipe can be found on GitHub and Zenodo (‘Code availability’ and ‘Data availability’).

MS suffers from a similar issue as short-read RNA-seq, being able to detect only relatively short peptides that do not cover the entire length of most proteins. This makes it challenging to accurately detect RNA isoforms from the same gene. To avoid false discoveries, we took measures to ensure that we would consider an RNA isoform to be validated at the protein level only if it had peptide hits that are unique to it (that is, not contained in other known human proteins). We started by taking the FragPipe output and keeping only peptide hits that mapped to only one of the proteins in the database. We then ran the sequence from those peptides against the database we provided to FragPipe to confirm that they were truly unique. Surprisingly, a small percentage of peptide hits that FragPipe reported as unique were contained in two or more proteins in our database; these hits were excluded from downstream analysis. We then summed the number of unique peptide spectral counts for every protein coming from a new high-confidence RNA isoform. We filtered out any proteins with fewer than six spectral counts. We took the peptide hits for proteins that had more than five spectral counts and used the online protein–protein NCB blast tool (blastp: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins ) 89 to search it against the human RefSeq protein database. We used loose thresholds for our blast search to ensure that even short peptide matches would be reported. A detailed description of the blast search parameters can be found on Zenodo. Spectral counts coming from peptides that had a blast match with 100% query coverage and 100% identity to a known human protein were removed from downstream analysis. We took the remaining spectral counts after the blast search filter and summed them by protein ID. Proteins from high-confidence RNA isoforms that had more than five spectral counts after a blast search filter were considered to be validated at the protein level. This process was repeated to separately analyze the brain MS data and the cell-line MS data.

Rigor and reproducibility

The present study was done under the ethics oversight of the University of Kentucky Institutional Review Board. Read preprocessing, alignment, filtering, transcriptome quantification and discovery, and quality control steps for Nanopore and Illumina data were implemented using customized NextFlow pipelines. NextFlow enables scalable and reproducible scientific workflows using software containers 90 . We used NextFlow v.23.04.1.5866. Singularity containers were used for most of the analysis in the present study, except for website creation and proteomics analysis owing to feasibility issues. Singularity containers enable the creation and employment of containers that package up pieces of software in a way that is portable and reproducible 91 . We used Singularity v.3.8.0-1.el8. Instructions on how to access the singularity containers that can be found in the GitHub repository for this project. Any changes to standard manufacturer protocols have been detailed in Methods . All code used for analysis in this article is publicly available on GitHub. All raw data, output from long-read RNA-seq and proteomics pipelines, references and annotations are publicly available. Long-read RNA-seq results from this article can be easily visualized through this web application: https://ebbertlab.com/brain_rna_isoform_seq.html .

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Raw long-read RNA-seq data generated and utilized in the present study are publicly available in Synapse 92 : https://www.synapse.org/#!Synapse:syn52047893 . Raw long-read RNA-seq data generated and utilized in the present study are also publicly available in NIH Sequence Read Archive (SRA) (accession no. SRP456327 ) 93 https://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP456327 . Output from long-read RNA-seq and proteomics pipelines, reference files and annotations are publicly available at 94 https://doi.org/10.5281/zenodo.8180677 . Long-read RNA-seq results from this article can be easily visualized through this web application: https://ebbertlab.com/brain_rna_isoform_seq.html . Raw cell-line deep proteomics data utilized in this article are publicly available at https://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD024364 . Raw brain proteomics data from round 2 of the ROSMAP TMT study are publicly available at https://www.synapse.org/#!Synapse:syn17015098 . GTEx long-read RNA-seq data used for validation of our study results are available at https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_GTEx_V9_hg38 . ROSMAP short-read RNA-seq data used for validation of our study results are available at https://www.synapse.org/#!Synapse:syn21589959 . CHM13 reference genome sequence can be found at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz . CHM13 reference GFF3 annotation can be found at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v2.0.gene_annotation.gff3 . The transcript annotation from Glinos et al. 19 was retrieved from https://storage.googleapis.com/gtex_analysis_v9/long_read_data/flair_filter_transcripts.gtf.gz . The transcript annotation from Leung et al. 20 was retrieved from https://zenodo.org/record/7611814/preview/Cupcake_collapse.zip#tree_item12/HumanCTX.collapsed.gff .

Code availability

All code used in the manuscript is publicly available at https://github.com/UK-SBCoA-EbbertLab/brain_cDNA_discovery (ref. 95 ).

Park, E., Pan, Z., Zhang, Z., Lin, L. & Xing, Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 102 , 11–26 (2018).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Martin, F. J. et al. Ensembl 2023. Nucleic Acids Res. 51 , D933–D941 (2023).

Article   CAS   PubMed   Google Scholar  

Yang, X. et al. Widespread expansion of protein interaction capabilities by alternative splicing. Cell 164 , 805–817 (2016).

Oberwinkler, J., Lis, A., Giehl, K. M., Flockerzi, V. & Philipp, S. E. Alternative splicing switches the divalent cation selectivity of TRPM3 channels. J. Biol. Chem. 280 , 22540–22548 (2005).

Végran, F. et al. Overexpression of caspase-3s splice variant in locally advanced breast carcinoma is associated with poor response to neoadjuvant chemotherapy. Clin. Cancer Res. 12 , 5794–5800 (2006).

Article   PubMed   Google Scholar  

Warren, C. F. A., Wong-Brown, M. W. & Bowden, N. A. BCL-2 family isoforms in apoptosis and cancer. Cell Death Dis. 10 , 177 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Dou, Z. et al. Aberrant Bcl-x splicing in cancer: from molecular mechanism to therapeutic modulation. J. Exp. Clin. Cancer Res. 40 , 194 (2021).

Vitting-Seerup, K. & Sandelin, A. The landscape of isoform switches in human cancers. Mol. Cancer Res. 15 , 1206–1220 (2017).

Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4 , 1521 (2015).

Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 , 511–515 (2010).

Tilgner, H. et al. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events. Nat. Biotechnol. 33 , 736–742 (2015).

Ringeling, F. R. et al. Partitioning RNAs by length improves transcriptome reconstruction from short-read RNA-seq data. Nat. Biotechnol. 40 , 741–750 (2022).

Evaluating long-read RNA-sequencing analysis tools with in silico mixtures. Nat. Methods 20 , 1643–1644 (2023).

Chen, Y. et al. Context-aware transcript quantification from long-read RNA-seq data with Bambu. Nat. Methods https://doi.org/10.1038/s41592-023-01908-w (2023).

Course, M. M. et al. Aberrant splicing of PSEN2, but not PSEN1, in individuals with sporadic Alzheimer’s disease. Brain J. Neurol. 146 , 507–518 (2023).

Article   Google Scholar  

Okubo, M. et al. RNA-seq analysis, targeted long-read sequencing and in silico prediction to unravel pathogenic intronic events and complicated splicing abnormalities in dystrophinopathy. Hum. Genet. 142 , 59–71 (2023).

Liu, M. et al. Long-read sequencing reveals oncogenic mechanism of HPV-human fusion transcripts in cervical cancer. Transl. Res. J. Lab. Clin. Med. 253 , 80–94 (2023).

CAS   Google Scholar  

Schwenk, V. et al. Transcript capture and ultradeep long-read RNA sequencing (CAPLRseq) to diagnose HNPCC/Lynch syndrome. J. Med. Genet. 60 , 747–759 (2023).

Glinos, D. A. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 608 , 353–359 (2022).

Leung, S. K. et al. Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell Rep. 37 , 110022 (2021).

Tilgner, H. et al. Microfluidic isoform sequencing shows widespread splicing coordination in the human transcriptome. Genome Res. 28 , 231–242 (2018).

Johnson, E. C. B. et al. Large-scale deep multi-layer analysis of Alzheimer’s disease brain reveals strong proteomic disease-related changes not observed at the RNA level. Nat. Neurosci. 25 , 213–225 (2022).

Higginbotham, L. et al. Unbiased classification of the elderly human brain proteome resolves distinct clinical and pathophysiological subtypes of cognitive impairment. Neurobiol. Dis. 186 , 106286 (2023).

Sinitcyn, P. et al. Global detection of human variants and isoforms by deep proteome sequencing. Nat. Biotechnol . https://doi.org/10.1038/s41587-023-01714-x (2023).

Mostafavi, S. et al. A molecular network of the aging human brain provides insights into the pathology and cognitive decline of Alzheimer’s disease. Nat. Neurosci. 21 , 811–819 (2018).

De Jager, P. L. et al. A multi-omic atlas of the human frontal cortex for aging and Alzheimer’s disease research. Sci. Data 5 , 180142 (2018).

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. Preprint at bioRxiv https://doi.org/10.1101/2023.07.25.550582 (2023).

Tang, A. D. et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat. Commun. 11 , 1438 (2020).

Tseng, E. et al. cDNA Cupcake. GitHub https://github.com/Magdoll/cDNA_Cupcake (2023).

Bustin, S. A. et al. The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin. Chem. 55 , 611–622 (2009).

Nurk, S. et al. The complete sequence of a human genome. Science 376 , 44–53 (2022).

Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40 , 672–680 (2022).

Singh, T. et al. Rare coding variants in ten genes confer substantial risk for schizophrenia. Nature 604 , 509–516 (2022).

Palmer, D. S. et al. Exome sequencing in bipolar disorder identifies AKAP11 as a risk gene shared with schizophrenia. Nat. Genet. 54 , 541–547 (2022).

Billingsley, K. J., Bandres-Ciga, S., Saez-Atienzar, S. & Singleton, A. B. Genetic risk factors in Parkinson’s disease. Cell Tissue Res. 373 , 9–20 (2018).

Perrone, F., Cacace, R., van der Zee, J. & Van Broeckhoven, C. Emerging genetic complexity and rare genetic variants in neurodegenerative brain diseases. Genome Med. 13 , 59 (2021).

Shadrina, M., Bondarenko, E. A. & Slominsky, P. A. Genetics factors in major depression disease. Front. Psychiatry 9 , 334 (2018).

Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180 , 568–584.e23 (2020).

Stein, M. B. et al. Genome-wide association analyses of post-traumatic stress disorder and its symptom subdomains in the Million Veteran Program. Nat. Genet. 53 , 174–184 (2021).

Maihofer, A. X. et al. Enhancing discovery of genetic variants for posttraumatic stress disorder through integration of quantitative phenotypes and trauma exposure information. Biol. Psychiatry 91 , 626–636 (2022).

Hatoum, A. S. et al. Multivariate genome-wide association meta-analysis of over 1 million subjects identifies loci underlying multiple substance use disorders. Nat. Ment. Health 1 , 210–223 (2023).

Bellenguez, C. et al. New insights into the genetic etiology of Alzheimer’s disease and related dementias. Nat. Genet. 54 , 412–436 (2022).

Gee, H. Y. et al. Mutations in SLC26A1 cause nephrolithiasis. Am. J. Hum. Genet. 98 , 1228–1234 (2016).

Pfau, A. et al. SLC26A1 is a major determinant of sulfate homeostasis in humans. J. Clin. Invest. 133 , e161849 (2023).

Parvari, R. et al. A recessive contiguous gene deletion of chromosome 2p16 associated with cystinuria and a mitochondrial disease. Am. J. Hum. Genet. 69 , 869–875 (2001).

Shaheen, R. et al. Mutation in WDR4 impairs tRNA m(7)G46 methylation and causes a distinct form of microcephalic primordial dwarfism. Genome Biol. 16 , 210 (2015).

Braun, D. A. et al. Mutations in WDR4 as a new cause of Galloway–Mowat syndrome. Am. J. Med. Genet. A 176 , 2460–2465 (2018).

Gilbody, S., Lewis, S. & Lightfoot, T. Methylenetetrahydrofolate reductase (MTHFR) genetic polymorphisms and psychiatric disorders: a HuGE review. Am. J. Epidemiol. 165 , 1–13 (2007).

Lee, H. J. et al. Association study of polymorphisms in synaptic vesicle-associated genes, SYN2 and CPLX2, with schizophrenia. Behav. Brain Funct. 1 , 15 (2005).

Tan, Y.-Y., Jenner, P. & Chen, S.-D. Monoamine oxidase-B inhibitors for the treatment of Parkinson’s disease: past, present, and future. J. Park. Dis. 12 , 477–493 (2022).

Guerreiro, R. et al. TREM2 variants in Alzheimer’s disease. N. Engl. J. Med. 368 , 117–127 (2013).

Kiianitsa, K. et al. Novel TREM2 splicing isoform that lacks the V-set immunoglobulin domain is abundant in the human brain. J. Leukoc. Biol. 110 , 829–837 (2021).

Shaw, B. C. et al. An alternatively spliced TREM2 isoform lacking the ligand binding domain is expressed in human brain. J. Alzheimers Dis. 87 , 1647–1657 (2022).

Tsegay, P. S. et al. Incorporation of 5′,8-cyclo-2′-deoxyadenosines by DNA repair polymerases via base excision repair. DNA Repair 109 , 103258 (2022).

Kaufman, B. A. & Van Houten, B. POLB: a new role of DNA polymerase beta in mitochondrial base excision repair. DNA Repair 60 , A1–A5 (2017).

Butchbach, M. E. R. Genomic variability in the durvival motor neuron genes (SMN1 and SMN2): implications for spinal muscular atrophy phenotype and therapeutics development. Int. J. Mol. Sci. 22 , 7896 (2021).

Guo, B. et al. Humanin peptide suppresses apoptosis by interfering with Bax activation. Nature 423 , 456–461 (2003).

Herai, R. H., Negraes, P. D. & Muotri, A. R. Evidence of nuclei-encoded spliceosome mediating splicing of mitochondrial RNA. Hum. Mol. Genet. 26 , 2472–2479 (2017).

Rahman, S. Mitochondrial disease and epilepsy. Dev. Med. Child Neurol. 54 , 397–406 (2012).

Delatycki, M. B. & Bidichandani, S. I. Friedreich ataxia- pathogenesis and implications for therapies. Neurobiol. Dis. 132 , 104606 (2019).

Lin, M. T. & Beal, M. F. Mitochondrial dysfunction and oxidative stress in neurodegenerative diseases. Nature 443 , 787–795 (2006).

Amorim, J. A. et al. Mitochondrial and metabolic dysfunction in ageing and age-related diseases. Nat. Rev. Endocrinol. 18 , 243–258 (2022).

Sen, P. et al. Spurious intragenic transcription is a feature of mammalian cellular senescence and tissue aging. Nat. Aging 3 , 402–417 (2023).

Goedert, M., Wischik, C. M., Crowther, R. A., Walker, J. E. & Klug, A. Cloning and sequencing of the cDNA encoding a core protein of the paired helical filament of Alzheimer disease: identification as the microtubule-associated protein tau. Proc. Natl Acad. Sci. USA 85 , 4051–4055 (1988).

Goedert, M., Spillantini, M. G., Potier, M. C., Ulrich, J. & Crowther, R. A. Cloning and sequencing of the cDNA encoding an isoform of microtubule-associated protein tau containing four tandem repeats: differential expression of tau protein mRNAs in human brain. EMBO J. 8 , 393–399 (1989).

Andreadis, A., Brown, W. M. & Kosik, K. S. Structure and novel exons of the human tau gene. Biochemistry 31 , 10626–10633 (1992).

Schmitt, F. A. et al. University of Kentucky Sanders-Brown healthy brain aging volunteers: donor characteristics, procedures and neuropathology. Curr. Alzheimer Res. 9 , 724–733 (2012).

Sipos, B. et al. epi2me-labs/pychopper: cDNA read preprocessing. GitHub https://github.com/epi2me-labs/pychopper (2023) .

Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34 , 3094–3100 (2018).

Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25 , 2078–2079 (2009).

Leger, A. & Leonardi, T. pycoQC, interactive quality control for Oxford Nanopore Sequencing. J. Open Source Softw. 4 , 1236 (2019).

Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 , 15–21 (2013).

Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference. Nat. Methods 14 , 417–419 (2017).

Bailey, T. L. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37 , W202–W208 (2009).

Roca, X., Sachidanandam, R. & Krainer, A. R. Determinants of the inherent strength of human 5′ splice sites. RNA 11 , 683–698 (2005).

Carranza, F., Shenasa, H. & Hertel, K. J. Splice site proximity influences alternative exon definition. RNA Biol. 19 , 829–840 (2022).

Pertea, G. & Pertea, M. GFF Utilities: GffRead and GffCompare. F1000Research 9 , 304 (2020).

Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 , 550 (2014).

Gustavsson, E. K., Zhang, D., Reynolds, R. H., Garcia-Ruiz, S. & Ryten, M. ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2. Bioinformatics 38 , 3844–3846 (2022).

Penna, I. et al. Selection of candidate housekeeping genes for normalization in human postmortem brain samples. Int. J. Mol. Sci. 12 , 5461–5470 (2011).

Perez-Riverol, Y. et al. ProteoGenomics Analysis Toolkit. https://pgatk.readthedocs.io/en/latest/ (2023).

Yu, F. et al. FragPipe. https://fragpipe.nesvilab.org/ (2023).

Chang, H.-Y. et al. Crystal-C: a computational tool for refinement of open search results. J. Proteome Res. 19 , 2511–2515 (2020).

Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14 , 513–520 (2017).

da Veiga Leprevost, F. et al. Philosopher: a versatile toolkit for shotgun proteomics data analysis. Nat. Methods 17 , 869–870 (2020).

Yu, F., Haynes, S. E. & Nesvizhskii, A. I. IonQuant enables accurate and sensitive label-free quantification with FDR-controlled match-between-runs. Mol. Cell. Proteomics 20 , 100077 (2021).

Teo, G. C., Polasky, D. A., Yu, F. & Nesvizhskii, A. I. Fast deisotoping algorithm and its implementation in the MSFragger search engine. J. Proteome Res. 20 , 498–505 (2021).

Tsou, C.-C. et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat. Methods 12 , 258–264 (2015).

Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35 , 1026–1028 (2017).

Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35 , 316–319 (2017).

Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLoS ONE 12 , e0177459 (2017).

Heberle, B. A. et al. Ebbert ebbert_lab_brain_long_read_cDNA_discovery_project. Synapse synapse.org/#!Synapse:syn52047893 (2023).

Heberle, B. A. et al. Ebbert ebbert_lab_brain_long_read_cDNA_discovery_project. Sequence Read Archive (SRA) https://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP456327 (2023).

Heberle, B. A. et al. Ebbert Lab Nanopore PCS111 brain cDNA discovery (12 samples—AD vs controls). Zenodo https://doi.org/10.5281/zenodo.8180677 (2023).

Heberle, B. A. et al. Brain cDNA Discovery. GitHub https://github.com/UK-SBCoA-EbbertLab/brain_cDNA_discovery (2023).

Download references

Acknowledgements

This work was supported by: the National Institutes of Health (NIH; grant nos. R35GM138636, R01AG068331 to M.T.W.E. and 5R50CA243890 to S.G.); the BrightFocus Foundation (grant no. A2020161S to M.T.W.E.), Alzheimer’s Association (grant no. 2019-AARG-644082 to M.T.W.E.), PhRMA Foundation (grant no. RSGTMT17 to M.T.W.E.); the Ed and Ethel Moore Alzheimer’s Disease Research Program of Florida Department of Health (grant nos. 8AZ10 and 9AZ08 to M.T.W.E. and 6AZ06 to J.D.F.); and the Muscular Dystrophy Association (to M.T.W.E.). We appreciate the contributions of the Sanders-Brown Center on Aging at the University of Kentucky. We are deeply grateful to the research participants and their families who made this research possible. We thank S. L. Anderson from the University of Kentucky brain bank for preparing the brain samples used in the present study. We thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Morgan Compute Cluster and associated research computing resources. We thank Singularity Sylabs for providing support and extra cloud storage for our software containers. We are grateful for support from the Goeke lab members who quickly and thoroughly answered our numerous questions about bambu on GitHub. We thank T. Wendt Viola, R. Grassi-Oliveira and C. Walss-Bass for guidance and help in the early stages of the proteomics analysis. We thank the reviewers for their sincere and meaningful contributions to improving the quality of the manuscript. The results published in the present study are in part based on data obtained from the AD Knowledge Portal. Short-read RNA-seq data used for crossvalidation of results in the present study were provided by the Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago. Rush Alzheimer’s Disease Center data collection was supported through funding by National Institute on Aging (grant nos. P30AG10161 (ROS), R01AG15819 (ROSMAP; genomics and RNA-seq), R01AG17917 (MAP) and RC2AG0365 (RNA-seq)).

Author information

These authors contributed equally: Bernardo Aguzzoli Heberle, J. Anthony Brandon.

Authors and Affiliations

Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY, USA

Bernardo Aguzzoli Heberle, J. Anthony Brandon, Madeline L. Page, Kayla A. Nations, Ketsile I. Dikobe, Brendan J. White, Lacey A. Gordon, Grant A. Fox, Mark E. Wadsworth, Patricia H. Doyle, Justin B. Miller, Peter T. Nelson & Mark T. W. Ebbert

Department of Neuroscience, College of Medicine, University of Kentucky, Lexington, KY, USA

Bernardo Aguzzoli Heberle, Grant A. Fox, Patricia H. Doyle & Mark T. W. Ebbert

Department of Pharmacology and Nutritional Sciences, College of Medicine, University of Kentucky, Lexington, KY, USA

Brittney A. Williams

Department of Biochemistry, Emory University School of Medicine, Atlanta, GA, USA

Edward J. Fox & Nicholas T. Seyfried

Department of Neurology, Emory University School of Medicine, Atlanta, GA, USA

Anantharaman Shantaraman

UK Dementia Research Institute at The University of Cambridge, Cambridge, UK

Department of Clinical Neurosciences, School of Clinical Medicine, University of Cambridge, Cambridge, UK

Department of Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London, UK

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA

Sara Goodwin, Elena Ghiban, Robert Wappel & Senem Mavruk-Eskipehlivan

Division of Biomedical Informatics, Internal Medicine, College of Medicine, University of Kentucky, Lexington, KY, USA

Justin B. Miller & Mark T. W. Ebbert

Department of Pathology and Laboratory Medicine, University of Kentucky, Lexington, KY, USA

Justin B. Miller

Microbiology, Immunology and Molecular Genetics, College of Medicine, University of Kentucky, Lexington, KY, USA

Department of Neuroscience, Mayo Clinic, Scottsdale, AZ, USA

John D. Fryer

You can also search for this author in PubMed   Google Scholar

Contributions

B.A.H., J.A.B. and M.T.W.E. developed and designed the study and wrote the paper. B.A.H., M.L.P., B.A.W., B.J.W., K.I.D., M.E.W., E.J.F. and A.S. performed all analyses. M.L.P. developed the RShiny app. K.I.D. embedded the RShiny app into ebbertlab.com . J.A.B., K.A.N., L.A.G., G.A.F., P.H.D., S.G., E.G., R.W. and S.M.-E. helped generate sequencing and supporting data. N.T.S., E.J.F. and A.S. generated and advised on proteomics analyses. P.T.N. provided the invaluable brain samples and pathology. J.D.F., M.R. and J.B.M. provided important intellectual contributions.

Corresponding author

Correspondence to Mark T. W. Ebbert .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Biotechnology thanks Stefan Canzar, Sandra T. Cooper and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 basic sequencing metrics..

AD = Alzheimer’s disease cases, CT = Cognitively unimpaired aged controls. a , Number of reads per sample after each step of the analysis. All downstream analysis were done with Mapped pass reads with both primers and MAPQ > 10. b , N50 and median read length for Mapped pass reads with both primers and MAPQ > 10. c , Percentage of reads that are full-length or unique as determined by bambu. Full-length counts = reads containing all exon-exon boundaries (that is, intron chain) from its respective transcript. Unique counts = reads that were assigned to a single transcript. All boxplots from this panel come from n = 12 biologically independent samples. Male AD n = 3, Female AD n = 3, Male CT n = 3, Female CT n = 3. All boxplots in this panel follow this format: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range.

Extended Data Fig. 2 Expression distribution and diversity for genes and transcripts.

a , Number of genes and transcripts represented across median CPM threshold. Cutoff shown as the dotted line set at median CPM = 1. b , Distribution of log 10 median CPM values for gene bodies, dotted line shows cutoff point of median CPM = 1. c , Distribution of log 10 median CPM values for gene bodies, dotted line shows cutoff point of median CPM = 1.

Extended Data Fig. 3 Expression of different transcript biotypes on aged human frontal cortex tissue using long-read RNAseq data.

a , Lineplot showing the number of transcripts from different biotypes expressed above different median CPM threshold in long-read RNAseq data from aged human dorsolateral prefrontal cortex postmortem tissue. b , Barplot showing the number of transcripts from different biotypes expressed at or above different median CPM threshold in long-read RNAseq data from aged human dorsolateral prefrontal cortex postmortem tissue.

Extended Data Fig. 4 Number of newly discovered transcripts across subsampling range.

a , Barplot showing the subsampling percentage on the Y-axis and number of new transcripts discovered with Bambu without filtering by expression estimates (no filter) on the X-axis. b , Barplot showing the subsampling percentage on the Y-axis and number of new transcripts discovered with Bambu when filtering by expression estimates X-axis (high-confidence; median CPM > 1). Nuclear encoded transcripts were filtered by median CPM > 1 and mitochondrially encoded transcripts were filtered by median full-length counts > 40. We used a different filter for mitochondrial transcripts due to issues in read assignment due to the polycistronic nature of mitochondrial transcription. The decline in identified new transcripts at lower sequencing depths was mostly due to Bambu’s filtering criteria, which demands enough evidence of unique and full-length reads to call a new transcript.

Extended Data Fig. 5 Difference in transcript discovery overlap based on annotation and computational tool used.

a , Venn diagram showing the overlap between all our new transcripts from known gene bodies and new transcripts from known gene bodies in original GTEx long-read RNAseq article published by Glinos et al. 20 using FLAIR for transcript discovery and ENSEMBL 88 annotation. b , Same as a but showing comparison only for new high-confidence transcripts from known gene bodies in our data. We used 70,000 as the number of new transcripts from known gene bodies in GTEx since they report just over 70,000 novel transcripts for annotated genes in their abstract. c , Venn diagram showing the overlap between all our new transcripts from known gene bodies and new transcripts from known gene bodies found when running GTEx long-read RNAseq data from article published by Glinos et al. 20 using bambu for transcript discovery and ENSEMBL 107 annotation. d , Same as a but showing comparison only for new high-confidence transcripts from known gene bodies in our data. We analyzed data from all tissue types from the original Glinos et al. article to ensure consistency between our approaches. The discovery of new isoforms unique to GTEx when using the identical pipeline and annotations from our study likely results from tissue-specific isoforms that do not occur in the brain. Venn diagrams are not to scale to improve readability.

Extended Data Fig. 6 RT-qPCR validations for new RNA isoforms from MAOB, SLC26A1, MT-RNR2 RNA isoforms match long-read sequencing data.

a , Comparison of relative abundance between long-read sequencing and RT-qPCR for RNA isoforms in MAOB . b , Same as a , but for MT-RNR2 c , Same as a , but for SLC26A1. Relative abundance was calculated as: \({Relative\; Abundance}=\frac{{Expression\; estimate\; for\; a\; given\; RNA\; isoform}}{\sum ({Expression\; estimates\; for\; RNA\; isoforms\; from\; the\; given\; gene})}* 100\) We used CPM (Counts Per Million) as the expression estimate for long-read sequencing and 2^(-∆Ct) for RT-qPCR. We used 2 -ΔCt as the expression estimate instead of the more common 2 -ΔΔCt . This is because the 2 -ΔΔCt is optimized for comparisons between samples within the same gene/isoform, but does not work well for comparison between genes/isoforms. On the other hand, the 2 -ΔCt expression estimate allows for comparison between different genes/isoforms. The housekeeping gene for RT-qPCR was CYC1. For all figures in this panel the data labeled as technology long-reads comes from n = 12 biologically independent samples while the data labeled as technology RT-qPCR comes from n = 8 biologically independent samples. The eight samples from RT-qPCR are a subset of the 12 samples contained in long-reads. We only used eight samples for RT-qPCR because we ran out of brain tissue for the four of our samples. All boxplots in this panel follow this format: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range.

Extended Data Fig. 7 External validation of new high-confidence transcripts using publicly availabla data from 5 GTEx brain samples (Brodmann area 9) sequenced with long-read RNAseq and 251 ROSMAP brain samples (Brodmann area 9/46) sequenced with Illumina 150 bp paired-end RNAseq reads.

a , Histogram showing total unique counts for new high-confidence transcripts across five GTEx long-read RNAseq data from brain samples. Total unique counts are shown in a log2(total unique counts + 1) scale to avoid streching generated by outliers. b , Barplot showing the number of new high-confidence transcripts that meet different total unique counts thresholds in cross-validation using five GTEx long-read RNAseq data from brain samples. The ‘≥ 0’ Y-axis label shows the total number of high-confidence transcripts before any filtering. Legend colors: New from known denotes new transcripts from known gene bodies, New from new denotes new transcripts from newly discovered gene bodies, and new from mito denotes new mitochondrially encoded spliced transcripts. c , Same as a but for 251 ROSMAP brain samples sequenced with 150 bp paired-end Illumina RNAseq. d , Same as b but for 251 ROSMAP brain samples sequenced with 150 bp paired-end Illumina RNAseq. We observed that 98.8% of the new high-confidence transcripts from known gene bodies had at least one uniquely mapped read in either GTEx or ROSMAP data and 69.6% had at least 100 uniquely mapped reads in either dataset. Over 94.4% of the new high-confidence transcripts from new gene bodies had at least one uniquely mapped read in either GTEx or ROSMAP data and over 44.2% had at least 100 uniquely mapped reads in either dataset.

Extended Data Fig. 8 Expression of 197 transcripts from extra 99 predicted protein coding genes in CHM13 reported by Nurk et al.

a , Lineplot with number of transcripts from extra 99 protein coding genes that are expressed across the total counts threshold for our 12 brain samples. The red line indicates all counts (including partial assignments), mint green line indicates full-length reads and purple line indicates unique reads. b , Barplot showing the number of transcripts from extra 99 protein coding genes expressed at or above different counts thresholds. The top y-axis label shows all the 197 annotated RNA isoforms from the extra 99 predicted protein coding genes in CHM13 reported by Nurk et al.

Extended Data Fig. 9 Attempt at validation of TNFSF12 RNA isoform expression pattern in healthy controls.

a , Boxplot showing the relative transcript abudance (percentage) for TNFSF12 RNA isoforms that are differentially expressed between Alzheimer’s disease cases and controls in this study. On the X-axis, the ‘OURS AD’ label represents data from six (n = 6) biologically independent Alzheimer’s disease brain samples sequenced in this study. The ‘OURS CT’ label represents data from six (n = 6) biologically independent cognitively unimpaired aged control brain samples sequenced in this study. The ‘GTEx CT’ label label represents data from five (n = 5) biologically independent GTEx brain samples (Brodmann area 9) sequenced with PCR amplified long-read nanopore RNAseq by Glinos et. al. b , Boxplot showing the CPM for TNFSF12 RNA isoforms that are differentially expressed between Alzheimer’s disease cases and controls in this study. X-axis labels follow the same pattern as a and labels represent the same groups as in a . All boxplots in this panel follow this format: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range.

Extended Data Fig. 10 Percentage of unique and full-length reads per transcript.

a , Scatterplot showing the percentage of uniquely aligned reads for each transcript with a median CPM > 1 on the X-axis and the Log10 transcript length on the Y axis. b , Scatterplot showing the percentage of full-length reads for each transcript with a median CPM > 1 on the X-axis and the Log10 transcript length on the Y axis. c , Violin plot showing the percentage of uniquely aligned reads for each transcript with median CPM > 1 on the Y-axis and the number of annotated transcript per gene on the X-axis. d , Violin plot showing the percentage of full-length reads for each transcript with median CPM > 1 on the Y-axis and the number of annotated transcript per gene on the X-axis. The percentage of full-length reads is more affected by increases in transcript length whereas the percentage of unique reads is more affected by increases in the number of annotated transcripts for a given gene.

Supplementary information

Supplementary information.

Supplementary Figs. 1–61 and Lexogen SPLIT RNA extraction kit user guide.

Reporting Summary

Supplementary tables 1–10.

This Excel workbook contains 10 Excel sheets, Supplementary Tables 1–10. Supplementary Table 1: Sample characteristics and sequencing information. Supplementary Table 2: Overlap between high-confidence transcripts discovered in our study with transcripts discovered in refs. 19 and 20 . Supplementary Table 3: Summary statistics for overlap between high-confidence transcripts discovered in our study with transcripts discovered in refs. 19 and 20 . Supplementary Table 4: Information for gel electrophoresis PCR experiments. Supplementary Table 5: Information for three batches of RT–qPCR experiments. Supplementary Table 6: Quantification for new and known RNA isoforms from MAOB , SLC26A1 and MT-RNR2 using RT–qPCR. Supplementary Table 7: Summary statistics from quantification for new and known RNA isoforms from MAOB , SLC26A1 and MT-RNR2 using RT–qPCR. Supplementary Table 8: Number of counts from ROSMAP and GTEx data that were uniquely aligned to new transcripts discovered in our study. Supplementary Table 9: Gene-level differential expression results between cases of AD ( n  = 6) and cognitively unimpaired controls ( n  = 6). We used the DESeq2 R package with two-sided Wald’s test for statistical comparisons and Benjamini–Hochberg correction for multiple comparisons in this differential expression analysis. Raw P values are in the ‘p value’ column and Benjamini–Hochberg FDR-adjusted P values are in the ‘p adj ’ column. Supplementary Table 10: Transcript-level differential expression results between cases with AD ( n  = 6) and cognitively unimpaired controls ( n  = 6). We used the DESeq2 R package with two-sided Wald’s test for statistical comparisons and Benjamini–Hochberg correction for multiple comparisons in this differential expression analysis. Raw P values are in the ‘pvalue’ column and Benjamini–Hochberg FDR-adjusted P values are in the ‘p adj ’ column.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Aguzzoli Heberle, B., Brandon, J.A., Page, M.L. et al. Mapping medically relevant RNA isoform diversity in the aged human frontal cortex with deep long-read RNA-seq. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02245-9

Download citation

Received : 06 August 2023

Accepted : 15 April 2024

Published : 22 May 2024

DOI : https://doi.org/10.1038/s41587-024-02245-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

assignment on human genome project

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Noubar Afeyan PhD ’87 gives new MIT graduates a special assignment

Press contact :, media download.

Noubar Afeyan speaks at a podium with the MIT seal on the front. Faculty and administrators in academic regalia are seated next to him.

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

Noubar Afeyan speaks at a podium with the MIT seal on the front. Faculty and administrators in academic regalia are seated next to him.

Previous image Next image

Biotechnology leader Noubar Afeyan PhD ’87 urged the MIT Class of 2024 to “accept impossible missions” for the betterment of the world, in a rousing keynote speech at the OneMIT Commencement ceremony this afternoon.

Afeyan is chair and co-founder of the biotechnology firm Moderna, whose groundbreaking Covid-19 vaccine has been distributed to billions of people in over 70 countries. In his remarks, Afeyan briefly discussed Moderna’s rapid development of the vaccine but focused the majority of his thoughts on this year’s graduating class — while using the “Mission: Impossible” television show and movies, a childhood favorite of his, as a motif.

“What I do want to talk about is what it takes to accept your own impossible missions and why you, as graduates of MIT, are uniquely prepared to do so,” Afeyan said. “Uniquely prepared — and also obligated. At a time when the world is beset by crises, your mission is nothing less than to salvage what seems lost, reverse what seems inevitable, and save the planet. And just like the agents in the movies, you need to accept the mission — even if it seems impossible.”

Afeyan spoke before an audience of thousands on MIT’s Killian Court, where graduates gathered in attendance along with family, friends, and MIT community members, during an afternoon of brightening weather that followed morning rain.

Video thumbnail

“Welcome long odds,” Afeyan told the graduates. “Embrace uncertainty, and lead with imagination.”

Afeyan’s speech was followed by an address from MIT President Sally Kornbluth, who described the Institute’s graduating class as a “natural wonder,” in a portion of her remarks directed to family and friends.

“You know how delightful and inspiring and thoughtful they are,” Kornbluth said of this year’s graduates. “It has been our privilege to teach them, and to learn together with them. And we share with you the highest hopes for what they will do next.”

The OneMIT Commencement ceremony is an Institute-wide event serving as a focal point for three days of graduation activities, from May 29 through May 31.

A group of graduates wearing caps and gowns cheer on Killian Court.

Previous item Next item

MIT’s Class of 2024 encompasses 3,666 students, earning a total of 1,386 undergraduate and 2,715 graduate degrees. (Some students are receiving more than one degree at a time.) Undergraduate and graduate students also have separate ceremonies, organized by academic units, in which their names are read as they walk across a stage.

Afeyan is a founder and the CEO of Flagship Pioneering, a venture firm started in 2000 that has developed more than 100 companies in the biotechnology industry, which combined have more than 60 drugs in clinical development.

A member of the MIT Corporation who earned his PhD from the Institute in biochemical engineering, Afeyan also served as a senior lecturer at the MIT Sloan School of Management for 16 years. He is currently on the advisory board of the MIT Abdul Latif Jameel Clinic for Machine Learning and has been a featured speaker at events such as MIT Solve. Afeyan is the co-founder of the Aurora Prize for Awakening Humanity, among other philanthropic efforts.

“You already have a head start, quite a significant one,” Afeyan told MIT’s graduates. “You graduate today from MIT, and that says volumes about your knowledge, talent, vision, passion, and perseverance — all essential attributes of the elite 21st-century agent.” He then drew laughs by quipping, “Oh, and I forgot to mention our relaxed, uncompetitive nature, outstanding social skills, and the overall coolness that characterizes us MIT grads.”

Afeyan also heralded the Institute itself, citing it as a place crucial to the development of the “telephone, digital circuits, radar, email, internet, the Human Genome Project, controlled drug delivery, magnetic confinement fusion energy, artificial intelligence and all it is enabling — these and many more breakthroughs emerged from the work of extraordinary change agents tied to MIT.”

Long before Afeyan himself came to MIT, he grew up in an immigrant Armenian family in Beirut. After civil war came to Lebanon in 1975, he spent long hours in the family apartment watching “Mission: Impossible” re-runs on television.

As Afeyan noted, the special agents in the show always received a message beginning, “Your mission, should you choose to accept it … ” He added: “No matter how long the odds, or how great the risk, the agents always took the assignment. In the 50 years since, I have been consistently drawn to impossible missions, and today I hope to convince each and every one of you that you should be too.”

To accomplish difficult tasks, Afeyan said, people often do three things: imagine, innovate, and immigrate, with the latter defined broadly, not just as a physical relocation but an intellectual exploration.

“Imagination, to my mind, is the foundational building block of breakthrough science,” Afeyan said. “At its best, scientific research is a profoundly creative endeavor.”

Breakthroughs also deploy innovation, which Afeyan defined as “imagination in action.” To make innovative leaps, he added, requires a kind of “paranoid optimism. This means toggling back and forth between extreme optimism and deep-seated doubt,” in a way that “often starts with an act of faith.”

Beyond that, Afeyan said, “you will also need the courage of your convictions. Make no mistake, you leave MIT as special agents in demand. As you consider your many options, I urge you to think hard about what legacy you want to leave, and to do this periodically throughout your life. … You are far more than a technologist. You are a moral actor. The choice to maximize solely for profits and power will in the end leave you hollow. To forget this is to fail the world — and ultimately to fail yourself.”

Finally, Afeyan noted, to make great innovative leaps, it is often necessary to “immigrate,” something that can take many forms. Afeyan himself, as an Armenian from Lebanon who came to the U.S., has experienced it as geographic and social relocation, and also as the act of changing things while remaining in place.

“Here’s the really interesting thing I’ve learned over the years,” Afeyan said. “You don’t need to be from elsewhere to immigrate. If the immigrant experience can be described as leaving familiar circumstances and being dropped into unknown territory, I would argue that every one of you also arrived at MIT as an immigrant, no matter where you grew up. And as MIT immigrants, you are all at an advantage when it comes to impossible missions. You’ve left your comfort zone, you’ve entered unchartered territory, you’ve foregone the safety of the familiar.”

Synthesizing these points, Afeyan suggested, “If you imagine, innovate, and immigrate, you are destined to a life of uncertainty. Being surrounded by uncertainty can be unnerving, but it’s where you need to be. This is where the treasure lies. It’s ground zero for breakthroughs. Don’t conflate uncertainty and risk — or think of it as extreme risk. Uncertainty isn’t high risk; it’s unknown risk. It is, in essence, opportunity.”

Afeyan also noted that many people are “deeply troubled by the conflicts and tragedies we are witnessing” in the world today.

“I wish I had answers for all of us, but of course, I don’t,” Afeyan said. “But I do know this: Having conviction should not be confused with having all the answers. Over my many years engaged in entrepreneurship and humanitarian philanthropy, I have learned that there is enormous benefit in questioning what you think you know, listening to people who think differently, and seeking common ground,” a remark that drew an ovation from the audience.

In conclusion, Afeyan urged the Class of 2024 to face up to the world’s many challenges while getting used to a life defined by tackling tough tasks.

“Graduates, set forth on your impossible missions,” Afeyan said. “Accept them. Embrace them. The world needs you, and it’s your turn to star in the action-adventure called your life.”

Next, Kornbluth, issuing the president’s traditional “charge to the graduates,” lauded the Class of 2024 for being “a community that runs on an irrepressible combination of curiosity and creativity and drive. A community in which everyone you meet has something important to teach you. A community in which people expect excellence of themselves — and take great care of one another.”

As Kornbluth noted, most of the seniors in the undergraduate Class of 2024 had to study through, and work around, the Covid-19 pandemic. MIT, Kornbluth said, is a place where people “fought the virus with the tools of measurement and questioning and analysis and self-discipline — and was therefore able to pursue its mission almost undeterred.”

The campus community, she added, “understands, in a deep way, that the vaccines, as Noubar just said, were not some ‘overnight miracle’ — but rather the final flowering of decades of work by thousands of people, pushing the boundaries of fundamental science.”

And while the Class of 2024 has acquired a great deal of knowledge in the classroom and lab, Kornbluth thanked its members for what they have given to MIT, as well.

“The Institute you are graduating from is — thanks in part to you — always reflecting and always changing,” Kornbluth said. “And I take that as your charge to us.”

The OneMIT Commencement event started with a parade for alumni from the class of 1974, back on campus for their 50th anniversary reunion. The MIT Police Honor Guard entered next as part of the ceremonial procession, followed by administration and faculty. The MIT Wind Ensemble, conducted by Fred Harris, Jr., provided the accompanying music.

Mark Gorenberg ’76, chair of the MIT Corporation, formally opened the ceremony, and Thea Keith-Lucas, chaplain to the Institute, gave an invocation. The Chorallaries of MIT sang the national anthem.

Afeyan’s remarks followed, but were delayed for several minutes by protesters holding signs. After his speech, Lieutenant Mikala Nicole Molina, president of the Graduate Student Council, delivered remarks as well.

“Let us step forward from today with a commitment not only to further our own goals, but also to use our skills and knowledge to contribute positively to our communities and the world,” Molina said. “Our actions reflect the excellence and integrity that MIT has instilled in us.”

Penny Brant, president of the undergraduate Class of 2024, then offered a salute to her classmates, saying “I know I would not be graduating here today if not for all of you who have helped me along the way. You all have had such a profound and positive impact on me, our community, and the world.”

Kornbluth’s speech, which followed, was momentarily interrupted by shouting from an audience member, before students and other audience members gave Kornbluth a sustained ovation and ceremonies resumed as planned.

R. Robert Wickham ’93, SM ’95, president of the MIT Alumni Association and chief marshal of the Commencement ceremony, also offered a traditional greeting for graduates saying he was “welcoming you into our alumni family, your infinite connection to MIT.” There are now almost 147,000 MIT alumni worldwide.

The Chorallaries sang the school song, “In praise of MIT,” as well as another Institute anthem, “Take Me Back to Tech,” moments after Gorenberg formally closed the ceremony.

Preceding Afeyan, recent MIT Commencement speakers have been engineer and YouTuber Mark Rober, in 2023; Director-General of the World Trade Organization Ngozi Okonjo-Iweala, in 2022; lawyer and activist Bryan Stevenson, in 2021; and retired U.S. Navy four-star admiral William McRaven, in 2020.

Share this news article on:

Related links.

  • Commencement 2024

Related Topics

  • Commencement
  • Special events and guest speakers
  • President Sally Kornbluth
  • Innovation and Entrepreneurship (I&E)

Related Articles

Mark Rober has attached his graduation cap to a flying drone, and he gestures to it as it flies away.

Mark Rober tells MIT graduates to throw themselves into the unknown

Crowd of graduates in caps and gowns smiling and waving.

“The world needs your smarts, your skills,” Ngozi Okonjo-Iweala tells MIT’s Class of 2022

Bryan Stevenson

“Commit to changing the world,” civil rights lawyer Bryan Stevenson urges the Class of 2021

More mit news.

Diane Hoskins speaks on an indoor stage, at a lectern bearing MIT’s logo

Diane Hoskins ’79: How going off-track can lead new SA+P graduates to become integrators of ideas

Read full story →

Melissa Nobles stands at podium while speaking at MIT Commencement.

Chancellor Melissa Nobles’ address to MIT’s undergraduate Class of 2024

Noubar Afeyan stands at the podium.

Commencement address by Noubar Afeyan PhD ’87

MIT president Sally Kornbluth speaking at MIT’s Commencement at podium.

President Sally Kornbluth’s charge to the Class of 2024

In between two rocky hills, an icy blue glacier flows down and meets the water.

Microscopic defects in ice influence how massive glaciers flow, study shows

View of the torso of a woman wearing a white lab coat and gloves in a lab holding a petri dish with green material oozing in one hand and a small pipette in the other hand

Scientists identify mechanism behind drug resistance in malaria parasite

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Sea urchins made to order: Scripps scientists make transgenic breakthrough

A fluorescent blue transgenic sea urchin is seen through a microscope

  • Show more sharing options
  • Copy Link URL Copied!

Consider the sea urchin. Specifically, the painted urchin: Lytechinus pictus , a prickly Ping-Pong ball from the eastern Pacific Ocean.

The species is a smaller and shorter-spined cousin of the purple urchins devouring kelp forests . They produce massive numbers of sperm and eggs that fertilize outside of their bodies, allowing scientists to watch the process of urchin creation up close and at scale. One generation gives rise to the next in four to six months. They share more genetic material with humans than fruit flies do and can’t fly away — in short, an ideal lab animal for the developmental biologist.

Scientists have been using sea urchins to study cell development for roughly 150 years. Despite urchins’ status as super reproducers, practical concerns often compel scientists to focus their work on more easily accessible animals: mice, fruit flies, worms.

Scientists working with mice, for example, can order animals online with the specific genetic properties they are hoping to study — transgenic animals, whose genes have been artificially tinkered with to express or repress certain traits.

Researchers working with urchins typically have to spend part of their year collecting them from the ocean.

“Can you imagine if mouse researchers were setting a mousetrap every night, and whatever it is they caught is what they studied?” said Amro Hamdoun , a professor at UC San Diego’s Scripps Institution of Oceanography.

Professor Amro Hamdoun holds a sea urchin specimen in a lab

Marine invertebrates represent about 40% of the animal world’s biological diversity yet appear in a scant fraction of a percentage of animal-based studies. What if researchers could access sea urchins as easily as mice? What if it were possible to make and raise lines of transgenic urchins?

How much more could we learn about how life works?

“You know how during the pandemic, everyone was making sourdough? I’m not good at making sourdough,” Hamdoun said recently at his office in Scripps’ Hubbs Hall. He set his sights instead on a project of a different sort: a new transgenic lab animal, “a fruit fly from the sea.”

In March, Hamdoun’s lab published a paper on the bioRxiv preprint server demonstrating the successful insertion of a piece of foreign DNA — specifically, a fluorescent protein from a jellyfish — into the genome of a painted urchin that passed the change down to its offspring.

The result is the first transgenic sea urchin, one that happens to glow like a Christmas bulb under a fluorescent light. (The paper has been submitted for peer review.)

The animals are the first transgenic echinoderms, the phylum that includes starfish, sea cucumbers and other marine animals. Hamdoun’s mission is to make genetically modified urchins available to researchers anywhere, not just those who happen to work in research facilities at the edge of the Pacific Ocean.

Elliot Jackson, a postdoctoral researcher, works with sea urchin eggs in a lab at Scripps

“If you look at some of the other model organisms, like Drosophila [fruit flies], zebrafish and mouse, there are well-established resource centers,” said Elliot Jackson, a postdoctoral researcher at Scripps and lead author of the paper. “If you want a transgenic line that labels the nervous system, you could probably get that. You could order it. And that’s what we hope we can be for sea urchins.”

Being able to genetically modify an animal supercharges what scientists can learn from it, with implications far beyond any individual species.

“It will transform sea urchins as a model for understanding neurobiology, for understanding developmental biology, for understanding toxicology,” said Christopher Lowe , a Stanford professor of biology who was not involved in the research.

Huntington Beach, CA - April 20: The sunset illuminates one of thousands of creatures known as by-the-wind-sailors that have been washing ashore Southern California beaches, including Dog Beach in Huntington Beach Thursday, April 20, 2023. The oval-shaped, flat creatures with tiny blue tentacles may look like little jellyfish but are in fact hydroids called Velella velella, more commonly known as "by-the-wind sailors." The invertebrates have been washing ashore at Crystal Cove in Newport Beach, and the sailing bodies have been spotted as far north as Point Reyes National Seashore, north of San Francisco. They've also been spotted in San Clemente, Manhattan Beach and along other Southern California beaches.(Allen J. Schaben / Los Angeles Times)

Climate & Environment

What are the blue blobs washing up on SoCal beaches? Welcome to Velella velella Valhalla

What are those blue things washing up on Southern California beaches? Velella velella, of course. Also known as by-the-wind sailors. They’re kind of like jellyfish.

May 3, 2024

The lab’s breakthrough, and its focus on making the animals freely available to fellow scientists, will “allow us to explore how evolution has solved a lot of really complicated life problems,” he said.

Researchers tend to study mice, flies and the like not because the animals’ biology is best suited to answer their questions but because “all the tools that were necessary to get at your questions were built up in just a few species,” said Deirdre Lyons , an associate professor of biology at Scripps who worked with Hamdoun on early research related to the project.

Expanding the range of animals available for sophisticated lab work is like adding colors to an artist’s palette, Lyons said: “Now you can go get the color that you really want, that best fits your vision, rather than being stuck with a few models.”

Hamdoun holds two painted urchins in an outstretched hand

On the ground floor of Hamdoun’s office building is the Hubbs Hall experimental aquarium , a garage-like space crammed with tanks full of recirculating seawater and a motley assortment of marine life.

On a recent visit, Hamdoun reached into a tank and gently dislodged a painted urchin. It scooched with surprising speed across an outstretched palm, as if exploring alien terrain.

The last common ancestor of L. pictus and Homo sapiens lived at least 550 million years ago. Despite the different evolutionary paths we’ve since traveled, our genomes reveal a shared biological heritage.

The genetic instructions that drive the transformation of a single zygote into a living body are strikingly similar in our two species. Specialized systems differentiating from a single fertilized egg and the translation of a jumble of proteins into a singular living thing — on the cellular level, all of that proceeds in much the same way for urchins and people.

These animals are “really fundamental to our understanding of all of life,” Hamdoun said, placing the urchin back in its tank. “And historically, very inaccessible genetically.”

The experimental aquarium was built in the 1970s, when scooping life from the sea was the only way to acquire research specimens. A few floors up in Hubbs Hall, Hamdoun led the way into the urchin nursery — the first large-scale effort to raise successive generations of the animals in a laboratory. At any given moment, the team has 1,000 to 2,000 sea urchins in various stages of development.

Hamdoun points at rows of greenish tanks holding sea urchins

Row upon row of tiny plastic tanks stood against a wall, each containing a lentil-size juvenile urchin. A strip of tape on each tank noted the animal’s genetic modification and date of fertilization. On some, a second bit of tape indicated animals that had the modification in their sex cells’ DNA, meaning it could be passed down to offspring. (For this reason, the lab keeps its urchins scrupulously separate from the wild population.)

“One of the big questions in all of biology is to understand how the series of instructions in the genome gives you whatever phenotype you want to study,” Hamdoun said — essentially, how the string of amino acids that is an animal’s genetic code gives rise to the characteristics of the living, respiring creature. “One of the fundamental things you have to do is be able to modify that genome, and then study what the outcome is.”

Long Beach, CA - March 29: NEWS EMBARGOED UNTIL 4/11: Visitors view one of two non-releasable sea otters swimming on its back inside the sea otter habitat at the Aquarium of the Pacific in Long Beach. The Aquarium of the Pacific in is attempting for the first time to ready a baby sea otter for release back into the wild by pairing it with a "surrogate mom." The adopted mom, they hope, will teach the baby the skills needed to survive in her natural environment. No one is allowed to get close to the surrogate mom and baby - they are trying to prevent all human interaction so the baby has a better chance of being re-wilded. However, cameras are set up to view the pair and there are other otters at the aquarium that we can get closer to. Photo taken at Aquarium of The Pacific in Long Beach Friday, March 29, 2024. (Allen J. Schaben / Los Angeles Times)

Surrogate otter mom at Long Beach aquarium is rehabilitating pup ‘better than any human ever can’

The pup could become Aquarium of the Pacific’s first surrogate-raised otter to return to the wild — if she masters the skills needed to hack it in the ocean.

April 11, 2024

He pointed to a tank containing a tiny urchin from whose genetic code the protein ABCD1 has been snipped.

ABCD1 acts like a bouncer, Hamdoun explained, parking along the cell membrane and ejecting foreign molecules. The protein’s action can preserve the cell from harmful substances but can sometimes work against an organism’s best interest, as when it prevents the cell from absorbing a necessary medication.

Researchers using urchins in which that protein no longer works can study the movement of a molecule through an organism — DDT, for example — and measure how much of the substance ends up in the cell without the confounding interference of ABCD1. They can reverse-engineer how big a role ABCD1 plays in preventing a cell from absorbing a drug.

A transgenic fluorescent sea urchin glows green through a microscope

And then there are the fluorescent urchins.

“The magic happens in this room,” Jackson said, walking into a narrow office with $1 million worth of microscopes at one end and a decades-old hand-cranked centrifuge bolted to a table at another.

He placed a petri dish containing three pencil-eraser-size transgenic urchins under a microscope. At 120 times its size, each looked like the Times Square New Year’s Eve ball come to life — a glowing, wiggling creature of pentamerous radial symmetry.

Fluorescence is not just an echinoderm party trick. Lighting up the cells makes it easier for researchers to track their movement in a developing organism. Researchers can watch as the early cells of a blastula divide and reorganize into neural or cardiac tissue. Eventually, scientists will be able to turn off individual genes and see how that affects development. It will help us understand how our own species develops, and why that development doesn’t always proceed according to plan.

Bovard Administration Building with Tommy Trojan sculpture on the Campus of the University of Southern California.

Science & Medicine

Star USC scientist faces scrutiny — retracted papers and a paused drug trial

USC’s Berislav Zlokovic has faced questions about the integrity of his research. Since a whistleblower report last year, several papers have been retracted and a drug trial has been paused.

May 16, 2024

The lab has “done a great job. It’s really been welcomed by the community,” said Marko Horb , senior scientist and director of the National Xenopus Resource at the University of Chicago’s Marine Biological Laboratory.

Horb runs the national clearinghouse for genetically modified species of Xenopus, a clawed frog used in lab research. Funded in part by the National Institutes of Health, the center develops lines of transgenic frogs for scientific use and distributes them to researchers.

Hamdoun envisions a similar resource center for his lab’s urchins. They’ve already started sending tiny vials of transgenic urchin sperm to interested scientists, who can grow bespoke urchins with eggs acquired from Hamdoun’s lab or another source.

Hamdoun vividly recalls the time he spent earlier in his career trying to track down random snippets of DNA necessary for his research, the disappointment and frustration of writing to professors and former postdocs only to find that the material had long been lost. He’d rather future generations of scientists spend their time on discovery.

“Biology is really interesting,” he said. “The more people can get access to it, the more we’re going to learn.”

Sea urchins in a microscope dish

More to Read

San Pedro, CA - May 21: An aerial view of research barges and an 4-acre array of solar panels on the roof of the long stretch of warehouses at AltaSea, an ocean research and business center at the Port of Los Angeles in San Pedro Tuesday, May 21, 2024. The $20-million first phase of ocean research and business center AltaSea is set to open officially May 29. The development on the San Pedro waterfront is intended to become the nation's largest ocean tech hub and a leader in creating clean-energy "blue economy" businesses. AltaSea's Center for Innovation in Berth 58 - part of AltaSea's $30 million renovation of three historic warehouses - is nearing completion and will be home to researchers from USC, UCLA, and Caltech, as well as famed oceanographer and explorer Dr. Bob Ballard. AltaSea will be the only oceanfront business center in Southern California serving entrepreneurs working on environmentally sustainable technology using the sea. (Allen J. Schaben / Los Angeles Times)

Ocean technology hub AltaSea blooms on San Pedro waterfront

May 27, 2024

CRESCENT CITY, CA - APRIL 13: Hundreds of sea lions sleep on the docks in Crescent City Marina with few people around to disturb them. They come for crab season and stay for salmon and tuna season. In Crescent City, California, the far northwest corner of the state, the people are used to being cut off from the rest of the state. They've dealt with tsunamis, fires, and other natural disasters before the coronavirus started. Del Norte County on Monday, April 13, 2020 in Crescent City, CA. (Carolyn Cole / Los Angeles Times)

Dead baby sea lions showing up along California coastal islands. Researchers aren’t sure why

May 23, 2024

Laguna Beach, CA - August 17: Toby plays with his football as his owner, Bob Pruitt, of Dana Point, offers him a treat and plays with him at the Laguna Beach Dog Park in Laguna Beach Thursday, Aug. 17, 2023. Laguna Beach officials this month amended a city law to make it more challenging for residents to file complaints about barking dogs. (Allen J. Schaben / Los Angeles Times)

Researchers keep finding microplastics everywhere: Now in human and dog testes

May 22, 2024

assignment on human genome project

Corinne Purtill is a science and medicine reporter for the Los Angeles Times. Her writing on science and human behavior has appeared in the New Yorker, the New York Times, Time Magazine, the BBC, Quartz and elsewhere. Before joining The Times, she worked as the senior London correspondent for GlobalPost (now PRI) and as a reporter and assignment editor at the Cambodia Daily in Phnom Penh. She is a native of Southern California and a graduate of Stanford University.

More From the Los Angeles Times

A dog waits to be adopted in a cage at the Chesterfield Square Animal Services Center in Los Angeles.

Opinion: Why California’s housing crisis has serious consequences for pets as well as people

May 31, 2024

LASKEVIEW, C ALIF. - AUG. 31, 2022. Cows are milked at Mavro Holsteins dairy farm in Lakeview, Calif. The state has a goal of reducing methane emissions from the dairy industry by 40 percent by 2030. (Luis Sinco / Los Angeles Times)

Third U.S. dairy worker comes down with avian flu; officials monitoring farm

May 30, 2024

Healthy Living Spring 2022

Are pet dogs and cats the weak link in bird flu surveillance?

A mountain lion sighting in Agoura Hills on Wednesday, May 22, 2024, from video sent by Peggy McClintick and Sally Tuchman.

You’re early: Cougar leaps into backyard near future site of Annenberg Wildlife Crossing

IMAGES

  1. Human Genome Project-Lesson Notes

    assignment on human genome project

  2. Complete assignment on human Genome Project

    assignment on human genome project

  3. Human Genome Project

    assignment on human genome project

  4. Complete assignment on human Genome Project

    assignment on human genome project

  5. The Human Genome Project Essay Example

    assignment on human genome project

  6. Human genome project

    assignment on human genome project

VIDEO

  1. Human Genome Project 12th CBSE Biology

  2. The Human nuclear genome and the human genome project || CSIR Lifesciences

  3. The Human Genome Project has successfully analyzed

  4. L-14

  5. WHAT IS HUMAN GENOME PROJECT .. LECTURE IN URDU /HINDI ..#biology #VIRAL #foryou #GENEMOE #PROJECT

  6. Case Based and MCQs

COMMENTS

  1. The Human Genome Project

    The Human Genome Project. The Human Genome Project (HGP) is one of the greatest scientific feats in history. The project was a voyage of biological discovery led by an international group of researchers looking to comprehensively study all of the DNA (known as a genome) of a select set of organisms. Launched in October 1990 and completed in ...

  2. The Human Genome Project: big science transforms biology and medicine

    Impact of the human genome project on biology and technology. First, the human genome sequence initiated the comprehensive discovery and cataloguing of a 'parts list' of most human genes [16,17], and by inference most human proteins, along with other important elements such as non-coding regulatory RNAs.Understanding a complex biological system requires knowing the parts, how they are ...

  3. Human Genome Project (HGP)

    The Human Genome Project (HGP), which operated from 1990 to 2003, provided researchers with basic information about the sequences of the three billion chemical base pairs (i.e., adenine [A], thymine [T], guanine [G], and cytosine [C]) that make up human genomic DNA (deoxyribonucleic acid). The HGP was further intended to improve the ...

  4. The Human Genome Project changed everything

    The joint announcement of the release of the human 'draft' genome sequences occurred 20 years ago, at a ceremony in the White House. The first analyses by two groups, the publicly funded ...

  5. PDF The Human Genome Project

    The Human Genome Project: An Annotated & Scholarly Guide to the Project in the United States. The idea for this annotated scholarly guide to the Human Genome Project (HGP) originated at an international meeting on the history of the HGP that was held in May of 2012 at the Cold Spring Harbor Laboratory's Banbury Center.

  6. The Human Genome Project—discovering the human blueprint

    The Human Genome Project aimed to map the entire genome, including the position of every human gene along the DNA strand, and then to determine the sequence of each gene's base pairs. At the time, sequencing even a small gene could take months, so this was seen as a stupendous and very costly undertaking. Fortunately, biotechnology was ...

  7. About the Human Genome Project

    The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.

  8. Human Genome Project: Sequencing the Human Genome

    Aa Aa Aa. The Human Genome Project was a 13-year-long, publicly funded project initiated in 1990 with the objective of determining the DNA sequence of the entire euchromatic human genome within 15 ...

  9. Human genome project

    Human genome project. Feb 17, 2017 • Download as PPTX, PDF •. 297 likes • 205,965 views. V. Vinitha Chandra Sekar. notes on human genome project for students. Education. 1 of 23.

  10. Human Genome Project

    The Human Genome Project ( HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a physical and a functional standpoint. It started in 1990 and was completed in 2003. [1]

  11. The Human Genome Project

    The Human Genome Project. First published Wed Nov 26, 2008; substantive revision Thu Sep 14, 2023. The 20 th century opened with rediscoveries of Gregor Mendel's studies on patterns of inheritance in peas and closed with a research project in molecular biology that was heralded as the initial and necessary step for attaining a complete ...

  12. The Human Genome Project

    Introduction. The Human Genome Project is a long-term project by international scientist to develop detailed genetic and physical maps of the human genome. Researchers are engaged in locating and identifying all of its genes and establishing the sequence of the genes and all other components of the genome. This monstrous task has the potential ...

  13. Lesson Explainer: The Human Genome Project

    The Human Genome Project was an international initiative to map the entire human genome. This meant that the order of all the nucleotide base pairs, and the genes they made up, in the human genome was identified. We can see that the options describing determining the order of the base pairs in the human genome and identifying all the genes in ...

  14. Human Genome Project

    Human Genome Project. Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others.

  15. Lesson Plan: The Human Genome Project

    state the major accomplishments of the Human Genome Project as the identification of many human genes, including those related to genetic diseases, and the advancement of DNA sequencing techniques, discuss the potential applications of the Human Genome Project to improve human health, including creating personalised medicines, predicting ...

  16. The Human Genome Project: big science transforms biology and medicine

    The Human Genome Project (HGP) has profoundly changed biology and is rapidly catalyzing a transformation of medicine [1-3].The idea of the HGP was first publicly advocated by Renato Dulbecco in an article published in 1984, in which he argued that knowing the human genome sequence would facilitate an understanding of cancer [].In May 1985 a meeting focused entirely on the HGP was held, with ...

  17. Scientists Finish the Human Genome at Last

    Published July 23, 2021 Updated July 26, 2021. Two decades after the draft sequence of the human genome was unveiled to great fanfare, a team of 99 scientists has finally deciphered the entire ...

  18. The Human Genome Project

    The Human Genome Project is an international research project whose primary mission is to decipher the chemical sequence of the complete human genetic material (i.e., the entire genome), identify all 50,000 to 100,000 genes contained within the genome, and provide research tools to analyze all this genetic information.

  19. Human Genome Project

    Introduction to Life Sciences - Week 4 Assignment. Human Genome Project. Research the purpose and history of the human genome project. Present your findings on the human genome project and discuss the benefits and potential drawbacks of the project. Provide an analysis on the implications of understanding the human genome in its entirety.

  20. Complete assignment on human Genome Project

    1 of 64. Download now. Complete assignment on human Genome Project. 1. 2. Name Aafaq Ali,Asad Noman Class M.Phil 1st Sem Topic Human Genome Project Presented To Sir Umair Malick Department Of Botany University Of Lahore (Sargodha Campus) 3. The Human Genome Project. 4.

  21. Human Genome Project

    Human genome project (HGP) was an international scientific research project which got successfully completed in the year 2003 by sequencing the entire human genome of 3.3 billion base pairs. The HGP led to the growth of bioinformatics which is a vast field of research. The successful sequencing of the human genome could solve the mystery of ...

  22. Human Genome Project

    The Human Genome Project is an international research project with the primary goal of deciphering the chemical sequence of the entire human genetic material (i.e., the entire genome). It identifies all 50,000 to 100,000 genes contained within the genome and provides research tools to analyse all of this genetic information.

  23. Did the human genome project affect research on Schizophrenia?

    The Human Genome Project was undertaken primarily to discover genetic causes and better treatments for human diseases. Schizophrenia was targeted since three of the project`s principal architects had a personal interest and also because, based on family, adoption, and twin studies, schizophrenia was widely believed to be a genetic disorder.

  24. Mapping medically relevant RNA isoform diversity in the aged human

    The preprocessed reads were then aligned to the GRCh38 human reference genome (without alternative contigs and with added ERCC sequences) using minimap2 (ref. 69) v.2.22-r1101 with parameters ...

  25. Noubar Afeyan PhD '87 gives new MIT graduates a special assignment

    Afeyan also heralded the Institute itself, citing it as a place crucial to the development of the "telephone, digital circuits, radar, email, internet, the Human Genome Project, controlled drug delivery, magnetic confinement fusion energy, artificial intelligence and all it is enabling — these and many more breakthroughs emerged from the ...

  26. Whole-Genome Evolutionary Analyses of Non-Endosymbiotic Organelle

    Chloroplasts and mitochondria, descendants of ancient prokaryotes via endosymbiosis, occupy a pivotal position in plant growth and development due to their intricate connections with the nuclear genome. Genes encoded by the nuclear genome but relocated to or being functional within these organelles are commonly referred as organelle-targeting nuclear genes (ONGs). These genes are essential for ...

  27. Sea urchins made to order: Scripps scientists make transgenic

    May 29, 2024 3 AM PT. Consider the sea urchin. Specifically, the painted urchin: Lytechinus pictus, a prickly Ping-Pong ball from the eastern Pacific Ocean. The species is a smaller and shorter ...