Data Analysis for Genomics

Drive your career forward.

This HarvardX professional certificate program gives learners the necessary skills and knowledge to tackle real-world data analysis challenges.

Harvard School of Public Health Logo

What You'll Learn

Advances in genomics have triggered fundamental changes in medicine and research. Genomic datasets are driving the next generation of discovery and treatment, and this series will enable you to analyze and interpret data generated by modern genomics technology.

Using open-source software, including R and Bioconductor, you will acquire skills to analyze and interpret genomic data. These courses are perfect for those who seek advanced training in high-throughput technology data. Problem sets will require coding in the R language to ensure mastery of key concepts. In the final course, you’ll investigate data analysis for several experimental protocols in genomics.

Enroll now to unlock the wealth of opportunities in modern genomics.

The course will be delivered via edX and connect learners around the world. After completing this series, you will understand how to:

  • Bridge diverse genomic assay and annotation structures to data analysis and research presentations via innovative approaches to computing
  • Use advanced techniques to analyze genomic data.
  • Structure, annotate, normalize, and interpret genome-scale assays.
  • Analyze data from several experimental protocols, using open-source software, including R and Bioconductor.

Courses in this Program

2–4 hours per week, for 4 weeks The structure, annotation, normalization, and interpretation of genome scale assays.

2–4 hours per week, for 5 weeks Perform RNA-Seq, ChIP-Seq, and DNA methylation data analyses, using open source software, including R and Bioconductor.

2–4 hours per week, for 4 weeks Learn advanced approaches to genomic visualization, reproducible analysis, data architecture, and exploration of cloud-scale consortium-generated genomic data.

Your Instructor

Rafael Irizarry

Rafael Irizarry

Professor of Biostatistics at Harvard University Read full bio.

Michael Love

Michael Love

Assistant Professor, Departments of Biostatistics and Genetics at UNC Gillings School of Global Public Health Read full bio.

Vincent Carey

Vincent Carey

Professor, Medicine at Harvard Medical School Read full bio.

Job Outlook

  • R is listed as a required skill in 64% of data science job postings and was Glassdoor’s Best Job in America in 2016 and 2017. (source: Glassdoor)
  • Companies are leveraging the power of data analysis to drive innovation. Google data analysts use R to track trends in ad pricing and illuminate patterns in search data. Pfizer created customized packages for R so scientists can manipulate their own data.
  • 32% of full-time data scientists started learning machine learning or data science through a MOOC, while 27% were self-taught. (source: Kaggle, 2017)
  • Data Scientists are few in number and high in demand. (source: TechRepublic)

Ways to take this program

When you enroll in this program, you will register for a Verified Certificate for all 3 courses in the Professional Certificate Series. 

Alternatively, learners can Audit the individual course for free and have access to select course material, activities, tests, and forums. Please note that Auditing the courses does not offer course or program certificates for learners who earn a passing grade.

Genomics Data Analysis

Learn advanced techniques to analyze genomics data

grouping default image

Associated Schools

Harvard T.H. Chan School of Public Health

Harvard T.H. Chan School of Public Health

What you'll learn.

Advanced techniques to analyze genomic data

How to structure, annotate, normalize, and interpret genome-scale assays

How to bridge diverse genomic assay and annotation structures to data analysis and research presentations via innovative approaches to computing

How to analyze data from several experimental protocols, using open source software, including R and Bioconductor

About this series

The Genomics Data Analysis XSeries is an advanced series that will enable students to analyze and interpret data generated by modern genomics technology.

Using open-source software, including R and Bioconductor, you will acquire skills to analyze and interpret genomic data.

This XSeries is perfect for those who seek advanced training in high-throughput technology data. Problem sets will require coding in the R language to ensure learners fully grasp and master key concepts. The final course investigates data analysis for several experimental protocols in genomics.

This series includes

lines of genomic data (dna is made up of sequences of a, t, g, c)

Case Studies in Functional Genomics

Perform RNA-Seq, ChIP-Seq, and DNA methylation data analyses, using open source software, including R and Bioconductor.

lines of genomic data (dna is made up of sequences of a, t, g, c)

Introduction to Bioconductor

The structure, annotation, normalization, and interpretation of genome scale assays.

lines of genomic data (dna is made up of sequences of a, t, g, c)

Advanced Bioconductor

Learn advanced approaches to genomic visualization, reproducible analysis, data architecture, and exploration of cloud-scale consortium-generated genomic data.

Instructors

Rafael Irizarry

Rafael Irizarry

Michael Love

Michael Love

Vincent Carey

Vincent Carey

Skip to Content

The value of genomic analysis

Genetic heritability is responsible for 30% of individual health outcomes , but is hardly used to guide disease prevention and care. Each individual carries 4-5 million genetic variants, each with varying influence on traits related to our health. The cost to sequence a genome has reduced drastically in recent years, and sequence data shows potential for ubiquitous use. However, the ability to read the sequence accurately and to meaningfully interpret it remain obstacles to broad adoption.

Sisters assembling a puzzle

Improving the accuracy of genomic analysis

Sequencing genomes enables us to identify variants in a person’s DNA that indicate genetic disorders such as an elevated risk for breast cancer.

Highly accurate genomes with deep neural networks

Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. As published in Nature Biotechnology , DeepVariant, an open-source variant caller that uses a deep neural network to call genetic variants from next-generation DNA sequencing data, significantly improves the accuracy in identifying variant locations, reducing the error rate by more than 50%. Learn more

Winner in PrecisionFDA V2 Truth Challenge

DeepVariant won awards for Best Overall accuracy in 3 of 4 instrument categories in the PrecisionFDA V2 Truth Challenge. Compared to previous state-of-the-art models, DeepVariant v1.0 significantly reduces the errors for widely-used sequencing data types, including Illumina and Pacific Biosciences. Read the article

Blurry image of genetic sequence

Identifying disease-causing variants in cancer patients

Researchers wanted to understand if incorporating automated deep learning technology would improve the detection of disease-causing variants in patients with cancer. In a cross-sectional study published in JAMA of 2,367 prostate cancer and melanoma patients in the US and Europe, DeepVariant found disease-causing variants in 14% more individuals than prior state-of-the-art methods.

Building large-scale cohorts for genetic discovery research

Large cohorts of sequenced individuals are the foundations for discovery of novel genetic associations with disease. We developed best practices for generating cohorts that substantially improves over previous methods, which has been adopted by the UK Biobank for its large-scale sequencing efforts. Read the article

Improving genetic association discovery with machine learning

Discovering genetic variants associated with a trait of interest requires a large cohort of individuals with both genetic and trait information. As published in AJHG , we demonstrate that using a machine learning model to predict eye-disease-related traits from fundus images significantly improves discovery of genetic variants influencing those traits.

Our partners in genomics research

Because genomic data is highly personal, to the greatest extent possible we use datasets that are fully public or are broadly available to qualified researchers. We also partner with trusted organizations that contribute scientific and technology development to improve standards in genomic analysis and enhance the utility of sequencing data.

Pacific Biosciences logo

DeepVariant’s precisionFDA Truth Challenge V2 submission using PacBio HiFi reads achieved the highest single-technology accuracy, which has been featured on the PacBio blog and in a Nature Biotechnology retrospective . The collaboration also successfully launched DeepConsensus , which improves HiFi yield and read quality compared to existing consensus basecalling methods.

Logo for Reneneron

The Regeneron Genetics Center, one of the world’s largest human genomic research efforts, has adopted DeepVariant and re-trained specialized models for both internal projects and the delivery of 200,000 exomes to UKBiobank .

Logo for University of California Santa Cruz Genomics Institute

Benedict Paten ’s lab at UC Santa Cruz collaborated with Google on PEPPER-DeepVariant , which won best accuracy in the Oxford Nanopore Technologies category of the PrecisionFDA . The paper was also published in Nature Methods .

Logo for NVIDIA

NVIDIA Clara Parabricks Pipelines software provides a suite of accelerated bioinformatic tools to support DNA and RNA applications, running on a GPU. Their implementation of DeepVariant processes a 30x whole human genome in less than 25 minutes from fastq to vcf using their latest A100 GPU.

Logo for GenapSys

GenapSys trained a custom DeepVariant model to provide a highly accurate variant caller for their new high accuracy, low cost, benchtop sequencing instrument.

Logo for GenapSys

ATGenomix builds a Spark framework which efficiently parallelizes DeepVariant , for their work with several clinical partners.

Logo for DNAnexus

DNAnexus provides a secure and collaborative fit-for-purpose bioinformatics system that integrates cutting-edge tools like DeepVariant. They work with industry leaders like Google, the FDA, and UK Biobank to provide solutions to the scientific community.

Logo for DNAstack

DNAstack enables researchers to organize, share, and analyze genomics and biomedical data, using tools like DeepVariant, in an easy to use cloud environment. DNAstack's software products use open standards developed by the Global Alliance for Genomics & Health.

The Genomic Data Analysis Network

research genomic data analysis

The Genomic Data Analysis Network (GDAN) serves to help the cancer research community leverage the genomic data and resources produced by CCG and other NCI programs

About the Genomic Data Analysis Network

While large genomic datasets are invaluable to the cancer research community, translating genomic data into biological insights into the development and treatment of cancer is not a straightforward task. Over a decade of experience from The Cancer Genome Atlas (TCGA) program demonstrated the power and necessity of “team science”—that successful analyses of large-scale genomic datasets require the coordination of a large body of researchers with a wide range of expertise in computational genomics, tumor biology, and clinical oncology. 

CCG’s Genomic Data Analysis Network (GDAN) was formed from the need to harness TCGA data and a growing need at large for computational genomics. For TCGA, the network created standardized data formats and processing protocols, generated bioinformatics tools for the community, and performed a range of analyses on the data, notably generating clinically meaningful molecular subgroups of cancer and producing the PanCancer Atlas .  

In the post-TCGA era, the GDAN continues to conduct key large-scale studies and generate genomic resources to support the genomic research community. The GDAN’s overall goal is to help the cancer research community leverage the genomic data and resources produced by CCG and other NCI programs for the benefit of cancer patients, largely by: 

  • developing and implementing new bioinformatic and computational tools to capture key biological insights about cancer (e.g., pathway analysis, data integration with visualization, and integrated cancer biology);   
  • developing data processing and quality control methods for working with large-scale genomic characterization data;   
  • processing and integrating a variety of analytical data types to generate disease-level findings and perform cross-disease analyses. 

The GDAN is comprised of individual Genome Data Analysis Centers (GDACs), each specializing in a unique set of computational analyses, molecular platforms, data integration, or visualization techniques. The GDACs are tasked to cooperatively perform molecular analyses on new and existing data from CCG programs and work with the other components of CCG’s Genome Characterization Pipeline. Areas of expertise and examples of their utility include: 

  • DNA Mutations – Identifying mutations in coding and non-coding regions of the genome, classifying mutations as driver or passenger mutations, identifying chromosomal rearrangement events leading to fusion proteins, and determining potential enhancer or suppressor functionality of mutations. 
  • Gene Expression – Identifying mRNA expression patterns and correlating with relevant clinical parameters, identifying translocation or rearrangement events. 
  • Copy number and tumor purity - Clustering cases according to copy number alteration or loss-of-heterozygosity events, identifying candidate drivers of copy number alterations, estimating tumor purity of the samples. 
  • miRNA analysis - Analyzing miRNA expression to correlate with patterns of mRNA expression and identify expression regulation networks, correlating miRNA data with relevant clinical parameters. 
  • Long non coding RNA (lnRNA) – Analyzing lnRNA expression patterns and correlating with patterns of mRNA expression or expression regulation networks. 
  • Batch effects and data integration – Identifying batch effects that might have been accrued during processing of samples, devising bioinformatics methods to correct such effects, determining biologically relevant groups that can subsequently be analyzed in the context of clinical data. 
  • Methylation analysis – Identifying DNA methylation patterns of interest and correlating patterns with relevant clinical parameters, correlating patterns with mRNA expression data to propose gene regulation mechanisms. 
  • Pathway analysis – Identifying biological pathways that have been altered, performing multi-omic data analyses to identify altered pathways and potential clinical relevance. 
  • Single cell RNA sequencing – Identifying cell clusters according to gene expression patterns, extracting expression levels and correlating with relevant clinical parameters, identifying translocation/rearrangement events, and identifying cell clusters or subclones of interest. 
  • Circulating cell-free DNA (cfDNA) and circulating tumor DNA (ctDNA) – Analyzing “liquid biopsies,” or blood samples for cfDNA or ctDNA to establish correlations between mutations in tumor tissue and ctDNA, developing methods to utilize the technology as a diagnostic and prognostic tool, and creating models of disease burden and progression in cancer development. 
  • Long-read sequencing – Assembling genomes, identifying structural variants, sequencing through repetitive regions, phasing critical variants. 
  • Spatial genomics – Analyzing gene expression data with spatial information, produced from different emerging spatial genomics platforms.  
  • Digital Imaging – Mining histopathology images for elements that aid in diagnostic or prognostic efforts, applying machine learning to learn relevant features. 

New Molecular Profiling Platforms to Explore New Facets of Cancer 

CCG continues to expand and develop new genomic data and analysis resources for the cancer research community. Through the GDAN and other CCG programs, CCG is exploring new ways to mine the data and learn new things about cancer from the massive dataset.  

Additionally, as new molecular platforms become available, CCG explores utilizing these platforms to complement existing datasets. New platforms may be utilized in new structural genomics projects or in some cases to further characterize existing samples. Existing or new Genome Characterization Centers may be sought out to provide these capabilities. These newer platforms include: 

  • Assay for transposase-accessible chromatin using sequencing (ATAC-seq) 
  • Single-cell RNA 
  • Single-cell DNA 
  • Spatial genomics 

CCG considers how these new technologies may be applied to enhance what we can learn about cancer. For example, can single-cell or spatial technologies provide much needed insights into the tumor microenvironments of tumors that don’t respond to treatment? How can the technologies be used to further what we can learn from TCGA or other existing datasets? 

For example, GDAN researchers applied the ATAC-seq chromatin accessibility assay to 410 TCGA tumor samples, getting an unprecedented systematic look at gene dysregulation in cancer. With this low-cost assay, the researchers were able to discover new DNA regulatory elements and a new class of mutations falling within these elements that may play a key role in cancer.  

In addition to applying new molecular platforms to TCGA samples, CCG is also working to perform whole-genome sequencing for the complete set of TCGA samples. These rich datasets, along with analyses and methods developed by the GDAN, could help facilitate the discovery of new diagnostic and prognostic markers, new targets for pharmaceutical interventions, and new cancer prevention and treatment strategies.  

Current GDAN Centers 

The GDAN is comprised of individual Genome Data Analysis Centers (GDACs), each contributing distinct functions, capabilities, and analytical components. Each GDAC works collaboratively within the network and also with other components of CCG’s Genome Characterization Pipeline. The current GDACs and their area of expertise in computational genomics are described below. 

research genomic data analysis

Data Science Journal

Press logo

  • Download PDF (English) XML (English)
  • Alt. Display
  • Collection: Open Data and Africa

Research Papers

Genomic research data generation, analysis and sharing – challenges in the african setting.

  • Nicola Mulder
  • Clement A. Adebamowo
  • Sally N. Adebamowo
  • Oladimeji Adebayo
  • Osimhiarherhuo Adeleye
  • Mohamed Alibi
  • Shakuntala Baichoo
  • Alia Benkahla
  • Faisal M. Fadlelmola
  • Hassan Ghazal
  • Kais Ghedira
  • Alice Matimba
  • Ahmed Moussa
  • Zahra Mungloo-Dilmohamud
  • Mayowa O. Owolabi
  • Fouzia Radouani
  • Charles N. Rotimi
  • Dan J. Stein
  • Oussama Souiai

Genomics is the study of the genetic material that constitutes the genomes of organisms. This genetic material can be sequenced and it provides a powerful tool for the study of human, plant and animal evolutionary history and diseases. Genomics research is becoming increasingly commonplace due to significant advances in and reducing costs of technologies such as sequencing. This has led to new challenges including increasing cost and complexity of data. There is, therefore, an increasing need for computing infrastructure and skills to manage, store, analyze and interpret the data. In addition, there is a significant cost associated with recruitment of participants and collection and processing of biological samples, particularly for large human genetics studies on specific diseases. As a result, researchers are often reluctant to share the data due to the effort and associated cost. In Africa, where researchers are most commonly at the study recruitment, determination of phenotypes and collection of biological samples end of the genomic research spectrum, rather than the generation of genomic data, data sharing without adequate safeguards for the interests of the primary data generators is a concern. There are substantial ethical considerations in the sharing of human genomics data. The broad consent for data sharing preferred by genomics researchers and funders does not necessarily align with the expectations of researchers, research participants, legal authorities and bioethicists. In Africa, this is complicated by concerns about comprehension of genomics research studies, quality of research ethics reviews and understanding of the implications of broad consent, secondary analyses of shared data, return of results and incidental findings. Additional challenges with genomics research in Africa include the inability to transfer, store,   process and analyze large-scale genomics data on the continent, because this requires highly specialized skills and expensive computing infrastructure which are often unavailable. Recently initiatives such as H3Africa and H3ABioNet which aim to build capacity for large-scale genomics projects in Africa have emerged. Here we describe such initiatives, including the challenges faced in the generation, analysis and sharing of genomic data and how these challenges are being overcome.

  • bioinformatics

Introduction

Broadly speaking, genomics is the study of the DNA that makes up the genomes of organisms, including sequencing and analysis of the structure and function of these molecules. Genomics is a powerful tool for the study of human, pathogen, plant and animal evolutionary history, pharmacogenomics and diseases through the analysis of genetic variations. Genomic data generation has achieved significant economy of scale in the context of genotyping and sequencing. This success story of decreasing costs of genomic technologies and the resulting increased data size and complexity is posing new challenges to scientists. The infrastructure and skills to manage, store, analyze and interpret genomic data are not keeping pace with the ever increasing data generation capabilities. Phenotypic characterization and collection of biological samples require enormous effort and input from multiple stakeholders. As a result, once data is generated, researchers are reluctant to immediately share it, because they want the opportunity to exploit the data given their efforts. The concern about sharing research data is particularly strong among African scientists who often lack the capacity to generate and analyze the genomics data arising from their samples. The outcome is dissatisfaction about the balance of recognition and benefits between primary data collectors and, primary and secondary genomics analysts in authorship, patents and other outcomes of research projects. For example, some of the major publications on genome sequences from African individuals have been led from abroad ( Tishkoff et al. 2009 ; Lambert and Tishkoff, 2009 ; Schuster et al. 2010 ).

Given the sensitive nature of human genetic data and the need to share samples and data across countries in large collaborative research projects, the nature and scope of informed consent raise several ethical, legal and social concerns. This is particularly so in the African context where novel and unique circumstances and opportunities may arise from the processes of informing and receiving consent for genomic research. The consent preferred by genomics researchers and funders, including the stipulation to share data may be problematic for other researchers, research participants, legal authorities and bioethicists. Comprehension of genomics research projects by participants, understanding the implications of broad consent and secondary analyses of shared data and samples, quality of research ethics reviews, return of genomics research results in the face of uncertainty about the meaning and lack of resources for verification, and incidental genomics findings pose additional ethical challenges for genomics research in Africa. For non-human genomic data, the ethical challenges are less, but there are potential commercial opportunities to be gained for example in the design of new drugs and vaccines against pathogens, which also form a barrier to open data sharing.

The infrastructural challenges posed by the increasing data generation capabilities of genomic technologies is particularly acute for researchers across the African continent. The highly specialized skills, extensive and expensive computing infrastructure, broadband internet access, secure cloud computing and uninterrupted power supply are not readily available across the continent. New initiatives are being established which aim to enable large-scale genomics projects in Africa to study the genetic basis of human history and diseases, but also have strong capacity building elements. One such effort is the Human Heredity and Health in Africa initiative (H3Africa: www.h3africa.org ) designed to generate large and complex genomics datasets from multiple ethnic groups across Africa ( H3Africa consortium 2014 ). At the core of the informatics activities of H3Africa is the H3ABioNet ( www.h3abionet.org ) ( Mulder et al. 2015 ). This Pan African bioinformatics network is building bioinformatics capacity to enable genomics data analysis on the continent. Here, as members of H3Africa projects and the H3ABioNet network, we describe the challenges associated with data generation, analysis and sharing in Africa, and how these initiatives and networks are working towards overcoming these challenges.

Genomics data generation

Genomics, a Big Data science, is generating ever increasing amounts of data, mostly from sequencing thousands of human and non-human genomes. In the mid 2000’s several new sequencing technologies based on massive parallel DNA sequencing approaches, referred to as next generation sequencing (NGS) emerged, which allow genomic scientists to sequence faster and cheaper. This has revolutionized our ability to interrogate the genomes of thousands of humans and other organisms, facilitating novel insights into biology, medicine and evolutionary history ( Goodwin, McPherson and McCombie 2016 ). These efforts require efficient data acquisition, storage, distribution, and analysis. Storage, distribution and analysis of data are core activities associated with any large amount of data, but data acquisition in genomics is highly distributed and involves heterogeneous formats. There is a need to systematically organize this data and to disseminate the resulting information through technically sound means to provide opportunities to academics and researchers globally. This information dissemination can facilitate advances in biomedical research for the improvement of health ( Stephens et al. 2015 ). Much of the data generated by NGS applications and other omics (including genomics, transcriptomics, proteomics and other high-throughput methods) technologies are housed in public databases ( Rung and Brazma 2013 ). These databases are either general (e.g. Sequence Read Archive, Array express, Gene Expression Omnibus) or dedicated to one research area (e.g. Oncomine for cancer research, plexDB for plant research). They are openly accessible for use by any researcher, though some, notably those containing human data, require access approval from specialist committees.

Genomic technologies are still relatively unaffordable in Africa. Generation of genomic datasets for human health and populations research has been limited, with South Africa, Ghana and Kenya being the most studied ( Adedokun et al. 2016 ; HuGE database). However, more recently, various African institutions are equipping themselves with next generation sequencing platforms and setting up systems for genomic data generation. H3Africa is the largest project to date with a focus on generating large scale genomic sequence data by scientists in Africa for human health research. There are different data types being generated by H3Africa projects, including genotyping by arrays, whole human genome or exome sequencing, targeted sequencing of human or pathogen genes, and 16S rRNA sequencing for microbiome analysis. The capacity for its storage and analysis is the subject in question and brings about the urgent need for improving resources and capacity for data sharing and analysis in Africa.

Data transfer and storage

Until recently, African institutions were not generating large genomics datasets alone, due to high cost and unavailability of equipment. Most were not involved in genomics research at all and the few that were generally outsourced their sequencing work or downloaded data from publicly available datasets. In both cases, a new challenge arose with transfer of large datasets in environments with limited infrastructure. Big data transfer is an essential service particularly when collaborating across multiple sites. For many African countries, data transfer is expensive, unreliable, insecure, and difficult to monitor. For example, it took 5 months to transfer 140TB of sequence data from the USA to South Africa using currently available transfer resources. With reliable internet, sufficient storage in place and constant monitoring, this should have taken 2 months, but delays were caused by low bandwidth and internet down times. Using File Transfer Africa ( http://filetransferafrica.co.za/3-5tb-and-2-6-million-files-uploaded-to-europe/ ), 3.5 terabytes were transferred at approximately 1 gigabyte per minute which equates to approximately, 60 gigabytes per hour or 1 terabyte every 12 hours. This meant the file transfer took a combined time of 40 hours ‘on the wire’. Based on this, if we use File transfer Africa, the transfer of 140TB should take approximately 66.7 days, or 2 months. However, many African Institutions do not have reliable internet connections and in some cases power supply is erratic because of weather-related damage to municipal infrastructure, and low levels of investment in public services such as electricity supply. There is also substantial lack of computer infrastructure and IT personnel to handle the large amount of data.

H3ABioNet has invested a lot of effort in infrastructure and skills development to enable big data transfer in Africa. There were challenges in trying to identify a single data transfer solution that can be used regardless of the operating system and the networking status of the institution. H3ABioNet started the NetMap project to map the internet speeds between its nodes (member institutions located throughout the continent), and implemented a unified transfer solution using Globus Online ( Foster 2011 ). Technical staff at nodes were trained to maintain their computing infrastructure and support their researchers with data transfer needs. How-To guides were developed to help technical staff and ensure the sustainability of the infrastructure.

Genomic data storage has also been a challenge. This includes the primary data, temporary files created during analysis, and the final dataset generated from research projects. Storing this large amount of data in federated but connected resources is a challenge in its own right. Many African and other Low and Middle Income Country (LMIC) institutions lack a well maintained and organized storage unit with built in redundancy that can ensure the security of research data. Apart from technical challenges, data organization and tracking is also a challenge. Genomic data can vary between several huge files and thousands of small files. Organizing them in ways that make them easily accessible and identifiable to researchers can be difficult and requires careful curation of the metadata.

There is enormous potential for scientific discovery when datasets from multiple projects are merged, particularly with genomic research that requires large sample sizes. In order to facilitate this, datasets need to be harmonized to ensure the metadata is consistent (same terms mean the same things), and data within files are easily searchable and findable. Within the H3Africa consortium alone, the clinical data has proven to be quite heterogeneous. Even when similar clinical observations have been captured, they are often not comparable due to differences in phrasing of questions or measurements. Therefore special effort is being invested in harmonizing data by mapping them to existing biomedical ontologies, which is essential for efficient data sharing. For example, curators have assessed case report forms for H3Africa projects to determine compatibility of data collected with similar but non-identical questions. The data is then being mapped to the phenotype, disease and experimental factor ontologies. Additionally, a recommended minimal set of questions based largely on PhenX measures, ( https://www.phenxtoolkit.org/ ) have been drafted for new projects starting recruitment of participants. More effort should be invested in future projects to ensure that better data harmonization strategies and rules are adopted and enforced.

Physical and infrastructural challenges of large scale genomic studies in Africa

There are few high performance computing centers in Africa outside of South Africa, although this is slowly changing. Additionally, the use of Cloud computing has not been practical for large datasets located on the continent due to slow internet speeds. Data security is also an issue for human data. Another challenge is the heterogeneous nature of the multitude of bioinformatics tools available. Many bioinformatics algorithms and software have been written by students who move on to other projects, so they remain unmaintained and unsupported. The field moves so rapidly that tools and pipelines constantly need to be updated. One advantage though is that most bioinformatics developers keep their code open source and available in github for the community to use and develop further. Nevertheless, using these tools for big data analysis in biomedical research requires familiarity with computer programming for data manipulation and bug fixing, biostatistics and use of analysis software. There is currently limited availability of skills to process and interpret big data in Africa. In anticipation of the development of an African Research Cloud, H3ABioNet has attempted to ease the barrier to accessing tools and computing by developing workflows for common data types that can be deployed on local high performance computing clusters or clouds. The network is also using courses and hackathons to build skills in the development and running of these workflows.

Sharing of genomic data

An ongoing challenge in the context of genomics research, is the relative novelty of data sharing practices in Africa, partly because African scientists have not been large generators of genomic data in the past. This novelty frequently translates into considerable concerns by ethics committees to approve genomics research protocols ( Ramsay et al. 2014 ; De Vries et al. 2015a ; De Vries et al. 2016 ) – although admittedly this is something that is more pertinent to the sharing of genomic samples than it seems to be the case for genomic data. Data sharing with a wide range of users is increasingly required by funding agencies, to maximize the use of data generated from individual research institutions or consortia. Sharing genomics data provides an opportunity to verify original analyses, improve reproducibility, test new hypotheses and combine datasets from different sources to achieve higher statistical power. Regardless of these potential benefits, sharing genomic data has provoked concerns for research participants and researchers. These concerns include endangering the privacy of the data subjects ( Gymrek et al. 2013 ; Homer et al. 2008 ) and downstream uses of data ( Shabani et al. 2016 ), which were not addressed by the initial informed consent process.

To address some of these concerns, several resources have been put in place, including setting up Data Access Committees (DACs), which are responsible for data release to external requestors based on legal, ethical and scientific eligibility. In a survey of DAC members involved in reviewing access requests for genomic data available in databases of Genotypes and Phenotypes (dbGaP) and the European Genome-phenome Archive (EGA) on their experiences and attitudes to the tools and mechanisms for access review and its adequacy in fulfilling the goals of controlled-access model of data sharing, the researchers concluded that DAC members and experts were ambivalent about the effectiveness and consistency of the review procedures and oversight process ( Shabani et al. 2016 ) Therefore, data sharing policies need to be structured in ways that address the ethical and access control concerns of all research stakeholders. H3Africa has developed a data sharing, access and release policy and established a Data and Biospecimen Access Committee (DBAC) with guidelines on its composition and role (H3Africa guidelines and policy documents are available at: http://www.h3africa.org/consortium/documents ). The data access policy stipulates that genomic and phenotype data must be submitted to public repositories in a timely manner. These timelines are discussed later.

Reluctance to share data can also be a major obstacle for academic or commercial reasons. Genomic data from the health or agricultural sectors may have commercial potential if, for example, it can lead to the development of novel vaccines, therapies or pesticides (genome analysis can identify potential novel targets for drugs or vaccines). While open access to data accelerates science, ownership of intellectual property and the benefits derivable from that may be barriers to sharing. African researchers are aware of the financial rewards and recognitions that have accrued to researchers in other parts of the world from the outcome of their research and are concerned about not receiving similar benefits from their research or credit for sharing their hard-won data ( van Panhuis et al. 2014 ). African researchers are particularly concerned about being perceived and treated as mere data collectors who do not make sufficient intellectual contributions worthy of similar levels of recognition and benefits received by other members of the research consortia. Many African institutions lack personnel and technical resources that can match those in High Income Countries (HIC) and enable them to quickly mine their own data and generate publications, patents and other benefits before the data become publicly available. Considerations therefore need to be given to extended periods of protected access for African institutions that is cognizant of this inequity in resources when timelines for data sharing are being determined.

Challenges to re-use of publicly available data

Even though several pan-African consortia, such as H3Africa, have aimed to strengthen the ability of African institutions to generate their own experimental data, these efforts remain limited by available funding. The concept of open data or open science that can be shared, freely used and reused by anyone for any purpose ( The Open Knowledge Foundation : http://opendefinition.org/ ) provides a useful alternative that helps bioinformaticians and other scientists to overcome the lack of access to their own experimental data. Public data reuse facilitates addressing of new biological questions ( Poldrack, and Gorgolewski 2014 ) and exploring secondary hypotheses that were not investigated in the original studies ( Chokshi et al. 2006 ; Goodman 2015 ; Van Horn and Ball 2008 ); developing and evaluating new methods; producing new products and services enabling inter- and transdisciplinary research ( Tenopir et al. 2011 ; Gaheen et al. 2013 ); implementing meta-analysis of data with the flexibility to include data and samples from different platforms (Rung and Brazma 2012) for making new observations that could not be detected using individual data sets; and integrating and analyzing several primary datasets in order to acquire new knowledge and/or build a new data resource (Rung and Brazma 2012). However, the reuse of data requires availability of sufficient information on the data generation process and experimental design to ensure its use in a scientifically appropriate way, as well as careful interpretation of the results.

Data repositories provide infrastructural solutions that enable scientists to share and reuse public data. The Wellcome Trust and the National Institutes for Health (NIH) have made large investments in sustainable infrastructures for genomic data ( Pisani and AbouZahr 2010 ). Several online repositories and archives offer the possibility to store, access, use and reuse research and scientific inputs and outputs. Such platforms speed up the transfer of knowledge among researchers and across scientific fields, and open up new ways of collaborating that can produce new knowledge, products and services. In addition, a growing number of funding agencies and publishers around the world are advocating and enforcing data management for open access. Despite the efforts made by data providers to make ‘omics’ data obtainable by everyone and simple to access, there are still several challenges facing African researchers to properly exploit the data. Technically, there is the issue of having a sustainable transfer channel to get the data on site, and there is generally a lack of skills and African infrastructural support for sharing, storing, managing, archiving and retrieving data ( Robinson 2014 ; Miller 2015 ). Also, shared data with an inappropriate data presentation format are less useful ( Ferguson et al. 2014 ). Most scientists would like to access data from others, but are rarely willing to disseminate their own data, except where intellectual property is recognized and rewarded. In addition, few grant-funded investigators want to spend precious time and funds preparing data for someone else’s research, even in exchange for authorship credit. There are a number of ethical and legal issues: data from some older studies may not conform to widespread reuse as they may not have had consent forms that allow free sharing of data with other researchers. Furthermore, even for new data with consent, anonymization of patients is not guaranteed, leaving researchers vulnerable to failure to preserve patients’ privacy.

Ethical challenges in Sharing of Genomic Data

Whilst there are a number of key ethical challenges relating to the sharing of genomic data, one of the most pertinent of these relates to how to promote fairness and equity in sharing. Whilst several international policies advocate for release of data immediately after curation, other policies recognize that this may disadvantage investigators based in LMICs who do not (yet) have the resources or capacity to analyze data rapidly ( Bull 2016 , Bull et al. 2015a ). The concern is that rapid sharing could lead to unfair collaboration practices and promote experiences of exploitation. Indeed, past collaborations may have offered little opportunity for African scientists to intellectually engage in or lead African health research. Adedokun and colleagues ( Adedokun et al. 2016 ) performed an analysis on 508 articles published between January 2004 and December 2013 using data from Sub-Saharan Africa (SSA) to assess the contributions of SSA scientists to genomics research involving African participants. While the majority of the publications (91.1%) had at least one author affiliated with an African institution, 8.9% did not include any author affiliated with an African institution. Less than the half (46.9%) of the publications had a first author from an African institution while the remaining proportion had a foreign scientist as first author ( Adedokun et al. 2016 ). Data sharing that does not meaningfully engage African researchers, risks being ill-suited to the needs of African populations and may even lead to inventions that are not relevant to Africans ( Ramsay et al. 2011 ). Another concern is that researchers who are not based in Africa may not be able to meaningfully interpret research findings which may aggravate existing stigma ( Bull et al. 2015b , De Vries et al. 2014 ).

In order to address this problem, H3Africa adopted three mechanisms to ensure that H3Africa researchers have at least a fair chance to meaningfully use genomic resources ( De Vries et al. 2015b , De Vries et al. 2015c ). The first is to only make resources available after nine months to allow investigators time to analyze data and submit manuscripts for publication. A second mechanism involves adding an additional twelve month embargo period for shared data so that H3Africa investigators can reserve research questions that they will address using data that they have generated within a reasonable timeframe while other investigators refrain from publishing on those topics. A third mechanism is to outline specific requirements for secondary use of samples, and in some cases data. For instance, some data can only be used for studies on specific diseases, such as for cardiovascular research. For samples, secondary use must involve meaningful capacity building and involvement of African researchers for the first two years of access. New data generated off the samples are required to be submitted to the EGA under the H3Africa project.

Protection of participants

Protection of participants is very important in genomic research and data sharing. Informed consent is an important mechanism to avoid potential exploitation of research participants and protect their rights and well-being. Valid consent is a process rather than a simple one-off matter of signing a form, and participants in genomics research may need to be engaged multiple times during the research project to ensure their continued voluntary consent to all parts of the research. The consent process needs to be designed in a way that is culturally appropriate and understandable ( Lemmens 2015 ). Investigators examining consent to genomic research in African settings have identified a number of challenges in communicating study goals, methods and procedures. ( Traore 2015 ), These include linguistic and conceptual barriers to comprehension of the research process, voluntariness, relationship between research and clinical practice, broad consent and the potential consequences of future unspecified research using samples and data ( Chokshi 2007 ; McGuire and Beskow 2010 ). Empirical evidence from across the continent shows that the majority of research participants do not grasp these concepts of genomics or are unable to explain what they mean ( Tindana and De Vries 2016 ), although there is notable variation with regards to the location (urban/rural) and demographics of research participants ( Marshall et al. 2014 ). There is often difficulty in finding local equivalent words for genomics terms and explaining these novel, unfamiliar and highly technical subjects in local languages ( Munung et al. 2016 ; Traore et al. 2015 ; Tindana et al. 2012 ). The process of working out comprehensible language is crucial, whether in writing or verbally. Chokshi et al ( 2007 ) suggest that this process should involve researchers, institutional review bodies (IRB), funders and communities jointly determining commonly accepted language, oral and written, for particular concepts. In addition to its classical elements, it is recommended that informed consent for genomic research should address four major elements associated with explanations of data and sample sharing: i) authorities (e.g. DACs and IRB/RECs) deciding on reuse of samples, (ii) restrictions on secondary use (e.g. when providing conditions for broad consent) (iii) reasons for storage and sharing of data and samples and (iv) the role of biobanks, which may include timeline for sample storage ( Munung et al. 2016 ).

Community engagement (CE) plays an important role in extending the ethical principle of respect for persons to entire communities, avoiding exploitation, and building trust between researchers and the communities involved in research ( Lakes et al. 2014 ; Tindana et al. 2011 ). CE in the context of genomic research provides opportunities for informing and educating communities about genomics and genomic research, and exchanging information between the research team and potential research participants about the research process over a period of time ( Tindana et al. 2015 ). Drawing an example from ongoing CE and outreach activities within the phenomics core of the Stroke Investigative Research and Education Network, SIREN (a member of the H3Africa Initiative), the CE framework design includes i) development and implementation of a Community Advisory Board (CAB) within each site to guide the ongoing research activities and research dissemination within communities, and ii) public outreach to communities and engagement with a focus on explaining the study, its objectives, expectations and to invite active participation. This community-based participatory research, in addition to other advantages, has allowed for the development of trusting community-researcher relationships and guided the researcher and team in disseminating findings and translating research into practice and policy ( Jenkins et al. 2016 ). In some research settings, CE may be conducted prior to data collection, while in others it may continue throughout the duration of a study. CE has been shown to enhance understanding of research goals and procedures particularly with the complexities involved in genomic studies. It also provides an avenue for feeding back research findings to participants and communities ( Rotimi et al. 2007 ; De Vries et al. 2011 ). Engaging the community in the course of the research grants members of the community an input into the research through their leaders. Therefore, ideally, CE activities should occur prior to, during and after a research project.

Guidelines and requirements of different ethics boards

International best practices, law and statutes of many African countries require independent ethics committees (institutional review boards (IRBs) or research ethics committees (RECs)), to provide third party review of research activities involving human subjects ( Kass et al. 2007a ). Although most African countries have some form of research review process, some still do not have national RECs ( Nyika et al. 2009 ; Kirigia et al. 2005 ). Furthermore, IRB/RECs requirements and guidelines vary across African borders with local challenges ( McWilliams et al. 2003 ). Recently, De Vries and collaborators ( De Vries et al. 2017 ) conducted a comprehensive analysis on 30 existing ethics guidelines, policies and other similar sources from 22 African countries involved in H3Africa in order to better understand the ethics regulatory landscape around genomic research and biobanking in Africa. It was shown that the type of ethics guidance sources varied tremendously across African countries, including standard operating procedures (SOPs); national guidelines for health research; national guidelines for genetic research; ministerial decrees and laws. Among the countries included, only Malawi, Nigeria and South Africa had specific national or local guidelines for genomic and/or biobanking research. While the informed consent topic is discussed in all reviewed guidelines, the way in which it is described differs from being more abstract, such as the case of Kenya’s guidelines, to very detailed, as is the case of Malawi’s guidelines. The use of broad consent for future unspecified uses seems to be prohibited only in Zambia, Malawi and Tanzania, while the majority of countries (Benin, Ghana, Guinee, Kenya, Lesotho, Mauritius, Namibia, Swaziland, Togo and Zimbabwe) neither prevent nor promote broad consent. The guidelines of the remaining countries (Botswana, Sierra Leone, Senegal, Uganda, Cameroon, South Africa, Sudan, Rwanda, Nigeria and Ethiopia) allow consent for future unspecified research, but with conditions attached. The storage of samples is allowed in all the countries studied, however few of them offer specific guidance on the timeframe for storage, such as Zambia, in which samples can only be stored for a period of 10 years and permission is needed for a longer period. In Malawi, samples can only be stored for five years and in Zimbabwe, extraterritorial storage of samples beyond the study period is prohibited.

In contrast to sample storage, the export of samples is tightly controlled in many African countries. Indeed, guidelines from twelve countries (Ethiopia, Lesotho, Nigeria, Rwanda, Botswana, Malawi, South Africa, Zambia, Cameroon, Kenya, Uganda and Zimbabwe) address the export of samples and require approval from one or more national agencies. Concerning sample re-use, only 14 countries (Botswana, Ghana, Ethiopia, Rwanda, Uganda, Kenya, Nigeria, Senegal, Sudan and Tanzania) out of 22, require approval from an ethics committee. The other countries are silent on whether approval from an ethics committee is required. Nine countries (Botswana, Cameroon, Ethiopia, Ghana, Kenya, Rwanda, Uganda, Zambia and Zimbabwe) specifically endorse international collaboration. Botswana, Kenya and Ugandan guidelines stipulate that export of samples is only allowed when there is no capacity to conduct the same analyses in the home country while, in Tanzania, if the local technology for the analysis exists, the researcher must explain why the samples need to be sent out of the country. Finally with respect to data sharing, only guidelines from Cameroon, Tanzania and Ethiopia mention data sharing and require a data sharing agreement to be submitted as part of the ethics application. Cameroon and Tanzania require review of all secondary studies by the ethics committee ( De Vries et al. 2017 ).

General research ethics challenges in Africa include concerns about independence of RECs, inadequate funding and inadequate qualified staff in a background of weak health system infrastructures ( Kass et al. 2007a ; Nyika et al. 2009 ; Coleman and Bouësseau 2008 ). There is widespread poor capacity to handle ethical issues with serious implications for efficient functioning of RECs ( Marzouk et al. 2014 ; Shabrawy et al. 2014 ). This poor capacity among RECs on the continent casts some doubt on their capacity to act as partners in genomics research in some places, in addition to the grave implications of making poorly informed decisions on applications ( Tindana et al. 2012 ; Wright et al. 2013 ). There are also issues of poor representation in committees, poor understanding of the role of RECs, an inadequate number of properly constituted RECs, low standard of the RECs, and inadequate compensation for time on the committees for members ( Coleman & Bouësseau 2008 ; Kirigia et al. 2005 ; Kass et al. 2007b; Ikingura et al. 2007 ). Therefore there is an urgent need for strengthening capacity for research ethics committees and for ethics research which could aid in development of national and regional policies which can support data sharing in Africa.

Although sharing of biospecimens is well established and enabled through material transfer agreements (MTAs), guidelines and mechanisms for sharing of genomics data is still evolving. Data protection laws differ across countries ( https://www.dlapiperdataprotection.com ) and could pose challenges if gaps identified in these laws are not recognized and addressed. The laws need to provide guidelines and policies that ensure success of consortium and collaborative research projects while ensuring that genomics data sharing is implemented in ways that are compatible with national laws and interests. Studies among H3Africa researchers, such as that from De Vries et al ( 2017 ) mentioned above, are underway to understand these gaps and improve alignment with national laws and guidelines. Unsurprisingly, there is no uniform informed consent template across different countries for genomic studies because there are wide cultural and ethnic disparities that make broad consent for data sharing difficult to apply across multinational studies ( Wright et al. 2014 ). There are legal and ethical challenges to the transition from narrow consent, which is the usual practice, to broad consent, which is necessary in emerging fields, such as genomics research ( Munung et al. 2016 ; Tindana et al. 2014 ; De Vries et al. 2015c ). There are additional layers of bureaucratic obstacles in some countries, due to the strict requirements for Materials Transfer Agreements (MTAs). These are in place to protect the loss and exploitation of national samples, but can have an impact on the ability to send genetic materials outside the country ( Federal Ministry of Health 2007 ; Staunton & Moodley 2013 ). Since data generation for many genetic studies is likely to be done outside the African shores, MTAs can thus impact data generation and transfer. In some cases MTAs are not clear about ownership of data or samples and how these would be governed especially where funding belongs to an external institution. Anecdotally, it may be stated that all materials belong to the government of the country where samples are obtained. However the international collaborator who is often well-resourced may dispute this ownership citing the significant funding contributions. Therefore there is need for development of guidelines on how these issues may be addressed. The Ethics and Governance Framework for Best Practice in Genomic Research and Biobanking in Africa recommends that MTAs should outline directions for handling commercializable outputs including benefit sharing arrangements.

Skills development for Big Data

Sustainable human resources.

It is important to emphasize the challenge of sustaining ongoing big data analysis. In particular, sustainable big data analysis requires a cohort of well-supported researchers at senior and junior levels, as well as a pipeline of graduate students and post-doctoral fellows. A crucial consideration is that not only are African science, health, and education budgets relatively small in real terms, but also that those who have posts in these sectors have other duties (including teaching and clinical duties) that leave little protected time for research ( Bezuidenhout et al. 2017 ). More senior researchers are likely to have significant administrative obligations, while more junior researchers are likely to have significant teaching obligations. Many African countries and institutions have not developed the concept of protected time for specific activities, or alignment of components of salaries with funding sources and proportion of efforts. Additionally, many funding agencies fund only research consumables and very few provide salary support or funds for development of general research infrastructure.

While there is a growing pipeline of postgraduate students in Africa, the flow is not uniform across countries, and it is relatively slow in some regions. Most postgraduate programs do not have postdoctoral components which truncates the training, mentoring and development of African scientists. In fact, there are very few programs with well-developed post-doctoral programs on the continent. Additionally, training of staff and students does not necessarily incorporate training in data management, curation or access. Even if researchers do have easy access to data, in poorly resourced institutions, many lack the capability to make effective use of the data. Therefore it is not only training in access that is important, but also in how to use what can be accessed ( Bezuidenhout et al. 2017 ). H3Africa has provided key resources for data collection and analysis in the short-term, but the longer-term view, in which sustained big data analysis takes place, may well require a different set of resources. An open ecosystem of genomic data requires bioinformatics skills and data curators, positions which may receive low priority for funding in academic institutions. In the absence of sustainable resources for big data analysis in Africa, there is every likelihood that publically shared data will be analyzed by researchers in high-income countries. This likelihood will perpetuate the “research gap”, whereby 90% of the world’s research is done in geographic regions where only 10% of the world’s population lives. This again raises the ethical issues of fairness and equity in sharing, discussed above, but, more importantly, the need for development of genomics skills through training.

Training in genomics data analysis

DNA sequencing is rapidly becoming a core part of medicine for millions of patients due to the decrease in sequencing costs. The data sources being generated by researchers, hospitals, and mobile devices around the world are diverse, complex, disorganized, massive, and of multimodal nature ( BD2K, 2016 ). It is predicted that by 2025, genomics will produce between two and 40 Exabytes of data annually ( Stephens et al. 2015 ). This deluge of data presents new opportunities, as well as new challenges. If researchers can make sense of the wealth of information, there can be an immense advancement in our understanding of human health and diseases. However, apart from the lack of appropriate tools and poor data accessibility, a major barrier to this rapid translational impact is insufficient training in the area of Big Data Analytics in Africa. New bioinformatics curricula can prepare students to address challenges raised by big data in the area of data unification, computational and storage limitations and multiple hypothesis testing ( Greene et al. 2016 ). Data unification refers to the challenges of addressing the inconsistency in data to obtain the necessary data in the appropriate format, as well as normalizing them to make them comparable across sources. This is essential for effective data sharing. H3ABioNet has run short courses on data management and analysis and has developed a recommended curriculum and guidelines for developing new degree programs in bioinformatics ( https://training.h3abionet.org/curriculum_development_wg/ ). These were used for establishing a new Masters degree at the University of Bamako, Mali, which has completed its first successful program this year (2017).

The field is moving rapidly, and the challenges and thus the solutions will keep changing. Scientists with skills in big data will need to be able to understand the current computing environment (processor, storage, memory and network costs) and how to, within that environment, most effectively mine the large-scale data to derive interesting insights. Significant resources are being allocated for training scientists in the analysis of large-scale data in the US (United States) and worldwide. In Africa, various training courses on big data have been provided to scientists and students under both the Square Kilometer Array Africa ( www.ska.ac.za/ ) and IBM MEA (Middle East and Africa) initiatives ( www.ibm.com/services/weblectures/meapwww.ibm.com/services/weblectures/meap ). IBM has partnered with academia in Africa to promote big data and analytics training programs through its MEA program. In another example, the United Genomes Project ( http://www.unitedgenomes.org/ ) aims at helping researchers to enhance genomic medicine by (1) developing methods to assemble genomic data across multiple African ethnicities, (2) building capacity across the continent and (3) facilitating scientific discovery through crowdsourcing and open innovation ( Siwo et al. 2015 ). The United Genomes Project collaborates with existing educational programs at universities or with programs offered by projects such as H3Africa.

A few organizations in Africa, namely H3ABioNet nodes, the African Society for Bioinformatics and Computational Biology, the African Society for Human Genetics, and Teaching and Research in Natural Sciences for Development in Africa (TReND in Africa; http://trendinafrica.org ) are actively involved in offering training programs in genomics and big data for African scientists ( Karikari & Aleksic 2015 ). The Wellcome Trust also regularly runs genomics-related courses in Africa in their extramural program. H3ABioNet has run a number of courses across all areas of bioinformatics and systems administration to prepare researchers for the deluge of genomics data. The African Genomic Medicine Training Initiative, a spin-off from H3ABioNet and other initiatives has designed and run Genomic Medicine training for African-based healthcare professionals based on collaboratively developed curricula.

Overcoming obstacles/challenges to facilitate translational genomics in Africa

Research output and authorship can serve as a veritable proxy for assessing research capacity of universities and research institutions. A recent study used evidence from genomics publications across Sub-Saharan Africa (SSA) to assess the genomic epidemiology research capacity of scientists in the region from 2004 to 2013. Significant disparities currently exist among SSA countries in genomics research capacity ( Adedokun et al. 2016 ). South Africa has the highest genomics research output, which is reflected in the investments made in its genomics and biotechnology sector ( Hardy et al. 2008 ; Warnich et al. 2011 ; Ndimba and Thomas 2008 ; Motari et al. 2004 ). The study findings call for African governments to increase their investments in building local capacity, provide a sustainable research environment, and encourage joint genomics research among those affiliated with SSA universities. Genomics has a huge potential to improve diagnosis and treatment of several medical conditions in Africa. Integrated translational research inclusive of basic functional genomics could add value to the genomics data generated. While the focus is currently on improving capacity for storage and transfer of genomics data, its interpretation is a critical step towards research and development of genomic products and personalized medicine and to understand population diversity in health and disease. However, infrastructure and scientists to conduct genomics research in the region are still suboptimal. Although the H3Africa efforts hold great promise for the transformation of genomics research in Africa through capacity building and better research facilities, there is a need to document the state of local or regional genomics research productivity in order to guide the equitable distribution of resources ( Adedokun et al. 2016 ). Foreign research investments such as those made by the NIH and Wellcome Trust through H3Africa have given genomics research in Africa a major boost, but the funding is not infinite and research groups in African countries need to work towards long term research sustainability.

As outlined above, one key challenge in data sharing relates to promoting fairness and equity. A crucially important aspect of ensuring fairness lies in empowering African researchers to take intellectual leadership roles in genomic research, which includes leading data analysis. The governance framework plays a critical role in ensuring conditions that allow African researchers a fair chance to analyze their data, first. The H3Africa Data Release Policy incorporates several elements to promote fairness, including for instance a lengthy period during which data will not be released (9 months before submission to EGA is required) and a further twelve month publication embargo for work on certain identified topics after data is released. Whilst the efficacy of ensuring that such conditions promote equitable sharing remains to be proven ( Bull 2016 ), the fact that they were developed with endorsement by both the NIH and the Wellcome Trust is a significant achievement for H3Africa researchers, and sets an important target for other, future genomic research initiatives that seek to use African samples. Since the first dataset from H3Africa has only recently been submitted to the EGA and other projects are just starting to analyze their own data, it is too early to tell what the impact of the policy will be, but the H3Africa consortium has already collectively published more than 70 papers since 2014.

Although some of the other obstacles discussed in this paper cannot be overcome easily, and require improvements in basic infrastructure and service provision by local governments, H3Africa has had a large impact on the development of capacity for genomics research and data generation. H3ABioNet has provided support for overcoming challenges in data transfer, storage and processing, and together with the larger H3Africa consortium, has played a major role in the development of necessary skills. Thus, the consortium is working toward a scenario where genomics data can be generated, stored and analyzed in Africa for the benefit of African scientists and ultimate translation to improve the health of Africans.

Acknowledgements

H3ABioNet is supported by the National Institutes of Health Common Fund (grant number U41HG006941). SNA and CAA are funded by H3Africa’s African Collaborative Center for Microbiome and Genomics Research (ACCME) grant supported by the National Institutes of Health (1U54HG006947). OA, OA and MOO are funded by H3Africa’s Stroke Investigative Research and Educational Network (U54HG007479). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Competing Interests

The authors have no competing interests to declare.

Adedokun, B O, Olopade, C O and Olopade, O I (2016). Building local capacity for genomics research in Africa: recommendations from analysis of publications in Sub-Saharan Africa from 2004 to 2013 Global Health Action 9: 31026. DOI:  https://doi.org/10.3402/gha.v9.31026  

BD2K ().  https://datascience.nih.gov [Accessed, December, 2016].  

Bezuidenhout, L M, Leonelli, S, Kelly, A H and Rappert, B (2017). Beyond the digital divide: Towards a situated approach to open data Science and Piblic Policy 44(4): 464–475, DOI:  https://doi.org/10.1093/scipol/scw036  

Bull, S (2016). Ensuring global equity in open research In: London: Wellcome Trust, DOI:  https://doi.org/10.6084/m9.figshare.4055181.v1  

Bull, S, Cheah, P Y, Denny, S, Jao, I, Marsh, V, Merson, L, Shah More, N, Nhan, L N T, Osrin, D, Tangseefa, D, Wassenaar, D and Parker, M (2015a). Best Practices for Ethical Sharing of Individual-Level Health Research Data From Low- and Middle-Income Settings Journal of Empirical Research on Human Research Ethics 10: 302–313, DOI:  https://doi.org/10.1177/1556264615594606  

Bull, S, Roberts, N and Parker, M (2015b). Views of Ethical Best Practices in Sharing Individual-Level Data From Medical and Public Health Research: A Systematic Scoping Review Journal of Empirical Research on Human Research Ethics 10: 225–38, DOI:  https://doi.org/10.1177/1556264615594767  

Chokshi, D A, Parker, M and Kwiatkowski, D P (2006). Data sharing and intellectual property in a genomic epidemiology network: policies for large-scale research collaboration Bulletin of the World Health Organisation 84(5): 382–7, DOI:  https://doi.org/10.2471/BLT.06.029843  

Chokshi, D A, Thera, M A, Parker, M, Diakite, M, Makani, J, Kwiatkowski, D P and Doumbo, O K (2007). Valid consent for genetic epidemiology in developing countries PLOS Medicine 4(4): e95. 636–41.  

Coleman, C H and Bouësseau, M-C (2008). How do we know that research ethics committees are really working? The neglected role of outcomes assessment in research ethics review BMC medical ethics 9: 6. DOI:  https://doi.org/10.1186/1472-6939-9-6  

De Vries, J, Abayomi, A, Littler, K, Madden, E, McCurdy, S, Ouwe Missi Oukem-Boyer, O, Seeley, J, Staunton, C, Tangwa, G, Tindana, P, Troyer, J and The H3Africa Working Group on Ethics (2015a). Addressing ethical issues in H3Africa research – the views of research ethics committee members The HUGO Journal 9: 1. DOI:  https://doi.org/10.1186/s11568-015-0006-6  

De Vries, J, Abayomi, A, Littler, K, Madden, E, McCurdy, S, Ouwe Missi Oukem-Boyer, O, Seeley, J, Staunton, C, Tangwa, G, Tindana, P, Troyer, J and The H3Africa Working Group on Ethics (2015c). Addressing ethical issues in H3Africa research – the views of research ethics committee members The HUGO Journal 9(1): 1. DOI:  https://doi.org/10.1186/s11568-015-0006-6  

De Vries, J, Bull, S J, Doumbo, O, Ibrahim, M, Mercereau-Puijalon, O, Kwiatkowski, D and Parker, M (2011). Ethical issues in human genomics research in developing countries BMC Medical Ethics 12: 5. DOI:  https://doi.org/10.1186/1472-6939-12-5  

De Vries, J, Littler, K, Matimba, A, McCurdy, S, Ouwe Missi Oukem-Boyer, O, Seeley, J and Tindana, P (2016). Evolving perspectives on broad consent for genomics research and biobanking in Africa. Report of the Second H3Africa Ethics Consultation Meeting, 11th May 2015 Global Health, Epidemiology and Genomics 1: 1–3, DOI:  https://doi.org/10.1017/gheg.2016.5  

De Vries, J, Munung, S N, Matimba, A, McCurdy, S, Ouwe Missi Oukem-Boyer, O, Staunton, C, Yakubu, A, Tindana, P and H3Africa Consortium (2017). Regulation of genomic and biobanking research in Africa: a content analysis of ethics guidelines, policies and procedures from 22 African countries BMC Med Ethics 18(1): 8. DOI:  https://doi.org/10.1186/s12910-016-0165-6  

De Vries, J, Tindana, P, Littler, K, Ramsay, M, Rotimi, C, Abayomi, A, Mulder, N and Mayosi, B M (2015b). The H3Africa policy framework: negotiating fairness in genomics Trends in Genetics 31: 117–9, DOI:  https://doi.org/10.1016/j.tig.2014.11.004  

De Vries, J, Williams, T, Bojang, K, Kwiatkowski, D, Fitzpatrick, R and Parker, M (2014). Knowing who to trust: exploring the role of ‘ethical metadata’ in mediating risk of harm in collaborative genomics research in Africa BMC Medical Ethics 15: 62. DOI:  https://doi.org/10.1186/1472-6939-15-62  

Federal Ministry of Health (2007). National Code of Health Research Ethics  

Ferguson, A R, Nielson, J L, Cragin, M H, Bandrowski, A E and Martone, M E (2014). Big data from small data: data-sharing in the ‘long tail’ of neuroscience Nature Neuroscience 17(11): 1442–7, DOI:  https://doi.org/10.1038/nn.3838  

Foster, I (2011). Globus Online: Accelerating and Democratizing Science through Cloud-Based Services Internet Computing, IEEE 15(3): 70–3. http://doi.ieeecomputersociety.org/10.1109/MIC.2011.64  

Gaheen, S, Hinkal, G W, Morris, S A, Lijowski, M, Heiskanen, M and Klemm, J D (2013). caNanoLab: data sharing to expedite the use of nanotechnology in biomedicine Computational Science and Discovery 6(1): 014010. DOI:  https://doi.org/10.1088/1749-4699/6/1/014010  

Goodman, S N (2015). Clinical trial data sharing: what do we do now? Annals of Internal Medicine 162(4): 308–9, DOI:  https://doi.org/10.7326/M15-0021  

Goodwin, S, McPherson, J D and McCombie, W R (2016). Coming of age: ten years of next-generation sequencing technologies Nature Reviews Genetics 17(6): 333–51, DOI:  https://doi.org/10.1038/nrg.2016.49  

Greene, A C, Giffin, K A, Greene, C S and Moore, J H (2016). Adapting bioinformatics curricula for big data Briefings in Bioinformatics 17(1): 43–50, DOI:  https://doi.org/10.1093/bib/bbv018  

Gymrek, M, McGuire, A L, Golan, D, Halperin, E and Erlich, Y (2013). Identifying personal genomes by surname inference Science 339: 321–4, DOI:  https://doi.org/10.1126/science.1229566  

Hardy, B J, Seguin, B, Ramesar, R, Singer, P A and Daar, A S (2008). South Africa: from species cradle to genomic applications Nature Reviews Genetics 9: S19–23, DOI:  https://doi.org/10.1038/nrg2441  

Homer, N, Szelinger, S, Redman, M, Duggan, D, Tembe, W, Muehling, J, Pearson, J V, Stephan, D A, Nelson, S F and Craig, D W (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays PLoS Genetics 4: e1000167. DOI:  https://doi.org/10.1371/journal.pgen.1000167  

Ikingura, J K B, Kruger, M and Zeleke, W (2007). Health research ethics review and needs of institutional ethics committee in Tanzania Tanzania Health Research Bulletin 9(3): 154–8.  

Jenkins, C, Arulogun, O S, Singh, A, Mande, A T, Ajayi, E, Benedict, C T, Ovbiagele, B, Lackland, D T, Sarfo, F S, Akinyemi, R, Akpalu, A, Obiako, R, Melikam, E S, Laryea, R, Shidali, V, Sagoe, K, Ibinaiye, P, Fakunle, A G, Owolabi, L F, Owolabi, M O and The SIREN team (2016). Stroke Investigative Research and Education Network: Community Engagement and outreach within Phenomic core Health Education & Behavior 43(1S): 82S–92S, http://journals.sagepub.com/home/heb DOI:  https://doi.org/10.1177/1090198116634082  

Karikari, T K and Aleksic, J (2015). Neurogenomics: An opportunity to integrate neuroscience, genomics and bioinformatics research in Africa Applied and Translational Genomics 5: 3–10, DOI:  https://doi.org/10.1016/j.atg.2015.06.004  

Kass, N E, Hyder, A A, Ajuwon, A, Appiah-Poku, J, Barsdorf, N, Elsayed, D E, Mokhachane, M, Mupenda, B, Ndebele, P, Ndossi, G, Sikateyo, B, Tangwa, G and Tindana, P (2007a). The structure and function of research ethics committees in Africa: A case study PLoS Medicine 4(1): e3. DOI:  https://doi.org/10.1371/journal.pmed.0040003  

Kirigia, J M, Wambebe, C and Baba-Moussa, A (2005). Status of national research bioethics committees in the WHO African region BMC Medical Ethics 6: E10. DOI:  https://doi.org/10.1186/1472-6939-6-10  

Lakes, K D, Vaughan, E, Pham, J, Tran, T, Jones, M, Baker, D, Swanson, J M and Olshansky, E (2014). Community member and faith leader perspectives on the process of building trusting relationships between communities and researchers Clinical and Translational Science 7(1): 20–8, DOI:  https://doi.org/10.1111/cts.12136  

Lambert, C A and Tishkoff, S A (2009). Genetic structure in African populations: Implications for human demographic history Cold Spring Harb Symp Quant Biol 74: 395–402, DOI:  https://doi.org/10.1101/sqb.2009.74.053  

Lemmens, T (2015). Informed consent In: Joly, Y and Knoppers, B M eds.   Routledge handbook of medical law and ethics . Abingdon, UK: Routledge, pp. 27–51.  

Marshall, P, Adebamowo, C, Adeyemo, A, Ogundiran, T, Strenski, T, Zhou, J and Rotimi, C (2014). Voluntary participation and comprehension of informed consent in a genetic epidemiological study of breast cancer in Nigeria BMC Medical Ethics 15: 38. DOI:  https://doi.org/10.1186/1472-6939-15-38  

Marzouk, D, Abd El Aal, W, Saleh, A, Sleem, H, Khyatti, M, Mazini, L, Hemminki, K and Anwar, W A (2014). Overview on health research ethics in Egypt and North Africa European Journal of Public Health 24(Suppl 1): 87–91, DOI:  https://doi.org/10.1093/eurpub/cku110  

McGuire, A L and Beskow, L M (2010). Informed consent in genomics and genetic research Annual Review of Genomics and Human Genetics 11: 361. DOI:  https://doi.org/10.1146/annurev-genom-082509-141711  

McWilliams, R, Hoover-Fong, J, Hamosh, A, Beck, S, Beaty, T and Cutting, G (2003). Problematic variation in local institutional review of a multicenter genetic epidemiology study Jama 290(3): 360–6, DOI:  https://doi.org/10.1001/jama.290.3.360  

Miller, GW (2015). Data sharing in toxicology: beyond show and tell Toxicological Sciences 143(1): 3–5, DOI:  https://doi.org/10.1093/toxsci/kfu237  

Motari, M, Quach, U, Thorsteinsdottie, H, Martin, D K, Daar, A S and Singer, P A (2004). South Africa blazing a trail for African biotechnology Nature Biotechnology 22: DC37–42, DOI:  https://doi.org/10.1038/nbt1204supp-DC37  

Mulder, N J Adebiyi, E Alami, R Benkahla, A Brandful, J Doumbia, S et al. (2015). H3ABioNet, a Sustainable Pan African Bioinformatics Network for Human Heredity and Health in Africa Genome Research 26(2): 271–7, DOI:  https://doi.org/10.1101/gr.196295  

Munung, N S, Marshall, P, Campbell, M, Littler, K, Masiye, F, Ouwe-Missi-Oukem-Boyer, O, Seeley, J, Stein, D J, Tindana, P and de Vries, J (2016). Obtaining informed consent for genomics research in Africa: analysis of H3Africa consent documents Journal of Medical Ethics 42(2): 132–7, DOI:  https://doi.org/10.1136/medethics-2015-102796  

Ndimba, B K and Thomas, L A (2008). Proteomics in South Africa: Current status, challenges and prospects Biotechnology Journal 3: 1368–74, DOI:  https://doi.org/10.1002/biot.200800236  

Nyika, A, Kilama, W, Chilengi, R, Tangwa, G, Tindana, P, Ndebele, P and Ikingura, J (2009). Composition, training needs and independence of ethics review committees across Africa: are the gate-keepers rising to the emerging challenges? Journal of Medical Ethics 35(3): 189–93, DOI:  https://doi.org/10.1136/jme.2008.025189  

Open Knowledge Foundation (). The Open Definition http://opendefinition.org/ [Accessed, December, 2016].  

Pisani, E and AbouZahr, C (2010). Sharing health data: good intentions are not enough Bulletin of the World Health Organisation 88(6): 462–6, DOI:  https://doi.org/10.2471/BLT.09.074393  

Poldrack, R A and Gorgolewski, K J (2014). Making big data open: data sharing in neuroimaging Nature Neuroscience 17(11): 1510–7, DOI:  https://doi.org/10.1038/nn.3818  

Ramsay, M, De Vries, J, Soodyall, H, Norris, S and Sankoh, O (2014). Ethical issues in genomic research on the African continent: experiences and challenges to ethics review committees Human Genomics 8: 15. DOI:  https://doi.org/10.1186/s40246-014-0015-x  

Ramsay, M, Tiemessen, C T, Choudhury, A and Soodyall, H (2011). Africa: the next frontier for human disease gene discovery? Human Molecular Genetics 20: R214–20, DOI:  https://doi.org/10.1093/hmg/ddr401  

Robinson, P N (2014). Genomic data sharing for translational research and diagnostics Genome Medicine 6(9): 78. DOI:  https://doi.org/10.1186/s13073-014-0078-2  

Rotimi, C, Leppert, M, Matsuda, I, Zeng, C, Zhang, H, Adebamowo, C, Ajayi, I, Aniagwu, T, Dixon, M, Fukushima, Y, Macer, D, Marshall, P, Nkwodimmah, C, Peiffer, A, Royal, C, Suda, E, Zhao, H, Wang, VO, McEwen, J and International HapMap Consortium (2007). Community engagement and informed consent in the International HapMap project Community Genetics 10(3): 186–98, DOI:  https://doi.org/10.1159/000101761  

Rung, J and Brazma, A (2013). Reuse of public genome-wide gene expression data Nature Reviews Genetics 14(2): 89–99, DOI:  https://doi.org/10.1038/nrg3394  

Schuster, S C Miller, W Ratan, A Tomsho, L P Giardine, B Kasson, L R et al. (2010). Complete Khoisan and Bantu genomes from southern Africa Nature 463(7283): 943–7, DOI:  https://doi.org/10.1038/nature08795  

Shabani, M, Thorogood, A and Borry, P (2016). Who should have access to genomic data and how should they be held accountable? Perspectives of Data Access Committee members and experts European Journal of Human Genetics 24: 1671–5, DOI:  https://doi.org/10.1038/ejhg.2016.111  

Shabrawy, E, El Hefnawy, T and Reda, H (2014). Applying Ethical Guidelines in Clinical Researches among Academic Medical Staff: An Experience from South Egypt British Journal of Medicine and Medical Research 4(10): 2014–24, DOI:  https://doi.org/10.9734/BJMMR/2014/7071  

Siwo, G H, Williams, S M and Moore, J H (2015). The future of genomic medicine education in Africa Genome Medicine 7(1): 47. http://genomemedicine.com/content/7/1/47 DOI:  https://doi.org/10.1186/s13073-015-0175-x  

Staunton, C and Moodley, K (2013). Challenges in biobank governance in Sub-Saharan Africa BMC medical Ethics 14(1): 35. DOI:  https://doi.org/10.1186/1472-6939-14-35  

Stephens, Z D Lee, S Y Faghri, F Campbell, R H Zhai, C Efron, M J et al. (2015). Big Data: Astronomical or Genomical? PLoS Biology 13(7): e1002195. DOI:  https://doi.org/10.1371/journal.pbio.1002195  

Tenopir, C, Allard, S, Douglass, K, Aydinoglu, A U, Wu, L, Read, E, Manoff, M and Frame, M (2011). Data Sharing by Scientists: Practices and Perceptions PLoS One 6(6): e21101. DOI:  https://doi.org/10.1371/journal.pone.0021101  

The H3Africa Consortium (2014). Enabling African Scientists to Engage Fully in the Genomic Revolution Science 344(6190): 1346–8, DOI:  https://doi.org/10.1126/science.1251546  

Tindana, P, Bull, S, Amenga-Etego, L, de Vries, J, Aborigo, R, Koram, K, Kwiatkowski, D and Parker, M (2012). Seeking consent to genetic and genomic research in a rural Ghanaian setting: a qualitative study of the MalariaGEN experience BMC Medical Ethics 13(1): 15. DOI:  https://doi.org/10.1186/1472-6939-13-15  

Tindana, P and De Vries, J (2016). Broad Consent for Genomic Research and Biobanking: Perspectives from Low- and Middle-Income Countries Annual Review of Genomics and Human Genetics 17: 2.1–2.19, DOI:  https://doi.org/10.1146/annurev-genom-083115-022456  

Tindana, P, De Vries, J, Campbell, M, Littler, K, Seeley, J, Marshall, P, Troyer, J, Ogundipe, M, Alibu, V P, Yakubu, A, Parker, M and as members of the H3A Working Group on Ethics (2015). Community engagement strategies for genomic studies in Africa: a review of the literature BMC Medical Ethics 16: 24. DOI:  https://doi.org/10.1186/s12910-015-0014-z  

Tindana, P, Molyneux, C S, Bull, S and Parker, M (2014). Ethical issues in the export, storage and reuse of human biological samples in biomedical research: perspectives of key stakeholders in Ghana and Kenya BMC Medical Ethics 15(76): 1–11, DOI:  https://doi.org/10.1186/1472-6939-15-76  

Tindana, P O, Rozmovits, L, Boulanger, R F, Bandewar, S V, Aborigo, R A, Hodgson, A V, Kolopack, P and Lavery, J V (2011). Aligning community engagement with traditional authority structures in global health research: A case study from northern Ghana American Journal of Public Health 101(10): 1857–67, DOI:  https://doi.org/10.2105/AJPH.2011.300203  

Tishkoff, S A Reed, F A Friedlaender, F R Ehret, C Ranciaro, A Froment, A et al. (2009). The genetic structure and history of Africans and African Americans Science 324(5930): 1035–44, DOI:  https://doi.org/10.1126/science.1172257  

Traore, K, Bull, S, Niare, A, Konate, S, Thera, M A, Kwiatkowski, D, Parker, M and Doumbo, O K (2015). Understandings of genomic research in developing countries: a qualitative study of the views of MalariaGEN participants in Mali BMC Medical Ethics 16(42): 42. DOI:  https://doi.org/10.1186/s12910-015-0035-7  

Van Horn, J D and Ball, C A (2008). Domain-Specific Data Sharing in Neuroscience: what do we have to learn from each other? Neuroinformatics 6(2): 117–21, DOI:  https://doi.org/10.1007/s12021-008-9019-9  

van Panhuis, W G, Paul, P, Emerson, C, Grefenstette, J, Wilder, R, Herbst, A J, Heymann, D and Burke, D S (2014). A systematic review of barriers to data sharing in public health BMC Public Health 14: 1144. DOI:  https://doi.org/10.1186/1471-2458-14-1144  

Warnich, L, Drogemoller, B I, Pepper, M S, Dandara, C and Wright, E B (2011). Pharmacogenomic research in South Africa: Lessons learned and future opportunities in the Rainbow Nation Current Pharmacogenomics Personalised Medicine 9: 191–207, DOI:  https://doi.org/10.2174/187569211796957575  

Wright, G E B, Adeyemo, A A and Tiffin, N (2014). Informed consent and ethical re-use of African genomic data Human Genomics 8(18): 1–3, DOI:  https://doi.org/10.1186/s40246-014-0018-7  

Wright, G E B, Koornhof, P G, Adeyemo, A A and Tiffin, N (2013). Ethical and legal implications of whole genome and whole exome sequencing in African populations BMC Medical Ethics 14(1): 21. DOI:  https://doi.org/10.1186/1472-6939-14-21  

  • Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics
  • The Human Genome Project
  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

A Brief Guide to Genomics

Genomics is the study of all of a person's genes (the genome), including interactions of those genes with each other and with the person's environment..

What is DNA?

Deoxyribonucleic acid (DNA) is the chemical compound that contains the instructions needed to develop and direct the activities of nearly all living organisms. DNA molecules are made of two twisting, paired strands, often referred to as a double helix

Each DNA strand is made of four chemical units, called nucleotide bases, which comprise the genetic "alphabet." The bases are adenine (A), thymine (T), guanine (G), and cytosine (C). Bases on opposite strands pair specifically: an A always pairs with a T; a C always pairs with a G. The order of the As, Ts, Cs and Gs determines the meaning of the information encoded in that part of the DNA molecule just as the order of letters determines the meaning of a word.

What is a genome?

An organism's complete set of DNA is called its genome. Virtually every single cell in the body contains a complete copy of the approximately 3 billion DNA base pairs, or letters, that make up the human genome.

With its four-letter language, DNA contains the information needed to build the entire human body. A gene traditionally refers to the unit of DNA that carries the instructions for making a specific protein or set of proteins. Each of the estimated 20,000 to 25,000 genes in the human genome codes for an average of three proteins.

Located on 23 pairs of chromosomes packed into the nucleus of a human cell, genes direct the production of proteins with the assistance of enzymes and messenger molecules. Specifically, an enzyme copies the information in a gene's DNA into a molecule called messenger ribonucleic acid (mRNA). The mRNA travels out of the nucleus and into the cell's cytoplasm, where the mRNA is read by a tiny molecular machine called a ribosome, and the information is used to link together small molecules called amino acids in the right order to form a specific protein.

Proteins make up body structures like organs and tissue, as well as control chemical reactions and carry signals between cells. If a cell's DNA is mutated, an abnormal protein may be produced, which can disrupt the body's usual processes and lead to a disease such as cancer.

A Brief Guide to Genomics

What is DNA sequencing?

Sequencing simply means determining the exact order of the bases in a strand of DNA. Because bases exist as pairs, and the identity of one of the bases in the pair determines the other member of the pair, researchers do not have to report both bases of the pair.

In the most common type of sequencing used today, called sequencing by synthesis, DNA polymerase (the enzyme in cells that synthesizes DNA) is used to generate a new strand of DNA from a strand of interest. In the sequencing reaction, the enzyme incorporates into the new DNA strand individual nucleotides that have been chemically tagged with a fluorescent label. As this happens, the nucleotide is excited by a light source, and a fluorescent signal is emitted and detected. The signal is different depending on which of the four nucleotides was incorporated. This method can generate 'reads' of 125 nucleotides in a row and billions of reads at a time.

To assemble the sequence of all the bases in a large piece of DNA such as a gene, researchers need to read the sequence of overlapping segments. This allows the longer sequence to be assembled from shorter pieces, somewhat like putting together a linear jigsaw puzzle. In this process, each base has to be read not just once, but at least several times in the overlapping segments to ensure accuracy.

Researchers can use DNA sequencing to search for genetic variations and/or mutations that may play a role in the development or progression of a disease. The disease-causing change may be as small as the substitution, deletion, or addition of a single base pair or as large as a deletion of thousands of bases.

What is the Human Genome Project?

The Human Genome Project, which was led at the National Institutes of Health (NIH) by the National Human Genome Research Institute, produced a very high-quality version of the human genome sequence that is freely available in public databases. That international project was successfully completed in April 2003, under budget and more than two years ahead of schedule.

The sequence is not that of one person, but is a composite derived from several individuals. Therefore, it is a "representative" or generic sequence. To ensure anonymity of the DNA donors, more blood samples (nearly 100) were collected from volunteers than were used, and no names were attached to the samples that were analyzed. Thus, not even the donors knew whether their samples were actually used.

The Human Genome Project was designed to generate a resource that could be used for a broad range of biomedical studies. One such use is to look for the genetic variations that increase risk of specific diseases, such as cancer, or to look for the type of genetic mutations frequently seen in cancerous cells. More research can then be done to fully understand how the genome functions and to discover the genetic basis for health and disease.

What are the implications for medical science?

Virtually every human ailment has some basis in our genes. Until recently, doctors were able to take the study of genes, or genetics, into consideration only in cases of birth defects and a limited set of other diseases. These were conditions, such as sickle cell anemia, which have very simple, predictable inheritance patterns because each is caused by a change in a single gene.

With the vast trove of data about human DNA generated by the Human Genome Project and other genomic research, scientists and clinicians have more powerful tools to study the role that multiple genetic factors acting together and with the environment play in much more complex diseases. These diseases, such as cancer, diabetes, and cardiovascular disease constitute the majority of health problems in the United States. Genome-based research is already enabling medical researchers to develop improved diagnostics, more effective therapeutic strategies, evidence-based approaches for demonstrating clinical efficacy, and better decision-making tools for patients and providers. Ultimately, it appears inevitable that treatments will be tailored to a patient's particular genomic makeup. Thus, the role of genetics in health care is starting to change profoundly and the first examples of the era of genomic medicine are upon us.

It is important to realize, however, that it often takes considerable time, effort, and funding to move discoveries from the scientific laboratory into the medical clinic. Most new drugs based on genome-based research are estimated to be at least 10 to 15 years away, though recent genome-driven efforts in lipid-lowering therapy have considerably shortened that interval. According to biotechnology experts, it usually takes more than a decade for a company to conduct the kinds of clinical studies needed to receive approval from the Food and Drug Administration.

Screening and diagnostic tests, however, are here. Rapid progress is also being made in the emerging field of pharmacogenomics, which involves using information about a patient's genetic make-up to better tailor drug therapy to their individual needs.

Clearly, genetics remains just one of several factors that contribute to people's risk of developing most common diseases. Diet, lifestyle, and environmental exposures also come into play for many conditions, including many types of cancer. Still, a deeper understanding of genetics will shed light on more than just hereditary risks by revealing the basic components of cells and, ultimately, explaining how all the various elements work together to affect the human body in both health and disease.

Related Contents

DNA sequencing by gel electrophoresis

Last updated: August 16, 2022

Genomic Data and Big Data Analytics

  • Conference paper
  • First Online: 01 December 2021
  • Cite this conference paper

research genomic data analysis

  • Hiren Kumar Deva Sarma 13  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 281))

822 Accesses

Genomic research has been highly prominent in recent times. Society has witnessed huge progress in genomic research in the last decade. The amount of data generated due activities like genome sequencing is huge. It is important to analyse such huge amount of data for acquiring meaningful insight so that such knowledge finds application in real-life scenarios. However, analysing such huge volume of data is extremely difficult because of the unique characteristics and complexities of these data. Big data analytic approaches are possible to explore for analytic purpose, and there have been researching efforts in that direction. In this paper, the relationship between genomic data and big data analytics has been explored. Challenges in processing of genomic data are analysed. The issue like how big data analytics concepts can be applied in genomic data processing is addressed. Future trends in combined research direction in the area of genomics and big data analytics are outlined.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Furtado, R.N.: Gene editing: the risks and benefits of modifying human DNA. Rev. Bioét. 27 (2) (2019). https://doi.org/10.1590/1983-80422019272304 ; On-line version ISSN 1983–8034

He, K.Y., Ge, D., He, M.M.: Big data analytics for genomic medicine. Int. J. Mol. Sci. 18 (2), 412 (2017). https://doi.org/10.3390/ijms18020412

Gullapalli, R.R., Lyons-Weiler, M., Petrosko, P., Dhir, R., Becich, M.J., LaFramboise, W.A.: Clinical integration of next-generation sequencing technology. Clinics Laborat. Med. 32 (4), 585–599 (2012)

Google Scholar  

Robison, R.J.: How big is the human genome? Precision Med (2014)

Ritchie, M.D., Holzinger, E.R., Li, R., Pendergrass, S.A., Kim, D.: Methods of integrating data to uncover genotype-phenotype interactions. Nat Rev Genet. 16 , 85–97 (2015). https://doi.org/10.1038/nrg3868

Article   Google Scholar  

Navarro, F.C.P., Mohsen, H., Yan, C., et al.: Genomics and data science: an application within an umbrella. Genome Biol 20 , 109 (2019). https://doi.org/10.1186/s13059-019-1724-1

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A System for Large-Scale Graph Processing, SIGMOD’10, June 6–11, 2010,, pp. 135–145. Indianapolis, Indiana, USA (2010)

Sakr, S., Orakzai, F. M., Abdelaziz, I., Khayyat, Z.: Large-Scale Graph Processing Using Apache Giraph. Springer (2016). ISBN 978-3-319-47430-4

Ceri, S., Pinoli, P.: Data science for genomic data management: challenges, resources experiences. SN Comput. Sci. 1 , 5 (2020). https://doi.org/10.1007/s42979-019-0005-0

Kashyap, H., Ahmed, H.A., Hoque, N., Roy, S., Bhattacharyya, D.K.: Big Data Analytics in Bioinformatics: A Machine Learning Perspective. (2015) arXiv preprint arXiv:1506.05101

Hulsen, T., Jamuar, S.S., Moody, A.R., Karnes, J.H., Varga, O., Hedensted, S., Spreafico, R., Hafler, D.A., McKinney, E.F.: From big data to precision medicine. Front. Med. 6 , 34 (2019). https://doi.org/10.3389/fmed.2019.00034

Sarma, H.K.D., Dwivedi Y.K., Rana N.P., Slade E.L.: A MapReduce based distributed framework for similarity search in healthcare big data environment. In: Janssen, M., et al. (eds.) Open and Big Data Management and Innovation. I3E 2015. Lecture Notes in Computer Science, vol. 9373. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25013-7_14

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51 (1), 107–113 (2008)

Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: Graphlab: a New Framework for Parallel Machine Learning (2014). arXiv preprint arXiv:1408.2041 , 2014.

Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high performance, portable implementation of the mpi message passing interface standard. Parallel Comput. 22 (6), 789–828 (1996)

Grossman, R.L., Heath, A.P., Ferretti, V., Varmus, H.E., Lowy, D.R., Kibbe, W.A., Staudt, L.M.: Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375 (12), 1109–1112 (2016)

Zhang, J., Baran, J., Cros, A., Guberman, J.M., Haider, S., Hsu, J., Liang, Y., Rivkin, E., Wang, J., Whitty, B., Wong-Erasmus, M., Yao, L., Kasprzyk, A.: International Cancer Genome Consortium Data Portal—A One-Stop Shop for Cancer Genomics Data. Database (2011); 2011:bar026.

Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., Stuart, J.M.: Cancer Genome Atlas Research Network. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45 (10), 1113 (2013)

Sarma H.K.D.: Security issues in big data. In: Sarma H.K.D., Bhuyan B., Borah S., Dutta N. (eds.) Trends in Communication, Cloud, and Big Data. Lecture Notes in Networks and Systems, vol. 99. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-1624-5_7

Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 (1), 27–30 (2000)

Croft, D., OKelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., et al.: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. gkq1018 (2010)

Cerami, E.C., Gross, B.E., Demir, E., Rodchenkov, I., Babur, O., Anwar, N., Schultz, N., Bader, G.D., Sander, C.: Pathway commons, a web resource for biological pathway data. Nucleic Acids Res. 39 (1). D685–D690 (2011)

NASA.  https://earthdata.nasa.gov

Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., et al. (2015) Big data: astronomical or genomical? PLoS Biol. 13 (7), e1002195. https://doi.org/10.1371/journal.pbio.1002195

Lander, E., et al.: Initial sequencing and analysis of the human genome”. Nature 409 , 860–921 (2001). https://doi.org/10.1038/35057062 . International Human Genome Sequencing Consortium, Whitehead Institute for Biomedical Research, Center for Genome Research

Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2 , 231–239 (1988). https://doi.org/10.1016/0888-7543(88)90007-9

Sims, D., Sudbery, I., Ilott, N.E., Heger, A., Ponting, C.P.: Sequencing depth and coverage: Key considerations in genomic analyses. Nat. Rev. Genet. 15 , 121–132 (2014). https://doi.org/10.1038/nrg3642

Schatz, M.C.: Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics 25 , 1363–1369 (2009). https://doi.org/10.1093/bioinformatics/btp236

Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPS with cloud computing. Genome Biol. 10 , R134 (2009). https://doi.org/10.1186/gb-2009-10-11-r134

Pireddu, L., Leo, S., Zanetti, G.: Seal: A distributed short read mapping and duplicate removal tool. Bioinformatics 27 , 2159–2160 (2011). https://doi.org/10.1093/bioinformatics/btr325

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R.: The sequence alignment/map format and samtools. Bioinformatics 25 , 2078–2079 (2009). https://doi.org/10.1093/bioinformatics/btp352

De Pristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., et al.: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43 , 491–498 (2011). https://doi.org/10.1038/ng.806

Garrison, E., Marth, G.: Haplotype-based variant detection from short-read sequencing. Available online:  http://arxiv.org/abs/1207.3907

Evani, U.S., Challis, D., Yu, J., Jackson, A.R., Paithankar, S., Bainbridge, M.N., Jakkamsetti, A., Pham, P., Coarfa, C., Milosavljevic, A., et al.: Atlas2 Cloud: a framework for personal genome analysis in the cloud. BMC Genom. 13 (Suppl. 6), S19 (2012). https://doi.org/10.1186/1471-2164-13-S6-S19

McCarthy, D.J., Humburg, P., Kanapin, A., et al.: Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6 , 26 (2014). https://doi.org/10.1186/gm543

Wang, K., Li, M., Hakonarson, H.: Annovar: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38 , 164 (2010). https://doi.org/10.1093/nar/gkq603

Cingolani, P., Platts, A., le Wang, L., Coon, M., Nguyen, T., Wang, L., Land, S.J., Lu, X., Ruden, D.M.: A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6 , 80–92 (2012). https://doi.org/10.4161/fly.19695

McLaren, W., Gil, L., Hunt, S.E., Riat, H.S., Ritchie, G.R., Thormann, A., Flicek, P., Cunningham, F.: The ensemble variant effect predictor. Genome Biol. 17 , 122 (2016). https://doi.org/10.1186/s13059-016-0974-4

He, M., Person, T.N., Hebbring, S.J., Heinzen, E., Ye, Z., Schrodi, S.J., McPherson, E.W., Lin, S.M., Peissig, P.L., Brilliant, M.H., et al.: Seqhbase: A big data toolset for family based sequencing data analysis. J. Med. Genet. 52 , 282–288 (2015). https://doi.org/10.1136/jmedgenet-2014-102907

Libbrecht, M.W., Noble, W.S.: Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16 , 321–332 (2015). https://doi.org/10.1038/nrg3920

Download references

Author information

Authors and affiliations.

Department of Information Technology, Sikkim Manipal Institute of Technology, Majitar, Sikkim, 737136, India

Hiren Kumar Deva Sarma

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Department of Information Technology, Sikkim Manipal Institute of Technology, Majitar, Sikkim, India

Department of Automatics and Applied Software, Aurel Vlaicu University of Arad, Arad, Romania

Valentina Emilia Balas

Bhaskar Bhuyan

Department of Computer Science and Engineering, Marwadi University, Rajkot, Gujarat, India

Nitul Dutta

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Sarma, H.K.D. (2022). Genomic Data and Big Data Analytics. In: Sarma, H.K.D., Balas, V.E., Bhuyan, B., Dutta, N. (eds) Contemporary Issues in Communication, Cloud and Big Data Analytics. Lecture Notes in Networks and Systems, vol 281. Springer, Singapore. https://doi.org/10.1007/978-981-16-4244-9_15

Download citation

DOI : https://doi.org/10.1007/978-981-16-4244-9_15

Published : 01 December 2021

Publisher Name : Springer, Singapore

Print ISBN : 978-981-16-4243-2

Online ISBN : 978-981-16-4244-9

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Open access
  • Published: 01 May 2023

Genomic benchmarks: a collection of datasets for genomic sequence classification

  • Katarína Grešová 1 , 2 ,
  • Vlastimil Martinek 1 , 2 ,
  • David Čechák 1 , 2 ,
  • Petr Šimeček   ORCID: orcid.org/0000-0002-2922-7183 1 &
  • Panagiotis Alexiou 1  

BMC Genomic Data volume  24 , Article number:  25 ( 2023 ) Cite this article

5614 Accesses

1 Citations

1 Altmetric

Metrics details

Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition.

Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks .

Conclusions

Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.

Recently, deep neural networks have been successfully applied to identify functional elements in the genomes of humans and other organisms, such as promoters [ 1 ], enhancers [ 2 ], transcription factor binding sites [ 3 ], and others. Neural network models have been shown to be capable of predicting histone accessibility [ 4 ], RNA-protein binding [ 5 ], and accurately identify short non-coding RNA loci within the genomic background [ 6 ].

However, deep neural network models are highly dependent on large amounts of high-quality training data [ 7 ]. Comparing the quality of various deep learning models can be challenging, as the authors often use different datasets for evaluation, and quality metrics can be heavily influenced by data preprocessing techniques and other technical differences [ 8 ].

Many computational fields have developed established benchmarks, for example, SQuAD for question answering [ 9 ], IMDB Sentiment for text classification [ 10 ], and ImageNet for image recognition [ 11 ]. Benchmarks are crucial in driving innovation. The annual competition for object identification [ 12 ] catalyzed the boom in AI, leading in just seven years to models that exceed human capabilities.

In biology, a great challenge over the past 50 years has been the protein folding problem . To compare different protein folding algorithms, the community introduced the Critical Assessment of protein Structure Prediction (CASP) [ 13 ] challenge benchmark that provides research groups with the opportunity to objectively test their methods. In 2021, AlphaFold [ 14 ] won this competition producing predicted structures within the error tolerance of experimental methods. This carefully curated benchmark led to the solution of the most prominent bioinformatic challenge of the past 50 years.

In Genomics, we have similar challenges in annotation of genomes and identification and classification of functional elements, but currently we lack benchmarks similar to CASP. Practically, machine learning tasks in Genomics commonly involve the classification of genomic sequences into several categories and/or contrasting them to a genomic background (a negative set). For example, a well-studied question in Genomics is the prediction of enhancer loci on a genome. For this question, the benchmark situation is highly fragmented. As an example, [ 15 ] proposed a benchmark dataset based on the chromatin state from multiple cell lines. Both enhancer and non-enhancer sequences were retrieved from experimental chromatin information. The CD-HIT software [ 16 ] was used to filter similar sequences, and the benchmark dataset was made available as a pdf file. However, information stored in a pdf file is suitable for human communication, but computers cannot easily extract data from these files. Despite not being easily machine readable, it was used by many subsequent publications ([ 2 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 ] or [ 27 ]) as a gold standard for enhancer prediction, highlighting the need for benchmark datasets in this field. Other common sources of enhancer data are the VISTA Enhancer Browser [ 28 ], the FANTOM5 [ 29 ], the ENCODE project [ 30 ], and the Roadmap Epigenomics Project [ 31 ] which provide a wealth of positive samples but no negatives. A researcher would need to implement their own method of negative selection, thus introducing individual selection biases to the samples.

Another highly studied question in Genomics is the prediction of promoters. Benchmark situation in this field has its own problems. For example, [ 32 ] extracted positive samples from EPD [ 33 ] and the non-promoter sequences were randomly extracted from coding regions and non-coding regions, and used as two negative sets. This method for creating a negative set is not an established one. Other authors used only coding sequences or only non-coding sequences as a negative set [ 34 ] or combined coding and non-coding sequences as a one negative set [ 35 , 36 , 37 ]. Even [ 32 ] are already pointing to the problem of missing benchmarks and reproducibility, saying that it is difficult to compare their results with other published results due to differences in data and experimental protocol. Several years later, [ 38 ] created their own dataset and reported similar problems. They were unable to compare the results with other published tools because the datasets were derived from different sources, used different proprocessing procedures, or were not made available at all.

In this paper, we propose a collection of benchmark datasets for the classification of genomic sequences, focusing on ease of use for machine learning purposes. The datasets are distributed as a Python package ’genomic-benchmarks’ that is available on GitHub Footnote 1 and distributed through The Python Package Index (PyPI) Footnote 2 . The package provides an interface that allows the user to easily work with the benchmarks using Python. Included are utilities for data processing, cleaning procedures, and summary reporting. Additionally, it contains functions that make training a neural network classifier easier, such as PyTorch [ 39 ] and TensorFlow [ 40 ] data loaders and notebooks containing basic deep learning architectures that can be used as templates for prototyping new methods. Importantly, every dataset presented here comes with an associated notebook that fully reproduces the dataset generation process, to ensure transparency and reproducibility of benchmark generation in the future.

Construction and content

Overview of datasets.

The currently selected datasets are divided into three categories. There is a group of datasets focused on human regulatory functional elements, either produced from mining the Ensembl database, or from published datasets used in multiple articles. For promoters, we have imported human non-TATA promoters [ 41 ]. For enhancers, we used human enhancers from [ 42 ] paper, Ensembl human enhancers from the FANTOM5 Project [ 29 ] and drosophila enhancer [ 43 ]. We have also included open chromatin regions and multiclass datasets composed of three regulatory elements (enhancers, promoters, and open chromatin regions), both constructed from the Ensembl regulatory build [ 44 ]. The second category consists of ’demo’ datasets that were computationally generated for this project, and focus on classification of genomic sequences between different species or types of transcripts (protein coding vs non-coding). Finally, the third category ’dummy’ has a single small dataset which can be used for quick prototyping of methods due to its small size. From the point of view of the model organism, our datasets include primarily human data, but also mouse ( Mus musculus ), and roundworm ( Caenorhabditis elegans ) and fruit fly ( Drosophila melanogaster ). An overview of available datasets is given in Table 1 and simple code for listing all currently available datasets in Fig. 1 . Additional examples of usage can be found in the project’s README (dataset info, downloading the dataset, getting dataset loader), TensorFlow/PyTorch workflows in ‘notebooks‘ folder and finally ‘experiments‘ folder contains papermill runs for each combination of a dataset and a framework.

figure 1

Python code for listing all available dataset in the Genomic benchmarks package

The Human enhancers Cohn dataset was adapted from [ 42 ]. Enhancers are genomic regulatory functional elements that can be bound by specific DNA binding proteins so as to regulate the transcription of a particular gene. Unlike promoters, enhancers do not need to be in a close proximity to the affected gene, and may be up to several million bases away, making their detection a difficult task.

The Drosophila enhancers Stark dataset was adapted from [ 43 ]. These enhancers were experimentally validated and we excluded the weak ones. Original coordinates referred to the dm3 [ 45 ] assembly of the D. melanogaster genome. We used pyliftover Footnote 3 tool to map coordinates to the dm6 assembly [ 46 ]. Negative sequences are randomly generated from drosophila genome dm6 to match lengths of positive sequences and to not overlap them.

The Human enhancers Ensembl dataset was constructed from Human enhancers from The FANTOM5 project [ 29 ] accessed through the Ensembl database [ 47 ]. Negative sequences have been randomly generated from the Human genome GRCh38 to match the lengths of positive sequences and not overlap them.

The Human non-TATA promoters dataset was adapted from [ 41 ]. These sequences are of length 251bp: from -200 to +50bp around transcription start site (TSS). To create non-promoters sequences of length 251bp, the authors of the original paper used random fragments of human genes located after first exons.

The Human ocr Ensembl dataset was constructed from the Ensembl database [ 47 ]. Positive sequences are Human Open Chromatin Regions (OCRs) from The Ensembl Regulatory Build [ 44 ]. Open chromatin regions are regions of the genome that can be preferentially accessed by DNA regulatory elements because of their open chromatin structure. In the Ensembl Regulatory Build, this label is assigned to open chromatin regions, which were experimentally observed through DNase-seq, but covered by none of the other annotations (enhancer, promoter, gene, TSS, CTCF, etc.). Negative sequences were generated from the Human genome GRCh38 to match the lengths of positive sequences and not overlap them.

The Human regulatory Ensembl dataset was constructed from Ensembl database [ 47 ]. This dataset has three classes: enhancer, promoter and open chromatin region from The Ensembl Regulatory Build [ 44 ]. Open chromatin region sequences are the same as the positive sequences in the Human ocr Ensembl dataset.

Reproducibility

The pre-processing and data cleaning process we followed is fully reproducible. We provide a Jupyter notebook that can be used to recreate each given dataset, and can be found in the docs folder of the GitHub repository Footnote 4 . All dependencies are provided, and a fixed random seed is set so that the notebook will always produce the same data splits.

Each dataset is divided into training and testing subsets. For some datasets, which contain only positive samples, we had to generate appropriate negative samples (dummy mouse enhancers Ensembl, drosophila enhancers stark, human enhancers Ensembl and human open chromatin region Ensembl dataset). Negative samples were selected from the same genome as the positive samples. For each positive sample, we generated a random interval in the genome with the same length as a given sample. We picked only those intervals not overlapping with any of the positive samples.

Data format

All samples were stored as genomic coordinates, and datasets originally provided as sequences (human enhancers Cohn, human nonTATA promoters) were mapped to the reference using the ‘seq2loc‘ tool included in the package. Data were stored as compressed (gzipped) CSV tables of genomic coordinates, containing all information typically found in a BED format table. Column names are id , region , start , end , and strand . Each dataset has train and test subfolders and a separate table for each class. Furthermore, each dataset contains a YAML information file with metadata such as its version, the names of included classes, and links to sequence files of the reference genome. The stored coordinates and linked sequence files were used to produce the final datasets, ensuring the reproducibility of our method. For more information, visit the datasets folder of the GitHub repository Footnote 5 . To speed up this conversion from a list of genomic coordinates to a locally stored folder of nucleotide sequences, we provide a cloud based cache of the full sequence datasets which can be used simply by setting the use_cloud_cache=True option.

Utility and discussion

Easy data access tools.

Python package with the data is installed using one command line command: pip install genomic-benchmarks . The installed package contains ready-to-use data loaders for the two most commonly used deep learning frameworks, TensorFlow and PyTorch. This feature is important for reproducibility and for the adoption of the package, particularly by people with limited knowledge of genomics. Data loaders allow the user to load any of the provided datasets using single line of code. Full examples including imports and accessing one sample of the data are shown in Figs. 2 and 3 for PyTorch and TensorFlow respectively. However, our data are not bound to any particular library or a tool. We provide an interface to the two most commonly used deep learning frameworks, but data are easily accessible using even plain Python, as shown in Fig. 4 . Furthermore, we made Genomic benchmarks available as Hugging Face datasets Footnote 6 , expanding their acessibility.

figure 2

Python code for loading dataset as a PyTorch Dataset object using get_dataset() function. This function takes three arguments: name of dataset, train or test split, and version of the dataset

figure 3

Python code for loading the dataset as TensorFlow Dataset object. First, we download dataset to our local machine and then we use TensorFlow function text_dataset_from_directory() to create a Dataset object

figure 4

Python code for downloading and acessing the dataset as a raw text files. First, we download dataset to our local machine and then we sequentialy read all files and store the samples in a dictionary. A full example can be found at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Train_BERT_Classifier_With_HF.ipynb

Baseline model

On top of ready-to-use data loaders, we provide tools for training neural networks and simple convolutional neural network (CNN) architecture (adapted from [ 48 ]). Demonstrative Jupyter notebook is provided in the notebooks folder of the GitHub repository Footnote 7 , PyTorch version is also shown in Fig. 5 , and it can be used as a starting point for further research and experimentation with genomic benchmark data. CNN is an architecture that is able to find input features without feature engineering and has a relatively small number of parameters due to weights sharing (see [ 49 ] for more). Our implementation consists of three convolutional layers with 16, 8, and 4 filters, with a kernel size of 8. The output of each convolutional layer goes through the batch normalization layer and the max-pooling layer. The output of the last set of layers is flattened and goes through two dense layers. The last layer is designed to predict probabilities that the input sample belongs to any of the given classes. The architecture of the model is shown in Fig. 6 . To get a baseline estimate for researchers using these benchmarks, we fit the CNN model described above to each dataset included in our collection. Training notebooks are provided in an experiments folder of the GitHub repository Footnote 8 . The models were trained for 10 epochs with batch size 64. The accuracy and F1 score for PyTorch and Tensorflow CNN models on all genomic benchmark datasets are shown in Table 2 . In addition, we provide an example notebook how to train a DNABERT model [ 50 ] using Genomic Benchmarks Footnote 9 .

figure 5

Python code showing the whole process of getting the dataset, tools, model and training the CNN model on the dataset. Thanks to out package, necessary code has only few lines and is easily understandable and expandable

figure 6

CNN architecture. The neural network consists of three convolutional layers with 16, 8, and 4 filters, with a kernel size of 8. The output of each convolutional layer goes through the batch normalization layer and the max-pooling layer. The output is then flattened and passes through two dense layers. The last layer is designed to predict the probabilities that the input sample belongs to any of the given classes

Future development

We are aware of the limitations of the current repository. While we strive to include diverse data, still most of our benchmark datasets are balanced, or close to balanced, having similar length of sequences and a limited number of classes. Our main datasets all come from the human genome, and all deal with regulatory features. In the future, we would like to increase the diversity of our datasets to be able to diagnose the model’s sensitivity to those factors. Many machine learning tasks in Genomics consist of binary classification of a class of Genomic functional elements against a background. However, it can be beneficial to start expanding the field into multi-class classification problems, especially for functional elements that have similar characteristics to each other against the background. We will expand our benchmark collection to include more imbalanced datasets, and more multi-class datasets.

Machine learning, especially deep learning, have recently started revolutionizing the field of genomics. Deep learning methods are highly dependent on large amounts of high-quality data to train and benchmark data are needed to accurately compare performance of different models. Here, we propose a collection of Genomic Benchmarks, produced with the aim of being easily accessible and reproducible. Our intention is to lower the difficulty of entry into the machine learning for Genomics field for researchers that may not have extensive knowledge of Genomics but want to apply their knowledge of machine learning in this field. Such an approach worked well for the field of protein folding, where benchmark-based competitions helped revolutionize the field.

The nine genomics datasets that have been currently added are a first step towards the direction of a large repository of Genomic Benchmarks. Beyond making access to these datasets easy for users, we have ensured that adding more datasets in a reproducible way is an easy task for further development of the repository. We encourage users to propose datasets or subfields of interest that would be useful in future releases. We have provided guidelines and tools to unify access to any genomic data and we will happily host submitted genomic datasets of sufficient quality and interest.

In this manuscript, we have implemented a simple convolutional neural network as a baseline model trained and evaluated on all of our datasets. Improvement on this baseline will be certainly achieved by using different architectures and training schemes. We have an open call for users that outperform the baseline to submit their solution via our Github repository, and be added to a ’Leaderboard’ of methods for each dataset. We hope that this will create a healthy competition on this set of reproducible datasets, and promote machine learning research in Genomics.

Availability of data and materials

The datasets generated and/or analysed during the current study are available in the GitHub repository, https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks .

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks

https://pypi.org/project/genomic-benchmarks/

https://github.com/konstantint/pyliftover

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/docs

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/datasets

https://huggingface.co/katarinagresova

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/notebooks

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/tree/main/experiments

https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks/blob/main/notebooks/How_To_Train_BERT_Classifier_With_HF.ipynb

Abbreviations

  • Convolutional neural network

Open chromatin region

Transcription start site

Oubounyt M, Louadi Z, Tayara H, Chong KT. DeePromoter: robust promoter predictor using deep learning. Front Genet. 2019;10:286.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. 2021;22(5).

Quang D, Xie X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods. 2019;166:40–7.

Yin Q, Wu M, Liu Q, Lv H, Jiang R. DeepHistone: a deep learning approach to predicting histone modifications. BMC Genomics. 2019;20(2):11–23.

Google Scholar  

Shen Z, Zhang Q, Han K, Huang Ds. A deep learning model for RNA-protein binding preference prediction based on hierarchical LSTM and attention network. IEEE/ACM Trans Comput Biol Bioinforma. 2020;19(2):753–62.

Georgakilas GK, Grioni A, Liakos KG, Chalupova E, Plessas FC, Alexiou P. Multi-branch convolutional neural network for identification of small non-coding RNA genomic loci. Sci Rep. 2020;10(1):1–10.

Article   Google Scholar  

Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision. Institute of Electrical and Electronics Engineers Inc., United States. 2017. p. 843–852.

Nawi NM, Atomi WH, Rehman MZ. The effect of data pre-processing on optimized training of artificial neural networks. Procedia Technol. 2013;11:32–9.

Rajpurkar P, Zhang J, Lopyrev K, Liang P. Squad: 100,000+ questions for machine comprehension of text. 2016. arXiv preprint arXiv:1606.05250 .

Maas A, Daly RE, Pham PT, Huang D, Ng AY, Potts C. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, Portland, Oregon, USA. 2011. p. 142–150.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L, Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE. 2009. p. 248–255.

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis. 2015;115(3):211–52. https://doi.org/10.1007/s11263-015-0816-y .

Moult J, Pedersen JT, Judson R, Fidelis K. A large-scale experiment to assess protein structure prediction methods. Wiley Online Library; 1995.

Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9.

Liu B, Fang L, Long R, Lan X, Chou KC. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics. 2016;32(3):362–9.

Article   CAS   PubMed   Google Scholar  

Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.

Liu B, Li K, Huang DS, Chou KC. iEnhancer-EL: identifying enhancers and their strength with ensemble learning approach. Bioinformatics. 2018;34(22):3835–42.

Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY. iEnhancer-5Step: identifying enhancers using hidden information of DNA sequences via Chou’s 5-step rule and word embedding. Anal Biochem. 2019;571:53–61.

Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou’s trinucleotide composition. Comput Methods Prog Biomed. 2017;146:69–75.

Jia C, He W. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Sci Rep. 2016;6(1):1–7.

He W, Jia C. EnhancerPred2. 0: predicting enhancers and their strength based on position-specific trinucleotide propensity and electron–ion interaction potential feature selection. Mol BioSyst. 2017;13(4):767–74.

Nguyen QH, Nguyen-Vo TH, Le NQK, Do TT, Rahardja S, Nguyen BP. iEnhancer-ECNN: identifying enhancers and their strength using ensembles of convolutional neural networks. BMC Genomics. 2019;20(9):1–10.

CAS   Google Scholar  

Khanal J, Tayara H, Chong KT. Identifying enhancers and their strength by the integration of word embedding and convolution neural network. IEEE Access. 2020;8:58369–76.

Zhang TH, Flores M, Huang Y. ES-ARCNN: Predicting enhancer strength by using data augmentation and residual convolutional neural network. Anal Biochem. 2021;618:114120.

Inayat N, Khan M, Iqbal N, Khan S, Raza M, Khan DM, et al. iEnhancer-DHF: Identification of Enhancers and Their Strengths Using Optimize Deep Neural Network With Multiple Features Extraction Methods. IEEE Access. 2021;9:40783–96.

Mu X, Wang Y, Duan M, Liu S, Li F, Wang X, et al. A Novel Position-Specific Encoding Algorithm (SeqPose) of Nucleotide Sequences and Its Application for Detecting Enhancers. Int J Mol Sci. 2021;22(6):3079.

Yang R, Wu F, Zhang C, Zhang L. iEnhancer-GAN: A Deep Learning Framework in Combination with Word Embedding and Sequence Generative Adversarial Net to Identify Enhancers and Their Strength. Int J Mol Sci. 2021;22(7):3589.

Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 2007;35(suppl_1):88–92.

Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507(7493):455–61.

ENCODE Project Consortium, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57.

Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–30.

Lin H, Li QZ. Eukaryotic and prokaryotic promoter prediction using hybrid approach. Theory Biosci. 2011;130(2):91–100.

Article   PubMed   Google Scholar  

Schmid CD, Perier R, Praz V, Bucher P. EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 2006;34(suppl_1):82–5.

Gordon L, Chervonenkis AY, Gammerman AJ, Shahmuradov IA, Solovyev VV. Sequence alignment kernel for recognition of promoter regions. Bioinformatics. 2003;19(15):1964–71.

Ohler U. Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res. 2006;34(20):5943–50.

Yang JY, Zhou Y, Yu ZG, Anh V, Zhou LQ. Human Pol II promoter recognition based on primary sequences and free energy of dinucleotides. BMC Bioinformatics. 2008;9(1):1–13.

Rani TS, Bhavani SD, Bapi RS. Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics. 2007;23(5):582–8.

Lai HY, Zhang ZY, Su ZD, Su W, Ding H, Chen W, et al. iProEP: a computational predictor for predicting promoter. Mol Ther Nucleic Acids. 2019;17:337–46.

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32:8026–37.

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. {TensorFlow}: A System for {Large-Scale} Machine Learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16). USENIX Association, Savannah, GA, USA. 2016. p. 265–283.

Umarov RK, Solovyev VV. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS ONE. 2017;12(2):0171410.

Cohn D, Zuk O, Kaplan T. Enhancer identification using transfer and adversarial deep learning of DNA sequences. BioRxiv. 2018:264200.

Kvon EZ, Kazmar T, Stampfel G, Yáñez-Cuna JO, Pagani M, Schernhuber K, et al. Genome-scale functional characterization of Drosophila developmental enhancers in vivo. Nature. 2014;512(7512):91–5.

Zerbino DR, Wilder SP, Johnson N, Juettemann T, Flicek PR. The ensembl regulatory build. Genome Biol. 2015;16(1):1–8.

Hoskins RA, Carlson JW, Kennedy C, Acevedo D, Evans-Holm M, Frise E, et al. Sequence finishing and mapping of Drosophila melanogaster heterochromatin. Science. 2007;316(5831):1625–8.

dos Santos G, Schroeder AJ, Goodman JL, Strelets VB, Crosby MA, Thurmond J, et al. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res. 2015;43(D1):690–7.

Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021. Nucleic Acids Res. 2021;49(D1):884–91.

Klimentova E, Polacek J, Simecek P, Alexiou P. PENGUINN: Precise exploration of nuclear G-quadruplexes using interpretable neural networks. Front Genet. 2020;11:1287.

Albawi S, Mohammed TA, Al-Zawi S, Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET). IEEE. 2017. p. 1–6.

Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20.

Download references

Acknowledgements

We are thankful to Google Cloud for providing P. Simecek and V. Martinek free research credits. Additional computational resources were provided by the e-INFRA CZ project (ID:90140), supported by the Ministry of Education, Youth and Sports of the Czech Republic.

The work of P. Simecek was supported by the H2020 MSCA IF LanguageOfDNA (nb. 896172) and funding from Czech Science Foundation, project no. 23-04260L. The work of P. Alexiou was supported by grant H2020-WF-01-2018: 867414. The work of K. Gresova, V. Martinek, and D. Cechak was supported by EMBO Installation Grant 4431 “Deep Learning for Genomic and Transcriptomic Pattern Identification” to P. Alexiou. The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and affiliations.

Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia

Katarína Grešová, Vlastimil Martinek, David Čechák, Petr Šimeček & Panagiotis Alexiou

National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia

Katarína Grešová, Vlastimil Martinek & David Čechák

You can also search for this author in PubMed   Google Scholar

Contributions

KG did current state of the field research. KG and PS created and collected datasets. VM implemented data loaders. DC, PS and KG implemented baseline models. KG, PS and PA prepared the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Petr Šimeček .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Grešová, K., Martinek, V., Čechák, D. et al. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genom Data 24 , 25 (2023). https://doi.org/10.1186/s12863-023-01123-8

Download citation

Received : 18 August 2022

Accepted : 31 March 2023

Published : 01 May 2023

DOI : https://doi.org/10.1186/s12863-023-01123-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning

BMC Genomic Data

ISSN: 2730-6844

research genomic data analysis

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 30 January 2018

Cloud computing for genomic data analysis and collaboration

  • Ben Langmead 1 &
  • Abhinav Nellore 2  

Nature Reviews Genetics volume  19 ,  pages 208–219 ( 2018 ) Cite this article

14k Accesses

168 Citations

99 Altmetric

Metrics details

  • Computational biology and bioinformatics
  • Genetic databases
  • Next-generation sequencing
  • Research data

A Corrigendum to this article was published on 12 February 2018

This article has been updated

Cloud computing is a paradigm whereby computational resources such as computers, storage and bandwidth can be rented on a pay-for-what-you-use basis.

The cloud's chief advantages are elasticity and convenience. Elasticity refers to the ability to rent and pay for the exact resources needed, and convenience refers to the fact that the user need not deal with the disadvantages of owning or maintaining the resources.

Archives of sequencing data are vast and rapidly growing. Cloud computing is an important enabler for recent efforts to reanalyse large cross-sections of archived sequencing data.

The cloud is becoming a popular venue for hosting large international collaborations, which benefit from the ability to hold data securely in a single location and proximate to the computational infrastructure that will be used to analyse it.

Funders of genomics research are increasingly aware of the cloud and its advantages and are beginning to allocate funds and create cloud-based resources accordingly.

Cloud clusters can be configured with security measures needed to adhere to privacy standards, such as those from the Database of Genotypes and Phenotypes (dbGaP).

Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

176,64 € per year

only 14,72 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

research genomic data analysis

Similar content being viewed by others

research genomic data analysis

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

research genomic data analysis

An open source knowledge graph ecosystem for the life sciences

research genomic data analysis

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Change history, 12 february 2018.

The above article originally stated “FireCloud and CGC rely on AWS and the Google Cloud Platform for computing and data storage. In addition to charges for these commercial services, users pay convenience surcharges.” The second sentence was incorrect, as pointed out to and independently verified by the authors, and has been removed. Also, an incorrect citation was given for reference 66. The citation should have been: Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotech . 34 , 525–527 (2016). Finally, reference 67 referred to an older version of the CWL specification and has been updated. The article has been corrected online. The authors apologize for these errors.

Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17 , 333–351 (2016).

Article   CAS   PubMed   Google Scholar  

Stephens, Z. D. et al. Big data: astronomical or genomical? PLOS Biol. 13 , e1002195 (2015). This perspective puts the genomic data deluge in context with other sciences and shows how growth of archived genomics data is tracking improvements in technology.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Kodama, Y. et al. The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 40 , D54–D56 (2012).

Leinonen, R. et al. The sequence read archive. Nucleic Acids Res. 39 , D19–D21 (2010).

Toribio, A. L. et al. European Nucleotide Archive in 2016. Nucleic Acids Res. 45 , D32–D36 (2017).

Denk, F. Don't let useful data go to waste. Nature 543 , 7 (2017).

Kuo, W. P., Jenssen, T.-K., Butte, A. J., Ohno-Machado, L. & Kohane, I. S. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18 , 405–412 (2002).

Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 , 733–739 (2010).

McCall, M. N., Bolstad, B. M. & Irizarry, R. A. Frozen robust multiarray analysis (fRMA). Biostatistics 11 , 242–253 (2010).

Article   PubMed   PubMed Central   Google Scholar  

Rhodes, D. R. et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Natl Acad. Sci. USA 101 , 9309–9314 (2004).

Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40 , 638–645 (2008).

Marchionni, L., Afsari, B., Geman, D. & Leek, J. T. A simple and reproducible breast cancer prognostic test. BMC Genomics 14 , 336 (2013).

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536 , 285–291 (2016).

International Cancer Genome Consortium et al. International network of cancer genome projects. Nature 464 , 993–998 (2010).

GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550 , 204–213 (2017).

Melé, M. et al. Human genomics. The human transcriptome across tissues and individuals. Science 348 , 660–665 (2015).

Trans-Omics for Precision Medicine (TOPMed) Program. National Heart, Lung, and Blood Institute https://www.nhlbi.nih.gov/science/trans-omics-precision-medicine-topmed-program (2017).

Collins, F. S. & Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 372 , 793–795 (2015).

Gaziano, J. M. et al. Million Veteran Program: A mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70 , 214–223 (2016).

Article   PubMed   Google Scholar  

Foster, I. G. & Dennis, B. Cloud Computing for Science and Engineering (MIT Press, 2017). This book describes the public and private cloud offerings availabkle and how to use APIs for both commercial and OpenStack clouds to automate cloud tasks. It also describes Globus Auth and other important ideas related to identity federation, authentication and authorization.

Google Scholar  

International Cancer Genes Consortium. PCAWG Data Portal and Visualizations. ICGC http://docs.icgc.org/pcawg/ (2017).

Birger, C. et al. FireCloud, a scalable cloud-based platform for collaborative genome analysis: strategies for reducing and controlling costs. bioRxiv , https://doi.org/10.1101/209494 (2017).

Lau, J. W. et al. The Cancer Genomics Cloud: collaborative, reproducible, and democratized – a new paradigm in large-scale computational research. Cancer Res. 77 , e3–e6 (2017).

Reynolds, S. M. et al. The ISB Cancer Genomics Cloud: a flexible cloud-based platform for cancer genomics research. Cancer Res. 77 , e7–e10 (2017).

Celniker, S. E. et al. Unlocking the secrets of the genome. Nature 459 , 927–930 (2009).

The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489 , 57–74 (2012).

Mell, P. M. & Grance, T. SP 800–145. The NIST definition of cloud computing. National Institute of Standards and Technology http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf (2011).

Wingfield, N., Streitfeld, D. & Lohr, S. Cloud produces sunny earnings at Amazon, Microsoft and Alphabet. New York Times https://www.nytimes.com/2017/04/27/technology/quarterly-earnings-cloud-computing-amazon-microsoft-alphabet.html (27 April 2017).

Mathews, L. Just how big is Amazon's AWS business? (hint: it's absolutely massive). Geek.com https://www.geek.com/chips/just-how-big-is-amazons-aws-business-hint-its-absolutely-massive-1610221/ (2014).

Sefraoui, O., Aissaoui, M. & Eleuldj, M. OpenStack: toward an open-source solution for cloud computing. Int. J. Comput. Appl. Technol. 55 , 38–42 (2012).

Moreno-Vozmediano, R., Montero, R. S. & Llorente, I. M. IaaS cloud architecture: from virtualized datacenters to federated cloud infrastructures. Computer 45 , 65–72 (2012).

Article   Google Scholar  

Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48 , 1284–1287 (2016).

Stewart, C. A. et al. in Proc. 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure https://dl.acm.org/citation.cfm?id=2792745 (2015).

European Open Science Cloud [Editorial]. Nat. Genet. 48 , 821 (2016).

Madduri, R. K. et al. Experiences building Globus Genomics: a next-generation sequencing analysis service using Galaxy, Globus, and Amazon web services. Concurr. Comput. 26 , 2266–2279 (2014).

Yakneen, S., Waszak, S., Gertz, M. & Korbel, J. O. Enabling rapid cloud-based analysis of thousands of human genomes via Butler. bioRxiv https://doi.org/10.1101/185736 (2017).

Yung, C. K. et al. Large-scale uniform analysis of cancer whole genomes in multiple computing environments. bioRxiv https://doi.org/10.1101/161638 (2017).

Baggerly, K. A. & Coombes, K. R. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Statist. 3 , 1309–1334 (2009).

Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33 , e175 (2005).

Ioannidis, J. P. et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 41 , 149–155 (2009).

Nekrutenko, A. & Taylor, J. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat. Rev. Genet. 13 , 667–672 (2012).

Piccolo, S. R. & Frampton, M. B. Tools and techniques for computational reproducibility. Gigascience 5 , 30 (2016).

Angiuoli, S. V. et al. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics 12 , 356 (2011).

Krampis, K. et al. Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics 13 , 42 (2012).

Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014 , 2 (2014).

Kurtzer, G. M., Sochat, V. & Bauer, M. W. Singularity: scientific containers for mobility of compute. PLOS One 12 , e0177459 (2017).

The Clinical Cancer Genome Task Team of the Global Alliance for Genomics and Health. Sharing clinical and genomic data on cancer – the need for global solutions. N. Engl. J. Med. 376 , 2006–2009 (2017).

Bonazzi, V. R. & Bourne, P. E. Should biomedical research be like Airbnb? PLOS Biol. 15 , e2001818 (2017). The authors of this paper describe the NIH Data Commons and suggest cloud computing as a means for making large-scale genomics data sets available and associated analyses reproducible.

Bourne, P. E., Lorsch, J. R. & Green, E. D. Perspective: sustaining the big-data ecosystem. Nature 527 , S16–17 (2015).

Tryka, K. A. et al. NCBI's database of genotypes and phenotypes: dbGaP. Nucleic Acids Res. 42 , D975–D979 (2014).

Iyer, M. K. et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47 , 199–208 (2015).

Brown, J. B. et al. Diversity and dynamics of the Drosophila transcriptome. Nature 512 , 393–399 (2014).

Graveley, B. The developmental transcriptome of Drosophila melanogaster . Genome Biol. 11 , I11 (2010).

Article   PubMed Central   Google Scholar  

Gutzwiller, F. et al. Dynamics of Wolbachia pipientis gene expression across the Drosophila melanogaster life cycle. G3 5 , 2843–2856 (2015).

Bernstein, M. N., Doan, A. & Dewey, C. N. MetaSRA: normalized human sample-specific metadata for the sequence read archive. Bioinformatics 33 , 2914–2923 (2017).

Yung, C. K. et al. The Cancer Genome Collaboratory [abstract]. Cancer Res. 77 , 378 (2017).

Article   CAS   Google Scholar  

Nellore, A. et al. Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the sequence read archive. Genome Biol. 17 , 266 (2016).

Frazee, A. C., Langmead, B. & Leek, J. T. ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12 , 449 (2011).

Langmead, B., Hansen, K. D. & Leek, J. T. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 11 , R83 (2010).

Nellore, A., Wilks, C., Hansen, K. D., Leek, J. T. & Langmead, B. Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32 , 2551–2553 (2016). This work reports the use of cloud computing and MapReduce software to study tens of thousands of human RNA sequencing data sets, showing that many splice junctions that are well represented in public data are not present in popular gene annotations.

Collado-Torres, L. et al. Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35 , 319–321 (2017).

Nellore, A. et al. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics 33 , 4003–4040 (2017).

Vivian, J. et al. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35 , 314–316 (2017).

Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29 , 15–21 (2013).

Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12 , 323 (2011).

Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotech. 34 , 525–527 (2016).

Amstutz, P. et al. Common workflow language, v1.0. Figshare https://doi.org/10.6084/m9.figshare.3115156.v2 (2016).

Tatlow, P. J. & Piccolo, S. R. A cloud-based workflow to quantify transcript-expression levels in public cancer compendia. Sci. Rep. 6 , 39259 (2016). This study shows how cloud computing can be used to reanalyse over 12,000 human cancer RNA sequencing data sets for as little as US$0.09 per sample.

Foster, I. K., Carl. The Grid 2: Blueprint for a New Computing Infrastructure (Morgan Kaufmann, 2003).

Drew, K. et al. The Proteome Folding Project: proteome-scale prediction of structure and function. Genome Res. 21 , 1981–1994 (2011).

Rahman, M. et al. Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics 31 , 3666–3672 (2015).

Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11 , 207 (2010).

Bais, P., Namburi, S., Gatti, D. M., Zhang, X. & Chuang, J. H. CloudNeo: a cloud pipeline for identifying patient-specific tumor neoantigens. Bioinformatics 33 , 3110–3112 (2017).

Afgan, E. et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res. 44 , W3–W10 (2016).

Towns, J. et al. XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16 , 62–74 (2014).

Galaxy Community Hub. Publicly accessible Galaxy servers. Galaxy Project https://galaxyproject.org/public-galaxy-servers/ (2017).

Afgan, E. et al. Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics 11 (Suppl. 12), S4 (2010).

Liu, B. et al. Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses. J. Biomed. Inform. 49 , 119–133 (2014).

Foster, I. Globus Online: accelerating and democratizing science through cloud-based services. IEEE Internet Comput. 15 , 70–73 (2011).

Dana-Farber Cancer Institute. Dana-Farber Cancer Institute and Ontario Institute for Cancer Research join Collaborative Cancer Cloud http://www.dana-farber.org/newsroom/news-releases/2016/dana-farber-cancer-institute-and-ontario-institute-for-cancer-research-join-collaborative-cancer-cloud/ (2016).

Hawkins, T. The Collaborative Cancer Cloud: Intel and OHSU team up for cancer research. siliconANGLE http://siliconangle.com/blog/2016/12/16/collaborative-cancer-cloud-intel-ohsu-team-cancer-research-thecube/ (2016).

Global Alliance for Genomics and Health. A federated ecosystem for sharing genomic, clinical data. Science 352 , 1278–1280 (2016).

Amazon Web Services. AWS case study: DNAnexus. Amazon https://aws.amazon.com/solutions/case-studies/dnanexus/ (2017).

ICGC Data Coordination Center. About cloud partners. ICGC http://docs.icgc.org/cloud/about/ (2017).

modENCODE Project. modENCODE on the EC2 cloud. modENCODE http://data.modencode.org/modencode-cloud.html (2017).

Dean, J. & Ghemawat, S. MapReduce. Commun. ACM 51 , 107 (2008).

Kelly, B. J. et al. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics. Genome Biol. 16 , 6 (2015).

Langmead, B., Schatz, M. C., Lin, J., Pop, M. & Salzberg, S. L. Searching for SNPs with cloud computing. Genome Biol. 10 , R134 (2009).

Feng, X., Grossman, R. & Stein, L. PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 12 , 139 (2011).

McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 , 1297–1303 (2010).

GA4GH-DREAM. GA4GH-DREAM Workflow Execution Challenge. Synapse https://www.synapse.org/WorkflowChallenge (2017).

Franke, A. et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn's disease susceptibility loci. Nat. Genet. 42 , 1118–1125 (2010).

Petryszak, R. et al. The RNASeq-er API—a gateway to systematically updated analysis of public RNA-seq data. Bioinformatics 33 , 2218–2220 (2017).

Goldman, M., Craft, B., Zhu, J. & Haussler, D. The UCSC Xena system for cancer genomics data visualization and interpretation [Abstr. 2584]. Cancer Res. 77 , 2584 (2017).

Kolesnikov, N. et al. ArrayExpress update—simplifying data submissions. Nucleic Acids Res. 43 , D1113–D1116 (2015).

Google Compute Engine. Google Compute Engine pricing. Google Cloud Platform https://cloud.google.com/compute/pricing (2017).

Chard, R. et al. in 2015 IEEE 11th International Conference on e-Science , 136–144 (IEEE, 2015).

Book   Google Scholar  

Barr, J. Natural Language Processing at Clemson University – 1.1 Million vCPUs & EC2 Spot Instances. Amazon https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/ (2017).

NIH Commons. Commons Credits Pilot Portal. Commons Credits Pilot Portal https://www.commons-credit-portal.org/ (2017).

National Science Foundation. Amazon Web Services, Google Cloud, and Microsoft Azure join NSF's Big Data Program. National Science Foundation https://www.nsf.gov/news/news_summ.jsp?cntn_id=190830&WT.mc_ev=click (2017).

National Institute of Mental Health. Welcome to the NIMH Data Archive. NDA https://data-archive.nimh.nih.gov/ (2017).

Genomes Project Consortium. A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Lappalainen, I. et al. The European Genome-Phenome Archive of human data consented for biomedical research. Nat. Genet. 47 , 692–695 (2015).

National Institutes of Health. NIH security best practices for controlled-access data subject to the NIH genomic data sharing (GDS) policy. NIH Office of Science Policy https://osp.od.nih.gov/wp-content/uploads/NIH_Best_Practices_for_Controlled-Access_Data_Subject_to_the_NIH_GDS_Policy.pdf (2015).

Stein, L. D., Knoppers, B. M., Campbell, P., Getz, G. & Korbel, J. O. Data analysis: Create a cloud commons. Nature 523 , 149–151 (2015). In this paper, the authors argue for the use of cloud computing in large consortia and describe plans for its use in the ICGC.

Deutsche Telekom. Deutsche Telekom launches highly secure public cloud based on Cisco platform. Deutsche Telekom https://www.telekom.com/en/media/media-information/archive/deutsche-telekom-launches-highly-secure-public-cloud-based-on-cisco-platform------362100 (2015).

Datta, S., Bettinger, K. & Snyder, M. Secure cloud computing for genomic data. Nat. Biotechnol. 34 , 588–591 (2016).

Dove, E. S. et al. Genomic cloud computing: legal and ethical points to consider. Eur. J. Hum. Genet. 23 , 1271–1278 (2015).

Francis, L. P. Genomic knowledge sharing: a review of the ethical and legal issues. Appl. Transl Genom. 3 , 111–115 (2014).

Seven Bridges Genomics. API Overview. Seven Bridges Genomics https://docs.sevenbridges.com/v1.0/docs/the-api (2017).

Ananthakrishnan, R., Chard, K., Foster, I. & Tuecke, S. Globus platform-as-a-service for collaborative science applications. Concurrency Comput. Pract. Exp. 27 , 290–305 (2015).

Chaterji, S. et al. Federation in genomics pipelines: techniques and challenges. Brief Bioinform. https://doi.org/10.1093/bib/bbx102 (2017).

Campbell, S. Teaching cloud computing. Computer 49 , 91–93 (2016).

Dudley, J. T. & Butte, A.J. In silico research in the era of cloud computing. Nat. Biotech. 28 , 1181–1185 (2010).

Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483 , 603–607 (2012).

Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45 , 1113–1120 (2013).

Heath, A. P. et al. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J. Am. Med. Inform. Assoc. 21 , 969–975 (2014).

Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35 , 316–319 (2017).

Fisch, K. M. et al. Omics Pipe: a community-based framework for reproducible multi-omics data analysis. Bioinformatics 31 , 1724–1728 (2015).

Allcock, W. et al. in Proceedings of the 2005 ACM/IEEE conference on Supercomputing 54 (Seattle, 2005).

Petryszak, R. et al. Expression Atlas update — a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments. Nucleic Acids Res. 42 , D926–D932 (2014).

Download references

Acknowledgements

The authors thank J. Taylor, E. Afgan, M. Schatz, J. Goecks and A. Margolin for reading through a draft of this work and providing helpful comments. B.L. was supported by the US National Institutes of Health/National Institute of General Medical Sciences grant 1R01GM118568.

Author information

Authors and affiliations.

Department of Computer Science, Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA

Ben Langmead

Department of Biomedical Engineering, Department of Surgery, Computational Biology Program, Oregon Health and Science University, Portland, OR, USA

Abhinav Nellore

You can also search for this author in PubMed   Google Scholar

Contributions

The authors contributed equally to all aspects of this manuscript.

Corresponding authors

Correspondence to Ben Langmead or Abhinav Nellore .

Ethics declarations

Competing interests.

The authors declare no competing financial interests.

Related links

Further information.

European Genome–Phenome Archive

European Nucleotide Archive

Database of Genotypes and Phenotypes (dbGaP)

Sequence Read Archive

Galaxy cloud

PowerPoint slides

Powerpoint slide for fig. 1, powerpoint slide for fig. 2, powerpoint slide for fig. 3, powerpoint slide for fig. 4, powerpoint slide for table 1, powerpoint slide for table 2, powerpoint slide for table 3, powerpoint slide for table 4, powerpoint slide for table 5, supplementary information, supplementary information s1 (methods).

Supplementary Information for: Cloud computing for genomic data analysis and collaboration (PDF 228 kb)

Snippets of DNA sequence as reported by a DNA sequencer.

A component of a computer that stores data.

A central component of a computer in which the computation takes place.

A collection of connected computers that are able to work in a coordinated fashion to analyse data.

Information about a data set, often pertaining to how and from where it was collected. For example, for a human data set, metadata might include sex, age, cause of death and sequencing protocol used.

Similar to 'virtual machines', containers are 'virtual computers' that enable the use of multiple, isolated services on a single platform. They can run in the context of another computer, using a portion of the host computer's resources. Docker and Singularity are two container management systems.

Barriers that prevent unwanted, perhaps insecure network traffic from reaching a protected network.

(APIs). Formal specifications of the ways in which a user or program can interface with a system, for example, a cloud.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Langmead, B., Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 19 , 208–219 (2018). https://doi.org/10.1038/nrg.2017.113

Download citation

Published : 30 January 2018

Issue Date : April 2018

DOI : https://doi.org/10.1038/nrg.2017.113

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Elasticblast: accelerating sequence search via cloud computing.

  • Christiam Camacho
  • Grzegorz M. Boratyn
  • Thomas L. Madden

BMC Bioinformatics (2023)

Accelerating genomic workflows using NVIDIA Parabricks

  • Kyle A. O’Connell
  • Zelaikha B. Yosufzai
  • Juergen A. Klenk

Challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single-cell RNA-sequencing datasets

  • Sean K. Maden
  • Sang Ho Kwon
  • Kristen R. Maynard

Genome Biology (2023)

Application of deep learning technique in next generation sequence experiments

  • Mehmet Orman

Journal of Big Data (2023)

BarleyExpDB: an integrative gene expression database for barley

  • Tingting Li

BMC Plant Biology (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

research genomic data analysis

Researchers aim to use quantum computing to assemble and analyse pangenomes 

Experts in quantum computing and genomics to develop new methods and algorithms to process biological data

research genomic data analysis

Quantum computing has the potential to overhaul how information is processed and to offer computational powers beyond current computing capabilities. In the life sciences, quantum computing could be useful for many applications, from drug discovery and protein 3D structure prediction to genome analysis and beyond. 

Now, a new collaboration brings together a world-leading interdisciplinary team with skills across quantum computing, genomics, and advanced algorithms. They aim to tackle one of the most challenging computational problems in genomic science: building, augmenting, and analysing pangenomic datasets for large population samples. Their project sits at the frontiers of research in both biomedical science and quantum computing.  

The project, which involves researchers based at the University of Cambridge, the Wellcome Sanger Institute and EMBL’s European Bioinformatics Institute (EMBL-EBI) has been awarded up to US $3.5 million to explore the potential of quantum computing for improvements in human health. 

The team aims to develop quantum computing algorithms with the potential to speed up the production and analysis of pangenomes – new representations of DNA sequences that capture population diversity. Their methods will be designed to run on emerging quantum computers. The project is one of 12 selected worldwide for the Wellcome Leap Quantum for Bio (Q4Bio) Supported Challenge Program . 

Genomics that represents population diversity

Since the initial sequencing of the human genome over 20 years ago , genomics has revolutionised science and medicine. Less than one per cent of the 6.4 billion letters of DNA code differs from one human to the next, but those genetic differences are what make us unique. Our genetic code can provide insights into our health, help to diagnose disease, or guide medical treatments.

However, the reference human genome sequence, which most subsequently sequenced human DNA is compared to, is based on data from only a few people and doesn’t represent human diversity. Scientists have been working to address this problem since the publication of the original human genome, and in 2023, the first human pangenome reference was produced.

What is a pangenome?

A pangenome is a collection of many different genome sequences that capture the genetic diversity in a population. Pangenomes could potentially be produced for all species.

The human pangenome data are freely accessible on the Ensembl human pangenome project page and through Ensembl Rapid Release .

Pangenomics, a new domain of science, demands high levels of computational power. While the existing human reference genome structure is linear, pangenome data can be represented and analysed as a network, called a sequence graph. This graph stores the shared structure of genetic relationships between many genomes. Comparing subsequent individual genomes to the pangenome then involves matching sequences to map a route through the graph. 

In this new project, the team aims to develop quantum computing approaches with the potential to speed up both key processes: mapping data to graph nodes and finding good routes through the graph. 

Quantum computing in a nutshell

Quantum technologies are poised to revolutionise high-performance computing. Classical computing stores information as bits, which are binary – with a value of either 0 or 1. However, a quantum computer works with particles that can be in a superposition of different states simultaneously. Rather than bits, information in a quantum computer is represented by qubits (quantum bits), which could take on the values 0 or 1, or be in a superposition state between 0 and 1. It takes advantage of quantum mechanics to enable solutions to problems that are not practical to solve using classical computers.

However, current quantum computer hardware is inherently sensitive to noise and decoherence, so scaling it up presents an immense technological challenge. While there have been exciting proof of concept experiments and demonstrations, today’s quantum computers remain limited in size and computational power, which restricts their practical application. But significant quantum hardware advances are expected to emerge in the next three to five years.

Laying the foundations for quantum computing in genomics

The Wellcome Leap Q4Bio Challenge is based on the premise that the early days of any new computational method will advance and benefit most from the co-development of applications, software, and hardware – allowing optimisations with not-yet-generalisable, early systems. 

Building on state-of-the-art computational genomics methods, the team will develop, simulate, and then implement new quantum algorithms, using real data. The algorithms and methods will be tested and refined in existing, powerful High Performance Compute (HPC) environments initially, which will be used as simulations of the expected quantum computing hardware. The team will test algorithms first using small stretches of DNA sequence, working up to processing relatively small genome sequences like that of SARS-CoV-2, before moving to the much larger human genome.

The project is a first step in exploring and conceptualising what quantum computing could bring to pangenomics. Expressing such scientific questions using quantum computing frameworks could itself yield benefits and new insights for researchers, even if practical application to quantum computers is not feasible. 

“On the one hand, we’re starting from scratch because we don’t even know yet how to represent a pangenome in a quantum computing environment,” explained David Yuan, Project Lead at EMBL-EBI. “If you compare it to the first moon landings, this project is the equivalent of designing a rocket and training the astronauts. On the other hand, we’ve got solid foundations, building on decades of systematically annotated genomic data generated by researchers worldwide and made available by EMBL-EBI. The fact that we’re using this knowledge to develop the next generation of tools for the life sciences is a testament to the importance of open data and collaborative science.”

EMBL-EBI is contributing data wrangling expertise to the project, as well as some of the technical infrastructure that will allow the project to run simulations of how quantum computing could work in the future, using existing technologies. 

“Currently it’s routine for genome sequencing from an individual to be compared to the linear reference genome to call variants and predict impact on functional elements,” explained Sarah Hunt, Variation Resources Coordinator at EMBL-EBI, who is not involved in the project. “As more and more full individual genomes are sequenced we want to be able to analyse them against the human pangenome reference. Being able to map data as quickly and efficiently as possible is critical. The hope is that projects like Wellcome Leap Q4Bio will one day help us leverage the information held in the pangenome and translate it into better clinical outcomes.”

“We’ve only just scratched the surface of both quantum computing and pangenomics,” said David Holland, Principal Systems Administrator at the Wellcome Sanger Institute, who is working to create a High Performance Compute environment to simulate a quantum computer. “So to bring these two worlds together is incredibly exciting. We don’t know exactly what’s coming, but we expect that all of a sudden, the heights of what is possible will come so much closer. We are doing things today that we hope will make tomorrow better.” 

The potential benefits of this work are huge. Comparing a specific human genome against the human pangenome – instead of the existing human reference genome – gives better insights into its unique composition. This will be important in driving forwards personalised medicine. Similar approaches for bacterial and viral genomes will underpin the tracking and management of pathogen outbreaks.

This article is based on a Wellcome Sanger Institute press release . 

Related links

  • The Human Pangenome project
  • Ensembl human pangenome data
  • Ensembl Rapid Release

Tags: big data , bioinformatics , computing , data science , embl-ebi , FAIR data , genomics

More from this category

This image showcases a processor with neon lights and abstract shapes that represent the flow and integration of spatial omics information. The background shows the analyzed and annotated breast cancer sample.

A universal framework for spatial biology

Cryo-electron tomography image.

Exploring cellular complexity with EMPIAR’s new cryo-electron tomography prototype browser

Artist’s stylised, semi-abstract representation of artificial intelligence

When AI meets biology

research genomic data analysis

Building the cell’s splicing machine

Read the latest Issue

Issue 101 Winter 2023

research genomic data analysis

Taking science on the road

Female scientist

Advocating for a generalist approach to science and life

research genomic data analysis

Deciphering the data deluge: how large language models are transforming scientific data curation

Looking for past print editions of EMBLetc. ? Browse our archive, going back 20 years.

Subscribe to our e-newsletter

Newsletter archive.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Mol Sci

Logo of ijms

Big Data Analytics for Genomic Medicine

Karen y. he.

1 Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, USA; ude.esac@9hyk

Dongliang Ge

2 BioSciKin Co., Ltd., Nanjing 210042, China

3 Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, WI 53706, USA

Associated Data

Genomic medicine attempts to build individualized strategies for diagnostic or therapeutic decision-making by utilizing patients’ genomic information. Big Data analytics uncovers hidden patterns, unknown correlations, and other insights through examining large-scale various data sets. While integration and manipulation of diverse genomic data and comprehensive electronic health records (EHRs) on a Big Data infrastructure exhibit challenges, they also provide a feasible opportunity to develop an efficient and effective approach to identify clinically actionable genetic variants for individualized diagnosis and therapy. In this paper, we review the challenges of manipulating large-scale next-generation sequencing (NGS) data and diverse clinical data derived from the EHRs for genomic medicine. We introduce possible solutions for different challenges in manipulating, managing, and analyzing genomic and clinical data to implement genomic medicine. Additionally, we also present a practical Big Data toolset for identifying clinically actionable genetic variants using high-throughput NGS data and EHRs.

1. Introduction

Next-generation sequencing (NGS) technologies, such as whole-genome sequencing (WGS), whole-exome sequencing (WES), and/or targeted sequencing, are progressively more applied to biomedical study and medical practice to identify disease- and/or drug-associated genetic variants to advance precision medicine [ 1 , 2 ]. Precision medicine allows scientists and clinicians to predict more accurately which therapeutic and preventive approaches to a specific illness can work effectively in subgroups of patients based on their genetic make-up, lifestyle, and environmental factors [ 3 ]. To date, over 6000 Mendelian disorders have been studied at the genetic level [ 4 , 5 ] and over 1500 clinically-relevant complex traits have been studied with genome-wide association study (GWAS) approaches [ 6 ]. Clinical research leveraging electronic health records (EHRs) has become feasible as EHRs have been widely implemented [ 7 ]. Additionally, a number of studies have been designed to combine genomic and EHR data to improve clinical research and/or healthcare outcome ( Table 1 ).

Studies and efforts of leveraging genomic data and EHRs for genomic research/medicine.

Leveraging large-scale genomic data with comprehensive clinical data derived from EHRs can implicate disease- and/or drug-associated variants for individualized diagnosis and therapy. NGS technological advancements in clinical genome sequencing and the adoption of EHRs will very likely create patient-centered precision medicine in clinical practice. Genomic data generated by NGS technologies are a vital component in supporting genomic medicine, but the volume and complexity of the data raise challenges for its use in clinical practice [ 8 ]. For instance, sequencing a single whole genome generates more than 100 gigabytes of data. Therefore, the development of novel bioinformatics infrastructures is required to implement NGS in clinical practice.

Big Data is a term used to describe data sets with such large volume or complexity that conventional data processing methods are not good enough to deal with them. Big Data has been described disparately by different people [ 9 ]. The most popular definition of Big Data is the 5Vs, which are Volume, Velocity, Variety, Verification/Veracity, and Value [ 10 ]. The definition of Big Data might be subjected to technological advances in the future. Big Data infrastructure is a framework, which covers important components including Hadoop (hadoop.apache.org), NoSQL databases, massively parallel processing (MPP), and others, that is used for storing, processing, and analyzing Big Data. Big Data analytics covers collection, manipulation, and analyses of massive, diverse data sets that contain a variety of data types including genomic data and EHRs to reveal hidden patterns, cryptic correlations, and other intuitions on a Big Data infrastructure [ 11 ]. Due to its effectiveness, Big Data analytics is widely used in different research fields [ 12 ]. In this review, we describe how one type of Big Data, genomic data, is applied to improve clinical research and healthcare. We give an overview of the challenges in processing genomic data and EHRs, provide possible solutions to overcome these challenges using approaches that ensure the safety of genomic data, and present a Big Data solution for identifying clinically actionable variants in sequence data. We also discuss the requirement for the efficient integration of genomic information into EHRs.

2. Challenges of Handling Genomic and Clinical Data

2.1. challenges in manipulating genomic data.

Although more than 6000 Mendelian disorders have been studied at the genetic level so far, we still do not have a clear understanding of the majority of their roles in health and diseases [ 25 ]. Over the past eight years, the size of the NIH sequence read archive (SRA) database has grown exponentially ( Figure 1 ). While the development of NGS technologies has made it increasingly easier to sequence a whole genome or exome, there continue to be considerable challenges in terms of handling, analyzing, and interpreting the genomic information generated by NGS. Since there are over three billion base pairs (sites) on a human genome, sequencing a whole genome generates more than 100 gigabytes of data in BAM (the binary version of sequence alignment/map) and VCF (Variant Call Format) file formats. The actual size of a BAM file is determined by the coverage (the average number of times each base is read; read depth) and read length in a sequencing experiment. Given a 30× WGS data for a single sample, the size of its FASTQ file can be approximately 250 GB, the BAM file can be approximately 100 GB, the VCF file can be about 1 GB, and the annotated files can be approximately 1 GB as well. The approximate file sizes of different NGS data formats and running times of generating those different format files are described in Figure 2 . Big Data infrastructures can greatly facilitate the analysis of these data. For example, Big Data-based Burrows-Wheeler Aligner (BWA) can increase the alignment speed 36-fold compared to the original BWA [ 26 ]. Currently, most analytical methods for sequencing data use VCF files that assume all “no-call sites” are the same as reference alleles. In fact, many “no-call sites” may be caused by low quality coverage. Therefore, the data quality information, such as coverage and Phred-scaled base quality scores for every site, needs to be utilized to pinpoint whether “no-call sites” are reference-consistent with high coverage or reference-inconsistent caused by low coverage in the downstream data analysis [ 27 ]. A number of toolsets for data compression, cloud computing, variant prioritization, copy number variation (CNV) detection, data sharing, and phenotypes on exome sequencing data have been reviewed by Lelieveld et al. [ 28 ]. Because VCFs are much smaller than BAM files, analytical tools on VCFs may not always require a Big Data infrastructure. However, researchers are currently facing substantial challenges in storing, managing, manipulating, analyzing, and interpreting WGS data for moderate numbers of individuals if they need to take into account of data quality information stored in BAM files. These challenges will become exacerbated when millions of individuals are sequenced, which embodies the goals of the precision medicine initiative (PMI) in the U.S. and similar efforts of the same scale elsewhere in the world. Leveraging the distribution and scalability inherent in Big Data’s infrastructures, it can be feasible to develop a Big Data system to manage and analyze the extensive genomic data compatible with clinical workflows.

An external file that holds a picture, illustration, etc.
Object name is ijms-18-00412-g001.jpg

The SRA database growth in the past eight years.

An external file that holds a picture, illustration, etc.
Object name is ijms-18-00412-g002.jpg

The approximately files sizes of different NGS data formats and running times of generating those different format files. BWA: Burrows-Wheeler aligner, GATAK: genome analysis toolkit, BAM: the binary version of sequence alignment/map, FASTQ: a text-based format for representing either nucleotide sequences or peptide sequences, VCF: variant call format.

2.2. Challenges in Manipulating Clinical Data

Up until the previous decade, approximately 90% of clinicians in the U.S. routinely recorded patient medical records by hand and stored them in color-coded files. In the past five years, the percentage of clinicians using certified EHR systems has grown dramatically [ 29 ]. Clinical data extracted from the EHRs for each patient can include the international classification of diseases (ICD) codes, drugs, treatments, procedure (CPT) codes, laboratory values, clinician notes, as well as self-reported dietary and physical activity data. The ICD code is a clinical cataloging system utilized by clinics and hospitals to classify and code diagnosis, symptom, procedure, and treatment in the U.S. Not only are ICD codes used for disease classification, but also as medical billing codes [ 30 ]. The volume of clinical data extracted from EHRs can be considerable. For example, the EHR data of ~20,000 patients enrolled in the Personalized Medicine Research Project (PMRP) at Marshfield Clinic is approximately 3.3 GB. The elements in clinical data can be used to classify and measure associations between environmental exposures and clinical consequences. An important application of mining clinical data is patient classification [ 31 , 32 , 33 , 34 ]. Without stringent and appropriate phenotyping approaches, the classification cannot be appropriately measured, resulting in false positive or negative associations [ 31 ]. Machine learning (ML) involves training an algorithm to systematically classify patients into phenotypic groups [ 35 ]. To do this, the ML classifier needs to learn which elements in clinical data are providing useful insights for distinguishing the different phenotypic groups. With the proliferation of EHR adoption, computational phenotyping characterization has shown its advantage in classifying research subjects [ 36 ]. In addition, millions of data points regarding tens of thousands of clinical elements within the EHRs are available for EHR-based phenotyping. Like sequence data, it will also become a significant challenge to store, manage, manipulate, and mine the complete clinical data of millions of individuals. Therefore, it is necessary to develop advanced and efficient ML approaches for subject characterization and/or better phenotyping. Meanwhile, some ML tasks may take one or two days or even several days to run specific data mining algorithms. For example, to mine large-scale literature, ML approaches on a Big Data infrastructure could be performed 100 times faster than any of the existing ML tools without using any Big Data infrastructures [ 37 ].

3. Big Data on the Cloud

3.1. cloud computing.

Cloud computing providers offer services that provide the infrastructure, software, and programming platforms to clients, and are accountable for the cost for development and maintenance [ 38 ]. Compared to creating and maintaining an in-house database, cloud computing is an economical approach to genomic data management because clients pay only for the services that they need. An example of an open-source framework used to develop infrastructure for processing genomic data in a cloud computing environment is Hadoop. It breaks the data into small fragments, distributes them across many data nodes, delivers the computational code to the nodes so that they are processed in parallel, and collectively assembles the results at the end. The parallel processing of many small pieces of data, known as MapReduce, greatly shortens the computing time. Challenges of using cloud computing for genomic data include lengthy data transfers for uploading NGS data to the cloud server, the perceived lack of information safety in cloud computing, and the requirement for developers with advanced programming skills to develop programs on the Hadoop [ 38 ].

3.2. Privacy and Security Challenges of Cloud Computing

Cloud computing infrastructures could be deployed with miscellaneous platforms and configurations in which each platform could be configured with diverse security, confidentiality, and authentication settings [ 39 ]. These unique aspects could exacerbate security and privacy challenges [ 40 ]. Some cloud service providers including Microsoft Azure [ 41 ] and Amazon Web Services [ 42 ] provide the health insurance portability and accountability act (HIPAA) amenable services for analyzing biomedical data. In addition, data security and privacy in cloud computing is an active research field involving the use of virtual machines [ 43 ] and sandboxing techniques [ 44 ] for biomedical data management on the cloud.

4. Big Data Analytics in Genomic Studies

4.1. ngs read alignment.

NGS involves breaking DNA into large amounts of segments. Each segment is called a ‘read’. Due to biases in sample processing, library preparation, sequencing-platform chemistry, and bioinformatics methods for genomic alignment and assembly of the reads, the distribution and/or length of reads across the genome can be uneven [ 45 , 46 ]. Therefore, some genomic regions are covered with more reads and others with fewer reads. As mentioned previously, read depth denotes the average number of times each base is read. For instance, a 10× read depth means that each base is present in an average of 10 reads. For RNAseq, read depth is more often designated as number of millions of reads. Read alignment involves lining up the sequence reads to a reference sequence [ 47 , 48 ] to allow comparison of sequence data from a sample sequenced with the reference genome. A number of alignment tools including CloudBurst [ 49 ], Crossbow [ 50 ], and SEAL [ 51 ] have been developed on Big Data infrastructures. More programs designed for short-read sequence alignment are shown in Table S1 . Alignment allows a number of quality control (QC) measures, such as the proportion of all reads aligned to a reference sequence, the ratio of unique reads aligned to a reference sequence, and the number of reads aligned at a specific locus. These QC measures affect the accuracy of variant calling.

4.2. Calling Variants

Variant calling is more reliable with higher read depth, which is especially valuable for detecting rare genetic variants with higher confidence. The read depth needed for accurately calling variants relies on various factors, including presence of repetitive genomic regions, error rate of the sequencing platform, and algorithm used for assembling reads into a genomic sequence. Read depth, such as 100× for heterozygous single nucleotide variant (SNV) detection by WES [ 52 ], 35× for genotype detection by WGS [ 53 ], and 60× for detecting insertions/deletions (INDELs) by WGS [ 54 ], may be required. Some widely-used programs for germline variant calling include SAMtools [ 55 ], GATK [ 56 ], FreeBayes [ 57 ], and Atlas2 [ 58 ]. SAMtools comprises a number of utilities for manipulating aligned sequence reads and calling SNV and/or INDEL variants. GATK is a NGS analysis suite designed to identify SNVs and INDELs in germline DNA and RNAseq data. It estimates the likelihood of genotype based on the observed sequence reads at a locus by leveraging a Bayesian model. In addition, it employs a MapReduce infrastructure to accelerate the procedure of processing large amounts of sequence aligned reads in parallel [ 59 , 60 ]. Now, it has been expanded to include somatic variant calling tools by incorporating MuTect [ 61 ], and to tackle CNVs and structural variations (SVs) as well. The major difference between SAMtools and GATK is the estimation of the genotype likelihood of SNVs and INDELs for calling variants. Regarding the filtering steps, SAMtools uses predefined filters while GATK learns the filters from the data. FreeBayes is a haplotype-based tool that concurrently discovers SNVs, INDELs, multiallelic sites, polyploidy, and CNVs in a sample, pooled multiple samples, or mixed populations [ 62 ]. Atlas2 [ 58 ] can be used to analyze data generated by the SOLiDTM platform via logistic regression models trained on validated WES data to detect SNVs and INDELs. This tool can also analyze data generated by the Illumina platform using logistic regression models to call INDELs and a mixture of logistic regression and a Bayesian model to call SNVs [ 63 ]. To evaluate various programs/tools, Hwang et al. have systematically examined 13 variant calling programs using gold standard personal exome variants [ 64 ].

4.3. Variant Annotation

Large amounts of sequence data are being generated by NGS. To pinpoint a small subset of functional variants, many annotation programs have been developed. As one of the most widely used annotation programs, ANNOVAR [ 65 ] annotates SNVs, INDELs, and CNVs by exploring their functional consequences on genes, conjecturing cytogenetic bands, and reporting biological functions and various functional scores, including PolyPhen-2 score [ 66 ], Sorting Intolerant From Tolerant (SIFT) score [ 67 ], the Combined Annotation Dependent Depletion (CADD) score [ 68 ], and others. It also discovers variants in conserved regions and identifies variants present in dbSNP [ 69 ], the 1000 Genomes Project [ 70 ], the NHLBI EPS6500 project [ 71 ], and the ExAC [ 72 ]. Furthermore, ANNOVAR can employ annotation databases from the UCSC Genome Browser or any other data resources conforming to Generic Feature Format version 3 (GFF3). Other commonly used annotation programs include snpEff [ 73 ], and the Ensembl Variant Effect Predictor (VEP) [ 74 ]. Xin et al. have developed a web-based service that can be run on the cloud [ 75 ]. In order to annotate a WGS data in a short period of time, we are currently developing a cloud-based version of ANNOVAR, which is built on a Hadoop framework and a Cassandra NoSQL database. Additional variant annotation programs are shown in Table S2 . In addition, variant annotation depends on biological knowledge in order to provide information on the known or likely impact of variants on gene regulation and protein function [ 65 , 73 ]. To produce a patient report, annotated variants are interpreted in a disease-specific context and are often classified based on their known or expected clinical impact. For instance, the ClinVar [ 76 ] variant database, released on 5 July 2016 by the National Center for Biotechnology Information (NCBI), contains 126,315 unique genetic variants with clinical interpretations.

4.4. Statistical Analysis of Genomic Data

Family-based analysis : Family-based NGS data enable the discovery of disease-contributing de novo mutations [ 77 , 78 , 79 ]. Meanwhile, family-based research strategies can uncover many mutations that may be contributing to recessive, inherited as homozygous or compound heterozygous diseases. SeqHBase [ 27 ] is a reliable and scalable computational program that manipulates genome-wide variants, functional annotations and every-site coverage, and analyzes WGS/WES data to identify disease-contributing genes effectively. It is a Big Data-based toolset designed to analyze large-scale family-based sequencing data to quickly discover de novo , inherited homozygous, and/or compound heterozygous mutations.

Population-based analysis : A number of large-scale population-based sequencing studies are undergoing. For example, the PMI cohort program attempts to sequence one million or more American participants for improving our ability to preclude and cure diseases based on one’s differences in genetic make-up, lifestyle, and environmental factors. By 2025, over 100 million human genomes could be sequenced [ 80 ]. Therefore, it is critical to develop statistical toolsets on a Big Data infrastructure for analyzing the genomic data of millions of people.

4.5. Security of Genomic Data

Genomic data need to be protected. Therefore, its privacy and confidentiality should be preserved similarly to other protected health information. Privacy safeguards include the utilization of data encryption, password protection, secure data transmission, auditions of data transferring methods, and the operation of institutional strategies against data breeches and mischievous abuse of the data [ 81 ]. The Fair Information Practices Principles (FIPPs) offer a framework for enabling data sharing and usage based on the guidelines adopted by the U.S. Department of Health and Human Services [ 82 ]. These principles include: individual access, data correction, data transparency, individual choice, data collection and disclosure limitation, data quality and integrity, safeguards, and accountability. The Workgroup for Electronic Data Interchange (WEDI) has released a report outlining the challenges in regards to the infrastructure, workflows, and coordination of health IT integration [ 83 ]. These challenges include data access and integration, data exchange, and data governance. Cloud-computing technology advancements offer easier solutions to store large genomic data files and to consolidate data to make them more easily accessible. The use of cloud computing presents extra security concerns because data storage and/or processing services are provided by an entity external to the healthcare organization. Cloud services qualify as a business associate and they must sign a business associate agreement (BAA) in order to adhere to the modifications to the HIPAA privacy, security, enforcement, and breach notification rules [ 84 ]. Cloud service providers can address these concerns by including controlled access to the data and building a user role based access system. Additional security measures should be taken, such as protecting the security of the computer network using warning alarms to monitor when changes are made to stored data, and guaranteeing the complete removal of data from its servers if the cloud storage service is no longer being used [ 39 ].

5. Analysis of Genomic and Clinical Data

5.1. clinically actionable genetic variants.

In clinical practice, the identification and return of incidental findings (IFs) for clinically disease-contributing variants in a set of 56 "highly medically actionable" genes associated with 24 inherited conditions have been recommended by the American College of Medical Genetics and Genomics (ACMG) [ 85 , 86 ]. A web-based tool for detecting clinically actionable variants in the 56 ACMG genes is developed by Daneshjou et al. [ 87 ], and a variant characterization framework for targeted analysis of relevant reads from high-throughput sequencing data is developed by Zhou et al. [ 88 ]. SeqHBase [ 27 ] is a bioinformatics toolset for analyzing family-based WGS/WES data on a Big Data infrastructure. To deduce biological perceptions from large amounts of NGS data and inclusive clinical data, we have expanded analysis functions within SeqHBase ( Figure 3 ) to detect disease- and/or drug-associated genetic variants quickly.

An external file that holds a picture, illustration, etc.
Object name is ijms-18-00412-g003.jpg

The basic framework of SeqHBase for identifying clinically actionable genetic variants.

Even though many variant prioritization tools are available, it remains a challenge to detect clinically actionable variants. Additional efforts are required to distinguish truly clinically actionable variants that can be used to guide clinical decisions. As one variant can be classified as different pathogenicity by multiple clinical laboratories [ 89 ], more stringent criteria [ 90 ] and the latest ACMG guidelines [ 91 ] should be complied to report pathogenic variants [ 92 ]. To classify the pathogenicity of new variants, which are not recorded in the ClinVar database [ 76 ], and to reach some level of concordance on the clinical variant interpretations, assessments from experts, such as medical geneticists, and/or further biological functional studies are needed. To apply actionable results in clinical practice, genetic findings need to be further complemented with highly strong pathological evidence, along with being reviewed by clinical geneticists.

5.2. Clinically Actionable Pharmacogenetic Variants

Substantial efforts have been made to identify clinically actionable pharmacogenetics variants, and it is instructive to review the approach being used. The Coriell Personalized Medicine Collaborative [ 93 ], the Clinical Pharmacogenetics Implementation Consortium [ 94 ], the Pharmacogenetics Working Group established by the Royal Dutch Association for the Advancement of Pharmacy [ 95 ], and the Evaluation of Genomic Applications in Practice and Prevention initiative sponsored by the Centers for Disease Control and Prevention [ 96 ] have individually developed similar processes for selecting candidate drugs, reviewing published literature to identify drug-gene associations, scoring evidence supporting associations between genetic variants and drug response, and interpreting the evidence to provide therapeutic guidelines. The approach involving review and interpretation of scientific literature by an expert committee can be considered as the gold standard for determining whether a variant is clinically relevant or actionable, but it also can be costly and labor-intensive. It will not be feasible for experts, either individually or in committees, to review a large number of genetic variants identified in NGS data. Tools such as POLYPHEN-2 [ 66 ], VEP [ 97 ], Mutation Assessor [ 98 ], and SIFT [ 99 ] can be used to predict variant effects. However, because these tools are sometimes inaccurate [ 100 ] and often differ in their predictions for a same variant [ 101 , 102 ], there will likely be many variants with no clear predicted, clinical interpretation. New methods and toolsets need to be developed to accurately predict the pathogenicity of genetic variants generated by NGS. More importantly, the methods should comply with the U.S. Food and Drug Administration (FDA) guideline [ 103 ] and the Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline [ 104 ].

6. Big Data Analytics in Health Research

Clinical data derived from EHRs have expanded from digital formats of individual patient medical records into high-dimensional data of enormous complexity. Big Data refers to novel technological tools delivering scalable capabilities for managing and processing immense and diverse data sets. On a single level, approaches such as natural language processing (NLP) allow incorporation and exploration of textual data with structured sources. On a population level, Big Data provides the possibility to conduct large-scale exploration of clinical consequences to uncover hidden patterns and correlations [ 105 ]. The large amounts of EHRs currently available have enabled us to overcome previously challenging obstacles, such as analyses of rare conditions, more sophisticated analyses, and in-depth analyses of specific data elements [ 106 ]. Big Data analytics facilitates the improvement of healthcare from depiction and record to prediction and optimized decision-making.

6.1. Health Informatics

In clinical practice, disease characterization is routinely collected from a number of different streams, such as imaging, pathology, genomics, and electrophysiology. However, much of the deeper insights into disease processes and mechanisms remain to be uncovered and interpreted from routinely acquired clinical data. Clinical data of millions of patients at a clinic/hospital or in a large study (e.g., PMI) exhibit many of the features of Big Data. The volume comes from large amounts of records that can be derived from the EHRs for patients; for example, medical images including magnetic resonance imaging (MRI) or neuroimaging data for each patient can be large, while social media data gathered from a population can be large-scale as well. The velocity occurs when data is accumulated at high speeds, which can be seen when monitoring a patient’s real-time conditions through medical sensors for sleep apnea ( http://www.sleepapnea.org/ ), for instance. The variety refers to data sets with a large amount of varying types of independent attributes, such as data sets that are gathered from many different resources. Veracity is a concern when working with possibly noisy, incomplete, or erroneous data where such data need to be properly evaluated using other relevant true evidence. Value portrays the usefulness for improving healthcare outcome. The advance in the fields of health informatics is a vital driving force in the expansion of Big Data, due to either the volume of clinical information produced or the complexity and variety of biomedical data that encompasses discoveries from basic science, translational research, medical system, and population-based study on the determining factors of healthiness. It is essential to develop novel data analytics tools with scalable, accessible and sustainable data infrastructure to effectively manage large, multiscale, and heterogeneous data sets and convert these data into knowledge that can be used for cost-effective decision support, disease management, and healthcare delivery. It is also necessary to develop Big Data infrastructures/systems to store, manage, manipulate, and analyze large-scale clinical data.

6.2. Medical Imaging Analysis

Medical imaging data is a type of Big Data in medical research. Imaging genomics is a rapidly growing field derived from recent advancement in multimodal imaging data and high-throughput omics data. The remarkable complexity of these datasets present critical computational challenges. Kitchen et al. reviewed methods for overcoming the challenges associated with integrating genomic, transcriptomic, and neuroproteomic data [ 107 ]. There is an increasing interest in integrating neuroimaging data into frameworks for promoting data mining and meta-analyses [ 108 ]. In the past century, a central interest in cognitive neuroscience has been trying to understand the human brain [ 109 , 110 ]. Recently, a group of neuroscience researchers used ML methods to map the human brain in order to understand the incredibly complex human cerebral cortex [ 111 ]. Human brain mapping is another monumental step toward precision medicine. For example, an important advance in developing operational management and therapeutics of neurological and psychiatric disease empowers researchers to collect and explore data from approximately 100 billion neurons from the brain in a much greater capacity and at an even more rapid speed. As the human brain controls multiple spatial and chronological scales, the data can be used to understand how the brain works by combining all relevant information [ 112 ]. Therefore, developing high-performance computing tools based on a Big Data framework becomes critical to neuroscience for improving healthcare [ 113 , 114 ].

6.3. Data Sharing

In order to share EHRs across multiple healthcare providers, several key components need to be taken into account: (1) functional interoperability , which allows data (e.g., medical records) to be exchangeable from one EHR system to other EHR systems without any restrictions; (2) structural interoperability , which permits the data structure to be exchangeable across all systems; (3) semantic interoperability , which allows multiple systems to exchange data and to easily make use of the data exchanged; and (4) interpretation , which allows clinicians to properly interpret the health records (e.g., symptoms) as carrying the same meaning. Currently, several universal EHR providers, including Epic, Cerner, MEDITECH, and Allscripts, have joined to establish an interoperability initiative ( http://www.beckershospitalreview.com/healthcare-information-technology/epic-cerner-meditech-dozens-more-make-interoperability-pledge-at-himss16-5-things-to-know.html ). All of these vendors have agreed to overcome the challenges of medical record interchangeability, information sharing, and positive patient engagement. The settlement is a critical step towards constructing an allied healthcare system, where information is shared smoothly and in a secure manner across various EHR systems [ 115 ].

7. Discussion

In order to maximize preventive measures of serious but preventable diseases, it is critical to understand as much about the patient as early as possible. Generally, precautionary health interventions are simpler or more cost-effective than therapies implemented at a later stage. In addition, knowing patients’ individual characteristics is often helpful in providing effective and individualized therapy to a disease because individual patients can respond to the same treatment differently. Genomic medicine could change the path for preventing and treating human diseases. However, the translation of these advances into healthcare will rely critically on our ability to identify disease- and/or drug-associated clinically actionable variants and on our knowledge of the roles of the genetic alterations in the illness procedure.

To conduct pilot studies on incorporating genomic data into clinical care, a number of healthcare systems have developed bioinformatics infrastructures to process NGS data through a group of databases supplementary to the EHRs [ 116 , 117 , 118 ]. Most of the infrastructures are locally developed and proprietary, but this is because these centers are among the first healthcare providers to use genomic data in clinical care and there are no established infrastructures to meet their bioinformatics requirements. It requires substantial investment in resources and personnel for developing and deploying an efficient bioinformatics infrastructure for incorporating NGS data in clinical care. Thus, healthcare providers might want to consider cooperatively establishing a cloud computing service designed to store and process genomic data securely for the healthcare community. The cost of sequencing instruments may need to be taken into account as part of the infrastructure cost by clinical laboratories. Targeted sequencing instruments are less expensive and generate less data than the ones that perform WGS/WES. Therefore, more laboratories are likely to implement targeted sequencing before attempting to build a framework to support WGS/WES. For instance, a study conducted by Regeneron Genetics Center and the Geisinger Health System has highlighted the value of integrating genomic data and EHRs to uncover a genetic variant that results in reduced levels of triglycerides and a lower risk of coronary artery disease [ 119 ]. In addition, a study on the large-scale analysis of more than 50,000 exomes of patients and their EHRs by Regeneron and Geisinger has found clinically actionable genetic variants in 3.5% of individuals and several known and/or potential drug targets as well [ 120 ]. However, there are still challenges with integrating genomic data into EHRs in clinical practice, including reliable bioinformatics systems/pipelines that translate raw genomic data into meaningful and actionable variants, the role of human curation in the interpretation of genetic variants, and the requirement for consistent standards to genomic and clinical data [ 121 ].

A vital challenge of incorporating genomic data into clinical practice is the lack of standards for generating NGS data, bioinformatics processing, data storage, and clinical decision support. Standards could promote interoperability in data quality. Obedience to standards would enable the routine use of genomic data in clinical care. However, it is challenging to build standards when NGS technology and bioinformatics tools are frequently evolving. Furthermore, approaches to clinical decision support differ among healthcare institutions [ 116 ]. Appropriately integrating genomic data with EHRs for the discovery of clinically actionable variants can generate novel insights into disease mechanisms and provide better treatments. To improve our understanding on the nature of the disease from comprehensive EHRs, new methods such as ML, NLP, and other artificial intelligence approaches are needed. However, not all patients are likely to benefit from the use of Big Data in healthcare due to our current knowledge gaps on how to extract useful information from large-scale genomic and clinical data and how to interpret discovered variants properly. In the meantime, targeted therapies are not yet available for many important genes, and regulatory issues need to be solved before some useful bioinformatics tools can be applied to clinical setting.

8. Conclusions

In conclusion, as EHRs are exceptionally private, methods of protecting patient data need to make certain that patient information is only shared with those with authorized access. Even with the existing challenges, the prospective advantages that genomic data can bring to healthcare are much more important than the potential disadvantages. The increasing development of integrating genomic data with EHRs may cause concerns, but genomic data will certainly play an important role in advancing genomic medicine only if patient privacy and data security can be strictly protected.

Acknowledgments

Max M. He is greatly appreciative of the support from the leaderships in the Center for Human Genetics and Biomedical Informatics Research Center at Marshfield Clinic Research Foundation. This work was supported by the National Institutes of Health (HL007567 to Karen Y. He, UL1TR000427 to Max M. He); and the MCRF to Max M. He.

Supplementary Materials

Supplementary materials can be found at www.mdpi.com/1422-0067/18/2/412/s1 .

Author Contributions

Karen Y. He, Dongliang Ge, and Max M. He contributed to designing, drafting, and performing critical review of the manuscript. Karen Y. He, Dongliang Ge, and Max M. He are the guarantors of the manuscript.

Conflicts of Interest

Dongliang Ge and Max M. He are employed and may hold stock of and/or stock options with BioSciKin Co., Ltd. This does not alter our adherence to the journal’s policies. The other authors declare no conflict of interest.

SciTechDaily

  • April 24, 2024 | Ash Shrouded Skies: Powerful Volcanic Eruption at Mount Ruang
  • April 24, 2024 | Startling Discovery: Cancer Can Arise Without Genetic Mutations
  • April 24, 2024 | Liftoff! NASA’s Next-Generation Solar Sail Boom Technology Launched
  • April 24, 2024 | 3.7 Billion Years Old: Oldest Undisputed Evidence of Earth’s Magnetic Field Uncovered in Greenland
  • April 24, 2024 | Revolutionizing Renewable Energy: Innovative Salt Battery Efficiently Harvests Osmotic Power

Quantum Computing Meets Genomics: The Dawn of Hyper-Fast DNA Analysis

By Wellcome Trust Sanger Institute April 24, 2024

Advanced Genomics DNA Analysis Concept Art

A pioneering collaboration has been established to focus on using quantum computing to enhance genomics. The team will develop algorithms to accelerate the analysis of pangenomic datasets, which could revolutionize personalized medicine and pathogen management. Credit: SciTechDaily.com

A new project unites world-leading experts in quantum computing and genomics to develop new methods and algorithms to process biological data.

Researchers aim to harness quantum computing to speed up genomics, enhancing our understanding of DNA and driving advancements in personalized medicine

A new collaboration has formed, uniting a world-leading interdisciplinary team with skills across quantum computing, genomics, and advanced algorithms. They aim to tackle one of the most challenging computational problems in genomic science: building, augmenting, and analyzing pangenomic datasets for large population samples. Their project sits at the frontiers of research in both biomedical science and quantum computing.

The project, which involves researchers based at the University of Cambridge , the Wellcome Sanger Institute , and EMBL’s European Bioinformatics Institute (EMBL-EBI) , has been awarded up to US $3.5 million to explore the potential of quantum computing for improvements in human health.

The team aims to develop quantum computing algorithms with the potential to speed up the production and analysis of pangenomes – new representations of DNA sequences that capture population diversity. Their methods will be designed to run on emerging quantum computers. The project is one of 12 selected worldwide for the Wellcome Leap Quantum for Bio (Q4Bio) Supported Challenge Program.

Advancements in Genomics

Since the initial sequencing of the human genome over two decades ago, genomics has revolutionized science and medicine. Less than one percent of the 6.4 billion letters of DNA code differs from one human to the next, but those genetic differences are what make each of us unique. Our genetic code can provide insights into our health, help to diagnose disease, or guide medical treatments.

However, the reference human genome sequence, which most subsequently sequenced human DNA is compared to, is based on data from only a few people, and doesn’t represent human diversity. Scientists have been working to address this problem for over a decade, and in 2023 the first human pangenome reference was produced. A pangenome is a collection of many different genome sequences that capture the genetic diversity in a population. Pangenomes could potentially be produced for all species , including pathogens such as SARS-CoV-2 .

Quantum Computing in Genomics

Pangenomics, a new domain of science, demands high levels of computational power. While the existing human reference genome structure is linear, pangenome data can be represented and analyzed as a network, called a sequence graph, which stores the shared structure of genetic relationships between many genomes. Comparing subsequent individual genomes to the pangenome then involves mapping a route for their sequences through the graph.

In this new project, the team aims to develop quantum computing approaches with the potential to speed up both the key processes of mapping data to graph nodes, and finding good routes through the graph.

Quantum technologies are poised to revolutionize high-performance computing. Classical computing stores information as bits, which are binary — either 0 or 1. However, a quantum computer works with particles that can be in a superposition of different states simultaneously. Rather than bits, information in a quantum computer is represented by qubits (quantum bits), which could take on the value 0, or 1, or be in a superposition state between 0 and 1. It takes advantage of quantum mechanics to enable solutions to problems that are not practical to solve using classical computers.

Challenges and Future Prospects

However, current quantum computer hardware is inherently sensitive to noise and decoherence, so scaling it up presents an immense technological challenge. While there have been exciting proof of concept experiments and demonstrations, today’s quantum computers remain limited in size and computational power, which restricts their practical application. But significant quantum hardware advances are expected to emerge in the next three to five years.

The Wellcome Leap Q4Bio Challenge is based on the premise that the early days of any new computational method will advance and benefit most from the co-development of applications, software, and hardware – allowing optimizations with not-yet-generalizable, early systems.

Building on state-of-the-art computational genomics methods, the team will develop, simulate and then implement new quantum algorithms, using real data. The algorithms and methods will be tested and refined in existing, powerful High Performance Compute (HPC) environments initially, which will be used as simulations of the expected quantum computing hardware. They will test algorithms first using small stretches of DNA sequence, working up to processing relatively small genome sequences like SARS-CoV-2, before moving to the much larger human genome.

Perspectives From the Team

Dr. Sergii Strelchuk, Principal Investigator of the project from the Department of Applied Mathematics and Theoretical Physics, University of Cambridge, said: “The structure of many challenging problems in computational genomics and pangenomics in particular make them suitable candidates for speedups promised by quantum computing. We are on a thrilling journey to develop and deploy quantum algorithms tailored to genomic data to gain new insights, which are unattainable using classical algorithms.”

David Holland, Principal Systems Administrator at the Wellcome Sanger Institute, who is working to create the High Performance Compute environment to simulate a quantum computer, said: “We’ve only just scratched the surface of both quantum computing and pangenomics. So to bring these two worlds together is incredibly exciting. We don’t know exactly what’s coming, but we see great opportunities for major new advances. We are doing things today that we hope will make tomorrow better.”

Dr. David Yuan, Project Lead at EMBL-EBI, said: “On the one hand, we’re starting from scratch because we don’t even know yet how to represent a pangenome in a quantum computing environment. If you compare it to the first moon landings, this project is the equivalent of designing a rocket and training the astronauts. On the other hand, we’ve got solid foundations, building on decades of systematically annotated genomic data generated by researchers worldwide and made available by EMBL-EBI. The fact that we’re using this knowledge to develop the next generation of tools for the life sciences, is a testament to the importance of open data and collaborative science.”

The potential benefits of this work are huge. Comparing a specific human genome against the human pangenome — instead of the existing human reference genome — gives better insights into its unique composition. This will be important in driving forward personalized medicine. Similar approaches for bacterial and viral genomes will underpin the tracking and management of pathogen outbreaks.

This project is funded by the Wellcome Leap Quantum for Bio (Q4Bio) Supported Challenge Program.

More on SciTechDaily

Lipid Transport Across Blood-Brain Barrier

Brain Food Unlocked: A Fishy Tale of Omega-3s and the Blood-Brain Barrier

Human Puzzle Pieces Illustration Art

Genetic Jigsaw: Piecing Together the Human Pangenome

Green Bank Telescope FRB Directional Changes

Strange Transient Radio Pulses Provide Clues to Mysterious Origin

Human Genetic Diversity Concept Image

The Pangenome Breakthrough: A Crystal Clear Image of Human Genomic Diversity

Human Genome Art Concept

Global Genomes: Scientists Rewrite the Story of Human Genetics

Refusing Alcohol Drink

New Research Challenges Theory That Moderate Alcohol Consumption Benefits Heart Health

NASA SpaceX Dragon Endeavour Splashdown

Watch NASA SpaceX Dragon Endeavour Splashdown in Stunning 4K HD Video

Senior Couple Exercise Treadmill

Exercise Could Help Reduce Severity of Cachexia (Wasting Syndrome) – A Serious Cancer Complication

1 comment on "quantum computing meets genomics: the dawn of hyper-fast dna analysis".

research genomic data analysis

Very promising initiative. How often are the guest pathogens get integrated in our genome and remain hidden are not known. As our immunology becomes weaker, they show up with the symptoms as disease, for we do see the hopes to counter them with mRNA vaccines, although we need them to be more widely applicable against different kinds of viruses. Some Human Endogenous Retroviruses (HERVs) are still active in our genomes, producing viral proteins in healthy tissue, causing cancer. Understanding their role in disease remains to be one of the challenges for our genome Sequence analysis. our genome contains ancient viral remnants, and DNA sequence analysis would enormously help to uncover those hidden viruses. The interplay between pathogens and our genetic heritage continues to intrigue Science. Host-guest interactions will remain to be a very hot area in which rapid DNA and deep RNA analysis with AI assisted bioinformatic tools are very very welcome.

Leave a comment Cancel reply

Email address is optional. If provided, your email will not be published or shared.

Save my name, email, and website in this browser for the next time I comment.

  • Washington State University
  • Go to wsu twitter
  • Go to wsu facebook
  • Go to wsu linkedin

Bioinformatics services offered by the Genomics Core at WSU Spokane

The Genomics Core at WSU Spokane now offers cutting-edge bioinformatics services to support WSU research. Spearheading this initiative is Dr. Daniel Beck ( Biosketch (PDF) ), an accomplished bioinformatician with over a decade of experience, particularly in next-generation sequencing data analysis. Our standard services include procedures such as data quality control and processing, differential expression analysis of RNA-Seq data , de novo genome assembly, and genomic variant analysis. Additionally, we offer custom data analyses tailored to your specific research needs ( visit our website for more information).

Our team will work with you to design custom methods for library preparation, sequencing, and data analysis for your research projects. We are committed to helping you access advanced tools and resources for genomic analysis, ensuring your work remains at the forefront of scientific discovery. For free consultation or more information, please reach out to us at 509-368-6668 or via email at [email protected] .

The Notices and Announcements section is provided as a service to the WSU community for sharing events such as lectures, trainings, and other highly transactional types of information related to the university experience. Information provided and opinions expressed may not reflect the understanding or opinion of WSU. Accuracy of the information presented is the responsibility of those who submitted it. The self-uploaded posts are reviewed for compliance with state statutes and ethics guidelines but are not edited for spelling, grammar, or clarity.

research genomic data analysis

New MASC exhibit explores queer experience on the Palouse

Recent news.

research genomic data analysis

Spanish, bilingual course from WSU Extension creates climate ambassadors

research genomic data analysis

VR can motivate people to donate to refugee crises regardless of politics

Todd butler resigns as college of arts and sciences dean.

research genomic data analysis

Tri-state team releases calendar guide for more productive, sustainable pastures

research genomic data analysis

WSU to study effect of controversial drug on racehorses

research genomic data analysis

Voiland College names 2024 outstanding students

  • Open access
  • Published: 19 April 2024

Single Cell Atlas: a single-cell multi-omics human cell encyclopedia

  • Paolo Parini 2 , 3 ,
  • Roman Tremmel 4 , 5 ,
  • Joseph Loscalzo 6 ,
  • Volker M. Lauschke 4 , 5 , 7 ,
  • Bradley A. Maron 6 ,
  • Paola Paci 8 ,
  • Ingemar Ernberg 9 ,
  • Nguan Soon Tan 10 , 11 ,
  • Zehuan Liao 10 , 9 ,
  • Weiyao Yin 1 ,
  • Sundararaman Rengarajan 12 ,
  • Xuexin Li   ORCID: orcid.org/0000-0001-5824-9720 13 , 14 on behalf of

The SCA Consortium

Genome Biology volume  25 , Article number:  104 ( 2024 ) Cite this article

5754 Accesses

63 Altmetric

Metrics details

Single-cell sequencing datasets are key in biology and medicine for unraveling insights into heterogeneous cell populations with unprecedented resolution. Here, we construct a single-cell multi-omics map of human tissues through in-depth characterizations of datasets from five single-cell omics, spatial transcriptomics, and two bulk omics across 125 healthy adult and fetal tissues. We construct its complement web-based platform, the Single Cell Atlas (SCA, www.singlecellatlas.org ), to enable vast interactive data exploration of deep multi-omics signatures across human fetal and adult tissues. The atlas resources and database queries aspire to serve as a one-stop, comprehensive, and time-effective resource for various omics studies.

The human body is a highly complex system with dynamic cellular infrastructures and networks of biological events. Thanks to the rapid evolution of single-cell technologies, we are now able to describe and quantify different aspects of single cellular activities using various omics techniques [ 1 , 2 , 3 , 4 ]. Observing or integrating multiple molecular layers of single cells has promoted profound discoveries in cellular mechanisms [ 5 , 6 , 7 , 8 ]. To accommodate the exponential growth of single-cell data [ 9 , 10 ] and to provide comprehensive reference catalogs of human cells [ 11 ], many have dedicated to single-cell database or repository constructions [ 9 , 11 , 12 , 13 , 14 , 15 ]. These databases vary in purpose and scope: some served as data repositories for raw/processed data retrieval [ 11 , 12 , 14 ]; quick references to cell type compositions and cellular molecular phenotypes across tissues [ 11 , 16 , 17 ]; summarized published study findings for global cellular queries across tissues or diseases [ 9 , 13 , 18 ]; or simply web-indexed published results [ 19 ]. The aim of these resources is to provide immediate information sharing among the scientific communities and real-time queries of diverse cellular phenotypes, which, in turn, to accelerate research progress and to provide additional research opportunities.

However, majority of these databases often provide simple cellular overviews or signature profiles largely based on single-cell RNA-sequencing (scRNA-seq) data confined to limited multi-omics landscape [ 9 , 11 , 13 , 20 ]. The need for a database capable of conducting in-depth, real-time rapid queries of several single-cell omics at a time across almost all human tissues has not yet been met. This limitation has motivated us to build a one-stop single-cell multi-omics queryable database on top of constructing the multi-tissue and multi-omics human atlas.

Here, we present the Single Cell Atlas (SCA), a single-cell multi-omics map of human tissues, through a comprehensive characterization of molecular phenotypic variations across 125 healthy adult and fetal tissues and eight omics, including five single-cell (sc) omics modalities, i.e., scRNA-seq [ 21 ], scATAC-seq [ 22 ], scImmune profiling [ 23 ], mass cytometry (CyTOF) [ 24 , 25 ], and flow cytometry [ 26 , 27 ]; alongside spatial transcriptomics [ 28 ]; and two bulk omics, i.e., RNA-seq [ 29 ] and whole-genome sequencing (WGS) [ 30 ]. Prior to quality control (QC) filtering, we have collected 67,674,775 cells from scRNA-Seq, 1,607,924 cells from scATAC-Seq, 526,559 clonotypes from scImmune profiling, and 330,912 cells from multimodal scImmune profiling with scRNA-Seq, 95,021,025 cells from CyTOF, and 334,287,430 cells from flow cytometry; 13 tissues from spatial transcriptomics; and 17,382 samples from RNA-seq and 837 samples from WGS. We demonstrated through case studies the inter-/intra-tissue and cell-type variabilities in molecular phenotypes between adult and fetal tissues, immune repertoire variations across different T and B cell types in various tissues, and the interplay between multiple omics in adult and fetal colon tissues. We also exemplified the extensive effects of monocyte chemoattractant family ligands (i.e., the CCL family) [ 31 ] on interactions between fibroblasts and other cell types, which demonstrates its key regulatory role in immune cell recruitment for localized immunity [ 32 , 33 ].

Construction and content

An overview of the multi-omics healthy human map.

We conducted integrative assessments of eight omics types from 125 adult and fetal tissues from published resources and constructed a comprehensive single-cell multi-omics healthy human map termed SCA (Fig.  1 ). Each tissue consisted of at least two omics types, with the colon having the full spectrum of omics layers, which allowed us to investigate extensively the key mechanisms in each molecular layer of colonic tissue. Organs and tissues with at least five omics layers included colon, blood (whole blood and PBMCs), skin, bone marrow, lung, lymph node, muscle, spleen, and uterus (Additional file 2 : Table S1). Overall, the scRNA-seq data set contained the highest number of matching tissues between adult and fetal groups, which allowed us to study the developmental differences between their cell types. For scRNA-seq data, majority of the sample matrices retrieved from published studies have already undergone filtering to eliminate background noise, including low-quality cells which are most probable empty droplets. However, some samples downloaded retained their raw matrix form, which contained a significant amount of background noise. Consequently, before proceeding with any additional QC filtering, we standardized all scRNA-seq data inputs to the filtered matrix format, ensuring that all samples underwent the removal of background noise before further processing (Additional file 2 : Table S2). This preprocessing step resulted in the removal of 61,774,307 cells out of the original 67,674,775 cells in the downloaded scRNA-seq dataset, leaving us with 5,900,468 cells for subsequent QC filtering. Strict QC was then carried out to filter debris, damaged cells, low-quality cells, and doublets for single-cell omics data [ 34 ], as well as low-quality samples for bulk omics data. After QC filtering, 3,881,472 high-quality cells were obtained for scRNA-Seq; 773,190 cells for scATAC-Seq; 209,708 cells for multimodal scImmune profiling with scRNA-seq data; 2,278,550 cells for CyTOF; and 192,925,633 cells for flow cytometry data. For scImmune profiling alone, clonotypes with missing CDR3 sequences and amino acid information were filtered, leaving 167,379 unique clonotypes across 21 tissues in the TCR repertoires and 16 tissues in the BCR repertoires. For RNA-seq and WGS, 163 severed autolysis samples were removed, leaving 16,704 samples for RNA-seq and 837 for genotyping data.

figure 1

A multi-omics healthy human single-cell atlas. Circos plot depicting the tissues present in the atlas. Tissues belonging to the same organ were placed under the same cluster and marked with the same color. Circles and stars represent adult and fetal tissues, respectively. The size of a circle or a star indicates the number of its omics data sets present in the atlas. The intensity of the heatmap in the middle of the Circos plot represents the cell count for single-cell omics or the sample count for bulk omics. The bar plots on the outer surface of the Circos represent the number of cell types in the scRNA-seq tissues (in blue) or the number of samples in bulk RNA-seq tissues (in red)

Single-cell RNA-sequencing analysis of adult and fetal tissues revealed cell-type-specific developmental differences

In total, out of the 125 adult and fetal tissues from all omics types, the scRNA-seq molecular layer in the SCA consisted of 92 adult and fetal tissues (Additional file 1 : Fig. S1, Additional file 2 : Additional file 2 : Table S1), spanning almost all organs and tissues of the human body. We profiled all cells from scRNA-seq data and annotated 417 cell types at fine granularity, in which we categorized them into 17 major cell type classes (Fig.  2 A). Comparing across tissues, most of them contained stromal cells, endothelial cells, monocytes, epithelial cells, and T cells (Fig.  2 A). Comparing across the cell type classes, epithelial cells constituted the highest cell count proportions, followed by stromal cells, neurons, and immune cells (Fig.  2 A). For adult tissues, most of the cells were epithelial cells, immune cells, and endothelial cells; whereas in fetal tissues, stromal cells, epithelial cells, and hematocytes constituted the largest cell type class proportions. Of these 92 tissues from the scRNA-seq data, we carried out integrative assessments of these tissues (Figs. 2 and 3 ) to study cellular heterogeneities in different developmental stages of the tissues.

figure 2

scRNA-seq integrative analysis revealed similarity and heterogeneity between adult and fetal tissues. A Clustering of the 417 cell types from scRNA-seq data, consisting of 92 tissues based on their cell type proportion within each tissue group. Cell types were colored based on the cell type class indicated in the legend. The numbers in the bracket represent the cell number within the tissue group. B UMAP of the cells present in the 94 adult and fetal tissues from scRNA-seq data, colored based on their cell type class. C Phylogenetic tree of the adult (left) and fetal (right) cell types. Clustering was performed based on their top regulated genes. The color represents the cell type class. Distinct clusters are outlined in black and labeled

figure 3

In-depth assessment of the integrated scRNA-seq further revealed inter-and intra-group similarities between adult and fetal tissues. A Chord diagrams of the highly correlated (AUROC > 0.9) adult and fetal cell types. Each connective line in the middle of the diagrams represents the correlation between two cell types. The color represents the cell type class. B Top receptor-ligand interactions between cell type classes in adult tissues (left) and fetal tissues (right). Color blocks on the outer circle represent the cell type class, and the color in the inner circle represents the receptor (blue) and ligand (red). Arrows indicate the direction of receptor-ligand interactions. C 3D tSNE of the integrative analysis between scRNA-seq and bulk RNA-seq tissues. The colors of the solid dots represent cell types in scRNA-seq data, and the colors of the spheres represent tissues of the bulk data. T indicates the T cell cluster, and B indicates the B cell cluster. D Heatmap showing the top DE genes in each cell type class of the adult and fetal tissues. Scaled expression values were used. Color blocks on the top of the heatmap represent cell type classes. Red arrows indicate the selected cell type classes for subsequent analyses. E Top significant GO BP and KEGG pathways for the cell type classes in adult and fetal tissues. The size of the dots represents the significance level. The color represents the cell type class

For each cell type, we performed differential expression (DE) analysis for each tissue to obtain the DE gene (DEG) signature for each cell type. We assessed the global gene expression patterns between cell types across the tissues based on their upregulated genes (Additional file 2 : Table S3) for adult and fetal tissues (Fig.  2 C, Additional file 1 : Fig. S2). In adult tissues, immune cells (i.e., B, T, monocytes, and NK cells) with hematocytes, stromal cells, neurons, endothelial cells, and epithelial cells formed distinct cellular clusters (Fig.  2 C, Additional file 1 : Fig. S2A), demonstrating highly similar DEG signatures within each of these cell type classes, consistent with the clustering patterns in the previous scRNA-seq atlas [ 35 ]. In fetal tissues, segregation is comparatively less distinctive such that only a subgroup of epithelial cells formed a distinct cell type cluster, cells from the immune cell type classes as well as hematocytes coalesced to form another cluster, and stromal cells formed small clusters between other fetal cell types (Fig.  2 C, Additional file 1 : Fig. S2B), which could represent the similarity in gene expression with other cell types during lineage commitment of stromal cell differentiation [ 36 ].

We next investigated the underlying gene regulatory network (GRN) of the transcriptional activities of cell types across adult and fetal tissues [ 37 ]. We identified active transcription factors (TFs) detected for cell types within each tissue (AUROC > 0.1), and based on these TF signatures, we measured similarities between cell types for adult and fetal tissues (Additional file 1 : Fig. S3). For adult tissues, clustering patterns similar to Additional file 1 : Fig. S1A were observed (Fig.  2 C, Additional file 1 : Fig. S3A). In fetal tissues, two unique clusters, including immune cells with hematocytes and stromal cells, were observed (Additional file 1 : Fig. S3B). Higher similarity in transcription regulatory patterns of stromal cells was observed compared to their gene expression patterns. The concordance between gene expression and transcription regulatory patterns within adult and fetal tissues demonstrated a direct and uniform interplay between the two molecular activities. In terms of the varying TF and DEG clustering patterns between adult and fetal tissues, the adult cell types demonstrated more similar transcriptional activities within the cell type classes than the less-differentiated fetal cell types, which shared more common transcriptional activities.

We dissected the correlation pattern of the clusters shown in Fig.  2 C by drawing inferences from their highly correlated (AUROC > 0.9) cell-type pairs (Fig.  3 A). Specifically, for the immune cluster in adult tissues, monocytes accounted for most of the high correlations within the immune cell cluster, followed by T cells (Fig.  3 A). For fetal tissues, a high number of correlations was observed between the immune cells (i.e., mostly monocytes and T cells) and hematocytes (Fig.  3 A), which explained the clustering pattern observed in fetal tissues (Fig.  2 C). For fetal stromal cells, other than with their own cell types, large coexpression patterns were observed with the hematocytes and the epithelial cells, and a smaller proportion of correlations with other clusters (Fig.  3 A), which accounted for the small clusters of stromal cells formed between other cell types (Fig.  2 C, Additional file 1 : Fig. S2B).

To describe possible cellular networking between the cell type class clusters in Fig.  2 C, we inferred cell–cell interactions [ 38 ] based on their gene expression (Additional file 2 : Table S4), and variations between adult and fetal tissues were observed (Fig.  3 B). In adult tissues, many cell type classes displayed interactions with the neurons, in which they networked with epithelial cells through UNC5D/NTN1 interaction; with stromal cells through SORCS3/NGF; with T cells through LRRC4C/NTNG2; etc. (Fig.  3 B). Among the top interactions of fetal tissues, among the top interactions, monocytes actively network with other cells, such as via CCR1/CCL7 with hematocytes, CSF1R/CSF1 with stromal cells, and FPR1/SSA1 with epithelial cells.

We performed a pseudobulk integrative analysis of the cell types of the scRNA-seq data from 19 tissues found in both adult and fetal tissues, with the 54 tissues from the bulk RNA-seq data (Fig.  3 C) to compare single-cell tissues with the corresponding tissues in the bulk datasets. For cell types of scRNA-seq data, adult cell types formed distinct clusters of T cells, B cells, hematocytes, stromal cells, epithelial cells, endothelial cells, and neurons (Fig.  3 C). Fetal cell types, by comparison, formed a unique cluster of cell types separating themselves from adult cell types. Internally, a gradient of cell types from brain tissues to cell types from the digestive system was observed in this fetal cluster. Fusing the bulk tissue-specific RNA-seq data sets with the pseudobulk scRNA-seq cell types gave close proximities of the bulk brain tissues with the pseudobulk brain-specific cell types, such as neurons and astrocytes (Fig.  3 C). Bulk whole blood clustered with pseudobulk hematocytes, and bulk EBV-transformed lymphocytes clustered with pseudobulk B cells. Other distinctive clusters included bulk colon and small intestine clustered with pseudobulk colon- and small intestine-specific epithelial cells, and bulk heart clustered with pseudobulk cardiomyocytes and other muscle cells (Fig.  3 C).

Next, we conducted gene ontology (GO) of biological processes (BPs) and KEGG pathway analyses [ 39 , 40 , 41 , 42 ] of the top upregulated genes of each cell type class cluster (Fig.  3 D) found in Fig.  2 C. Multiple testing correction for each cell type class was performed using Benjamini & Hochberg (BH) false discovery rate (FDR) [ 43 ]. At 5% FDR and average log2-fold-change > 0.25 (ranked by decreasing fold-change), the top three most significant genes of the remaining cell type classes were each scanned through the phenotypic traits from 442 genome-wide association studies (GWAS) and the UK Biobank [ 44 , 45 ] to seek significant genotypic associations of the top genes with diseases and traits. Notably, for GO pathways, the most significant BPs for B and T cells in both adult and fetal tissues were similar (Fig.  3 E). In contrast, epithelial cells and neurons differ in their associated BPs between adult and fetal tissues. For KEGG pathways, adult and fetal tissues shared common top pathways in T cells and in epithelial cells (Fig.  3 E). Among the top genotype–phenotype association results of the top genes (Additional file 1 : Fig. S4), SNP rs2239805 in HLA-DRA of adult monocytes has a high-risk association with primary biliary cholangitis, which is consistent with previous studies showing associations of HLA-DRA or monocytes with the disease [ 46 , 47 , 48 , 49 , 50 ].

Multimodal analysis of scImmune profiling with scRNA-sequencing in multiple tissues

To decipher the immune landscape at the cell type level in the scImmune profiling data, we carried out an integrative in-depth analysis of the immune repertoires with their corresponding scRNA-seq data. The overall landscape of the cell types mainly included clusters of naïve and memory B cells, naïve T/helper T/cytotoxic T cells, NK cells, monocytes, and dendritic cells (Fig.  4 A) and mainly comprised immune repertoires from the blood, cervix, colon, esophagus, and lung (Additional file 1 : Fig. S5). On a global scale, we examined clonal expansions [ 51 , 52 ] in both T and B cells across all tissues. Here, we defined unique clonal types as unique combinations of VDJ genes of the T cell receptor (TCR) chains (i.e., alpha and beta chains) and immunoglobin (Ig) chains on T cells and B cells, respectively. Integrating clonal type information from both the T and B cell repertoires with their scRNA-seq revealed sites of differential clonal expansion in various cell types (Fig.  4 B and C, Additional file 1 : Fig. S5). In T cell repertoires, high proportions of large or hyperexpanded clones were found in terminally differentiated effector memory cells reexpressing CD45RA (Temra) CD8 T cells [ 53 , 54 ] and cytotoxic T cells, and a large proportion of them was found in the lung (Fig.  4 C, Additional file 1 : Fig. S5), which interplays with the highly immune regulatory environment of the lungs to defend against pathogen or microbiota infections [ 55 , 56 ]. MAIT cells [ 57 , 58 ] have also demonstrated their large or high expansions across tissues, especially in the blood, colon, and cervix (Additional file 1 : Fig. S5A), with their main function to protect the host from microbial infections and to maintain mucosal barrier integrity [ 58 , 59 ]. In contrast, single clones were present mostly in naïve helper T cells and naïve cytotoxic T cells. (Additional file 1 : Fig. S5B) and were almost homogeneously across tissues (Fig.  4 C). This observation ensures the availability of high TCR diversity to trigger sufficient immune response for new pathogens [ 60 ]. For the B cell repertoire in blood, most of these immunocytes remained as single clones or small clones, with a small subset of naïve B cells and memory B cells exhibiting medium clonal expansion (Additional file 1 : Fig. S5B).

figure 4

Multi-modal analysis of scImmune profiling with scRNA-seq revealed a clonotype expansion landscape in six tissues. A tSNE of cell types from the multi-modal tissues of the scImmune-profiling data. Colors represent cell types. Cell clusters were outlined and labeled. B tSNE of cell types from the multi-modal tissues of the scImmune-profiling data. Colors indicate clonal-type expansion groups of the cells. Cells not present in the T or B repertoires are shown in gray (NA group). C Stacked bar plots revealing the clonal expansion landscapes of the T and B cell repertoires across 6 tissues. Colors represent clonal type groups. D Alluvial plot showing the top clonal types in T cell repertoires and their proportions shared across the cell types. Colors represent clonotypes. E Alluvial plot showing the top clonal types in B cell repertoires and their proportions shared across the cell types. Colors represent clonotypes

Among the top clones (Fig.  4 D), TRAV17.TRAJ49.TRAC_TRBV6-5.TRBJ1-1.TRBD1.TRBC1 was present mostly in Temra CD8 T cells and shared the same clonal type sequence with cytotoxic T and helper T cells (Additional file 2 : Table S5). This top clone was found to be highly represented in the lung, and comparatively, other large clones of CD8 T cells were found in the blood (Additional file 1 : Fig. S5C). The top ten clones were found in Temra CD8 T cells of blood and lung tissues and cytotoxic T cells and helper T cells from blood, cervix, and lung tissues (Additional file 1 : Fig. S5C). Some of them exhibited a high prevalence of cell proportions in Temra CD8 T cells (Fig.  4 D). In the B cell repertoire of blood, the top clones were found only in naïve and memory B cells, with similar proportions for each of the top clones (Fig.  4 E).

Multi-omics analysis of colon tissues across five omics data sets

To examine the phenotypic landscapes and interplays between different omics methods and data sets, we carried out an interrogative analysis of colon tissue across five omics data sets, including scRNA-Seq, scATAC-Seq, spatial transcriptomics, RNA-seq, and WGS, to examine the phenotypic landscapes across omics layers and the interplays and transitions between omics layers. In the overview of the transcriptome landscapes in adult and fetal colons (Fig.  5 A and B), the adult colon consisted of a large proportion of immune cells (such as B cells, T cells, and macrophages) and epithelial cells (such as mucin-secreting goblet cells and enterocytes) (Fig.  5 A). In contrast, the fetal colon contained a substantial number (proportion) of mesenchymal stem cells (MSCs), fibroblasts, smooth muscle cells, neurons, and enterocytes and a very small proportion of immune cells (Fig.  5 B).

figure 5

In-depth scRNA-seq analysis revealed distinct variations between adult and fetal colons. A tSNE of the adult colon; colors represent cell types. B tSNE of the fetal colon; colors represent cell types. C Heatmap showing the correlations of the cell types of the MSC lineage from adult and fetal colons based on their top upregulated genes. The intensity of the heatmap shows the AUROC level between cell types. Color blocks on the top of the heatmap represent classes (first row from the top), cell types (second row), and cell type classes (third row). D Heatmap showing the correlations of the cell types of the MSC lineage from adult and fetal colons based on the expression of the TFs. The intensity of the heatmap shows the AUROC level between cell types. Color blocks on the top of the heatmap represent classes (first row from the top), cell types (second row), and cell type classes (third row). E Pseudotime trajectory of the MSC lineage in the adult colon. The color represents the cell type, and the violin plots represent the density of cells across pseudo-time. F Pseudo-time trajectory of the MSC lineage in the fetal colon. The color represents the cell type, and the violin plots represent the density of cells across pseudotime. G Heatmap showing the pseudotemporal expression patterns of TFs in the lineage transition of MSCs to enterocytes in both adult and fetal colons. Intensity represents scaled expression data. The top 25 TFs for MSCs or their differentiated cells are labeled. H Pseudotemporal expression transitions of the top TFs in the MSC-to-enterocyte transitions for both adult and fetal colons. I Heatmap showing the pseudotemporal expression patterns of TFs in the lineage transition of MSCs to fibroblasts in both adult and fetal colons. Intensity represents scaled expression data. The top 25 TFs for MSCs or their differentiated cells are labeled. J Pseudotemporal expression transitions of the top TFs in the MSC-to-fibroblast transitions for both adult and fetal colons

As there were fewer immune cells observed in the fetal colon as compared to the adult colon, we compared the MSC lineage cell types between the two groups. Based on their differential gene expression signatures (Fig.  5 C) and their TF expression (Fig.  5 D), the highly specialized columnar epithelial cells, enterocytes, for both molecular layers correlated well between adult and fetal colons, unlike other cell types, which did not demonstrate high correlations between their adult and fetal cells. Other than the enterocytes, adult and fetal fibroblasts were highly similar to MSCs in both transcriptomic and regulatory patterns (Fig.  5 C and D). We modeled pseudo-temporal transitions of MSC lineage cells, and similar phenomena were observed (Fig.  5 E and F). Both adult and fetal fibroblasts were pseudotemporally closer to MSCs, and the transitions were much earlier than other cells. Analysis across regulatory, gene expression, and pseudotemporal patterns showed in both adult and fetal colons that fibroblasts were more similar to MSCs phenotypically, as shown in prior literature reports [ 61 , 62 , 63 ] and recently with therapeutic implications [ 64 , 65 ]. In addition, transient phases of cells along the MSC lineage trajectory were observed for enterocytes and goblet cells (Fig.  5 E and F), which demonstrated that these high plasticity cells were at different cell-state transitions before their full maturation, as evident in the literature [ 66 , 67 ]. By contrast, the fetal intestine was more primitive than the adult intestine during fetal development, and as a key cell type in extracellular matrix (ECM) construction [ 68 ], fibroblasts displayed transitional cell stages of cells along the pseudotime trajectory (Fig.  5 F).

Comparing regulatory elements of these transitions demonstrated similarities and differences (Fig.  5 G–J, Additional file 1 : Fig. S6). For MSC-to-enterocyte transitions (Fig.  5 G, Additional file 2 : Table S6), the leading TFs with significant pseudotemporal changes were labeled. The expression E74 Like ETS transcription factor 3, ELF3, which belongs to the epithelium-specific ETS (ESE) subfamily [ 69 ], increased during the transition for both adult and fetal enterocytes (Fig.  5 H, Additional file 2 : Table S6) and as previously demonstrated is important in intestinal epithelial differentiation during embryonic development in mice [ 69 , 70 ]. Conversely, high mobility group box 1, HMGB1 [ 71 ], decreased pseudotemporally for both adult and fetal enterocytes (Fig.  5 H, Additional file 2 : Table S6) and has been shown to inhibit enterocyte migration [ 72 ]. The nuclear orphan receptor, NR2F6, a non-redundant negative regulator of adaptive immunity, [ 73 , 74 ], displayed a comparative decline in expression halfway through the pseudotime transition for adult enterocytes but continued to increase for fetal enterocytes (Fig.  5 H, Additional file 2 : Table S6). Another TF from the ETS family, Spi-B transcription factor, SPIB, also showed differential expression during the transition between adult and fetal enterocytes (Fig.  5 H), which was up-regulated in fetal enterocytes and down-regulated in adult enterocytes, suggesting its potential bi-functional role in enterocyte differentiation in fetal-to-adult transition.

For MSC-to-fibroblast transitions (Fig.  5 I, Additional file 2 : Table S6), TFs such as ARID5B, FOS, FOSB, JUN, and JUNB displayed almost identical trajectory patterns between adult and fetal fibroblasts (Fig.  5 J, Additional file 2 : Table S6). Of these TFs, FOS, FOSB, JUN, and JUNB were shown to be absent in the healthy mucosa transcriptional networks [ 75 ], in line with their observations in Fig.  5 J. By contrast, Bcl-2-associated transcription factor 1, BCLAF1, was pseudotemporally up-regulated in fetal fibroblasts but downregulated in adult fibroblasts. Prior studies showed that knocking out BCLAF1 is embryonic lethal [ 76 , 77 ] and yet could be oncogenic in colon cancer [ 78 ], which could explain the trajectory difference of it in fetal and adult. Other cell types also displayed varying degrees of similarities and differences (Additional file 1 : Fig. S5, Additional file 2 : Table S6).

In scATAC-Sequencing, we examined the contributions of cis -regulatory elements in the adult colon. We identified DA peaks for cell clusters and identified corresponding genes closest to these DA peak regions. Cell type identities were postulated based on the gene activities of the scATAC-Seq data (GSEA) [ 79 , 80 ] (Fig.  6 A). Common cell types were detected in scATAC-Seq compared to scRNA-seq (Figs. 5 A and 6 A). We performed sequence motif analysis to detect regulatory sequences unique to each cell type based on their leading DA peaks; among the top enriched motifs, many of the Myocyte Enhancer Factors such as MEF2B, MEF2C, and MEF2D from cells such as smooth muscle cells and pericytes, were found to be significantly enriched (Fig.  6 B), which were also up-regulated in the scRNASeq findings shown earlier (Additional file 2 : Table S6).

figure 6

Multi-omics analysis of adult and fetal colon tissues revealed distinct variations between adults and fetuses as well as across omics. A UMAP of cell types present in the scATAC-Seq of the adult colon. Colors represent cell types. B Top enriched motif sequences in cell types of the adult colon scATAC-Seq data. C , D Spatial transcriptomic profiles of adult colon sample 1 ( C ) and sample 2 ( D ). The top TFs were selected, and their spatial expressions were mapped onto the slide images. E , F Top receptor-ligand interactions between cell type classes in colon 1 ( E ) and colon 2 ( F ) of the spatial transcriptomics data. Color blocks on the outer circle represent the cell type class, and the color in the inner circle represents the receptor (blue) and ligand (red). Arrows indicate the direction of receptor-ligand interactions. G , H Top receptor-ligand interactions between cell type classes in the adult colon ( G ) and fetal colon ( H ) of the scRNA-seq data. Color blocks on the outer circle represent the cell type class, and the color in the inner circle represents the receptor (blue) and ligand (red). Arrows indicate the direction of receptor-ligand interactions

We examined the physical landscape of the leading TFs (found in scRNA-Seq and scATAC-Seq) in spatial transcriptomics data from two adult colons [ 5 ]. TFs ELF3 and NR2F6 were expressed generally in many locations in colonic tissue and displayed similar expression patterns for both of the adult colons (Fig.  6 C and D), consistent with significant up-regulation in almost all MSC lineage cell types in the pseudotemporal transitions (Additional file 2 : Table S6). In contrast, SPIB was not up-regulated in general, while displaying higher expression in B cells (Fig.  6 C and D), consistent with its role in adaptive immunity, as previously discussed. For other leading TFs, such as BCLAF1, EPAS1, and PLAG1, there were no clear discrete patterns of expression among the cell types.

To examine how cells interact with one another in spatial transcriptomics of the adult colon, we performed receptor-ligand interaction analysis [ 38 ]. Leading interactions included VIP/VIPR2 and ADCYAP1/VIPR2 interactions between neurons and fibroblasts, the NCAM1/GFRA1 interaction between neuronal cells, as well as LTB/CD40 and LY86/CD180 interactions between B cells (Fig.  6 E, Additional file 2 : Table S7). In colon 2, leading interactions occurred between the B cells and between the B cells and enterocytes or fibroblasts. These included LTB/CD40, APOE/LRP8, LY86/CD180, and VCAM1/ITGB7 between B cells; APOE/VLDLR between B cells (APOE) and enterocytes (VLDLR); and CXCL12/CXCR4, FN1/CD79A, CD34/SELL, and ICAM2/ITGAL between fibroblasts and B cells (Fig.  6 F, Additional file 2 : Table S7).

The same type of analysis was performed on both scRNA-seq from both adult and fetal colons. In the adult colon in scRNA-seq (Fig.  6 G), the fibroblasts comprised the leading interactions with cells such as CD8 T cells (CCL8-ACKR2), with (other) fibroblasts (CCL13-CCR9), goblet cells (CCL13-CCR3), and mast cells (PROC-PROCR). In the fetal colon, leading interaction pairs were derived mostly from fibroblasts and macrophages with other cells (Fig.  6 H, Additional file 2 : Table S7), including C4BPA-CD40 between fibroblasts (C4BPA) and endothelial cells (CD40); CCL24-CCR2 between neuronal cells (CCL24) and macrophages (CCR2); CCL13-CCR1 and MUC7-SELL between goblet cells (CCL13 and MUC7) and macrophages (CCR1 and SELL); and IL21-IL21R between smooth muscle cells (IL21) and macrophages (IL21R). In scRNA-seq of both adult and fetal colons, the active interactions of fibroblasts with other cells based on CCL family ligand-receptor interactions seemed to suggest its key regulatory role in immune cell recruitment in the colon (via the active interaction and activation of monocyte chemoattractants, i.e., the CCL family), consistent with prior publications [ 32 , 33 ].

Comparing the two omics data sets, both colon samples from spatial transcriptomics data shared leading interactions with that of the scRNA-seq from adult and fetal colons (Additional file 2 : Table S7). Between spatial colon 1 and the scRNA-seq fetal colon, common interaction pairs were found between neuronal cells, enterocytes with neurons, and neurons with fibroblasts (Additional file 2 : Table S7). For spatial colon 2, 25 of its 95 top unique interactions were shared with the scRNA-seq adult colon, and 10 were shared with the scRNA-seq fetal colon (Additional file 2 : Table S7). For the scRNA-seq adult colon, 445 of its 852 top unique interactions were found in the scRNA-seq fetal colon. For example, CLEC3A-CLEC10A interactions between macrophages (CLEC10A) and enterocytes (CLEC3A), goblet cells (CLEC3A), or smooth muscle cells (CLEC3A), as well as between macrophages. Among them, the scRNA-seq fetal colon seemed to share the greatest number of cell-type-specific interactions with the other three groups (Additional file 2 : Table S7).

At 1% BH FDR and log2FC > 0.25 for the bulk RNA-seq data in adult transverse colon data, we compared these upregulated genes with the top genes in scRNA-seq and the top genes in expression quantitative trait loci (eQTL) (eGenes) and splicing QTL (sQTL) (sGenes) of WGS of the corresponding transverse colon data (Additional file 1 : Fig. S6). Comparing the top 10 genes of eGenes and sGenes, no common genes were found (Additional file 1 : Figs. S7A and S7B). Comparing the overlapping patterns in bulk transcriptomics with scRNA-seq data, there was a much higher number of overlaps in scRNA-seq with eGenes and sGenes compared to bulk RNA-seq (Additional file 1 : Fig. S7C). We grouped the overlapping genes according to their cell types in scRNA-seq (Additional file 1 : Fig. S7D). In particular, the goblet cells and enterocytes in eGenes were similar in proportion within eGenes for bulk RNA-seq compared to scRNA-Seq. Similar phenomena were observed in sGenes (Additional file 1 : Fig. S7D).

Utility and discussion

User interface (ui) overview.

SCA offers an intuitive, user-friendly interface designed to facilitate seamless navigation and efficient phenotype retrieval by researchers across eight single-cell and bulk omics from 125 healthy adult and fetal tissues. Designed with a focus on user experience, the UI offers intuitive and simple navigations for users to explore complex layers of multi-omics multi-tissue resources. Here is an overview of the SCA UI, (I) Home Page: Landing page of the database to serve as the gateway to the comprehensive features of the SCA, offering users a starting point to dive into the wealth of multi-omics data. (II) About: This section offers a thorough description of the portal, complemented by an introductory video summarizing the key features of the database to provide guidance to new users. (II) Overview: Here, we highlight the diversity of omics data available, providing a snapshot of the various omics types and summarizing key information about each. (IV) Atlas: Features interactive representations of human adult and fetal anatomies, and a gateway for users to explore each tissue in-depth with detailed phenotypes specific to each tissue and their corresponding omics. (V) Query: While the Atlas tab is to showcase comprehensive features in each tissue, the Query tab is dedicated to exploring key phenotypic features across all tissues for different omics types, such as regulon search, receptor-ligand interactions, and clonotype abundance, etc. (VI) Demo: Offers a comprehensive walkthrough of the database, using the adult colon transverse tissue as an illustrative example, to demonstrate the capability of the platform and how users can extract meaningful insights. (VII) Analyze: Provides an extensive suite of tools tailored to assist users in performing single-cell analyses across a wide array of omics, along with rapid plotting tools that allow for the creation of customizable plots quickly and efficiently. (VIII) Download: Provides the option for batch downloads, enabling users to conveniently download the data utilized within the database based on their specific selections. (IX) Sources: Offers detailed information about the origins of the raw data used to construct the database, ensuring transparency and trust in the data provided. (X) Discussion: Facilitates a collaborative community space where users can interact, offer assistance, pose questions, and share feedback and suggestions, enhancing the collective utility of the platform. (XI) News: Keeps users informed about the latest updates, additions, and enhancements to the database, ensuring the SCA community stays abreast of new developments.

Intended uses of the database and envisioned benefits

SCA is crafted to serve as a comprehensive resource in the burgeoning field of single-cell and multi-omics research. Its primary intention is to facilitate a deeper understanding of the cellular complexity and diversity inherent in healthy adult and fetal tissues through simultaneous exploration of multiple omics. Beyond this, SCA aims to serve as a robust analysis platform to support post-quantification analysis of high-throughput single-cell sequencing data. As such, researchers can leverage SCA for comparative studies, hypothesis generation, and validation purposes. The integration of multi-omics data facilitates a deeper understanding of cellular mechanisms, potentially accelerating discoveries in cellular mechanisms, developmental biology, and potential therapeutic targets.

Explicitly, SCA enables scientists to quickly derive insights that would otherwise require extensive time and resources to uncover, thereby speeding up the cycle of hypothesis, experimentation, and conclusion. The database will significantly enhance data accessibility and integration, allowing researchers to easily combine data from different omics types and tissues to obtain a holistic view of cellular functions. This integrative approach is crucial for understanding complex biological systems and for the development of comprehensive models of human health and disease. By cataloging cellular characteristics across a range of tissues and conditions, SCA empowers precision medicine initiatives. It provides a detailed cellular context for phenotypic variations and potential markers at the single-cell level and with bulk level for comparative assessments, supporting the development of potential personalized treatment plans based on cellular profiles.

SCA fosters a collaborative research environment by providing a common platform for scientists from diverse backgrounds with research specialties across tissues, diseases, and omics analysis. It encourages interdisciplinary approaches, connecting researchers from diverse fields and promoting the exchange of knowledge and methodologies. This collaborative ethos is expected to drive forward innovations in research and technology.

Benchmarking with existing databases

Here, we evaluated our SCA database against other existing databases [ 9 , 11 , 13 , 20 , 81 ], emphasizing the distinctive attributes that make SCA stand out (Additional file 2 : Table S8). SCA integrates eight distinct omics types, surpassing the scope of Single Cell Portal (SCP) [ 20 ], Human Cell Atlas (HCA) [ 11 ], GTEx Portal [ 81 ], DISCO [ 9 ], and Panglaodb [ 13 ] in providing a wide-ranging multi-omics platform for exhaustive single-cell omics research. Data accessibility is publicly available for all these platforms, except that GTEx Portal encompassing both public and protected datasets (Additional file 2 : Table S8). SCA is noteworthy for its extensive coverage of eight single-cell and bulk omics over 125 differentiated tissues, established a significant lead over the other portals in terms of omics types. Furthermore, SCA sets a new standard with its unmatched capabilities. Other than the typical representations of cell type proportions and visualizing basic features in cell types, features that are notably limited or absent in SCP, HCA, DISCO, and Panglaodb, such as cell–cell interactions, transcription factor activities, the visualization of regulon modules, motif enrichments, clonotype abundance, detailed repertoire profiles, etc., are areas unaddressed by other databases. SCA is the sole provider of specialized queries targeting various phenotypes across multiple omics (Additional file 2 : Table S8). This specificity of analysis remains unparalleled when juxtaposed with other databases in our comparative cohort. Ultimately, SCA stands out as a premier, all-encompassing resource for the omics research community.

Future development and maintenance

In an effort to ensure the platform remains relevant, up-to-date, and increasingly valuable to the broad spectrum of researchers, we will be implementing annual updates. These will incorporate findings from newly published studies and novel phenotypic analyses gathered over the year. As we strive to continually enrich our platform, these updates will address gaps in tissue representation for each omics type, and simultaneously expand the sample size within each tissue. Our commitment to transparency and traceability is reflected in our approach to versioning. We will systematically denote improvements to the database, including new features and datasets, in an accessible point-form format. Updates will be marked by adjustments to the database accession number, with the current version designated as SCA V1.0.0. In addition to serving as a resource for data and phenotypic features, our ultimate aim is for SCA to function as a user-friendly platform, facilitating rapid access to multi-omics data resources and enabling cross-comparison of user datasets with our own.

Conclusions

Our study establishes a comprehensive evaluation of the healthy human multi-tissue and multi-omics landscape at the single-cell level, culminating in the construction of a multi-omics human map and its accompanying web-based platform SCA. This innovative platform streamlines the delivery of multi-omics insights, potentially reducing costs and accelerating research by obviating the need for extensive data consolidation. The big data framework of SCA facilitates the exploration of a broad spectrum of phenotypic features, offering a more representative snapshot of the study population than traditional single omics or bulk analysis could achieve. This multi-omics approach is poised to be instrumental in unraveling the complexities of multidimensional biological systems, offering a holistic perspective that enhances our understanding of biological phenomena.

Despite its robust capabilities, SCA faces challenges associated with the technological limitations of flow cytometry and CyTOF modalities, which restrict the number of detectable proteins. These constraints complicate the integration of data from different studies. We have consciously chosen not to pursue the imputation of expression values across these datasets due to concerns about reliability. Moving forward, we aim to refine tissue stratification within the portal by introducing more detailed sample classifications, such as sampling sites, age groups, genders across tissues, and for fetal tissues, different developmental stages. This advancement depends on the acquisition of comprehensive data to support more precise and accurate analyses.

SCA is designed not only as a database but as a catalyst for a paradigm shift towards a multi-omics-focused research approach. It encourages the scientific community to embrace a multi-omics perspective in their research, facilitating the generation of new hypotheses and the discovery of novel insights. This platform is expected to foster an environment rich in intellectual exploration, propelling forward the development of groundbreaking research trajectories. In essence, SCA emerges as a pioneering open-access, single-cell multi-omics atlas, offering an in-depth view of healthy human tissues across a wide array of omics disciplines and 125 diverse adult and fetal tissues. It unlocks new avenues for exploration in multi-omics research, positioning itself as a vital tool in advancing our understanding of life sciences. SCA is set to become an invaluable asset in the research community, significantly contributing to advancements in biology and medicine by facilitating a deeper comprehension of complex biological systems.

Availability of data and materials

This paper used and analyzed publicly available data sets and their resource references are available at http://www.singlecellatlas.org . Codes used for the construction of the database, data analysis, and visualization have been deposited on GitHub and can be accessed via https://github.com/eudoraleer/sca and is under the MIT License [ 82 ], and is also on Zenodo at https://zenodo.org/records/10906053 [ 83 ]. Web-based platforms hosting the interactive atlas and database queries are available at https://www.singlecellatlas.org .

Aldridge S, Teichmann SA. Single cell transcriptomics comes of age. Nat Commun. 2020;11:4307.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Zhu C, Preissl S, Ren B. Single-cell multimodal omics: the power of many. Nat Methods. 2020;17:11–4.

Article   CAS   PubMed   Google Scholar  

Mimitou EP, Lareau CA, Chen KY, Zorzetto-Fernandes AL, Hao Y, Takeshima Y, Luo W, Huang T-S, Yeung BZ, Papalexi E, et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat Biotechnol. 2021;39:1246–58.

Li X. Harnessing the potential of spatial multiomics: a timely opportunity. Signal Transduct Target Ther. 2023;8:234.

Article   PubMed   PubMed Central   Google Scholar  

Fawkner-Corbett D, Antanaviciute A, Parikh K, Jagielowicz M, Gerós AS, Gupta T, Ashley N, Khamis D, Fowler D, Morrissey E, et al. Spatiotemporal analysis of human intestinal development at single-cell resolution. Cell. 2021;184:810-826.e823.

Miao Z, Humphreys BD, McMahon AP, Kim J. Multi-omics integration in the age of million single-cell data. Nat Rev Nephrol. 2021;17:710–24.

Chappell L, Russell AJC, Voet T. Single-Cell (Multi)omics Technologies. Annu Rev Genomics Hum Genet. 2018;19:15–41.

Li H, Qu L, Yang Y, Zhang H, Li X, Zhang X. Single-cell transcriptomic architecture unraveling the complexity of tumor heterogeneity in distal cholangiocarcinoma. Cell Mol Gastroenterol Hepatol. 2022;13(1592–1609): e1599.

Google Scholar  

Li M, Zhang X, Ang KS, Ling J, Sethi R, Lee NYS, Ginhoux F, Chen J. DISCO: a database of Deeply Integrated human Single-Cell Omics data. Nucleic Acids Res. 2022;50:D596-d602.

Pan L, Mou T, Huang Y, Hong W, Yu M, Li X. Ursa: A comprehensive multiomics toolbox for high-throughput single-cell analysis. Mol Biol Evol. 2023;40(12):msad267.

Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, et al. The Human Cell Atlas eLife. 2017;6: e27041.

PubMed   Google Scholar  

Clough E, Barrett T. The gene expression omnibus database. Statistical Genomics: Methods and Protocols. 2016:93–110.

Franzén O, Gan L-M, Björkegren JLM: PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, 2019.

Cummins C, Ahamed A, Aslam R, Burgin J, Devraj R, Edbali O, Gupta D, Harrison PW, Haseeb M, Holt S, et al. The European Nucleotide Archive in 2021. Nucleic Acids Res. 2022;50:D106-d110.

Pan L, Shan S, Tremmel R, Li W, Liao Z, Shi H, Chen Q, Zhang X, Li X. HTCA: a database with an in-depth characterization of the single-cell human transcriptome. Nucleic Acids Res. 2022;51:D1019–28.

Article   PubMed Central   Google Scholar  

Elmentaite R, Domínguez Conde C, Yang L, Teichmann SA. Single-cell atlases: shared and tissue-specific cell types across human organs. Nat Rev Genet. 2022;23:395–410.

Quake SR: A decade of molecular cell atlases. Trends in Genetics 2022.

Zeng J, Zhang Y, Shang Y, Mai J, Shi S, Lu M, Bu C, Zhang Z, Zhang Z, Li Y, et al. CancerSCEM: a database of single-cell expression map across various human cancers. Nucleic Acids Res. 2022;50:D1147-d1155.

Ner-Gaon H, Melchior A, Golan N, Ben-Haim Y, Shay T. JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets. J Immunol. 2017;198:3375–9.

Tarhan L, Bistline J, Chang J, Galloway B, Hanna E, Weitz E: Single Cell Portal: an interactive home for single-cell genomics data. bioRxiv 2023.

Kolodziejczyk Aleksandra A, Kim JK, Svensson V, Marioni John C, Teichmann Sarah A. The Technology and Biology of Single-Cell RNA Sequencing. Mol Cell. 2015;58:610–20.

Schwartzman O, Tanay A. Single-cell epigenomics: techniques and emerging applications. Nat Rev Genet. 2015;16:716–26.

Gomes T, Teichmann SA, Talavera-López C. Immunology Driven by Large-Scale Single-Cell Sequencing. Trends Immunol. 2019;40:1011–21.

Cheung RK, Utz PJ. CyTOF—the next generation of cell detection. Nat Rev Rheumatol. 2011;7:502–3.

Spitzer Matthew H, Nolan Garry P. Mass Cytometry: Single Cells. Many Features Cell. 2016;165:780–91.

CAS   PubMed   Google Scholar  

Tian Y, Carpp LN, Miller HER, Zager M, Newell EW, Gottardo R. Single-cell immunology of SARS-CoV-2 infection. Nat Biotechnol. 2022;40:30–41.

McKinnon KM: Flow Cytometry: An Overview. Current Protocols in Immunology 2018, 120:5.1.1–5.1.11.

Rao A, Barkley D, França GS, Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596:211–20.

Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20:631–56.

Ng PC, Kirkness EF. Whole Genome Sequencing. In: Barnes MR, Breen G, editors. Genetic Variation: Methods and Protocols. Totowa, NJ: Humana Press; 2010. p. 215–26.

Chapter   Google Scholar  

Hughes CE, Nibbs RJB. A guide to chemokines and their receptors. Febs j. 2018;285:2944–71.

Stadler M, Pudelko K, Biermeier A, Walterskirchen N, Gaigneaux A, Weindorfer C, Harrer N, Klett H, Hengstschläger M, Schüler J, et al. Stromal fibroblasts shape the myeloid phenotype in normal colon and colorectal cancer and induce CD163 and CCL2 expression in macrophages. Cancer Lett. 2021;520:184–200.

Davidson S, Coles M, Thomas T, Kollias G, Ludewig B, Turley S, Brenner M, Buckley CD. Fibroblasts as immune regulators in infection, inflammation and cancer. Nat Rev Immunol. 2021;21:704–17.

Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573-3587.e3529.

Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, Chen H, Wang J, Tang H, Ge W, et al. Construction of a human cell landscape at single-cell level. Nature. 2020;581:303–9.

Kariminekoo S, Movassaghpour A, Rahimzadeh A, Talebi M, Shamsasenjan K, Akbarzadeh A. Implications of mesenchymal stem cells in regenerative medicine. Artificial Cells, Nanomedicine, and Biotechnology. 2016;44:749–57.

Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, et al. SCENIC: single-cell regulatory network inference and clustering. Nat Methods. 2017;14:1083–6.

Cillo AR, Kürten CHL, Tabib T, Qi Z, Onkar S, Wang T, Liu A, Duvvuri U, Kim S, Soose RJ, et al. Immune Landscape of Viral- and Carcinogen-Driven Head and Neck Cancer. Immunity. 2020;52:183-199.e189.

Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47–e47.

Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 2021;49:D545-d551.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.

The Gene Ontology resource. enriching a GOld mine. Nucleic Acids Res. 2021;49:D325-d334.

Article   Google Scholar  

Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc: Ser B (Methodol). 1995;57:289–300.

Staley JR, Blackshaw J, Kamat MA, Ellis S, Surendran P, Sun BB, Paul DS, Freitag D, Burgess S, Danesh J, et al. PhenoScanner: a database of human genotype-phenotype associations. Bioinformatics. 2016;32:3207–9.

Kamat MA, Blackshaw JA, Young R, Surendran P, Burgess S, Danesh J, Butterworth AS, Staley JR. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics. 2019;35:4851–3.

Ballardini G, Bianchi F, Doniach D, Mirakian R, Pisi E, Bottazzo G. ABERRANT EXPRESSION OF HLA-DR ANTIGENS ON BILEDUCT EPITHELIUM IN PRIMARY BILIARY CIRRHOSIS: RELEVANCE TO PATHOGENESIS. The Lancet. 1984;324:1009–13.

Hirschfield GM, Liu X, Xu C, Lu Y, Xie G, Lu Y, Gu X, Walker EJ, Jing K, Juran BD, et al. Primary Biliary Cirrhosis Associated with HLA, IL12A, and IL12RB2 Variants. N Engl J Med. 2009;360:2544–55.

Peng A, Ke P, Zhao R, Lu X, Zhang C, Huang X, Tian G, Huang J, Wang J, Invernizzi P, et al. Elevated circulating CD14(low)CD16(+) monocyte subset in primary biliary cirrhosis correlates with liver injury and promotes Th1 polarization. Clin Exp Med. 2016;16:511–21.

Chen Y-Y, Arndtz K, Webb G, Corrigan M, Akiror S, Liaskou E, Woodward P, Adams DH, Weston CJ, Hirschfield GM. Intrahepatic macrophage populations in the pathophysiology of primary sclerosing cholangitis. JHEP Reports. 2019;1:369–76.

Olmos JM, García JD, Jiménez A, de Castro S. Impaired monocyte function in primary biliary cirrhosis. Allergol Immunopathol (Madr). 1988;16:353–8.

Britanova OV, Putintseva EV, Shugay M, Merzlyak EM, Turchaninova MA, Staroverov DB, Bolotin DA, Lukyanov S, Bogdanova EA, Mamedov IZ, et al. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J Immunol. 2014;192:2689–98.

Borcherding N, Bormann NL, Kraus G. scRepertoire: An R-based toolkit for single-cell immune receptor analysis. F1000Research. 2020;9.

Larbi A, Fulop T. From “truly naïve” to “exhausted senescent” T cells: When markers predict functionality. Cytometry A. 2014;85:25–35.

Article   PubMed   Google Scholar  

Lee S-W, Choi HY, Lee G-W, Kim T, Cho H-J, Oh I-J, Song SY, Yang DH, Cho J-H. CD8<sup>+</sup> TILs in NSCLC differentiate into TEMRA via a bifurcated trajectory: deciphering immunogenicity of tumor antigens. J Immunother Cancer. 2021;9: e002709.

Chen K, Kolls JK. T Cell-Mediated Host Immune Defenses in the Lung. Annu Rev Immunol. 2013;31:605–33.

Mowat AM, Agace WW. Regional specialization within the intestinal immune system. Nat Rev Immunol. 2014;14:667–85.

Godfrey DI, Koay H-F, McCluskey J, Gherardin NA. The biology and functional importance of MAIT cells. Nat Immunol. 2019;20:1110–28.

Nel I, Bertrand L, Toubal A, Lehuen A. MAIT cells, guardians of skin and mucosa? Mucosal Immunol. 2021;14:803–14.

Legoux F, Salou M, Lantz O. MAIT Cell Development and Functions: the Microbial Connection. Immunity. 2020;53:710–23.

van den Broek T, Borghans JAM, van Wijk F. The full spectrum of human naive T cells. Nat Rev Immunol. 2018;18:363–73.

Soundararajan M, Kannan S. Fibroblasts and mesenchymal stem cells: Two sides of the same coin? J Cell Physiol. 2018;233:9099–109.

Muzlifah AH, Matthew PC, Christopher DB, Francesco D. Mesenchymal stem cells: the fibroblasts’ new clothes? Haematologica. 2009;94:258–63.

Lendahl U, Muhl L, Betsholtz C. Identification, discrimination and heterogeneity of fibroblasts. Nat Commun. 2022;13:3409.

Steens J, Unger K, Klar L, Neureiter A, Wieber K, Hess J, Jakob HG, Klump H, Klein D. Direct conversion of human fibroblasts into therapeutically active vascular wall-typical mesenchymal stem cells. Cell Mol Life Sci. 2020;77:3401–22.

Ichim TE, O’Heeron P, Kesari S. Fibroblasts as a practical alternative to mesenchymal stem cells. J Transl Med. 2018;16:212.

Beumer J, Clevers H. Cell fate specification and differentiation in the adult mammalian intestine. Nat Rev Mol Cell Biol. 2021;22:39–53.

Moor AE, Harnik Y, Ben-Moshe S, Massasa EE, Rozenberg M, Eilam R, Bahar Halpern K, Itzkovitz S. Spatial Reconstruction of Single Enterocytes Uncovers Broad Zonation along the Intestinal Villus Axis. Cell. 2018;175:1156-1167.e1115.

Kendall RT, Feghali-Bostwick CA. Fibroblasts in fibrosis: novel roles and mediators. Front Pharmacol. 2014;5:123.

Oliver JR, Kushwah R, Wu J, Pan J, Cutz E, Yeger H, Waddell TK, Hu J. Elf3 plays a role in regulating bronchiolar epithelial repair kinetics following Clara cell-specific injury. Lab Invest. 2011;91:1514–29.

Ng AYN, Waring P, Ristevski S, Wang C, Wilson T, Pritchard M, Hertzog P, Kola I. Inactivation of the transcription factor Elf3 in mice results in dysmorphogenesis and altered differentiation of intestinal epithelium. Gastroenterology. 2002;122:1455–66.

Chen R, Kang R, Tang D. The mechanism of HMGB1 secretion and release. Exp Mol Med. 2022;54:91–102.

Dai S, Sodhi C, Cetin S, Richardson W, Branca M, Neal MD, Prindle T, Ma C, Shapiro RA, Li B, et al. Extracellular High Mobility Group Box-1 (HMGB1) Inhibits Enterocyte Migration via Activation of Toll-like Receptor-4 and Increased Cell-Matrix Adhesiveness 2<sup></sup>. J Biol Chem. 2010;285:4995–5002.

Klepsch V, Gerner RR, Klepsch S, Olson WJ, Tilg H, Moschen AR, Baier G, Hermann-Kleiter N. Nuclear orphan receptor NR2F6 as a safeguard against experimental murine colitis. Gut. 2018;67:1434–44.

Klepsch V, Hermann-Kleiter N, Baier G. Beyond CTLA-4 and PD-1: Orphan nuclear receptor NR2F6 as T cell signaling switch and emerging target in cancer immunotherapy. Immunol Lett. 2016;178:31–6.

Sanz-Pamplona R, Berenguer A, Cordero D, Molleví DG, Crous-Bou M, Sole X, Paré-Brunet L, Guino E, Salazar R, Santos C, et al. Aberrant gene expression in mucosa adjacent to tumor reveals a molecular crosstalk in colon cancer. Mol Cancer. 2014;13:46.

McPherson JP, Sarras H, Lemmers B, Tamblyn L, Migon E, Matysiak-Zablocki E, Hakem A, Azami SA, Cardoso R, Fish J, et al. Essential role for Bclaf1 in lung development and immune system function. Cell Death Differ. 2009;16:331–9.

Aw S. Sun H, Geng Y, Peng Q, Wang P, Chen J, Xiong T, Cao R, Tang J: Bclaf1 is an important NF-κB signaling transducer and C/EBPβ regulator in DNA damage-induced senescence. Cell Death Differ. 2016;23:865–75.

Zhou X, Li X, Cheng Y, Wu W, Xie Z, Xi Q, Han J, Wu G, Fang J, Feng Y. BCLAF1 and its splicing regulator SRSF10 regulate the tumorigenic potential of colon cancer cells. Nat Commun. 2014;5:4581.

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102:15545–50.

Liberzon A, Subramanian A, Pinchback R. Thorvaldsdottir H, Tamayo P, Mesirov JP: Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–40.

GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–30.

Pan L, Parini P, Tremmel R, Loscalzo J, Lauschke VM, Maron BA, Paci P, Ernberg I, Tan NS, Liao Z, Yin W, Rengarajan S, Li X: Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Github. https://github.com/eudoraleer/sca/ ; 2024.

Pan L, Parini P, Tremmel R, Loscalzo J, Lauschke VM, Maron BA, Paci P, Ernberg I, Tan NS, Liao Z, Yin W, Rengarajan S, Wang ZN, Li X: Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Zenodo. https://zenodo.org/doi/10.5281/zenodo.10906053 ; 2024.

Download references

Acknowledgements

The computations and data handling were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Rackham, partially funded by the Swedish Research Council through grant agreement no. 2018-05973. We would like to thank Vladimir Kuznetsov for his advice on the manuscript, and Liming Zhang and Xueqiang Peng for their help in data handling.

Members of The SCA Consortium

Lu Pan 1 , Paolo Parini 2,3 , Roman Tremmel 4,5 , Joseph Loscalzo 6 , Volker M. Lauschke 4,5,7 , Bradley A. Maron 6 , Paola Paci 8 , Ingemar Ernberg 9 , Nguan Soon Tan 10,11 , Zehuan Liao 9,10 , Weiyao Yin 1 , Sundararaman Rengarajan 12 , Xuexin Li 13,14,*

1 Institute of Environmental Medicine, Karolinska Institutet, Solna, 171 65, Sweden.

2 Cardio Metabolic Unit, Department of Medicine, and Department of Laboratory Medicine, Karolinska Institutet, Stockholm, 141 86, Sweden.

3 Medicine Unit, Theme Inflammation and Ageing, Karolinska University Hospital, Stockholm, 141 86, Sweden.

4 Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, 70376, Germany.

5 University of Tuebingen, Tuebingen, 72076, Germany.

6 Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA.

7 Department of Physiology and Pharmacology, Karolinska Institutet, Solna, 171 65, Sweden.

8 Department of Computer, Control and Management Engineering, Sapienza University of Rome, Rome, 00185, Italy.

9 Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Solna, 171 65, Sweden.

10 School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore.

11 Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore 308232, Singapore.

12 Department of Physical Therapy, Movement & Rehabilitation Sciences, Northeastern University, Boston, MA, 02115, USA.

13 Department of General Surgery, The Fourth Affiliated Hospital, China Medical University, Shenyang 110032, China.

14 Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, 171 65, Sweden.

Review history

The review history is available as Additional File 4 .

Peer review information

Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Open access funding provided by Karolinska Institute. This work is supported by the Karolinska Institute Network Medicine Global Alliance (KI NMA) collaborative grant C24401073 (X.L., L.P.), C62623013 (X.L., L.P.), and C331612602 (X.L., L.P.).

Author information

Authors and affiliations.

Institute of Environmental Medicine, Karolinska Institutet, 171 65, Solna, Sweden

Lu Pan & Weiyao Yin

Cardio Metabolic Unit, Department of Medicine, and, Department of Laboratory Medicine , Karolinska Institutet, 141 86, Stockholm, Sweden

Paolo Parini

Theme Inflammation and Ageing, Medicine Unit, Karolinska University Hospital, 141 86, Stockholm, Sweden

Dr. Margarete Fischer-Bosch Institute of Clinical Pharmacology, 70376, Stuttgart, Germany

Roman Tremmel & Volker M. Lauschke

University of Tuebingen, 72076, Tuebingen, Germany

Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, 02115, USA

Joseph Loscalzo & Bradley A. Maron

Department of Physiology and Pharmacology, Karolinska Institutet, 171 65, Solna, Sweden

Volker M. Lauschke

Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185, Rome, Italy

Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, 171 65, Solna, Sweden

Ingemar Ernberg & Zehuan Liao

School of Biological Sciences, Nanyang Technological University, Singapore, 637551, Singapore

Nguan Soon Tan & Zehuan Liao

Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, Singapore, 308232, Singapore

Nguan Soon Tan

Department of Physical Therapy, Movement & Rehabilitation Sciences, Northeastern University, Boston, MA, 02115, USA

Sundararaman Rengarajan

Department of General Surgery, The Fourth Affiliated Hospital, China Medical University, Shenyang, 110032, China

Department of Medical Biochemistry and Biophysics, Karolinska Institutet, 171 65, Solna, Sweden

You can also search for this author in PubMed   Google Scholar

  • , Paolo Parini
  • , Roman Tremmel
  • , Joseph Loscalzo
  • , Volker M. Lauschke
  • , Bradley A. Maron
  • , Paola Paci
  • , Ingemar Ernberg
  • , Nguan Soon Tan
  • , Zehuan Liao
  • , Weiyao Yin
  • , Sundararaman Rengarajan
  •  & Xuexin Li

Contributions

Conceptualization, X.L., L.P., and J.L.; methodology, X.L. and L.P.; investigation, X.L., L.P., V.M.L., R.T., and J.L.; analysis and visualization, L.P.; cross-checking and validation, X.L. and L.P.; website construction, L.P., X.L., and R.T.; funding acquisition, X.L. and L.P.; project administration, X.L., L.P., P.P., and V.M.L.; supervision, X.L. and J.L.; writing, L.P. and X.L. All authors edited and reviewed the manuscript.

Corresponding author

Correspondence to Xuexin Li .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

VML is CEO and shareholder of HepaPredict AB, co-founder, and shareholder of PersoMedix AB, and discloses consultancy work for Enginzyme AB. The other authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:.

Figure S1. Sample count in fetal and adult groups across tissues and omics types. Figure S2. Correlations between cell types based on gene expression signatures revealed distinct cell type class clusters. (A-B) Heatmap showing the correlations of the cell types from adult (A) and fetal (B) cell types based on the expression of their top upregulated genes. The intensity of the heatmap shows the AUROC level between cell types. Colour blocks on the top of the heatmap represent tissues (first row from the top), biological systems (second row), cell types (third row) and cell type classes (fourth row). Figure S3. Correlations between cell types based on TF signatures revealed similar clustering patterns. (A-B) Heatmap showing the correlations of the cell types from adult (A) and fetal (B) cell types based on the expression of the TF signatures of each cell type. The intensity of the heatmap shows the AUROC level between cell types. Colour blocks on the top of the heatmap represent tissues (first row from the top), biological systems (second row), cell types (third row) and cell type classes (fourth row). Figure S4. Phenotype or disease trait associations. Forest plot showing the associations of phenotype or disease traits in selected cell type classes of scRNA-seq data for both adult and fetal tissues. The X-axis displays the odds ratio of each trait, and the colors of the points represent cell type classes. Figure S5. Landscape of clonal expansion patterns across tissues. (A) tSNE of the tissues from the multi-modal tissues of the scImmune-profiling data. Colors indicate clonal type expansion groups of the cells. Cells not present in the T or B repertoires are colored gray (NA group). Tissues with too few cells present in the T or B repertoires were filtered (i.e., bile duct and kidney) in the main analysis. (B) Stacked bar plots revealing the overall clonal expansion landscapes of the T and B cell repertoires. Colors represent clonal type groups. (C) Alluvial plot showing the top clonal types in T cell repertoires and their proportions shared across tissues containing these clonotypes. Colors represent clonotypes. Figure S6. Pseudotime heatmaps of MSC lineage cell types in the adult and fetal colon. (A-B) Pseudotime trajectory of each cell type in the MSC lineage of adult (A) and fetal (B) colons. The color represents the cell type, and the violin plots represent the density of cells across pseudotime. Figure S7. Comparison of DE gene overlaps between bulk RNA-seq, scRNA-seq and WGS. (A) Chromosomal positions of the top 10 eGenes in colon transverse bulk RNA-seq data. Gene names and their SNP rsid are shown. (B) Chromosomal positions of the top 10 sGenes in colon transverse bulk RNA-seq data. Gene names and their SNP rsid are shown. (C) Stacked bar plot showing the number of shared DE genes of the bulk RNA-seq data and the scRNA-seq data with the genes of the top eQTLs and sQTLs. The color represents the omics type. (D) Stacked bar plot showing the number of shared DE genes across the bulk RNA-seq data, the scRNA-seq data, genes of the top eQTLs and sQTLs. Colors represent the cell types to which the genes belonged with reference to the DE genes of the cell types in the scRNA-seq data. Fig. S8. Comprehensive workflow for scATAC-Seq data analyses in SCA V1.0.0.

Additional file 2:

Table S1. Cell counts of the adult and fetal tissue groups at each omics level. Table S2. Filtered matrix raw read counts for scRNA-Seq across tissues in both fetal and adult groups. Cell_Count_Filtered_Matrix column represents raw read counts initially obtained from published studies or after filtering for the removal of background noises. Table S3. Statistics of the upregulated genes from adult and fetal tissues, filtered by average Log2FoldChange > 0.25 and adjusted P of 0.05. Clusters represent cell types. Genes were ranked by average log2-fold-change. Table S4. Top receptor–ligand interaction profiles of the cell types in the 38 matching adult and fetal tissues. Interaction analysis was done separately for each tissue, and information on the interaction pairs can be viewed from the first column. Table S5: Top clonotypes (VDJ gene combinations) of each cell type present in the T and B cell repertoires. Table S6. Top TFs in the pseudotime transitions of adult and fetal colon cell types. Table S7 . Top receptor-ligand pairs in spatial transcriptomics of adult colons (colon 1 and colon 2) as well as in scRNA-seq adult and fetal colons. The first column represents the data type to which the interactions belong. Table ranked by decreasing interaction ratios. Table S8 . Comparison of SCA with other single-cell omics databases. Green tick indicates a yes and a red cross indicates a no. Table S9. List of public resources included in the SCA database portal. SCA_PID refers to SCA-designated project identity number (PID).

Additional file 3.

Supplementary Methods.

Additional file 4.

Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Pan, L., Parini, P., Tremmel, R. et al. Single Cell Atlas: a single-cell multi-omics human cell encyclopedia. Genome Biol 25 , 104 (2024). https://doi.org/10.1186/s13059-024-03246-2

Download citation

Received : 16 November 2022

Accepted : 12 April 2024

Published : 19 April 2024

DOI : https://doi.org/10.1186/s13059-024-03246-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Single-cell omics
  • Multi-omics
  • Single Cell Atlas
  • Human database
  • Single-cell RNA-sequencing
  • Spatial transcriptomics
  • Single-cell ATAC-sequencing
  • Single-cell immune profiling
  • Mass cytometry
  • Flow cytometry

Genome Biology

ISSN: 1474-760X

research genomic data analysis

IMAGES

  1. Overview of Genes, DNA, and Chromosomes

    research genomic data analysis

  2. Genome Characterization Pipeline

    research genomic data analysis

  3. Landmark Study: Sequencing of 64 Full Human Genomes to Better Capture

    research genomic data analysis

  4. | Genomic data analysis of Lactobacillus faecis 2-84. (A) Circular

    research genomic data analysis

  5. Genomic Research Into Different Ancestries Leads to Better Results and

    research genomic data analysis

  6. Whole Genome Sequencing (WGS)- Introduction, workflow, Pipelines

    research genomic data analysis

VIDEO

  1. Genomic Data Analysis Techniques by Dr Alok Shiv and Dr Abhijith K P

  2. Genomic Data Analysis with Python #bioinformatics #genomics #shorts

  3. Genomic Data Science Working Group of Council Annual Report

  4. What is the Genomic Data Commons?

  5. Best Programming Languages for Bioinformatics

  6. 17th batch genomic data analysis day 1

COMMENTS

  1. Genomic Data Science Fact Sheet

    Genomic data science emerged as a field in the 1990s to bring together two laboratory activities: Experimentation: Generating genomic information from studying the genomes of living organisms. Data analysis: Using statistical and computational tools to analyze and visualize genomic data, which includes processing and storing data and using ...

  2. Practical guide for managing large-scale human genome data in research

    This review aims to guide researchers in human genetics to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses in their specific ...

  3. Tasks, Techniques, and Tools for Genomic Data Visualization

    Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data-driven ...

  4. Data Analysis for Genomics

    Advances in genomics have triggered fundamental changes in medicine and research. Genomic datasets are driving the next generation of discovery and treatment, and this series will enable you to analyze and interpret data generated by modern genomics technology. ... Bridge diverse genomic assay and annotation structures to data analysis and ...

  5. Exploring the Use of Genomic and Routinely Collected Data: Narrative

    Objective. This study aims to provide an outline of the use of genomic data alongside routinely collected data in health research to date. As this field prepares to move forward, it is important to take stock of the current state of play in order to highlight new avenues for development, identify challenges, and ensure that adequate data governance models are in place for safe and socially ...

  6. Genomics Data Analysis

    The Genomics Data Analysis XSeries is an advanced series that will enable students to analyze and interpret data generated by modern genomics technology. Using open-source software, including R and Bioconductor, you will acquire skills to analyze and interpret genomic data. This XSeries is perfect for those who seek advanced training in high ...

  7. Innovations in Genomics and Big Data Analytics for Personalized

    Big data analysis will empower digital health care in a way that will ensure the timely access of clinicians to the entire scope of a patient's ... Bernal-Delgado E., Blomberg N., Bock C., Conesa A. Making sense of big data in health research: Towards an EU action plan. Genome Med. 2016; 8:1-13. doi: 10.1186/s13073-016-0323-y. [PMC free ...

  8. Genomic analysis

    Genomic analysis is the identification, measurement or comparison of genomic features such as DNA sequence, structural variation, gene expression, or regulatory and functional element annotation ...

  9. Uniform genomic data analysis in the NCI Genomic Data Commons

    The Genomic Data Commons repository contains genomic, epigenomic, proteomic and clinical data from the TCGA and TARGET datasets. Here, the authors describe the analysis methods for how these ...

  10. The Value of Genomic Analysis

    The value of genomic analysis. Genetic heritability is responsible for 30% of individual health outcomes, but is hardly used to guide disease prevention and care. Each individual carries 4-5 million genetic variants, each with varying influence on traits related to our health. The cost to sequence a genome has reduced drastically in recent ...

  11. Comprehensive Bioinformatics: Learn Genomics Data Analysis

    Acquire practical skills in analyzing genomic data using a variety of bioinformatics tools. Explore methods for variant calling, gene location on chromosomes, and gene structure analysis. Applications in Research: Understand how genomics contributes to disease research, biotechnology, and agriculture.

  12. The Genomic Data Analysis Network

    CCG's Genomic Data Analysis Network (GDAN) was formed from the need to harness TCGA data and a growing need at large for computational genomics. For TCGA, the network created standardized data formats and processing protocols, generated bioinformatics tools for the community, and performed a range of analyses on the data, notably generating ...

  13. Genomic Data Analysis: A Step-by-Step Walkthrough for Beginners

    I. Introduction Genomic data analysis is a rapidly evolving field that plays a pivotal role in understanding the genetic information encoded in an organism's DNA. This guide aims to provide a comprehensive overview of genomic data analysis, emphasizing its significance in scientific research. It is designed to cater to a diverse audience, ranging from researchers

  14. Bioinformatics and Big Data Analytics in Genomic Research

    Bioinformatics and Big Data Analyt ics in Genomic Research. Qaiser asad. Department of health science, university of Public Health, Gujrat, India. Abstract: The field of genomics has witnessed a ...

  15. Genomic Research Data Generation, Analysis and Sharing

    The CODATA Data Science Journal is a peer-reviewed, open access, electronic journal, publishing papers on the management, dissemination, use and reuse of research data and databases across all research domains, including science, technology, the humanities and the arts. The scope of the journal includes descriptions of data systems, their implementations and their publication, applications ...

  16. Cloud-based biomedical data storage and analysis for genomic research

    At least two key differences between cloud-based platforms and pre-existing data sharing mechanisms merit attention. First, cloud-based platforms "invert" data sharing in that users come to data stored in central cloud locations for analysis, rather than downloading data to store and analyze locally. 2 Second, to expedite data access and analysis, streamlined mechanisms are being developed ...

  17. What is Genomic Data?

    It allows for the exchange of data for genomic research and data analysis. Scientists use shared data to develop treatments for genetic disease, identify new genetic markers, and create personalized medicine. Genomic data is commonly shared through secure databases, managed by organizations such as the National Institutes of Health (NIH).

  18. A Brief Guide to Genomics

    An organism's complete set of DNA is called its genome. Virtually every single cell in the body contains a complete copy of the approximately 3 billion DNA base pairs, or letters, that make up the human genome. With its four-letter language, DNA contains the information needed to build the entire human body. A gene traditionally refers to the ...

  19. Genomic Data and Big Data Analytics

    Genomic research shall facilitate in meeting many challenges in the days to come. Genomic research produces huge data. However, in order to exploit the power of genomic research, analysis of this huge data is important. If we want to infuse genomic information into day-to-day medical practice, there are many issues to be faced.

  20. Genomic benchmarks: a collection of datasets for ...

    We have provided guidelines and tools to unify access to any genomic data and we will happily host submitted genomic datasets of sufficient quality and interest. ... and promote machine learning research in Genomics. ... Ng AY, Potts C. Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association ...

  21. St. Jude ProteinPaint adopted by the National Cancer Institute for

    The St. Jude ProteinPaint platform provides comprehensive data analysis and dynamic viewing capabilities for cancer research. In recognition of the platform's excellence as a leader in data visualization, the National Cancer Institute (NCI) recently adopted the platform as the new visualization program for their large cancer genetic data portal.

  22. Cloud computing for genomic data analysis and collaboration

    This Review discusses the role of cloud computing in genomics research to facilitate data sharing and new analyses of archived sequencing data, as well as large-scale international collaborations ...

  23. Researchers aim to use quantum computing to assemble and analyse

    A pangenome is a collection of many different genome sequences that capture the genetic diversity in a population. Pangenomes could potentially be produced for all species, including pathogens such as SARS-CoV-2. The human pangenome data are freely accessible on the Ensembl human pangenome project page and through Ensembl Rapid Release.

  24. Big Data Analytics for Genomic Medicine

    Statistical Analysis of Genomic Data. Family-based analysis: Family-based NGS data enable the discovery of disease-contributing de novo mutations [77,78 ... Medical imaging data is a type of Big Data in medical research. Imaging genomics is a rapidly growing field derived from recent advancement in multimodal imaging data and high-throughput ...

  25. ExpOmics: a comprehensive web platform empowering biologists ...

    Motivation: High-throughput technologies have yielded a broad spectrum of multi-omics datasets, offering unparalleled insights into complex biological systems. However, effectively analyzing this diverse array of data presents challenges, given factors such as species diversity, data types, costs, and limitations of available tools. Results: We propose ExpOmics, a comprehensive web platform ...

  26. Bridging genomic gaps: A versatile SARS-CoV-2 benchmark ...

    Genomic sequencing's adoption in public health laboratories (PHLs) for pathogen surveillance is innovative yet challenging, particularly in the realm of bioinformatics. Low- and middle-income countries (LMICs) face heightened difficulties due to supply chain volatility, workforce training, and unreliable infrastructure such as electricity and internet services.

  27. Quantum Computing Meets Genomics: The Dawn of Hyper-Fast DNA Analysis

    A new project unites world-leading experts in quantum computing and genomics to develop new methods and algorithms to process biological data.. Researchers aim to harness quantum computing to speed up genomics, enhancing our understanding of DNA and driving advancements in personalized medicine. A new collaboration has formed, uniting a world-leading interdisciplinary team with skills across ...

  28. Bioinformatics services offered by the Genomics Core at WSU Spokane

    The Genomics Core at WSU Spokane now offers cutting-edge bioinformatics services to support WSU research. Spearheading this initiative is Dr. Daniel Beck (Biosketch (PDF)), an accomplished bioinformatician with over a decade of experience, particularly in next-generation sequencing data analysis. Our standard services include procedures such as data quality control and processing, differential ...

  29. Genomic Diversity and Recombination Analysis of the Spike ...

    Human coronaviruses (HCoVs) are seriously associated with respiratory diseases in humans and animals. The first human pathogenic SARS-CoV emerged in 2002-2003. The second was MERS-CoV, reported from Jeddah, the Kingdom of Saudi Arabia, in 2012, and the third one was SARS-CoV-2, identified from Wuhan City, China, in late December 2019. The HCoV-Spike (S) gene has the highest mutation ...

  30. Single Cell Atlas: a single-cell multi-omics human cell encyclopedia

    Single-cell RNA-sequencing analysis of adult and fetal tissues revealed cell-type-specific developmental differences. In total, out of the 125 adult and fetal tissues from all omics types, the scRNA-seq molecular layer in the SCA consisted of 92 adult and fetal tissues (Additional file 1: Fig. S1, Additional file 2: Additional file 2: Table S1), spanning almost all organs and tissues of the ...