1) Introduction
Next Generation Sequencing (NGS) also known as Deep Sequencing or
Massive Parallel Sequencing or Second and Third Generation Sequencing is a
technique that offers unprecedented detail in the genomic, transcriptomic, and
epigenomic patterns associated with cellular processes. A medium size lab
(10-15 scientists) could easily generate multiple terabytes of data during a
NGS end-to-end process. Hence building a scalable, cost effective, and secured
data pipelines is a lifeblood for life sciences industry and in particular Genomics
domain.
Genomics: It is study
of genomes (complete set of genetic material within an organism). Genomics
involves the mapping, sequencing and analysis of genomes. It includes
structure, function, comparison, and evolution of genomes
Transcriptomics:
Techniques available to identify mRNA from actively transcribed genes
Epigenomics:
It is study of the complete set of epigenetic modifications on the genetic
material of a cell, known as the epigenome.
Epigenetics:
It is the study of heritable changes in gene expression that does not involve
changes to the underlying DNA sequence - a change in phenotype without a change
in genotype - which in turn affects how cells read the genes
Epigenome:
It is a multitude of chemical compounds that can tell the genome what to do
Precision
Medicine (or old term Personalized Medicine):
It is an approach for disease treatment and prevention that considers
individual genes, human microbiome, environment and lifestyle for each person
Human
microbiome:
Includes genes of humans and the genes of all microorganisms that live on and
in humans
Pharmacogenomics:
It is a part of precision medicine and study of how genes affect a person’s
response to drugs. It combines pharmacology (the science of drugs) and genomics
to develop effective, safe medications and doses that are tailored to
variations in a person’s genes.
3)
Next Generation Sequencing Process
A typical NGS process consist
of three steps:
·
Sample
Preparation kits
·
Sample
Sequencing using NGS Instruments
·
Analysis
of Data
In a high-throughput pipeline
such as NGS, Analysis of Data is a time-consuming and expensive step of the
end-to-end process.
3.1 Next Generation Sequencing Methods
NGS enables a wide variety of
sequencing solutions and some of the most common solution areas and their
methods are:
· Genomics
- which includes Whole-Genome Sequencing, Exome Sequencing, De novo Sequencing,
and Targeted Sequencing.
· Transcriptomics
- which includes Total RNA and mRNA Sequencing, Targeted RNA Sequencing, Small
RNA and Noncoding RNA Sequencing
· Epigenomics
- which includes Methylation Sequencing, ChiP Sequencing, and Ribosome Profiling.
A NGS data analysis pipeline
is typically termed as Bioinformatics. Based on the solution and the sequencing
method, a NGS Bioinformatics consist of one or more types of data analysis. For
example, a de novo assembly has different steps compared to a reference
sequence analysis, but a generic data analysis pipeline consists of Primary,
Secondary and Tertiary phases.
3.2.1.
Primary Analysis Phase
Primary
analysis solutions are simultaneously carried during the generation of
sequencing data by the sequencers. Some of the vendors provide primary analysis
software tools as part of their sequencer platform suite, and the tools can be
run on bundled compute resources with the sequencer instruments or on-premises
computing clusters or in a public cloud environment. The primary analysis
includes de-multiplexing of short reads, base sequence quality scores, etc. The
raw data is delivered in FASTQ format and the key computational need is to keep
up with the throughput of the instrument as it produces measurement
information, such as images of chemistry.
3.2.2. Secondary Analysis Phase
Secondary
analysis performs operations on the aggregated data from one or more runs of
the machine and hence requires a significant amount of data storage and compute
resources. It is a repeatable process of running data through various
algorithms on a per-sample basis, monitoring quality metrics and tweaking or
improving the algorithms and their parameters. As secondary analysis requires
economies of scale in computing and resource utilization, a public cloud can be
a choice for executing the pipeline phase. The files generated during this
phase are BAM and BAI (Binary Sequence Alignment/Map and Index) files.
3.2.3.
Tertiary Analysis Phase
The
tertiary phase is an integrative analysis pipeline consisting of heavy data
analysis steps to make sense of data in the context of the research study.
Multiple samples need to be bought together along with phenotype and other
experimental data. The pipeline steps include annotation with data from public
datasets, sequence comparison using Basic Local Alignment Search Tool(BLAST),
statistical analysis, hierarchical clustering, machine learning and/or deep
neural networks, reports, dashboards, etc. The data from the BLAST is delivered
as Variant Call Format (VCF) files. A public cloud such as Azure can offer all
the tools required for this phase.
A NGS pipeline with multiple
repetitive steps aligns very well with the concept of ETL (Extract Transform
and Load) process. In an ETL process, data is acquired (or Extracted) from a data
source in real time or batch mode, then the data is processed (or Transformed)
using computing resources, and finally the processed data is persisted (or
Loaded) into a scalable storage or datastore for further advanced analysis and
reporting. Azure offers number of services and tools for deploying and
executing an ETL process end-to-end.
A cloud platform, such as Microsoft Azure can offer a high-performance, scalable and cost-effective connectivity (from on-premises), storage, computing, security and developer tools for end-to-end orchestration of Next Generation Sequencing (NGS) pipelines.
A cloud platform, such as Microsoft Azure can offer a high-performance, scalable and cost-effective connectivity (from on-premises), storage, computing, security and developer tools for end-to-end orchestration of Next Generation Sequencing (NGS) pipelines.
4.1. Connectivity to Azure Cloud
4.1.1. On-Premises Connectivity
A sequencer(s) running in an on-premises lab environment can
be securely connected over either Virtual Private Network (VPN) or a direct
private connection to Azure. A VPN provides encrypted communication over the
Internet between an on-premises VPN device and Azure VPN gateway. Typically, in
a high-throughput environment such as NGS a dedicated private connection called
“ExpressRoute” is deployed, which can offer bandwidths speed up to 10 Gbps.
4.1.2. Internet2 Network Connectivity
Internet2
network is a member owned advanced technology community founded by the nation's
leading higher education institutions. The network provides a collaborative
environment for U.S. research and education organizations to solve common
technology challenges, and to develop innovative solutions in support of their
educational, research, and community service missions. Internet2 can offer
speeds up to 100 Gbps.
Microsoft
provides direct peering between Azure and Internet2 to help research and
educational institutions. Hence universities part of the Internet2 network can
connect their DNA sequencers in the R&D labs directly to Azure through
private peering connection.
4.2. NGS Data Transfer to Azure Cloud
Using the Azure cloud connectivity options, a customer can
transfer raw sequence data to Azure cloud either through streaming or in batch
mode. The data can be streamed from on-premises sequencers to Azure Blob
storage using Azure Storage Data Services API offered in several popular programming
languages. In batch mode, AzCopy for Windows or Linux can be leveraged in transferring bulk data to cloud. The data can also be shipped to a closest Azure
datacenter using Azure Import/Export service or new Azure Data Box service (in
preview).
4.3. NGS Data Encryption
Azure offers Storage Service Encryption (SSE), which
encrypts the data before writing to Azure Storage. It also offers client-side encryption,
by providing storage client libraries which can encrypt data before sending it
across the wire from the client to Azure.
4.4. Data Encryption Keys
Azure Key Vault helps safeguard cryptographic keys and
secrets used by cloud applications. Key Vault allows to encrypt keys using keys
that are protected by Hardware Security Modules (HSMs). Microsoft processes
keys in FIPS 140-2 Level 2 validated HSMs (hardware and firmware).
4.5. Azure Blob Storage for NGS Data
Azure
Storage offers three storage tiers for sequencing data based on its usage
patterns. The Azure hot storage tier is optimized for storing data that is
accessed frequently. The Azure cool storage tier is optimized for storing data
that is infrequently accessed and stored for at least a month. The new archive
storage tier (in preview) is optimized for storing data that is rarely accessed
and stored for at least six months with flexible latency requirements (on the
order of hours).
During
the data analysis pipeline execution, many data files are generated and used by
the various workflows. Depending on the type and frequency of the usage of the
data, the following storage options can be employed:
· BCL
files - Raw
files containing base calls per cycle. The raw files are analyzed once and
rarely used later. Hence, they can be moved to Azure archival service.
· FASTQ files - These files store biological sequence and quality score
information. The files are generated during the initial stages of the analysis
pipeline and hence can be archived to Azure archive storage for data compliance
needs
· BAM
files -
Binary files that contains sequence alignment data. These files are used
frequently in the initial workflow steps and rarely used later and hence can be
moved to Azure cold storage tier after initial processing.
· VCF
Files -
Stores gene sequence variations in text format. These files are used frequently
for further analysis and hence are stored in Azure hot tier.
· Reference
genome data
- is a digital nucleic acid sequence database, assembled by scientists as a
representative example of species set of genes. During the alignment and
variant analysis steps, reference data is used and hence can be stored in Azure
Blob Storage or premium Azure Data Disks attached to cloud computing resources
for high throughput and IOPS. The other popular Azure storage options are:
- Azure Data Lake Store - a Hadoop compatible repository for analytics on data of any size, type, and ingestion speed
- Azure SQL Data Warehouse - which combines the SQL server relational database with Massive parallel processing
- Azure Cosmos DB - a globally distributed database service supporting document, key/value, or graph databases
- Azure SQL database - a fully managed relational database-as-a-service using the Microsoft SQL server engine.
4.6. Azure Computing Resources for NGS Workloads
A big data type analysis pipeline such as NGS needs to scale
the computing resources both horizontally (several computing nodes) or
vertically (adding more memory and/or CPU processing on one computing node).
Azure provides multiple on-demand, auto-scale, and secure options through
managed and unmanaged services.
4.6.1. Virtual Machines (VM)
Azure
offers virtualized infrastructure using Windows Server or Red Hat, Ubuntu, and
other distributions in Linux. An Azure VM gives the flexibility of virtualization
without having to buy and maintain the physical hardware that runs it. However,
the customer need to maintain the VM by performing tasks such as configuring,
patching, and installing the software that runs on it.
Azure
VMs are well suited for NGS workflows that have high memory/CPU requirements,
need full access to the VM or need to run custom software that can be
challenging to run as an Azure managed service (or through containerization).
Azure offers GPU-enabled instances that include NVIDIA’s GPU cards, optimized
for compute-intensive and network-intensive applications and algorithms,
including CUDA and OpenCL based simulations.
For deploying,
managing and monitoring VMs, Azure offers multiple cloud services and some of
the key services are:
·
Automate
VM configuration
·
Create
custom VM images (Windows & Linux)
·
Highly
available VM scale set (up to 1000 VMs) with Load balancers
·
Manage
VMs with virtual networks
·
Backup/Monitor/Secure
VMs
·
Create
a CI/CD pipeline using Jenkins, Team Services
· Infrastructure-as-a-Code automation
tools such as Chef, Puppet, Azure Resource Templates, etc.
A VM
lifecycle can be managed through Azure Portal, Azure Command Line Interface
(CLI) or Azure PowerShell. It can also be managed through REST API and SDKs
offered in multiple programming languages. Additionally, a few pre-configured
VMs with software packages like Galaxy, Bioconductor, etc. are available from
Microsoft partners on Azure Marketplace.
4.6.2. Azure App Services
Azure
App Services facilitates the creation & deployment of web apps in multiple
programming languages, build and consume Cloud API, and build & host the
backend for any mobile app. Azure App Service is a managed service and hence
all the server infrastructure is managed by the service including the auto-scaling
and high availability of the applications.
Azure
App Services will fit well into the NGS data analysis pipeline steps where the
workflow need to be managed from web and/or mobile applications.
4.6.3. Azure Container Services and Azure Batch
The tasks such as sequence alignment, variant calling,
quality control, rendering, analysis, & processing of images of chemistry, etc.
can be containerized and orchestrated using Azure Container Service and/or
Azure Batch.
4.6.3.2. Azure Container Services (ACS)
Azure container instances offer the fastest and simplest way
to run a container in Azure, without having to provision any virtual machines
and without having to adopt a higher-level service. As the NGS data pipeline
involves multiple steps with variable processing time and scale, containers are
an excellent choice for processing in NGS. A container image can be built with
all the required software modules for a pipeline step, analysis method(s), etc.
added to an automation process or pipeline job or shared with other collaborators. Azure offer multiple Azure Container options.
4.6.3.1.2. Service Fabric Windows Container on Azure
Azure
Service Fabric is a distributed systems platform for deploying and managing
scalable and reliable microservices and containers. The container deployment
process consists of four simple steps:
· Package a Docker image
container
· Configure communication
· Build and package the Service
Fabric application
· Deploy the container
application to Azure
4.6.3.1.2. Azure Container Service (ACS) for Kubernetes
ACS
for Kubernetes makes it simple to create, configure, and manage a cluster of
virtual machines that are preconfigured to run containerized applications. ACS
combines the enterprise-grade features of Azure and application portability
through Kubernetes and the Docker image format.
4.6.3.1.3. Azure Container Service with DC/OS and Swarm
ACS
allows quick deployment of production ready DC/OS or Docker Swarm cluster.
DCS/OS is a distributed operating system based on the Apache Mesos distributed
systems kernel. It includes a Marathon orchestration platform for scheduling
workloads.
Docker
Swarm provides native clustering for Docker. Supported tools for managing
containers on a Swarm cluster include Dokku, Docker CLI and Docker Compose,
Krane, Jenkins, etc.
4.6.3.1.4. Azure Container Registry
Azure
Container Registry is a managed Docker registry service based on the
open-source Docker Registry 2.0. It allows to store and manage private Docker
container images. The images can be pulled from scalable orchestration systems
such as Kubernetes, DC/OS and Docker Swarm.
4.6.3.2. Azure Batch
Azure
Batch is a platform service for running large-scale parallel and
high-performance computing (HPC) applications efficiently in the cloud. Azure
Batch schedules compute-intensive workloads to run on a managed collection of
virtual machines and can automatically scale compute resources to meet the
needs of computing jobs.
As a
managed service, Azure Batch eliminates the manual create, configure, and
manage HPC cluster, individual virtual machines, virtual networks, or a complex
job and task scheduling infrastructure.
A High-level
Azure Batch workflow for NGS data analysis pipeline consist of:
- Upload DNA sequence data files from Azure Storage. Azure Batch includes built-in support for accessing Azure Blob Storage
- Upload analysis modules that tasks will run
- Create a pool of compute nodes. As part of step, number of compute nodes, their size, operating system, etc., are defined
- Create a job. The job manages a collection of tasks
- Add tasks to the job. Each task runs the analysis modules to process the data files. As each task completes, it can upload its output to Azure storage for further analysis
- Monitor job progress and retrieve the task output from Azure Storage
4.7. Azure Analytic Tools for NGS Data Analysis
4.7.1. Bigdata Analysis Workloads
4.7.1.1. Azure HDInsight
Azure
HDInsight’s allows to process and analyze big data, and develop solutions with
Hadoop, Spark, R-Server, Storm and other technologies in the Hadoop ecosystem.
4.7.1.2. Azure Data Lake Store and Analytics
Azure
Data Lake Store provides a hyper-scale, Hadoop compatible repository for
analytics on data of any size, type, and ingestion speed. Azure Analysis
Services provides enterprise-grade data modeling in the cloud. A fully managed
platform, integrated with Azure Data platform services.
Azure
Analysis Services allows to mashup and combine data from multiple sources,
define metrics, and secure data in a single, trusted semantic data model. The
data model provides an easier and faster way to browse massive amounts of data
with client applications like Power BI, Excel, Reporting Services, third-party,
and custom apps.
4.7.2. Building AI and ML Models
Azure
Machine Learning (AML) is an integrated, end-to-end data science analytics
solution. It enables data scientists to prepare data, develop experiments, and
deploy models at cloud scale. The main components of AML are:
· Azure Machine Learning
Workbench
· Azure Machine Learning
Experimentation Service
· Azure Machine Learning Model
Management Service
· Microsoft Machine Learning
Libraries for Apache spark (MMLSpark Library)
· Visual Studio Code Tools for
AI
Azure
Machine Learning fully supports open source technologies and machine learning
frameworks such as:
· Scikit-learn
· Tensor Flow
· Microsoft Cognitive Toolkit
· Spark ML
Azure
Machine Learning is built on top of the following open source technologies:
· Jupyter Notebook
· Apache Spark
· Docker
· Kubernetes
· Python
·
Conda
4.8. Monitoring and Management
Azure
provides number of services to monitor and manage NGS end-to-end data analysis
pipeline. The key services offered are:
· Azure Monitor - Highly granular
and real-time monitoring data for any Azure resource
· Application Insights -
Detect, triage, and diagnose issues in your web apps and services
·
Azure
Cost Management - Track cloud usage and expenditures
4.9. Security and Compliance
Provides
unified security management and advanced threat protection for workloads
running in Azure, on-premises, and in other clouds. The security center
facilitates
·
Unified
visibility and control
·
Adaptive
threat prevention
·
Intelligent
threat detection and response
4.9.2. Compliance
To
help comply with national, regional, and industry-specific requirements
governing the collection and use of individual data, Microsoft offers the most
comprehensive set of compliance offerings of any cloud service provider, some
of the key compliance offering for healthcare industry are:
·
HIPAA/HITECH
- Microsoft offers Health Insurance Portability & Accountability Act
Business Associate Agreements (BAAs)
·
HITRUST
- Azure is certified to the Health Information Trust Alliance Common Security
Framework
·
MARS-E
- Microsoft complies with the US Minimum Acceptable Risk Standards for Exchange
·
NEN
7510:2011 - Organizations in the Netherlands must demonstrate control over
patient health data in accordance with the NEN 7510 standard
·
NHS
IS Toolkit - Azure is certified to the Health Information Trust Alliance Common
Security Framework
4.10. Partner Platforms for NGS Data Pipeline
Microsoft
has a large partner ecosystem in major industries including healthcare and
specifically Genomics. Some of the key cloud partners are:
·
Appistry
·
BC
Platforms
·
DNAnexus
· WuXI
NextCODE
5) Conclusion
Azure cloud provides several cloud services and tools for
end-to-end Next Generation Sequencing (NGS) data analysis pipeline. The
platform is optimized to address the challenges of security, scalability and
collaboration between organizations and research institutions.
Comments
Post a Comment