Building a Genome Next Generation Sequencing (NGS) Data Pipeline in Azure

1) Introduction

Next Generation Sequencing (NGS) also known as Deep Sequencing or Massive Parallel Sequencing or Second and Third Generation Sequencing is a technique that offers unprecedented detail in the genomic, transcriptomic, and epigenomic patterns associated with cellular processes. A medium size lab (10-15 scientists) could easily generate multiple terabytes of data during a NGS end-to-end process. Hence building a scalable, cost effective, and secured data pipelines is a lifeblood for life sciences industry and in particular Genomics domain.

2) Key Terms

Genomics: It is study of genomes (complete set of genetic material within an organism). Genomics involves the mapping, sequencing and analysis of genomes. It includes structure, function, comparison, and evolution of genomes

Transcriptomics: Techniques available to identify mRNA from actively transcribed genes

Epigenomics: It is study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome.

Epigenetics: It is the study of heritable changes in gene expression that does not involve changes to the underlying DNA sequence - a change in phenotype without a change in genotype - which in turn affects how cells read the genes

Epigenome: It is a multitude of chemical compounds that can tell the genome what to do

Precision Medicine (or old term Personalized Medicine): It is an approach for disease treatment and prevention that considers individual genes, human microbiome, environment and lifestyle for each person

Human microbiome: Includes genes of humans and the genes of all microorganisms that live on and in humans

Pharmacogenomics: It is a part of precision medicine and study of how genes affect a person’s response to drugs. It combines pharmacology (the science of drugs) and genomics to develop effective, safe medications and doses that are tailored to variations in a person’s genes.

3) Next Generation Sequencing Process

A typical NGS process consist of three steps:

· Sample Preparation kits

· Sample Sequencing using NGS Instruments

· Analysis of Data

In a high-throughput pipeline such as NGS, Analysis of Data is a time-consuming and expensive step of the end-to-end process.

3.1 Next Generation Sequencing Methods

NGS enables a wide variety of sequencing solutions and some of the most common solution areas and their methods are:

· Genomics - which includes Whole-Genome Sequencing, Exome Sequencing, De novo Sequencing, and Targeted Sequencing.

· Transcriptomics - which includes Total RNA and mRNA Sequencing, Targeted RNA Sequencing, Small RNA and Noncoding RNA Sequencing

· Epigenomics - which includes Methylation Sequencing, ChiP Sequencing, and Ribosome Profiling.

3.2 Next Generation Sequencing Data Analysis Pipeline

A NGS data analysis pipeline is typically termed as Bioinformatics. Based on the solution and the sequencing method, a NGS Bioinformatics consist of one or more types of data analysis. For example, a de novo assembly has different steps compared to a reference sequence analysis, but a generic data analysis pipeline consists of Primary, Secondary and Tertiary phases.

3.2.1. Primary Analysis Phase

Primary analysis solutions are simultaneously carried during the generation of sequencing data by the sequencers. Some of the vendors provide primary analysis software tools as part of their sequencer platform suite, and the tools can be run on bundled compute resources with the sequencer instruments or on-premises computing clusters or in a public cloud environment. The primary analysis includes de-multiplexing of short reads, base sequence quality scores, etc. The raw data is delivered in FASTQ format and the key computational need is to keep up with the throughput of the instrument as it produces measurement information, such as images of chemistry.

3.2.2. Secondary Analysis Phase

Secondary analysis performs operations on the aggregated data from one or more runs of the machine and hence requires a significant amount of data storage and compute resources. It is a repeatable process of running data through various algorithms on a per-sample basis, monitoring quality metrics and tweaking or improving the algorithms and their parameters. As secondary analysis requires economies of scale in computing and resource utilization, a public cloud can be a choice for executing the pipeline phase. The files generated during this phase are BAM and BAI (Binary Sequence Alignment/Map and Index) files.

3.2.3. Tertiary Analysis Phase

The tertiary phase is an integrative analysis pipeline consisting of heavy data analysis steps to make sense of data in the context of the research study. Multiple samples need to be bought together along with phenotype and other experimental data. The pipeline steps include annotation with data from public datasets, sequence comparison using Basic Local Alignment Search Tool(BLAST), statistical analysis, hierarchical clustering, machine learning and/or deep neural networks, reports, dashboards, etc. The data from the BLAST is delivered as Variant Call Format (VCF) files. A public cloud such as Azure can offer all the tools required for this phase.

4) Azure Services for NGS Data Analysis Pipeline

A NGS pipeline with multiple repetitive steps aligns very well with the concept of ETL (Extract Transform and Load) process. In an ETL process, data is acquired (or Extracted) from a data source in real time or batch mode, then the data is processed (or Transformed) using computing resources, and finally the processed data is persisted (or Loaded) into a scalable storage or datastore for further advanced analysis and reporting. Azure offers number of services and tools for deploying and executing an ETL process end-to-end.

A cloud platform, such as Microsoft Azure can offer a high-performance, scalable and cost-effective connectivity (from on-premises), storage, computing, security and developer tools for end-to-end orchestration of Next Generation Sequencing (NGS) pipelines.

4.1. Connectivity to Azure Cloud

4.1.1. On-Premises Connectivity

A sequencer(s) running in an on-premises lab environment can be securely connected over either Virtual Private Network (VPN) or a direct private connection to Azure. A VPN provides encrypted communication over the Internet between an on-premises VPN device and Azure VPN gateway. Typically, in a high-throughput environment such as NGS a dedicated private connection called “ExpressRoute” is deployed, which can offer bandwidths speed up to 10 Gbps.

4.1.2. Internet2 Network Connectivity

Internet2 network is a member owned advanced technology community founded by the nation's leading higher education institutions. The network provides a collaborative environment for U.S. research and education organizations to solve common technology challenges, and to develop innovative solutions in support of their educational, research, and community service missions. Internet2 can offer speeds up to 100 Gbps.

Microsoft provides direct peering between Azure and Internet2 to help research and educational institutions. Hence universities part of the Internet2 network can connect their DNA sequencers in the R&D labs directly to Azure through private peering connection.

4.2. NGS Data Transfer to Azure Cloud

Using the Azure cloud connectivity options, a customer can transfer raw sequence data to Azure cloud either through streaming or in batch mode. The data can be streamed from on-premises sequencers to Azure Blob storage using Azure Storage Data Services API offered in several popular programming languages. In batch mode, AzCopy for Windows or Linux can be leveraged in transferring bulk data to cloud. The data can also be shipped to a closest Azure datacenter using Azure Import/Export service or new Azure Data Box service (in preview).

4.3. NGS Data Encryption

Azure offers Storage Service Encryption (SSE), which encrypts the data before writing to Azure Storage. It also offers client-side encryption, by providing storage client libraries which can encrypt data before sending it across the wire from the client to Azure.

4.4. Data Encryption Keys

Azure Key Vault helps safeguard cryptographic keys and secrets used by cloud applications. Key Vault allows to encrypt keys using keys that are protected by Hardware Security Modules (HSMs). Microsoft processes keys in FIPS 140-2 Level 2 validated HSMs (hardware and firmware).

4.5. Azure Blob Storage for NGS Data

Azure Storage offers three storage tiers for sequencing data based on its usage patterns. The Azure hot storage tier is optimized for storing data that is accessed frequently. The Azure cool storage tier is optimized for storing data that is infrequently accessed and stored for at least a month. The new archive storage tier (in preview) is optimized for storing data that is rarely accessed and stored for at least six months with flexible latency requirements (on the order of hours).

During the data analysis pipeline execution, many data files are generated and used by the various workflows. Depending on the type and frequency of the usage of the data, the following storage options can be employed:

· BCL files - Raw files containing base calls per cycle. The raw files are analyzed once and rarely used later. Hence, they can be moved to Azure archival service.

· FASTQ files - These files store biological sequence and quality score information. The files are generated during the initial stages of the analysis pipeline and hence can be archived to Azure archive storage for data compliance needs

· BAM files - Binary files that contains sequence alignment data. These files are used frequently in the initial workflow steps and rarely used later and hence can be moved to Azure cold storage tier after initial processing.

· VCF Files - Stores gene sequence variations in text format. These files are used frequently for further analysis and hence are stored in Azure hot tier.

· Reference genome data - is a digital nucleic acid sequence database, assembled by scientists as a representative example of species set of genes. During the alignment and variant analysis steps, reference data is used and hence can be stored in Azure Blob Storage or premium Azure Data Disks attached to cloud computing resources for high throughput and IOPS. The other popular Azure storage options are:

Azure Data Lake Store - a Hadoop compatible repository for analytics on data of any size, type, and ingestion speed
Azure SQL Data Warehouse - which combines the SQL server relational database with Massive parallel processing
Azure Cosmos DB - a globally distributed database service supporting document, key/value, or graph databases
Azure SQL database - a fully managed relational database-as-a-service using the Microsoft SQL server engine.

4.6. Azure Computing Resources for NGS Workloads

A big data type analysis pipeline such as NGS needs to scale the computing resources both horizontally (several computing nodes) or vertically (adding more memory and/or CPU processing on one computing node). Azure provides multiple on-demand, auto-scale, and secure options through managed and unmanaged services.

4.6.1. Virtual Machines (VM)

Azure offers virtualized infrastructure using Windows Server or Red Hat, Ubuntu, and other distributions in Linux. An Azure VM gives the flexibility of virtualization without having to buy and maintain the physical hardware that runs it. However, the customer need to maintain the VM by performing tasks such as configuring, patching, and installing the software that runs on it.

Azure VMs are well suited for NGS workflows that have high memory/CPU requirements, need full access to the VM or need to run custom software that can be challenging to run as an Azure managed service (or through containerization). Azure offers GPU-enabled instances that include NVIDIA’s GPU cards, optimized for compute-intensive and network-intensive applications and algorithms, including CUDA and OpenCL based simulations.

For deploying, managing and monitoring VMs, Azure offers multiple cloud services and some of the key services are:

· Automate VM configuration

· Create custom VM images (Windows & Linux)

· Highly available VM scale set (up to 1000 VMs) with Load balancers

· Manage VMs with virtual networks

· Backup/Monitor/Secure VMs

· Create a CI/CD pipeline using Jenkins, Team Services

· Infrastructure-as-a-Code automation tools such as Chef, Puppet, Azure Resource Templates, etc.

A VM lifecycle can be managed through Azure Portal, Azure Command Line Interface (CLI) or Azure PowerShell. It can also be managed through REST API and SDKs offered in multiple programming languages. Additionally, a few pre-configured VMs with software packages like Galaxy, Bioconductor, etc. are available from Microsoft partners on Azure Marketplace.

4.6.2. Azure App Services

Azure App Services facilitates the creation & deployment of web apps in multiple programming languages, build and consume Cloud API, and build & host the backend for any mobile app. Azure App Service is a managed service and hence all the server infrastructure is managed by the service including the auto-scaling and high availability of the applications.

Azure App Services will fit well into the NGS data analysis pipeline steps where the workflow need to be managed from web and/or mobile applications.

4.6.3. Azure Container Services and Azure Batch

The tasks such as sequence alignment, variant calling, quality control, rendering, analysis, & processing of images of chemistry, etc. can be containerized and orchestrated using Azure Container Service and/or Azure Batch.

4.6.3.2. Azure Container Services (ACS)

Azure container instances offer the fastest and simplest way to run a container in Azure, without having to provision any virtual machines and without having to adopt a higher-level service. As the NGS data pipeline involves multiple steps with variable processing time and scale, containers are an excellent choice for processing in NGS. A container image can be built with all the required software modules for a pipeline step, analysis method(s), etc. added to an automation process or pipeline job or shared with other collaborators. Azure offer multiple Azure Container options.

4.6.3.1.2. Service Fabric Windows Container on Azure

Azure Service Fabric is a distributed systems platform for deploying and managing scalable and reliable microservices and containers. The container deployment process consists of four simple steps:

· Package a Docker image container

· Configure communication

· Build and package the Service Fabric application

· Deploy the container application to Azure

4.6.3.1.2. Azure Container Service (ACS) for Kubernetes

ACS for Kubernetes makes it simple to create, configure, and manage a cluster of virtual machines that are preconfigured to run containerized applications. ACS combines the enterprise-grade features of Azure and application portability through Kubernetes and the Docker image format.

4.6.3.1.3. Azure Container Service with DC/OS and Swarm

ACS allows quick deployment of production ready DC/OS or Docker Swarm cluster. DCS/OS is a distributed operating system based on the Apache Mesos distributed systems kernel. It includes a Marathon orchestration platform for scheduling workloads.

Docker Swarm provides native clustering for Docker. Supported tools for managing containers on a Swarm cluster include Dokku, Docker CLI and Docker Compose, Krane, Jenkins, etc.

4.6.3.1.4. Azure Container Registry

Azure Container Registry is a managed Docker registry service based on the open-source Docker Registry 2.0. It allows to store and manage private Docker container images. The images can be pulled from scalable orchestration systems such as Kubernetes, DC/OS and Docker Swarm.

4.6.3.2. Azure Batch

Azure Batch is a platform service for running large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. Azure Batch schedules compute-intensive workloads to run on a managed collection of virtual machines and can automatically scale compute resources to meet the needs of computing jobs.

As a managed service, Azure Batch eliminates the manual create, configure, and manage HPC cluster, individual virtual machines, virtual networks, or a complex job and task scheduling infrastructure.

A High-level Azure Batch workflow for NGS data analysis pipeline consist of:

Upload DNA sequence data files from Azure Storage. Azure Batch includes built-in support for accessing Azure Blob Storage
Upload analysis modules that tasks will run
Create a pool of compute nodes. As part of step, number of compute nodes, their size, operating system, etc., are defined
Create a job. The job manages a collection of tasks
Add tasks to the job. Each task runs the analysis modules to process the data files. As each task completes, it can upload its output to Azure storage for further analysis
Monitor job progress and retrieve the task output from Azure Storage

4.7. Azure Analytic Tools for NGS Data Analysis

4.7.1. Bigdata Analysis Workloads

4.7.1.1. Azure HDInsight

Azure HDInsight’s allows to process and analyze big data, and develop solutions with Hadoop, Spark, R-Server, Storm and other technologies in the Hadoop ecosystem.

4.7.1.2. Azure Data Lake Store and Analytics

Azure Data Lake Store provides a hyper-scale, Hadoop compatible repository for analytics on data of any size, type, and ingestion speed. Azure Analysis Services provides enterprise-grade data modeling in the cloud. A fully managed platform, integrated with Azure Data platform services.

Azure Analysis Services allows to mashup and combine data from multiple sources, define metrics, and secure data in a single, trusted semantic data model. The data model provides an easier and faster way to browse massive amounts of data with client applications like Power BI, Excel, Reporting Services, third-party, and custom apps.

4.7.2. Building AI and ML Models

Azure Machine Learning (AML) is an integrated, end-to-end data science analytics solution. It enables data scientists to prepare data, develop experiments, and deploy models at cloud scale. The main components of AML are:

· Azure Machine Learning Workbench

· Azure Machine Learning Experimentation Service

· Azure Machine Learning Model Management Service

· Microsoft Machine Learning Libraries for Apache spark (MMLSpark Library)

· Visual Studio Code Tools for AI

Azure Machine Learning fully supports open source technologies and machine learning frameworks such as:

· Scikit-learn

· Tensor Flow

· Microsoft Cognitive Toolkit

· Spark ML

Azure Machine Learning is built on top of the following open source technologies:

· Jupyter Notebook

· Apache Spark

· Docker

· Kubernetes

· Python

· Conda

4.8. Monitoring and Management

Azure provides number of services to monitor and manage NGS end-to-end data analysis pipeline. The key services offered are:

· Azure Monitor - Highly granular and real-time monitoring data for any Azure resource

· Application Insights - Detect, triage, and diagnose issues in your web apps and services

· Azure Cost Management - Track cloud usage and expenditures

4.9. Security and Compliance

4.9.1. Azure Security Center

Provides unified security management and advanced threat protection for workloads running in Azure, on-premises, and in other clouds. The security center facilitates

· Unified visibility and control

· Adaptive threat prevention

· Intelligent threat detection and response

4.9.2. Compliance

To help comply with national, regional, and industry-specific requirements governing the collection and use of individual data, Microsoft offers the most comprehensive set of compliance offerings of any cloud service provider, some of the key compliance offering for healthcare industry are:

· HIPAA/HITECH - Microsoft offers Health Insurance Portability & Accountability Act Business Associate Agreements (BAAs)

· HITRUST - Azure is certified to the Health Information Trust Alliance Common Security Framework

· MARS-E - Microsoft complies with the US Minimum Acceptable Risk Standards for Exchange

· NEN 7510:2011 - Organizations in the Netherlands must demonstrate control over patient health data in accordance with the NEN 7510 standard

· NHS IS Toolkit - Azure is certified to the Health Information Trust Alliance Common Security Framework

4.10. Partner Platforms for NGS Data Pipeline

Microsoft has a large partner ecosystem in major industries including healthcare and specifically Genomics. Some of the key cloud partners are:

· Appistry

· BC Platforms

· DNAnexus

· WuXI NextCODE

5) Conclusion

Azure cloud provides several cloud services and tools for end-to-end Next Generation Sequencing (NGS) data analysis pipeline. The platform is optimized to address the challenges of security, scalability and collaboration between organizations and research institutions.

Guru's blog