Book Image

Bioinformatics with Python Cookbook - Second Edition

By : Tiago Antao
Book Image

Bioinformatics with Python Cookbook - Second Edition

By: Tiago Antao

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data. This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries. This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark. By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.
Table of Contents (16 chapters)
Title Page
About Packt
Contributors
Preface
Index

Computing sequencing statistics using Spark


If you need to use parallel computing, then Spark is one alternative to Dask. Its abstraction level is slightly higher. This gives you less granular control over the computation, but is more declarative to code. Spark is also somewhat language agnostic (it is actually Java/Scala-based). Here, we will compute some very basic statistics over the Parquet dataset that we generated in the previous recipe.

Getting ready

Preparing for this recipe can be quite tricky. First, we will have to start a Spark server. At the time of writing this book, the conda packages for accessing Spark were quite immature. We will still use conda here, but we will not install any Spark packages from conda. Follow these steps to prepare the environment:

  1. Make sure that you have Java 8 installed. Be careful with the Java version, as an older version will not work, but a newer might also be problematic.
  2. Download Spark (https://spark.apache.org/downloads.html). This code was tested...