Having a genome sequence is interesting, but we will want to extract features from it, such as genes, exons, and coding sequences. This type of annotation information is made available in Generic Feature Format (GFF) and General Transfer Format (GTF) files. In this recipe, we will look at how to parse and analyze GFF files, using the annotation of the Anopheles gambiae genome as an example.
Use the Chapter03/Annotations.ipynb
Notebook file, which is provided in the code bundle of this book.
Let's take a look at the following steps:
import gffutils import sqlite3 try: db = gffutils.create_db('gambiae.gff.gz', 'ag.db') except sqlite3.OperationalError: db = gffutils.FeatureDB('ag.db')
The gffutils
library creates a SQLite database to store annotations efficiently. Here, we will try to create the database, but if it already exists, we will use the existing...