Studying genome accessibility and filtering SNP data
While the previous recipes were focused on giving an overview of Python libraries to deal with alignment and variant call data, in this recipe, we will concentrate on actually using them with a clear purpose in mind.
If you are using NGS data, chances are that your most important file to analyze is a VCF file, which is produced by a genotype caller such as SAMtools, mpileup
, or GATK. The quality of your VCF calls may need to be assessed and filtered. Here, we will put in place a framework to filter SNP data. Rather than giving you filtering rules (an impossible task to be performed in a general way), we will give you procedures to assess the quality of your data. With this, you can devise your own filters.
Getting ready
In the best-case scenario, you have a VCF file with proper filters applied. If this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER
column, but this might...