The Bio.Selectome namespace has features to query
Selectome.Selectome is a database that merges data from
Ensembl and the programs in
PAML used to compute the ratio of non-synonymous to synonymous (
dN/dS) mutations along various branches of the phylogenetic tree. A low dN/dS ratio indicates that the protein sequence is under strong selective constraint, while a high one indicates that selective constraint is more relaxed. Selectome is also a fantastic resource to get
gene trees and multiple sequence alignments. Using selectome and .NET Bio allows you to quickly investigate divergence across the vertebrate phylogeny.
This page gives a walk of how to query selectome, convert the text data it returns in to objects, and the compute and plot various quantities from those objects.
Example Walk Through.
Step 0: Setup F#–If you haven’t used F# before, you can download Tsunami. Open the program and add references to Bio.Selectome using the #r command and open the relevant namespaces. (highlight and hit Alt+Enter to run in the console).
|
#r @"C:\Users\Nigel\Software\Bio.dll" #r @"C:\Users\Nigel\Software\Bio.Pamsam.dll" #r @"C:\Users\Nigel\Software\Bio.Selectome.dll" open System open System.IO; open Bio.Selectome open System.Linq |
Step 1: Make Gene List –
Selectome requires ensembl identifiers be used in queries.To create a set of interesting genes, I first downloaded the full set of genes from the
MitoCarta website. These genes are identified by Entrez IDs, while selectome uses Ensembl IDs, so to convert between these I used the
GeneID converter website to create a flatfile of the new ids. Given this flatfile, we can load it and convert it to typed classes as follows:
|
let fname= @"C:\SomeFolder\FullListOfGenes.csv" let data=File.ReadAllLines fname |> Array.map (fun x->x.Split(',')) type Gene = {Gene: string; EntrezID:string; EnsemblID:string} let genesForQuery=Array.map (fun (x:string[])-> {Gene=x.[4];EnsemblID=x.[3];EntrezID="Not Set"}) data |
Step 2: Query Selectome – All selectome data is accessed through the
SelectomeDataFetcher
class. This class will return a
SelelctomeQueryResult
that will let you know if the query was successful. Currently, the queries will only be successful for genes that exist in the database and have data available for the full vertebrate tree. If no data is available the
Result
will be
NoResultsFound
, if selectome returned data but there was no tree available for all vertebrates(but maybe just primates)
the result will be NoVeterbrateTreeDataFound
. We want to extract genes from query results that successfully returned data for the vertebrate tree.
|
let getGene (gene:Gene) : SelectomeQueryResult = SelectomeDataFetcher.FetchGeneByEnsemblID(gene.EnsemblID) let queriedData = genesForQuery |> Array.map getGene |> Array.filter (fun x->x.Result=QUERY_RESULT.Success) |> Array.map (fun x-> x.Gene) |
Step 3: Determine how many genes show positive selection – F# makes this easy:
|
let dividedBySelection=successfulQueries |> Array.partition (fun (x:SelectomeGene)-> x.SelectionSignature) let showsSelection= (float32 (fst dividedBySelection).Length)/ (float32 successfulQueries.Length) |
Interestingly, roughly 33% of genes show selection, so we know not to get too
excited by any one result!
Step 4: Download Multiple Sequence Alignments – In order to decide how conserved a protein is relative to other proteins, we can download the multiple sequence alignment for each protein in this set and compare it to a particular protein of interest. In Selectome, each protein comes with a masked and unmasked alignment for both proteins and DNA. These objects are available from the SelectomeGene
class and are lazily downloaded when requested from the Selectome server. These sequence alignment downloads are also cached for 30 days in a temporary directory to avoid waiting for downloads if you want to reload your script of interest. Once downloaded they are converted to full-fledged .NET BIO multiple sequence alignments, meaning one can do nearly anything with them. The example below gets the DNA alignment and the BLOSUM90 alignment score for the masked amino acid alignments.
|
let maskedAlignments = successfulQueries |> Array.map (fun x-> x.MaskedDNAAlignment) let alignmentScores= successfulQueries |> Array.map (fun x-> x.GetMaskedBlosum90AlignmentScore()) |
Step 5: Download the Gene Trees – The selectome gene defines a class,
SelectomeTree
, that provides a great set of functions to query all the interesting metadata provided by selectome. These features are most usefully investigated by using the autocomplete functionality of your editor, but there is a lot of useful information! Some examples are also shown below.
|
let example=successfulQueries.[0] let tree=example.VertebrateTree let taxa=tree.TaxaPresent let selectedNode=tree.SelectedNodes.[0] selectedNode.BootStrap selectedNode.PValue selectedNode.Name ... |
Tree queries are also cached locally to avoid going back to the server in the event of repeated requests.
Step 6: Plot distribution of interest using the R data provider – You
can call R plotting functions directly from F# using the R type provider. More information is available from that site, but the code snippet below is sufficient to produce a histogram of alignment scores, no need to dump to a flat file first!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
#r @"C:\Users\Nigel\packages\RDotNet.FSharp.0.1.2.1\lib\net40\RDotNet.FSharp.dll" #r @"C:\Users\Nigel\packages\R.NET.1.5.5\lib\net40\RDotNet.dll" #r @"C:\Users\Nigel\packages\RProvider.1.0.1\lib\RProvider.dll" open RDotNet open RProvider open RProvider.``base`` open RProvider.graphics open RProvider.stats let args = new System.Collections.Generic.Dictionary<string,Object>() args.["x"] <- AlignmentScores args.["main"] <-"BLOSUM90 Alignment Scores" args.["col"] <-"blue" args.["breaks"]<-20 args.["xlab"]<-"BLOSUM90 Score" R.hist(args) |
Huzzah! One intermediate machine to rule them all (or at least to avoid useless glue between different libraries/APIs).