{"id":153,"date":"2013-09-16T20:12:09","date_gmt":"2013-09-16T20:12:09","guid":{"rendered":"http:\/\/evolvedmicrobe.com\/blogs\/?p=153"},"modified":"2016-06-15T21:55:46","modified_gmt":"2016-06-15T21:55:46","slug":"using-selectome-with-net-bio-f-and-r","status":"publish","type":"post","link":"http:\/\/evolvedmicrobe.com\/blogs\/?p=153","title":{"rendered":"Using Selectome with .NET Bio, F# and R"},"content":{"rendered":"The Bio.Selectome namespace has features to query\u00a0<a href=\"http:\/\/selectome.unil.ch\/\">Selectome<\/a>.Selectome is a database that merges data from <a href=\"http:\/\/useast.ensembl.org\/index.html\">Ensembl<\/a>\u00a0and the programs in <a href=\"http:\/\/abacus.gene.ucl.ac.uk\/software\/paml.html\">PAML<\/a> used to compute the ratio of non-synonymous to synonymous (<strong>dN\/dS<\/strong>)\u00a0mutations along various branches of the phylogenetic tree. A low dN\/dS ratio\u00a0indicates that the protein sequence is under strong selective constraint, while a high one indicates that selective constraint is more relaxed. Selectome is\u00a0also a fantastic resource to get<strong> gene trees and multiple sequence alignments<\/strong>.\u00a0Using selectome and .NET Bio allows you to quickly investigate divergence across\u00a0the vertebrate phylogeny.\r\n\r\nThis page gives a walk of how to query selectome, convert\u00a0the text data it returns in to objects, and the compute and plot various quantities from those objects.\r\n\r\n<br\/>\r\n<p>\r\n<strong><span style=\"font-size: large;\">Example Walk Through.<\/span><\/strong>\r\n<\/p>\r\n<br\/>\r\n<p>\r\n<strong>Step 0: Setup F#\u2013<\/strong>If you haven\u2019t used F# before, you can download\u00a0<a href=\"http:\/\/tsunami.io\/index.html\" target=\"_blank\">Tsunami<\/a>. Open the program and add references\u00a0<span style=\"line-height: 1.714285714; font-size: 1rem;\">to Bio.Selectome using the #r command and open the relevant namespaces. (highlight and hit Alt+Enter\u00a0<\/span><span style=\"line-height: 1.714285714; font-size: 1rem;\">to run in the console).<\/span>\r\n<pre class=\"brush: fsharp\">#r @\"C:\\Users\\Nigel\\Software\\Bio.dll\"\r\n#r @\"C:\\Users\\Nigel\\Software\\Bio.Pamsam.dll\"\r\n#r @\"C:\\Users\\Nigel\\Software\\Bio.Selectome.dll\"\r\n\r\nopen System\r\nopen System.IO;\r\nopen Bio.Selectome\r\nopen System.Linq<\/pre>\r\n<\/p>\r\n<br\/>\r\n<strong>Step 1: Make Gene List \u2013 <\/strong>\r\nSelectome requires ensembl identifiers be used in queries.To create a set of\u00a0interesting genes, I first downloaded the full set of genes from the\u00a0<a href=\"http:\/\/www.broadinstitute.org\/pubs\/MitoCarta\/\">MitoCarta website<\/a>. These genes are identified by Entrez IDs,\u00a0while selectome uses Ensembl IDs, so to convert between these I used the\r\n<a href=\"http:\/\/idconverter.bioinfo.cnio.es\/\">GeneID\u00a0converter website<\/a> to create a flatfile of the new ids. Given this\u00a0flatfile, we can load it and convert it to typed classes as follows:\r\n\r\n<pre class=\"brush: fsharp\">let fname= @\"C:\\SomeFolder\\FullListOfGenes.csv\"\r\nlet data=File.ReadAllLines fname |> Array.map (fun x->x.Split(','))\r\ntype Gene = {Gene: string; EntrezID:string; EnsemblID:string}\r\nlet genesForQuery=Array.map (fun (x:string[])-> {Gene=x.[4];EnsemblID=x.[3];EntrezID=\"Not Set\"}) data<\/pre>\r\n<br\/>\r\n\r\n<strong>Step 2: Query Selectome &#8211;\u00a0 <\/strong>All selectome data is\u00a0accessed through the <code>SelectomeDataFetcher<\/code> class. This class will return a <code>SelelctomeQueryResult\u00a0<\/code>that will let you know if the query was successful. Currently, the queries will\u00a0only be successful for genes that exist in the database and have data available\u00a0for the full vertebrate tree. If no data is available the <code>Result<\/code> will be <code>NoResultsFound<\/code>,\u00a0if selectome returned data but there was no tree available for all vertebrates(but maybe just primates)\u00a0<span style=\"line-height: 1.714285714; font-size: 1rem;\">the result will be <\/span><code style=\"line-height: 1.714285714;\">NoVeterbrateTreeDataFound<\/code><span style=\"line-height: 1.714285714; font-size: 1rem;\">. We want to extract genes from query results that successfully returned data for the vertebrate tree.<\/span>\r\n\r\n<pre class=\"brush: fsharp\">let getGene (gene:Gene) : SelectomeQueryResult = SelectomeDataFetcher.FetchGeneByEnsemblID(gene.EnsemblID)\r\nlet queriedData = genesForQuery |> Array.map getGene\r\n\t\t\t\t|> Array.filter (fun x->x.Result=QUERY_RESULT.Success)\r\n\t\t\t\t|> Array.map (fun x-> x.Gene)<\/pre>\r\n<br\/>\r\n\r\n<strong>Step 3: Determine how many genes show positive selection\u00a0&#8211; <\/strong>F# makes this easy:\r\n<pre class=\"brush: fsharp\">let dividedBySelection=successfulQueries \r\n     |> Array.partition (fun (x:SelectomeGene)-> x.SelectionSignature) \r\nlet showsSelection= \r\n     (float32 (fst dividedBySelection).Length)\/ (float32 successfulQueries.Length)\r\n<\/pre>\r\nInterestingly, roughly 33% of genes show selection, so we know not to get too\r\nexcited by any one result!\r\n<br\/>\r\n<p>\r\n\r\n<strong>Step 4: Download Multiple Sequence Alignments\u00a0<\/strong>\u00a0&#8211; In order to decide how conserved a protein is relative to\u00a0other proteins, we can download the multiple sequence alignment for each protein in this set and compare it to a particular protein of interest.\u00a0 In Selectome, each protein comes with a masked and unmasked\u00a0alignment for both proteins and DNA. These objects are available from the <code>SelectomeGene<\/code> class and are lazily downloaded when requested from the Selectome server.\u00a0 These sequence\u00a0alignment downloads are also cached for 30 days in a temporary directory to avoid\u00a0waiting for downloads if you want to reload your script of interest.\u00a0 Once downloaded they\u00a0are converted to full-fledged .NET BIO multiple sequence alignments, meaning one\u00a0can do nearly anything with them. The example below gets the DNA alignment and the BLOSUM90\u00a0alignment score for the masked amino acid alignments.\r\n<\/p>\r\n<pre class=\"brush: fsharp\">\r\nlet maskedAlignments = successfulQueries \r\n     |> Array.map (fun x-> x.MaskedDNAAlignment)\r\nlet alignmentScores= successfulQueries \r\n     |> Array.map (fun x-> x.GetMaskedBlosum90AlignmentScore())\r\n<\/pre>\r\n<br\/>\r\n<strong>Step 5: Download the Gene Trees <\/strong>&#8211; The selectome gene defines a class, <code>SelectomeTree<\/code>, that provides a great set of functions to query all the\u00a0interesting metadata provided by selectome. These features are most usefully investigated by\u00a0using the autocomplete functionality of your editor, but there is a lot of useful information! Some\u00a0examples are also shown below.\r\n<p>\r\n<pre class=\"brush: fsharp\">let example=successfulQueries.[0]\r\nlet tree=example.VertebrateTree\r\nlet taxa=tree.TaxaPresent\r\nlet selectedNode=tree.SelectedNodes.[0]\r\nselectedNode.BootStrap\r\nselectedNode.PValue\r\nselectedNode.Name\r\n...<\/pre>\r\n<\/p>\r\nTree queries are also cached locally to avoid going back to the server in the\u00a0event of repeated requests.\r\n<br\/>\r\n<strong>Step 6: Plot distribution of interest using the R data provider &#8211; <\/strong>You\u00a0<span style=\"line-height: 1.714285714; font-size: 1rem;\">can call R plotting functions directly from F# using the\u00a0<\/span><a href=\"https:\/\/github.com\/BlueMountainCapital\/FSharpRProvider\">R type provider<\/a>.\u00a0More information is available from that site, but the code snippet below is\u00a0sufficient to produce a histogram of alignment scores, no need to dump to a flat\u00a0file first!\r\n<pre class=\"brush: fsharp\">\r\n#r @\"C:\\Users\\Nigel\\packages\\RDotNet.FSharp.0.1.2.1\\lib\\net40\\RDotNet.FSharp.dll\"\r\n#r @\"C:\\Users\\Nigel\\packages\\R.NET.1.5.5\\lib\\net40\\RDotNet.dll\"\r\n#r @\"C:\\Users\\Nigel\\packages\\RProvider.1.0.1\\lib\\RProvider.dll\"\r\nopen RDotNet\r\nopen RProvider\r\nopen RProvider.``base``\r\nopen RProvider.graphics\r\nopen RProvider.stats\r\n \r\n \r\nlet args = new System.Collections.Generic.Dictionary<string,Object>()\r\nargs.[\"x\"] <- AlignmentScores\r\nargs.[\"main\"] <-\"BLOSUM90 Alignment Scores\"\r\nargs.[\"col\"] <-\"blue\"\r\nargs.[\"breaks\"]<-20\r\nargs.[\"xlab\"]<-\"BLOSUM90 Score\"\r\nR.hist(args)\r\n<\/pre>\r\nHuzzah! One intermediate machine to rule them all (or at least to avoid useless glue between different libraries\/APIs).\r\n<div align=\"center\"><img alt=\"\" src=\"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/Alignments.png?resize=300%2C240\" align=\"middle\" data-recalc-dims=\"1\" \/><\/div>","protected":false},"excerpt":{"rendered":"The Bio.Selectome namespace has features to query\u00a0Selectome.Selectome is a database that merges data from Ensembl\u00a0and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN\/dS)\u00a0mutations along various branches of the phylogenetic tree. A low dN\/dS ratio\u00a0indicates that the protein sequence is under strong selective constraint, while a high one indicates that [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[18,14,8,3,17],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":188,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=188","url_meta":{"origin":153,"position":0},"title":"The .NET Bio BAM Parser is Smoking Fast","date":"October 12, 2013","format":false,"excerpt":"The .NET Bio library has an improved version of it's BAM file\u00a0parser, which makes it significantly faster and easily competitive with the\u00a0current standard C coded SAMTools for obtaining\u00a0sequencing data and working with it. The chart below compares the time it\u00a0takes in seconds for the old version of the parser and\u2026","rel":"","context":"In &quot;.NET Bio&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/10\/img5.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":398,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=398","url_meta":{"origin":153,"position":1},"title":".NET Bio is Significantly Faster on .Net Core 2.0","date":"November 5, 2017","format":false,"excerpt":"Summary: With the release of .NET Core 2.0, .NET Bio is able to run significantly faster (~2X) on Mac OSX due to better compilation and memory mangement. The .NET Bio\u00a0library contains libraries for genomic data processing tasks like parsing, alignment, etc. that are too computationally intense to be\u00a0undertaken with interpreted\u2026","rel":"","context":"In \".NET Bio\"","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2017\/11\/Benchmark-1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":71,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=71","url_meta":{"origin":153,"position":2},"title":"Java vs. C# Performance Comparison for Parsing VCF Files","date":"May 26, 2013","format":false,"excerpt":"Making a comparison with a reasonably complex program ported between the two languages. Update 3\/10\/2014: After writing this post I changed the C# parser to remove an extra List<> allocation in the C# code that was not in the Java code.\u00a0\u00a0After this, the Java\/C# versions are indistinguishable on speed, but\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/05\/image_thumb1.png?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":91,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=91","url_meta":{"origin":153,"position":3},"title":"Accessing dbSNP with C# and the .NET Platform","date":"August 22, 2013","format":false,"excerpt":"NCBI Entrez can be accessed with many different platforms (python, R, etc.) , but I find .NET one of the best because the static typing makes it easy to infer what all the datafields mean, and navigate the data with much greater ease. Documentation is sparse for this task, but\u2026","rel":"","context":"In &quot;Bioinformatics&quot;","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":35,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=35","url_meta":{"origin":153,"position":4},"title":"Comparing data structure enumeration speeds in C#","date":"March 1, 2013","format":false,"excerpt":"Determining which data structure to use for storing data involves trade-offs between how much memory they require and how long different operations, such as insertions, deletions, or searches take.\u00a0 In C#, using a linear array is the fastest way to enumerate all of the objects in a collection.\u00a0 However, the\u2026","rel":"","context":"Similar post","img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":112,"url":"http:\/\/evolvedmicrobe.com\/blogs\/?p=112","url_meta":{"origin":153,"position":5},"title":"Mono.Simd and the Mandlebrot Set.","date":"September 10, 2013","format":false,"excerpt":"C# and .NET are some of the fastest high level languages, but still cannot truly compete with C\/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the\u2026","rel":"","context":"In &quot;Algorithms&quot;","img":{"alt_text":"","src":"https:\/\/i0.wp.com\/evolvedmicrobe.com\/blogs\/wp-content\/uploads\/2013\/09\/img2_thumb.gif?resize=350%2C200","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/153"}],"collection":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=153"}],"version-history":[{"count":22,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/153\/revisions"}],"predecessor-version":[{"id":354,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=\/wp\/v2\/posts\/153\/revisions\/354"}],"wp:attachment":[{"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=153"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=153"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/evolvedmicrobe.com\/blogs\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=153"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}