Monthly Archives: September 2013

Using Selectome with .NET Bio, F# and R

The Bio.Selectome namespace has features to query Selectome.Selectome is a database that merges data from Ensembl and the programs in PAML used to compute the ratio of non-synonymous to synonymous (dN/dS) mutations along various branches of the phylogenetic tree. A low dN/dS ratio indicates that the protein sequence is under strong selective constraint, while a high one indicates that selective constraint is more relaxed. Selectome is also a fantastic resource to get gene trees and multiple sequence alignments. Using selectome and .NET Bio allows you to quickly investigate divergence across the vertebrate phylogeny. This page gives a walk of how to query selectome, convert the text data it returns in to objects, and the compute and plot various quantities from those objects.

Example Walk Through.


Step 0: Setup F#–If you haven’t used F# before, you can download Tsunami. Open the program and add references to Bio.Selectome using the #r command and open the relevant namespaces. (highlight and hit Alt+Enter to run in the console).


Step 1: Make Gene List – Selectome requires ensembl identifiers be used in queries.To create a set of interesting genes, I first downloaded the full set of genes from the MitoCarta website. These genes are identified by Entrez IDs, while selectome uses Ensembl IDs, so to convert between these I used the GeneID converter website to create a flatfile of the new ids. Given this flatfile, we can load it and convert it to typed classes as follows:
Step 2: Query Selectome –  All selectome data is accessed through the SelectomeDataFetcher class. This class will return a SelelctomeQueryResult that will let you know if the query was successful. Currently, the queries will only be successful for genes that exist in the database and have data available for the full vertebrate tree. If no data is available the Result will be NoResultsFound, if selectome returned data but there was no tree available for all vertebrates(but maybe just primates) the result will be NoVeterbrateTreeDataFound. We want to extract genes from query results that successfully returned data for the vertebrate tree.
Step 3: Determine how many genes show positive selection – F# makes this easy: Interestingly, roughly 33% of genes show selection, so we know not to get too excited by any one result!

Step 4: Download Multiple Sequence Alignments  – In order to decide how conserved a protein is relative to other proteins, we can download the multiple sequence alignment for each protein in this set and compare it to a particular protein of interest.  In Selectome, each protein comes with a masked and unmasked alignment for both proteins and DNA. These objects are available from the SelectomeGene class and are lazily downloaded when requested from the Selectome server.  These sequence alignment downloads are also cached for 30 days in a temporary directory to avoid waiting for downloads if you want to reload your script of interest.  Once downloaded they are converted to full-fledged .NET BIO multiple sequence alignments, meaning one can do nearly anything with them. The example below gets the DNA alignment and the BLOSUM90 alignment score for the masked amino acid alignments.


Step 5: Download the Gene Trees – The selectome gene defines a class, SelectomeTree, that provides a great set of functions to query all the interesting metadata provided by selectome. These features are most usefully investigated by using the autocomplete functionality of your editor, but there is a lot of useful information! Some examples are also shown below.

Tree queries are also cached locally to avoid going back to the server in the event of repeated requests.
Step 6: Plot distribution of interest using the R data provider – You can call R plotting functions directly from F# using the R type provider. More information is available from that site, but the code snippet below is sufficient to produce a histogram of alignment scores, no need to dump to a flat file first! Huzzah! One intermediate machine to rule them all (or at least to avoid useless glue between different libraries/APIs).

Mono.Simd and the Mandlebrot Set.

C# and .NET are some of the fastest high level languages, but still cannot truly compete with C/C++ for low level speed, and C# code can be anywhere from 20%-300% slower. This is despite the fact that the C# compiler often gets as much information about a method as the C compiler.  It has been suggested that SSE/SIMD instructions in C# could overcome some of these issues.  To evaluate this, I took a famous computational task, re-coded it using C# SIMD instructions, and evaluated the performance by looking at the execution time and how the emitted assembly code compared to the optimal assembly code.

Background

In 2009, Miguel de Icaza demonstrated a framework that allows C# to use SSE2 intrinsic operations.This is now part of the Mono library and in theory, such methods can greatly save computational time (1.5X-16X), particularly for operations on byte arrays, a common task in bioinformatics.

Test Case

Computing the mandelbrot set is one of the tasks of the computer benchmarks game, which compares program speed in different languages.  Currently, C# is listed as being slower than Java, though the programs in each language use different techniques and neither uses an optimal algorithm (see here for a better one).  However, it makes a useful test case for raw computing speed.  The challenge is to compute the picture shown below by recursively squaring a complex number and adding a constant to it.

z_{n+1} = z_n^2 + c

image_thumb[1]

The Algorithm

Complex numbers are a particularly challenging case because their multiplication is not a simple operation, but involves a rather annoyingly squaring of different terms and then adding/subtracting them.

image_thumb1

These are not easy vector operations, and as will be shown later one result of this is that using SSE to speed up the values for one multiplication is useless (I tried, it was worse). So, the performance improvement comes from doing two elements of the loop at a time (this is how the current Java submission is faster than the C# one, though it does not use SIMD).  The current C# version does the inner loop of the program without SIMD instructions as follows:


This loop is iterating until convergence (<4) or until the max iterations (i<0) have expired. See the wikipedia page for a full explanation.  Of interest here is that the “t” variables exist for pipelining purposes, so this is a reasonably optimized code snippet.

Trying to Implement the Optimal Solution Using Mono.Simd

Actually figuring out how to do the complex arthimetic with SSE2 is rather challenging.  Fortunately Intel published a document giving the best solution as hand coded assembly, which involves using some special SSE3 instructions I was not aware of.  Notably, the Intel optimized code is far faster than even their C code, but is only about 25% better than their optimized assembly code without SSE3.

The code below shows my attempt to implement the Intel approach in Mono.  It should be pretty fast, however it may suffer from one fatal flaw.  The comparison at the end to check for the final conditions currently requires unpacking both values in the Vector2D (when either Length.X or Length.Y have been < 4.0 at least once, then the loop stops).  The comparison for both X and Y can be done in one operation in SIMD using the built in less than statement.  However, I do not know how to turn that into a control loop function, as this requires a movmskps assembly function which mono does not seem to expose.

Results – It’s Faster on Ubuntu

img2

Shown above are the times for three different algorithms. The first is the original best submission on the website.  My SIMD submission is shown on the bottom.  Because the SIMD version does two cycles within each inner loop, as the Java submission does, I tested the Java submission converted to C# as well.  Compared to the current submission this version shows a 81% improvement, but clearly much of that is simply from doing 2 cycles in one loop.  However, even once this change is made, the SIMD instructions still give a performance boost.

The Assembly Generated is Still Not Optimal

Although the assembly generated did include lots of SSE commands, in general inspecting the assembly I noticed several things.

  1. Never unpack the double array values!  My first attempt tried to do the SSE2 steps with those instructions, and then unpack the single values as needed.  However, this failed prettty miserably, as accessing the double seemed to involve a lot of stack to register movement.

  2. All the XMM registers are not used.  The optimal version uses all of them, the C# version uses only 0-3, and soften moves things to and from the stack. Not using registers optimally seems to be a common problem though with C#.

Conclusions

The SIMD version was indeed a fair bit faster, which is nice!  However, in this case it was not a game-changer.  Most importantly though, it was really easy to use, and I think I might incorporate the byte array operations at some point in future code.  This also gave me an appreciation for assembly, which in contrast to what I had heard is easy to read and seems easy to optimize.  I just submitted the code to the shoot-out, assuming it works there it should be up soon, and I would love for someone to show me how to fix the end statement.