The topics discussed in this post result from working on jchemhub. It's still a very young project soon to be renamed to kemia.
Some javascript benchmarks
Like I said in the previous post, firefox & chrome javascript performance is impressive Unfortunately, at the other end of the spectrum there is internet explorer. The way simple things like iterating over an array are done can make a huge difference. To illustrate, here are 4 code samples for iterating over the elements of an array. The preferred way is highlighted in green.
// example 1: check arr.length each iteration
for (var i = 0; i < arr.length; i++) {
var tmp = arr[i] + arr[i];
}
// example 2: store arr.length in a variable
for (var i = 0, li = arr.length; i < li; i++) {
var tmp = arr[i] + arr[i];
}
// example 3: for ... in ...
for (var i in arr) {
var tmp = arr[i] + arr[i];
}
// example 4: forEach
function doStuff(element, index, arr) {
var tmp = arr[index] + arr[index];
}
goog.array.forEach(doStuff);


It should be obvious that both for ... in ... and forEach should be avoided at all cost. To non-javascript developers, the for ... in .. syntax might be surprising. It doesn't give you the elements but the keys to the elements. As a result, you still need the var element = elements[i]; line and there is not much benefit over a classic c-style for loop. No optimization is done when running javascript and accessing an arr.length property for each iteration, will be slower than storing the property in a local variable.
The second benchmark shows the overhead of a function call for accessing an object property.
var obj = { x: [], getX: function(index) {return this.x[index]; } };
for (var i = 0; i < 10000; i++) {
obj.x.push({});
}
// example 1: direct access
for (var i = 0; i < 10000; i++) {
obj.x[i];
}
// example 2: wrapper function
for (var i = 0; i < 10000; i++) {
obj.getX(i);
}

A final benchmark shows that common knowledge from using other languages may not always apply. When creating an array of known size, using push() to expand the array is the fastest method in chrome. This is not true for internet explorer and the best compromise are the first two examples.
// example 1: Use goog.array.repeat.
var arr1 = goog.array.repeat(2, 100000);
// example 2: Use access by index
var arr2 = [];
for (var i = 0; i < 100000; i++) {
arr2[i] = 2;
}
// example 3: Allocate using access by index.
var arr3 = [];
arr3[99999] = 2;
for (var i = 0; i < 100000; i++) {
arr3[i] = 2;
}
// example 4: Use push
var arr4 = [];
for (var i = 0; i < 100000; i++) {
arr4.push(2);
}


More of these benchmarks will appear in the future on this page.
Ring perception
There haven't been any posts on this blog about ring perception before. The aim of ring perception is simple, it finds the rings in a molecule. Ring perception is also required in various other fields like electrical engineering. While it may be easy to get the concept, actually implementing it (efficiently) is not straightforward. There are various algorithms to compute the rings in a graph and the results are not the same. The algorithms can be classified by their output ring sets. In this post, exhaustive ring search and the Smallest Set of Smallest Rings (SSSR) will be discussed. An interesting approach combining both algorithms is also presented.
Exhaustive ring search generates a large ring set containing all possible rings. This includes many combinations of rings. The Hanser algorithm is the fastest exhaustive algorithm. However, finding all rings in a reasonable large molecule will be slow. Fortunately, a depth or maximum ring size can be specified. For many structures, rings up to size 6 are often enough and the algorithm is really fast in these cases. Increasing the maximum ring size quickly results in unacceptable performance though. Using other programming languages such as c++ or java, it might be possible to increase the maximal ring size to make it usable for depiction. The implementation in jchemhub is ported from the MX java cheminformatics library.
The Smallest Set of Smallest Rings (SSSR) is a popular kind of ring set. The number of sssr rings in a molecule (single connected fragment) is given by Cauchy's formulla:
m - n + 1 m: number of bonds
n: number of atoms
If the molecule contains more disconnected fragments, 1 can be replaced with the fragment count. Knowing the number of rings in the sssr set in advance can be used for various optimizations. The simplest optimization is to conclude there are no rings if the formula is equal to 0. SSSR rings have their disadvantages though. For example, SSSR rings are not canonical, the found rings will depend on the atom order of the input molecule. This would be a good topic for an additional blog post but for now the SSSR will be good enough.
There are several algorithms to compute the SSSR such as Berger's and Figueras' algorithm. The algorithm in jchemhub implements a recent algorithm that makes use of path-included distance matrices. The algorithm is based on the Floyd-Warhsall algorithm for computing the distance matrix. The element i,j in a distance matrix is the length of the shortest path (i.e. number of bonds) connecting atoms with index i and j. Two additional matrices are created at the same time containing the shortest path and the shortest path+1 between atoms i and j. From these path-included matrices, the ring candidates can be computed. The method has an O(n^3) runtime where n is the number of atoms. This isn't bad for finding the SSSR but using it on large molecules can be time consuming.
The ideal algorithm would be fast when there are only small rings but it should also detect large rings. To solve this problem, both algorithms can be combined. First the Hanser algorithm detects all rings up to size 6. From this ring set, an attempt is made to construct the SSSR using a XOR function to test if a ring is already in the SSSR. If the full SSSR is found (i.e. the number of rings in the set after selection is equal to the calculated number of rings), the molecule doesn't contain any rings larger than 6. However, if the full SSSR set is not found the real SSSR algorithm is used to find all the rings of the SSSR including the larger rings.
To evaluate the combined ring search approach, the use case is important. When doing substructure search, the average time is the most important. If an individual molecule takes some more time to process, thi is not a problem as long as the whole process is fast. The graph below shows the average time for each algorithm. A rapid decrease in performance for the Hanser algorithm can be seen. Also, the combined algorithm is 3.4 times faster than the SSSR algorithm. These tests are done in nodejs.

A second graph shows the worst case for each algorithm. This is relevant for use cases like on-the-fly depiction. Here it doesn't matter if all molecules take 50ms more as long as there are no cases that seem to take forever and make the browser unresponsive. The SSSR algorithm can take up to 6 seconds limiting it's use. The same applies for the Hanser algorithm with larger maximal ring sizes.

In the previous graphs, a simple optimization is used. All terminal atoms are progressively removed until there are no more terminal atoms. This reduces the molecule's size and therefore improves the performance of the ring perception (regardless of the algorithm). This worked great on firefox and chrome but wasn't enough for internet explorer 7. Internet explorer is still widely used and it should be supported. Combining the algorithms and the simple atom removal brought IE7's time for perceiving rings in a 534 atom molecule down to around 10 seconds. To get it down to the current 500ms another trick was needed.
Ring membership for atoms can be computed using an O(n) algorithm. A breath-first iteration is done and visited atoms and bonds are tracked together with the atom depth. When a exploring the bonds from an atom to continue iterating, there will be cases where the bond isn't visited yet but the atom it connects to is. In these cases, the bond is a ring-closure. When a ring-closure is found a backtrack can be performed to set isInCycle properties for the atoms in the ring until the paths converge. Using this atom ring membership information, the molecule can be divided in merged ring systems (i.e. rings with at least one atom in common are part of the same ring system). Running ring perception on these smaller molecules makes a huge difference when using an O(n^3) algorithm. This together with some IE7 specific performance benchmarks allowed the algorithm to be efficient in IE7 too. The worst case for the 23000 molecules in nodejs is now only 49ms.

In conclusion, it is possible to make cheminformatics software using javascript, even for Internet Explorer. Doing so requires a clever combination of algorithms and programming though. The next bottleneck is rendering, VML in IE is slow...
1 comments:
Tim - nice idea on the combined ring algo approach :) - neat.
I have some brief comments about the Lee/Kang/Cho/No (LKCN) algorithm you are using. Can you correct me if I've lost my way in my analysis.
I haven't come across LKCN before but on reading the paper it appears to be a 90% rediscovery of the Balducci/Pearlman (BP) algorithm. There are some key differences however in the path generation phase but otherwise they converge on identical post processing of the joined paths, namely work from smallest to largest ring, do ring uniqueness tests, do ring independence test and terminate at the cyclomatic number of found rings. The differences ...
a) The LKCN algorthm computes the Pe, Pe' PID matrices in advance, whereas the BP is progressive and in effect only computes the equivalent of one of these matrices and then only incompletely up to the point the cyclomatic number of rings is found. (Actually as path messages migrating through the graph). This says that the LKCN does more upfront work as it builds paths that may not get used
b) The BP discovers each even ring N times where N is the number of atoms in the ring. It also discovers each odd ring N times as it uses a simple edge collision test to keep this "path symmetric" (both colliding paths are same length - but overlap by one edge). The LKCN algorithm also finds each even ring N times, but by my reading the odd sized rings are found 2xN times - because of the choice of combining an odd path + even path going both ways around the ring
So LKCN computes more paths than BP and can find the rings more times than BP. Consequently I'm very much doubting the O(V^3) claim. BP erroneously claimed their algo was worst case O(d^3V^2logV) (d is max degree of node) and in chemical graphs d << V -> O(V^2logV). Vismara's thesis reanalysed the BP algo and found it to be O(duE^2V) (u is cyclomatic number)[this was reported in Berger's counterexamples paper].
Its interesting that LKCN paper skips the details of big-oh'ing i) detecting the new ring is identical to one already found - e.g. could be O(N) or O(1) choosing between sequential search and a hash lookup ii) ring independence test - BP use Gaussian elimination - big-oh ? How does the LKCN do these steps and hence the final big-oh quoted is opaque to analysis in the paper??
I've just implemented and optimised the BP algo for my PhD studies, but I haven't tried the LKCN (yet)
You have the LKCN implemented so maybe you can fill in my ??
Thanks, Tony
Post a Comment