Tuesday, September 29, 2009

The automorphism group & bliss

In the previous post, an algorithm for computing the automorphism group was given. Although it drastically reduced the number of permutations to be checked, it was still much to slow for use in cheminformatics systems. A real world example is inositol where there are 2 orbits of 6 atoms. This example was also used for a stereochemistry-puzzler on Depth-First (answer). Since it is a good example (i.e. it contains "large" orbits) to test the performance and correctness (there are only 9 stereoisomers) of an implementation, I'll be using it for the following posts too.



While 6! * 6! = 518400 is already way faster better than 12! = 479001600, computing half a million matrix equations still takes a considerable amount of time. The actual time (seconds to minutes) depends on implementation details but the method is generally inefficient.

A faster set of algorithms to solve this problem are known as backtracking algorithms. Some well known examples include VFLib (for isomorphisms in general) and nauty (automorphisms). Another, relatively new algorithm called bliss (paper available from their site, MIT license) came to my attention. It is a very small library which computes a set of generators of the automorphism group. From these generators, the full automorphism group can easily be constructed. Using the bliss API was easy and straightforward (see permutation.cpp below). Using inositol as example again, bliss gives 2 generators:

Generator: 1 6 5 4 3 2 7 12 11 10 9 8
Generator: 2 3 4 5 6 1 12 7 8 9 10 11

Next the generators' inverses and all possible products can be added. This is repeated until all automorphisms are found. [note: the matrix equation doesn't have to be used, the inverse and products are always part of the automorphism group too -- don't forget the identity permutation :-) ]

The final 12 automorphisms constructed from the given generators are :

1 2 3 4 5 6 7 8 9 10 11 12
1 6 5 4 3 2 7 12 11 10 9 8
2 3 4 5 6 1 12 7 8 9 10 11
6 1 2 3 4 5 8 9 10 11 12 7
2 1 6 5 4 3 12 11 10 9 8 7
6 5 4 3 2 1 8 7 12 11 10 9
5 4 3 2 1 6 9 8 7 12 11 10
4 3 2 1 6 5 10 9 8 7 12 11
3 2 1 6 5 4 11 10 9 8 7 12
5 6 1 2 3 4 9 10 11 12 7 8
4 5 6 1 2 3 10 11 12 7 8 9
3 4 5 6 1 2 11 12 7 8 9 10

This whole computation takes .019s here :-D

The code for the example can be found here: permutation.h & permutation.cpp (requires OB trunk + libbliss)

In the next post, these automorphisms will be used to construct the signed permutation matrix to finally identify identical stereoisomers.

Wednesday, September 16, 2009

Molsketch: Reaction & Mechanism arrows

Here is a short screencast showing some of the new features in Molsketch. First, there are the reaction and mechanism arrows. While there are only two major types of arrows, double clicking an arrow in move mode will bring up a dialog that allows you to change the arrow's appearance.

The second new feature is the .msk file format. Since it is actually cml, it will likely be renamed to cml later. Apart from the molecules, the format also saves the atom/bond colours and both types of arrows.



If I find some time in the near future, I will make a release and build a windows installer so windows users can try it out without going through the difficult build process.

More para-stereocenters, permutation groups and the automorphism group

Rich Apodaca recently blogged about this stereochemistry related series of posts. He correctly pointed out that the automorphism group lies at the heart of the Razinger paper on stereoismer generation. However, before the automorphism group can be used, all stereocenters have to be found.

In a previous post I briefly described rule 1 and gave some examples. Below are more examples of these para-stereocenters identified by applying rule 1. As you will notice, most of these structures contain no stereocenter and are used to avoid false positives in the unit tests.
While rule 1 handles interdependant stereocenters for cycles, both rule 2 and rule 3 handle individual stereocenters and can be applied directly (rule 1 is applied recursively). The rules can schematically be summarized as:

2) tetracoordinate carbon is para-stereocenter if it attached:
2a) 1 or 2 pairs of identical symmetry classes (e.g. 1123, 1122), each pair contains at least :
2a') >= 1 true-stereocenter (have 4 different symmetry classes) -- or --
2a") >= 2 para-stereocenters (from rule 1)
2b) 3 or 4 identical symmetry classes (e.g. 1112, 1111), duplicated symmetry class ligand contains:
2b') >= 2 true-stereocenters -- or --
2b") >= 2 separate assemblies (merged rings) with each >= 2 para-stereocenters

3) double bond is para-stereocenter if terminal with identical symmetry classes contains at least:
3') >= 1 true-stereocenter
3") >= 2 para-stereocenters

Some examples of rule 2 & 3:


Now that all stereocenters are found, a (very) short introduction to permutations and permutation groups will help us get to the automorphism group (This can be skipped if you are familiar with it). A permutations is simply a reordering of a set of numbers. Although there are several ways to represent permutations, the matrix representation is most useful here. A permutation group is unsurprisingly a group or collection of Permutations. (see Razinger paper + wikipedia Permutation Group & Permutation Matrix)

The automorphism group is a subgroup of the Sn group containing all permutations. Since elements are not allowed to repeat, there will be n! permutations in Sn where n is the number of atoms. In general, the term "auto" means self and permutations belonging to the automorphism group change the atom labeling in such a way that the molecule isn't changed (i.e. no bonds broken or neighbors changed, the molecule remains itself). Mathematically this can be expressed using the adjacency matrix (A) and the permutation matrix (P) in the equation:


While a brute force check of all n! permutations would work for small molecules, it is not efficient enough for use with most drug-like molecules. To limit the number of automorphism candidates, symmetry classes can be used to provide an initial partitioning. Atoms with identical symmetry class are said to be in the same orbit. Only permutations within orbits are allowed (i.e. to be part automorphism group). These intra-orbit permutations can easily be computed in c++ using std::next_permutation. These intra-orbit permutations are all candidates for the automorphism group (G) and will be added to G if the equation holds (duplicates are ignored, especially the identity permutation will be duplicated many times).

Next, all inter-orbit permutations still need to be considered. This is achieved by "multiplying" the permutations (I used matrix multiplication but other methods are possible) and this gives rise to a new set of candidates which are also checked using the equation. All permutations in G now make up the automorphism group.

Below are 3 examples to illustrate how these "obrits" limit the number of permutations. The structure on the left, has 2 orbits, each containing 2 atoms. This structures' Sn group has 7! or 5040 permutations. Using the orbits we can reduce the number of candidates to 2! x 2! = 4. The structure in the middle also has 2 orbits but one orbit has 2 atoms and the other has 4 (6! = 720 -> 2! x 4! = 48). The structure on the right has 3 orbits, each containing 2 atoms (9! = 362880 -> 2! x 2! x 2! = 8).


While OpenBabel doesn't have code to compute the automorphism group yet, here is some proof of concept code which will be integrated later. It also contains a method to compute all permutations of the Sn group which can be used in combination with small molecules to validate the automorphism group generating algorithm.

Once the stereocenters are identified and the automorphism group has been found, this information can be used to eventually create a stereoparity matrix. The details and code will follow in a later post but knowing that duplicate rows in this matrix represent the same stereoisomer should be motivating :-) (for both stereoisomer generation & canonicalization)

Friday, September 11, 2009

As promised, here are some molecules with stereogenic units that are considered "hard" to detect. As can be seen from the image, the previous definition of atoms having four different neighbours (by comparing symmetry classes) doesn't always work.



The term para-stereocenter is used and adopted for these cases.[1] In general para means ressemble, so para-stereocenters ressemble true-stereocenters but there are differences. The most important difference is that para-stereocenters depend on each other to be stereogenic. In fact this is rule 1 from the Razinger paper:

  • rule1: para-stereocenters are assigned to be stereogenic if there are at least two of them in a ring assemblage (single ring, sharing atom/bond, bridged, ...). In OpenBabel code this is called mergedRings but it basically means the same thing.


  • This simple rule allows all stereocenters to be found in the given examples. There is also a rule 2 and 3 but these will be covered in the following posts. Still need to commit the code for rule1 later tonight.

    [1] M. Razinger, K. Balasubramanian, M. Perdih, M. E. Munk,
    Stereoisomere Generation in Computer-Enhanced Structure Elucidation,
    J. Chem. Inf. Comput. Si. 1993, 33, 812-825
    http://www.mcs.csueastbay.edu/~kbalasub/reprints/282.pdf

    Symmetry classes: How to get them and what to do with them

    For a long time, I've wanted to make a series of blog posts about my stereochemistry adventures. This is post will mainly be about symmetry but we'll get to it's relationship to stereochemistry at the end. In this post (and the following), symmetry will always mean topological symmetry (or graph symmetry). This doesn't imply any 3D orientation or special symmetry. For example, below is an image with symmetry classes assigned to the atoms (click to enlarge):



    Perhaps, a better title would have been "... and what to do with them in the first place" but this seemed too long. However, to motivate my readers, I'll start by answering the second question. In OpenBabel we currently use the symmetry classes to identify stereogenic units (i.e. tetrahedral atoms, cis/trans double bonds, ...). For example, if an atom contains neighbors with 4 distinct symmetry classes, the atom is considered to be chiral or stereogenic. There are exceptions to this which will be discussed in later posts.



    A second and perhaps more useful everyday example is to determine how many peaks there should be in an 13C-NMR spectrum. Since carbon atoms in the same symmetry class will be in a similar environment, their chemical shifts will be similar. In the example below, you can clearly see that there are 8 different carbon symmetry classes. Two classes have more than one atom and correctly show the symmetry (6 & 8). The spectrum taken from wikipedia doesn't hve a hight enough resolution but the peaks should be there :-)





    For H-NMR this is slightly more complex due to stronger coupling.

    Computing the symmetry classes is actually not that hard. The most common method use is the Morgan's Extended Connectivity (EC) algorithm:

    • Assign each atom it's degree as starting point

    • Iterator over the atoms and assign the summed values for each atom's neighbors to it.

    • Repeat untill the number of distinct classes remains the same for two iterations.


    The degree is the Graph Invariant (GI) here. In reality OpenBabel uses another GI with more discriminating power to avoid isospectral points (i.e. points with the same symmetry class which are not symmetric) but the principle remains the same. The symmetry classes are also renumbered so they are all within 1 ... n (# symmetry classes).

    Next post will be about the identification of stereocenters for symmetrical molecules.

    Avogadro 0.9.8

    Avogadro 0.9.8 was released yesterday and the windows installer including the python bindings and dependencies is available from sourceforge. There are no big new features but several bugs are fixed including some problems with translations. Accents were not displayed correctly and it had something to do with the encoding (UTF-8) of the files but we found a solution and Avogadro looks fine in French.

    Monday, September 7, 2009

    Molsketch

    The last 2 months Nicola Zonta and I have been working on improving Molsketch. While development is still at an early stage and bugs may be present, great progress has been made. Nicola released a version called Zem to go together with the 3D Zodiac program.

    If you want to try it out, a windows installer is available on zeden.org. Linux users can get the current development code for molsketch at my github
    page. NOTE: requires OpenBabel trunk: svn co https://openbabel.svn.sourceforge.net/svnroot/openbabel/openbabel/trunk.

    The zem release doen't include the latest features yet but here is a screencast made with the current github version of molsketch. The big difference with zem 0.3 is the ItemPlugin concept which is illustrated in the screencast. The idea is that more of these plugins can be added. While I don't have an example of this yet I intend to make it possible to use REST API's in this way.

    Feel free to make suggestions, propose ideas, file bugs, contribute, etc.