Clustering of ligand cavity shape

In this tutorial, we will perform clustering of ligand cavity. This tutorial can be used as a giude to perform protein channel shape clustering to analyse different shapes of channel during the simulations. The protein considered in this tutorial has ligand cavity burried deep inside the protein and cavity is almost linear.

Final result

429c7d9de1514d59a808695740d057cd

Instructions

  • Tutorial files: The tutorial files can be downloaded from here.

  • Extract the files: tar -zxvf protein-pca.tar.gz

  • Go to directory: cd protein-pca

  • Copy the Jupyter Notebook: This notebook is available in the GitHub repo. Download and copy it from the github.

Required Tools

Steps

  1. Preparation of reference structure

  2. Generate a new trajectory aligned on reference structure

  3. Plotting radius and residues distribution

  4. Calculation of cavity/channel radius

  5. Create feature-file

  6. Clustering using radius as features

  7. Analysis

1. Preparation of reference structure

A reference structure need to be prepared where cavity/channel axis align with one of the principal axis. For protein channel, it might be already done as channel-axis is mostly aligned along z-axis. In this case, I have already aligned the ligand-cavity axis along the Y-axis and it is provided as inputs/ref_cavity.pdb.

097f7af0c3ab4cda828bb3f38747a086

2. Generate a new trajectory superimposed on reference structure

gmx_clusterByFeatures hole has a functionality to superimpose the conformation for consistent calculation of cavity/channel radius along the given vector However, during clustering, the output central structures and cluster trajectory will be not superimposed and therefore, it is good to superimpose whole trajectory on reference structure so that cluster’s central structures orientations will be same as of reference structure.

We will use gmx trjconv to create a new superimposed trajectory with reference structre as follows.

[3]:
%%bash

rm traj_superimposed.xtc

echo 11 0 | gmx trjconv -s inputs/ref_cavity.pdb -f inputs/onlyProtein.xtc -n inputs/onlyProtein.ndx \
                        -fit rot+trans -o traj_superimposed.xtc
                     :-) GROMACS - gmx trjconv, 2025.2 (-:

Executable:   /opt/gromacs-2025/bin/gmx
Data prefix:  /opt/gromacs-2025
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-pca
Command line:
  gmx trjconv -s inputs/ref_cavity.pdb -f inputs/onlyProtein.xtc -n inputs/onlyProtein.ndx -fit rot+trans -o traj_superimposed.xtc

Will write xtc: Compressed trajectory (portable xdr format): xtc
WARNING: all CONECT records are ignored

WARNING: If there are molecules in the input trajectory file
         that are broken across periodic boundaries, they
         cannot be made whole (or treated as whole) without
         you providing a run input file.

Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Select a group: Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Reading frame       0 time    0.000
Precision of inputs/onlyProtein.xtc is 0.001 (nm)
Using output precision of 0.001 (nm)
Reading frame   22000 time 2200000.000    ->  frame  21999 time 2199900.000

Last written: frame  22500 time 2250000.000


GROMACS reminds you: "We can make it into a friend class. But I don't like having friends." (Joe Jordan)

Note that major changes are planned in future for trjconv, to improve usability and utility.

WARNING: Masses and atomic (Van der Waals) radii will be guessed
         based on residue and atom names, since they could not be
         definitively assigned from the information in your input
         files. These guessed numbers might deviate from the mass
         and radius of the atom type. Please check the output
         files if necessary. Note, that this functionality may
         be removed in a future GROMACS version. Please, consider
         using another file format for your input.

Select group for least squares fit
Selected 11: 'r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444'
Select group for output
Selected 0: 'System'

3. Calculation of cavity/channel radius

We will use HOLE2 tool to calculate ligand cavity. It should be installed and hole should be in PATH. HOLE accept only one pdb file as input. gmx_clusterByFeatures hole use external hole and calculate radius from each frame automating the entire calculation for the trajectory.

Following options are used here:

  • -catmid 1903 - seed atom from where radius calculation will start in both the direction along the axis

  • -rad bondi - radius of atoms

  • -cvect 0 1 0 - vector along which radius will be calculated. Here, it is Y-axis.

  • -endrad 12 - stop radius calculation when this threshold radius is reached. It means, now it is in solvent phase.

  • -sample 0.5 - slab size along the cavity/channel axis. Radius is calculated in each slab

  • -mcstep 2000 - Monte-Carlo step size. See details in HOLE manual

This command will dump radius data as a function of time in radius.dat file

[1]:
%%bash

rm radius.dat

echo 11 1 | gmx_clusterByFeatures hole -f traj_superimposed.xtc -s inputs/ref_cavity.pdb -n inputs/onlyProtein.ndx \
                                       -rad bondi -catmid 1903 -cvect 0 1 0 -endrad 12 -sample 0.5 -mcstep 2000
 :-) GROMACS - gmx_clusterByFeatures hole, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures hole
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-pca
Command line:
  'gmx_clusterByFeatures hole' -f traj_superimposed.xtc -s inputs/ref_cavity.pdb -n inputs/onlyProtein.ndx -rad bondi -catmid 1903 -cvect 0 1 0 -endrad 12 -sample 0.5 -mcstep 2000


        :-)  gmx_clusterByFeatures hole (-:

        Author: Rajendra Kumar

        Copyright (C) 2018-2025  Rajendra Kumar


        gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
        it under the terms of the GNU General Public License as published by
        the Free Software Foundation, either version 3 of the License, or
        (at your option) any later version.

        gmx_clusterByFeatures is distributed in the hope that it will be useful,
        but WITHOUT ANY WARRANTY; without even the implied warranty of
        MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
        GNU General Public License for more details.

        You should have received a copy of the GNU General Public License
        along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
        "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
        LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
        A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
        OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
        SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
        TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
        PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
        LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
        NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
        SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

WARNING: all CONECT records are ignored
Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Select a group: Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Reading frame   22000 time 2200000.000
-------- -------- ------------------- -------- ----------
O.S. Smart, J.M. Goodfellow and B.A. Wallace (1993)
The Pore Dimensions of Gramicidin A
Biophysical Journal 65:2455-2460.
-------- -------- ------------------- -------- ----------

GROMACS reminds you: "Never replicate a successful experiment." (Fett's law.)


Choose a group for the least squares fit
Selected 11: 'r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444'

Choose a group for hole calculation...
Selected 1: 'Protein'


Selected "H" atom of "GLY-122" as seed for channel/cavity

Thanks for using gmx_hole!!!

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++

4. Plotting radius and residues distribution

To plot radius value distributions and residues outlining cavity/channel positions distribution over the simulations, gmx_clusterByFeatures holeplot sub-command can be used. It reads radius file generated in the above command, extract radius values for the given segment of the cavity/channel, calculate average/distributions and plot it as a function of cavity/channel axis.

Options

  • -resplot - enable plotting of residues positions distributions

  • -vplot - distributions are plotted as violin-plot

  • -xmin - minimum value of cavity/channel axis position to consider

  • -xmax - maximum value of cavity/channel axis position to consider

  • -endrad - ignore if radius is greater than this value (mostly at the opening end)

  • -gap - slab size along the cavity/channel

  • -ax - X, Y or Z axis along which cavity/channel axis is aligned

  • -rlsize - font-size of residues in residue-distribution plots.

[1]:
%%bash

gmx_clusterByFeatures holeplot -i radius.dat -resplot -vplot -o average_radius_distribution.png \
                               -xmin 0 -xmax 17 -endrad 6 -gap 0.5 -ax Y -rlsize 7

Reading file : radius.dat
Reading frame 0 at time        0.000
Maximum axis range is: (0.000 to 17.000)
Reading frame 20000 at time  2000000.000
Finishid reading.... Total number of frame read =  22501
After removing axis points with any missing data:
------------------------------------------------------------
#        Axis        Mean radius     Std. deviation
------------------------------------------------------------
       0.000            0.864            0.286
       0.500            0.867            0.336
       1.000            0.889            0.394
       1.500            0.904            0.442
       2.000            0.920            0.460
       2.500            0.958            0.449
       3.000            0.993            0.455
       3.500            1.043            0.466
       4.000            1.144            0.448
       4.500            1.288            0.414
       5.000            1.449            0.386
       5.500            1.614            0.387
       6.000            1.750            0.451
       6.500            1.907            0.534
       7.000            2.081            0.594
       7.500            2.146            0.631
       8.000            2.142            0.637
       8.500            2.127            0.610
       9.000            2.111            0.581
       9.500            2.085            0.569
      10.000            2.036            0.571
      10.500            1.984            0.567
      11.000            1.948            0.559
      11.500            1.925            0.561
      12.000            1.915            0.571
      12.500            1.916            0.574
      13.000            1.925            0.572
      13.500            1.938            0.595
      14.000            1.963            0.646
      14.500            2.001            0.715
      15.000            2.049            0.796
      15.500            2.088            0.884
      16.000            2.113            0.965
      16.500            2.178            1.044
------------------------------------------------------------

4cb883262c2243e6bfbf0c3248bc853a

Observations

As can be seen aboive, radius of cavity flucuated by at-least 2Å and much more along the cavity. Therefore, it does not provide a true picture of different shapes sampled during the simulations. Clustering could unravel all those shapes sampled during the simulations.

5. Create feature file

In the next step, we will create a file containing cavity radius as features for clustering. gmx_clusterByFeatures holefeatures can be used to create this file using the input radius.dat file, which is obtained in the previous step.

Options

  • -xmin Minimum axis position value above which radius will be considered. Here, we take it as 5.5 because GLY122 is present at the bottom most region of the cavity and residue-distribution shows that GLY122 postions start from 5.5Å along the axis.

  • -xmax Maximum axis position value below which radius will be considered

  • -endrad Maximum radius to be considered for opening end of the cavity

  • -gap Radius calculation slab size

  • -ax Cavity/Channel axis

[2]:
%%bash

gmx_clusterByFeatures holefeatures -i radius.dat -o cavity_features.xvg \
                                   -xmin 5.5 -xmax 17 -endrad 6 -gap 0.5 -ax Y

Reading file : radius.dat
Reading frame 0 at time        0.000
Maximum axis range is: (5.500 to 17.000)
Reading frame 20000 at time  2000000.000
Finishid reading.... Total number of frame read =  22501
After removing axis points with any missing data:
------------------------------------------------------------
#        Axis        Mean radius     Std. deviation
------------------------------------------------------------
       5.500            1.629            0.384
       6.000            1.750            0.451
       6.500            1.907            0.534
       7.000            2.081            0.594
       7.500            2.146            0.631
       8.000            2.142            0.637
       8.500            2.127            0.610
       9.000            2.111            0.581
       9.500            2.085            0.569
      10.000            2.036            0.571
      10.500            1.984            0.567
      11.000            1.948            0.559
      11.500            1.925            0.561
      12.000            1.915            0.571
      12.500            1.916            0.574
      13.000            1.925            0.572
      13.500            1.938            0.595
      14.000            1.963            0.646
      14.500            2.001            0.715
      15.000            2.049            0.796
      15.500            2.088            0.884
      16.000            2.113            0.965
      16.500            2.178            1.044
------------------------------------------------------------
WARNING: radius value for the axis point was not calculated by hole. Zero will be written as feature value. However, large number of missing values will lead to wrong clustering. Therefore, please try to minimize or eliminate the missing values by changing axis-point range using xmin and xmax option and/or dataOccupancy option.
(23, 22501)

6. Clustering using radius as features

Now, we use radius as the features and perform clustering. We will use K-Means clustering. However, there is one drawback that the number of clusters should be known beforehand. gmx_clusterByFeatures implements several cluster metrics and also automatated way to decide number of clusters. We will use Elbow method to automatically determine the number of clusters.

Following command will perform the clustering of conformations on the basis of cavity-radius profile. Explanation of options are as follows:

  • -method kmeans: Use K-Means clustering algorithm

  • -ncluster 15: K-Means clustering will be performed for 1 upto 15 clusters times each time. Finally, based on -ssrchange option, final number of clusters will be automatically selected.

  • -cmetric ssr-sst: Use Elbow method to decide final number of clusters.

  • -nfeature 23: Take 23 features from the feature file. Here it is 23 slabs for which cavity radius is dumped in feature file cavity_features.xvg.

  • -sort features: Sort the output clustered trajectory based on the distance in feature space from its central structure.

  • -ssrchange 1.5: Threshold (percentage) of change in SSR/SST ratio in Elbow method to automatically decide the number of clusters.

  • -g cluster.log: Log file containing clustering information and cluster-mterics.

index group order

  1. First index group - Output group of atoms in the central structures

  2. Second index group - Group of atoms to calculate RMSD between central conformations of clusters as RMSD matrix, which is dumped in the log file with -g option. Here, it is Ligand without hydrogen atoms.

  3. Third index group - Used for Superimposition by least-square fitting. ONLY used in separate clustered trajectories to superimpose conformations on the central structure.

Content of the output ``-g clustering-radius.log`` file

It contains the command summary, and for each input cluster-numbers, number of frames in each clusters. At the end it dumps the Cluster Metrics Summary, which is important for deciding final number of clusters.

[3]:
%%bash

mkdir clustering-radius-output

echo 0 10 11 | gmx_clusterByFeatures cluster -f traj_superimposed.xtc -s inputs/ref_cavity.pdb -n inputs/onlyProtein.ndx \
                                            -feat cavity_features.xvg -method kmeans -nfeature 23 -cmetric ssr-sst -ssrchange 1.5 -ncluster 15 \
                                            -cpdb clustering-radius-output/central.pdb -g clustering-radius.log -clid cluster-id-radius.xvg
mkdir: cannot create directory ‘clustering-radius-output’: File exists
 :-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures cluster
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-pca
Command line:
  'gmx_clusterByFeatures cluster' -f traj_superimposed.xtc -s inputs/ref_cavity.pdb -n inputs/onlyProtein.ndx -feat cavity_features.xvg -method kmeans -nfeature 23 -cmetric ssr-sst -ssrchange 1.5 -ncluster 15 -cpdb clustering-radius-output/central.pdb -g clustering-radius.log -clid cluster-id-radius.xvg


         :-)  gmx_clusterByFeatures cluster (-:

             Author: Rajendra Kumar

       Copyright (C) 2018-2019  Rajendra Kumar


gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


Back Off! I just backed up clustering-radius.log to ./#clustering-radius.log.2#
WARNING: all CONECT records are ignored
Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Select a group: Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Select a group: Group     0 (         System) has  8327 elements
Group     1 (        Protein) has  8327 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5659 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (r_5-535_&_MainChain) has  2124 elements
Group    11 (r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444) has  1920 elements
Group    12 (      r_79_&_CA) has     1 elements
Group    13 (     r_341_&_CA) has     1 elements
Group    14 (     r_293_&_CA) has     1 elements
Group    15 (r_10-535_&_MainChain) has  2104 elements
Group    16 (          r_203) has    11 elements
Group    17 (          r_447) has    18 elements
Group    18 (  oxyanion_hole) has    52 elements
Group    19 (    acyl_pocket) has    84 elements
Group    20 (   choline_site) has    45 elements
Group    21 (peripheral_site) has   106 elements
Reading frame       4 time  400.000
=======================
  Cluster Log output
=======================

Command:
=======================
'gmx_clusterByFeatures cluster' -f traj_superimposed.xtc -s inputs/ref_cavity.pdb -n inputs/onlyProtein.ndx -feat cavity_features.xvg -method kmeans -nfeature 23 -cmetric ssr-sst -ssrchange 1.5 -ncluster 15 -cpdb clustering-radius-output/central.pdb -g clustering-radius.log -clid cluster-id-radius.xvg
=======================

Choose a group for the output:
Selected 0: 'System'

Choose a group for clustering/RMSD calculation:
Selected 10: 'r_5-535_&_MainChain'

Choose a group for fitting or superposition:
Selected 11: 'r_5-535_&_MainChain_&_!r_72-85_&_!r_257-268_&_!r_338-352_&_!r_435-444'


 Input Trajectory dt = 100 ps



###########################################
########## NUMBER OF CLUSTERS : 1 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               22501
===========================



###########################################
########## NUMBER OF CLUSTERS : 2 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               11602
2               10899
===========================



###########################################
########## NUMBER OF CLUSTERS : 3 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               10730
2               9339
3               2432
===========================



###########################################
########## NUMBER OF CLUSTERS : 4 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               6985
2               6795
3               6561
4               2160
===========================



###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               6529
2               6088
3               5529
4               2354
5               2001
===========================



###########################################
########## NUMBER OF CLUSTERS : 6 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               5775
2               5088
3               4794
4               3109
5               1922
6               1813
===========================



###########################################
########## NUMBER OF CLUSTERS : 7 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               5268
2               4624
3               3265
4               3174
5               2646
6               1767
7               1757
===========================



###########################################
########## NUMBER OF CLUSTERS : 8 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4331
2               3408
3               3075
4               2879
5               2867
6               2719
7               1742
8               1480
===========================



###########################################
########## NUMBER OF CLUSTERS : 9 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4341
2               3427
3               3083
4               2782
5               2730
6               2683
7               1495
8               1139
9               821
===========================



###########################################
########## NUMBER OF CLUSTERS : 10 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4230
2               3259
3               3048
4               2659
5               2538
6               2271
7               1482
8               1405
9               811
10              798
===========================



###########################################
########## NUMBER OF CLUSTERS : 11 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3098
2               3006
3               2976
4               2940
5               2501
6               2009
7               1538
8               1443
9               1405
10              794
11              791
===========================



###########################################
########## NUMBER OF CLUSTERS : 12 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3011
2               2996
3               2735
4               2691
5               2073
6               1907
7               1489
8               1396
9               1320
10              1302
11              795
12              786
===========================



###########################################
########## NUMBER OF CLUSTERS : 13 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               2946
2               2685
3               2572
4               2233
5               1988
6               1916
7               1596
8               1373
9               1326
10              1191
11              1105
12              792
13              778
===========================



###########################################
########## NUMBER OF CLUSTERS : 14 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               2910
2               2698
3               2431
4               2107
5               1946
6               1919
7               1618
8               1393
9               1392
10              1218
11              919
12              734
13              733
14              483
===========================



###########################################
########## NUMBER OF CLUSTERS : 15 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               2508
2               2322
3               2069
4               1926
5               1917
6               1676
7               1668
8               1555
9               1350
10              1319
11              1285
12              944
13              752
14              723
15              487
===========================



===========================================================================================================
                                          Cluster Metrics Summary
-----------------------------------------------------------------------------------------------------------
Clusters       SSR/SST    D(SSR/SST)         (Psuedo)F-stat.    Silhouette-score   Davies-bouldin-score
2              39.50      39.503                  14691.213     0.347              1.121
3              50.59      11.086                  11517.190     0.337              1.117
4              57.15       6.563                  10002.451     0.240              1.329
5              62.30       5.150                   9294.526     0.241              1.271
6              65.75       3.452                   8638.371     0.239              1.232
7              68.68       2.930                   8222.448     0.231              1.272
8              70.67       1.983                   7741.149     0.212              1.304
9              72.57       1.902                   7437.596     0.218              1.263
10             74.22       1.647                   7192.800     0.218              1.202
11             75.34       1.123                   6870.415     0.209              1.273
12             76.28       0.946                   6576.296     0.202              1.314
13             77.03       0.750                   6286.124     0.200              1.299
14             77.87       0.837                   6087.276     0.201              1.297
15             78.54       0.671                   5879.271     0.196              1.331
===========================================================================================================


#####################################
Final number of cluster selected: 10
#####################################

Calculating central structure for cluster-10 ...
Reading frame       9 time 407500.000
Back Off! I just backed up cluster-id-radius.xvg to ./#cluster-id-radius.xvg.2#
Reading frame       9 time 407500.000
GROMACS reminds you: "Christianity may be OK between consenting adults in private but should not be taught to young children." (Francis Crick)



===========================================
Cluster-ID      Central Frame   Total Frames
1               19488           4230
2               9807            3259
3               1873            3048
4               14884           2659
5               20134           2538
6               13371           2271
7               22103           1482
8               2417            1405
9               21652           811
10              4075            798
===========================================



Extracting coordinates of the central structure...


Calculating RMSD between central structures...


=====================================
 Central structurs - RMSD matrix
=====================================
    c1     c2     c3     c4     c5     c6     c7     c8     c9    c10
 0.000  0.162  0.219  0.173  0.115  0.213  0.169  0.242  0.167  0.233
 0.162  0.000  0.172  0.174  0.160  0.148  0.159  0.184  0.161  0.183
 0.219  0.172  0.000  0.231  0.221  0.188  0.190  0.142  0.204  0.154
 0.173  0.174  0.231  0.000  0.166  0.220  0.162  0.241  0.169  0.230
 0.115  0.160  0.221  0.166  0.000  0.213  0.164  0.247  0.159  0.242
 0.213  0.148  0.188  0.220  0.213  0.000  0.214  0.175  0.208  0.174
 0.169  0.159  0.190  0.162  0.164  0.214  0.000  0.213  0.127  0.211
 0.242  0.184  0.142  0.241  0.247  0.175  0.213  0.000  0.229  0.140
 0.167  0.161  0.204  0.169  0.159  0.208  0.127  0.229  0.000  0.225
 0.233  0.183  0.154  0.230  0.242  0.174  0.211  0.140  0.225  0.000
=====================================


Writing central structure to pdb-files...

7. Analysis and visualizations

Now, we will perform few analysis and visualizations.

  1. Calculating and plotting average cavity radius cluster-wise

  2. Visualizing clustered cavity shapes

1. Calculating and plotting average cavity radius cluster-wise

gmx_clusterByFeatures holeclustersplot can be used to plot the average radius of each clusters. The options to the sub-commands are as follows:

  • -xmin - minimum value of cavity/channel axis position to consider

  • -xmax - maximum value of cavity/channel axis position to consider

  • -endrad - ignore if radius is greater than this value (mostly at the opening end)

  • -gap - slab size along the cavity/channel

  • -ax - X, Y or Z axis along which cavity/channel axis is aligned

  • -dl - Do not consider last two smallest clusters for calculation and plotting.

  • -csv radius-clusterwise.csv - Dump the average and standard-deviation of each clusters’ radius to a csv file

  • -o radius-cluster.png - Plot the average radius values from each cluster

[4]:
%%bash

gmx_clusterByFeatures holeclustersplot -i radius.dat -clid cluster-id-radius.xvg -o radius-cluster.png -csv radius-clusterwise.csv \
                                       -xmin 5.5 -xmax 17 -endrad 6 -gap 0.5 -ax Y -dl 2 -wd 8

Reading file : radius.dat
Reading frame 0 at time        0.000
Maximum axis range is: (5.500 to 17.000)
Reading frame 20000 at time  2000000.000
Finishid reading.... Total number of frame read =  22501
After removing axis points with any missing data:
------------------------------------------------------------
#        Axis        Mean radius     Std. deviation
------------------------------------------------------------
       5.500            1.629            0.384
       6.000            1.750            0.451
       6.500            1.907            0.534
       7.000            2.081            0.594
       7.500            2.146            0.631
       8.000            2.142            0.637
       8.500            2.127            0.610
       9.000            2.111            0.581
       9.500            2.085            0.569
      10.000            2.036            0.571
      10.500            1.984            0.567
      11.000            1.948            0.559
      11.500            1.925            0.561
      12.000            1.915            0.571
      12.500            1.916            0.574
      13.000            1.925            0.572
      13.500            1.938            0.595
      14.000            1.963            0.646
      14.500            2.001            0.715
      15.000            2.049            0.796
      15.500            2.088            0.884
      16.000            2.113            0.965
      16.500            2.178            1.044
------------------------------------------------------------

b1b166497a364856a9c660a5961dbffa

2. Visualizing clustered cavity shapes

To visualize the cavity shape of each cluster, we will use central structure pdb file and calculate the radius with same options as we did it originally with whole trajectory. This step also generate a pdb file with -pdb option containg cavity shape.

Subsequently, we will use sph_process and sos_triangle tools (provided with HOLE package) to generate a file that can be used to draw the cavity in the VMD.

Note: Remove %%capture --no-stdout and %%capture --no-stderr to populate all the output and errors generated from these commands.

[5]:
%%capture --no-stderr
%%capture --no-stdout
%%bash

for i in 1 2 3 4 5 6 7 8
do
input=central_c${i}

echo 11 1 | gmx_clusterByFeatures hole -f clustering-radius-output/${input}.pdb -s inputs/ref_cavity.pdb -n inputs/onlyProtein.ndx -rad bondi \
                                       -catmid 1903 -cvect 0 1 0 -endrad 12 -sample 0.5 -pdb ${input}_sph -o dummy_radius.dat
mv ${input}_sph.pdb ${input}.sph

rm ${input}.sos ${input}.vmd_plot dummy_radius.dat
sph_process -sos -dotden 15 -color ${input}.sph ${input}.sos
sos_triangle -s < ${input}.sos > cavity-snapshots/${input}.vmd_plot
rm ${input}.sph ${input}.sos
done

Visualizing cavity shape using VMD

Above commands generate cluster_<cluster_id>.vmd_plot files that can be loaded in VMD to visualize the cavity.

  1. Go to the cavity-snapshots library

  2. At first load the central strucure by using follwoing command: vmd ../clustering-radius-output/central_c1.pdb -e snap.vmd. This command will load central structure of first crystal structure and sets several representations.

  3. In VMD Main Window, click on Analysis -> Extensions -> Tk Console. It will open the Tk console. Write source central_c1.vmd_plot and it will plot the cavity.

15c8cdfa36c2462683f29bd73da2465a