Protein-Ligand Interaction Clustering

During the protein-ligand simulations, both protein and ligand undergoes for conformational change, and therefore interaction between them is also evolve and changes during the simulations. Sometime, protein and ligand conformation changes such that interaction remains constant. Capturing similarities in protein-ligand interactions while both are undergoing for changes is challenging. Therefore, clustering of protein-ligand interactions will provide collection of complex conformations that are grouped together by the interactions.

This tutorial provides the steps with examples to perform conformational clustering besed on the protein-ligand interactions. This method is employed in this publication. For more details about the theory of the method, please follow the method section of this publication.

Instructions

Tutorial files: The tutorial files can be downloaded from here.
Extract the files: tar -zxvf protein-ligand-interaction.tar.gz
Go to directory: cd protein-ligand-interaction
Copy the Jupyter Notebook: This notebook is available in the GitHub repo. Download and copy it from the github.

Final result

230d5756491f4ff393034352781b607a 6be6ea28abd843a5b1446db56b21ec57

Required Tools

GROMACS
gmx_clusterByFeatures

Steps

Modification of ligand topology file
Calculation of protein-ligand reciprocal-distance-matrix trajectory
PCA of the reciprocal-distance-matrix trajectory
Calculations of projection of first five PCs on reciprocal-distance-matrix trajectory
Pre-Clustering scan for Cluster-Metrics
Clustering with pre-determined number of clusters
Analysis

1. Modification of ligand topology file

In general, whole ligand is considered as a single residue. At first, ligand’s atoms are grouped as several residues and these are marked in the topology file as shown below in example. Idea is to divide ligand into virtual residues.

29e378f42a4941a584c1a0f7e92c416a 16991786bf5d4bccb7311f507dc7ef12

Above right side, ligand is colored by the virtual residues.

Notes:

The atoms in a residue should be continuous and its ID and Name should be unique.
Do not reorder the atoms in topology after simulation is performed because same order is present in trajectory.
If required, always reorder atoms during ligand topology generation at the very begining of the simulation setup. In this way, atoms in generated trajectory will have same order as in topology and tpr files.

After, modifying the topology, rerun gmx grompp to regenerate tpr file with ligand having virtual residues. Also, create new index files to include make new groups depending on the requirements.

2. Calculation of protein-ligand reciprocal-distance-matrix trajectory

Now, we calculate reciprocal-distance matrix using distmat sub-command. Since, ligand is now made-up of virtual residues, reciprocal-distance-matrix between ligand and protein will contain reciprocal of minimum-distance between ligand’s virtual residues and protein’s residues.

Notes:

-power -1 option is used to calulate 1/(minimum-distance) in place of minimum-distance.
Selected index group: 1 is protein and 12 is ligand (Other)

[1]:

%%bash

echo 1 12 | gmx_clusterByFeatures distmat -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr \
                                          -gx 1 -power -1 -var var_lig_protein.dat -cmap cmap_lig_protein.dat -pca pca.xtc

 :-) GROMACS - gmx_clusterByFeatures distmat, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures distmat
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
  'gmx_clusterByFeatures distmat' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -gx 1 -power -1 -var var_lig_protein.dat -cmap cmap_lig_protein.dat -pca pca.xtc


         :-)  gmx_clusterByFeatures distmat (-:

             Author: Rajendra Kumar

       Copyright (C) 2018-2019  Rajendra Kumar


gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Select first group:
Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Select a group: Select second group:
Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Select a group: There are 542 residues with 8325 atoms in first group
There are 12 residues with 68 atoms in second group
Reading frame   22000 time 440000.000

GROMACS reminds you: "Chemistry: It tends to be a messy science." (Gunnar von Heijne, former chair of the Nobel Committee for chemistry)

Selected 1: 'Protein'
Selected 12: 'Other'
 Number of distance-matrix elements for PCA trajectory: 6505
 Number of distance-matrix coordinates in PCA trajectory: 2169

3. PCA of the reciprocal-distance-matrix trajectory

Now, we will use pca.xtc and pca_dummy.pdb generated in the above command as input files to GROMACS tool gmx covar to perform PCA. This step will calculate covariance matrix, eigenvectors and eigenvalues. By default, the eigenvectors are written in eigenvec.trr while eigenvalues are written in eigenval.xvg files.

-nofit, -nomwa and -nopbc options are used because xtc file does not contain cartesian-coordinates and these option has no meanings in the case of the distance-matrix trajectory.

[2]:

%%bash

echo 0 0 | gmx covar -f pca.xtc -s pca_dummy.pdb -nofit -nomwa -nopbc

                      :-) GROMACS - gmx covar, 2025.2 (-:

Executable:   /opt/gromacs-2025/bin/gmx
Data prefix:  /opt/gromacs-2025
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
  gmx covar -f pca.xtc -s pca_dummy.pdb -nofit -nomwa -nopbc

Group     0 (         System) has  2169 elements
Group     1 (        Protein) has  2169 elements
Group     2 (      Protein-H) has  1102 elements
Group     3 (        C-alpha) has   140 elements
Group     4 (       Backbone) has   419 elements
Group     5 (      MainChain) has   558 elements
Group     6 (   MainChain+Cb) has   682 elements
Group     7 (    MainChain+H) has   683 elements
Group     8 (      SideChain) has  1486 elements
Group     9 (    SideChain-H) has   544 elements
Select a group: Calculating the average structure ...
Reading frame   22000 time 440000.000
Constructing covariance matrix (6507x6507) ...
Reading frame   22000 time 440000.000
Read 22501 frames

Trace of the covariance matrix: 140.004 (nm^2)

Diagonalizing ...

Sum of the eigenvalues: 140.004 (nm^2)

Writing eigenvalues to eigenval.xvg

Writing average structure & eigenvectors 1--6507 to eigenvec.trr
Wrote the log to covar.log

GROMACS reminds you: "Can I have everything louder than everything else?" (Deep Purple)


WARNING: Masses and atomic (Van der Waals) radii will be guessed
         based on residue and atom names, since they could not be
         definitively assigned from the information in your input
         files. These guessed numbers might deviate from the mass
         and radius of the atom type. Please check the output
         files if necessary. Note, that this functionality may
         be removed in a future GROMACS version. Please, consider
         using another file format for your input.


Choose a group for the covariance analysis
Selected 0: 'System'

4. Projection of first five PCs on reciprocal-distance-matrix trajectory

We will use eigenvectors written in eigenvec.trr, pca.xtc and pca_dummy.pdb as input files to GROMACS tool gmx anaeig to calculate projection of first 5 eigenvectors on distance-matrix trajectory. These projections will be written into proj.xvg file.

[3]:

%%bash

echo 0 0 | gmx anaeig -f pca.xtc -s pca_dummy.pdb -first 1 -last 5 -proj

                      :-) GROMACS - gmx anaeig, 2025.2 (-:

Executable:   /opt/gromacs-2025/bin/gmx
Data prefix:  /opt/gromacs-2025
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
  gmx anaeig -f pca.xtc -s pca_dummy.pdb -first 1 -last 5 -proj

trr version: GMX_trn_file (single precision)
Eigenvectors in eigenvec.trr were determined without fitting
Read non mass weighted average/minimum structure with 2169 atoms from eigenvec.trr
Read 6507 eigenvectors (for 2169 atoms)


WARNING: If there are molecules in the input trajectory file
         that are broken across periodic boundaries, they
         cannot be made whole (or treated as whole) without
         you providing a run input file.

Group     0 (         System) has  2169 elements
Group     1 (        Protein) has  2169 elements
Group     2 (      Protein-H) has  1102 elements
Group     3 (        C-alpha) has   140 elements
Group     4 (       Backbone) has   419 elements
Group     5 (      MainChain) has   558 elements
Group     6 (   MainChain+Cb) has   682 elements
Group     7 (    MainChain+H) has   683 elements
Group     8 (      SideChain) has  1486 elements
Group     9 (    SideChain-H) has   544 elements
Select a group: 5 eigenvectors selected for output: 1 2 3 4 5
Reading frame   22000 time 440000.000


GROMACS reminds you: "Mathematics is a game played according to certain rules with meaningless marks on paper." (David Hilbert)


Select an index group of 2169 elements that corresponds to the eigenvectors
Selected 0: 'System'

5. Pre-Clustering scan for Cluster-Metrics

Before performing final clutering, we would first generate cluster-mterics that can be used to make a decision on the number of clusters. One of the drawback of K-Means clustering is that the number of clusters should be known beforehand. Although, gmx_clusterByFeatures implements several cluster metrics and also automatated way to decide number of clusters, here, we first calculate cluster-metrics and final clustering can be performed later.

Following command will perform the clustering of conformations using first 5 PCs projection. Explanation of options are as follows:

-method kmeans: Use K-Means clustering algorithm
-ncluster 15: K-Means clustering will be performed for 1 upto 15 clusters times each time. Finally, based on -ssrchange option, final number of clusters will be automatically selected.
-cmetric ssr-sst: Use Elbow method to decide final number of clusters. It is not for final purpose.
-nfeature 5: Take 5 feature from feat proj.xvg input file. Here it is projection of first 5 eigenvectors on the trajectory.
-ssrchange 1.0: Threshold (percentage) of change in SSR/SST ratio in Elbow method to automatically decide the number of clusters.
-g clusters_scan.log: Log file containing clustering information and cluster-mterics.

index group order

First index group - Output group of atoms in the central structures and clustered trajectories
Second index group - Group of atoms to calculate RMSD between central conformations of clusters as RMSD matrix, which is dumped in the log file with -g option. Here, it is Ligand.
Third index group - Used for Superimposition by least-square fitting. ONLY used in separate clustered trajectories to superimpose conformations on the central structure. Here, it is protein’s C-alpha atoms.

Content of the output ``-g clusters_scan.log`` file

It contains the command summary, and for each input cluster-numbers, number of frames in each clusters. At the end it dumps the Cluster Metrics Summary, which is important for deciding final number of clusters.

[4]:

%%bash

echo 0 12 26 | gmx_clusterByFeatures cluster -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx \
                                            -method kmeans -feat proj.xvg -nfeature 10 -cmetric ssr-sst -ncluster 15 -ssrchange 1.0 \
                                            -g clusters_scan.log

 :-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures cluster
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
  'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -method kmeans -feat proj.xvg -nfeature 10 -cmetric ssr-sst -ncluster 15 -ssrchange 1.0 -g clusters_scan.log


         :-)  gmx_clusterByFeatures cluster (-:

             Author: Rajendra Kumar

       Copyright (C) 2018-2019  Rajendra Kumar


gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Group    25 (  Protein_Other) has  8393 elements
Group    26 (C-alpha_Other_&_!H*) has   574 elements
Select a group: Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Group    25 (  Protein_Other) has  8393 elements
Group    26 (C-alpha_Other_&_!H*) has   574 elements
Select a group: Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Group    25 (  Protein_Other) has  8393 elements
Group    26 (C-alpha_Other_&_!H*) has   574 elements
Reading frame       4 time   80.000

=======================
  Cluster Log output
=======================

Command:
=======================
'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -method kmeans -feat proj.xvg -nfeature 10 -cmetric ssr-sst -ncluster 15 -ssrchange 1.0 -g clusters_scan.log
=======================

Choose a group for the output:
Selected 0: 'System'

Choose a group for clustering/RMSD calculation:
Selected 12: 'Other'

Choose a group for fitting or superposition:
Selected 26: 'C-alpha_Other_&_!H*'


 Input Trajectory dt = 20 ps



###########################################
########## NUMBER OF CLUSTERS : 1 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               22501
===========================



###########################################
########## NUMBER OF CLUSTERS : 2 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               15000
2               7501
===========================



###########################################
########## NUMBER OF CLUSTERS : 3 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               7510
2               7501
3               7490
===========================



###########################################
########## NUMBER OF CLUSTERS : 4 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               7501
2               7484
3               5309
4               2207
===========================



###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               7501
2               7485
3               3872
4               2205
5               1438
===========================



###########################################
########## NUMBER OF CLUSTERS : 6 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               7485
2               5829
3               3872
4               2205
5               1672
6               1438
===========================



###########################################
########## NUMBER OF CLUSTERS : 7 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               5829
2               5075
3               3872
4               2415
5               2200
6               1672
7               1438
===========================



###########################################
########## NUMBER OF CLUSTERS : 8 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               5830
2               5082
3               2408
4               2386
5               2199
6               1671
7               1491
8               1434
===========================



###########################################
########## NUMBER OF CLUSTERS : 9 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               5083
2               4605
3               2407
4               2398
5               2199
6               1552
7               1479
8               1434
9               1344
===========================



###########################################
########## NUMBER OF CLUSTERS : 10 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4900
2               4615
3               2387
4               2195
5               2188
6               1542
7               1489
8               1434
9               1344
10              407
===========================



###########################################
########## NUMBER OF CLUSTERS : 11 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4692
2               4290
3               2387
4               2188
5               2177
6               1489
7               1473
8               1434
9               1336
10              803
11              232
===========================



###########################################
########## NUMBER OF CLUSTERS : 12 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4692
2               4290
3               2189
4               2177
5               1682
6               1473
7               1434
8               1336
9               1118
10              1075
11              803
12              232
===========================



###########################################
########## NUMBER OF CLUSTERS : 13 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3029
2               2944
3               2332
4               2271
5               2189
6               1866
7               1682
8               1434
9               1250
10              1118
11              1075
12              951
13              360
===========================



###########################################
########## NUMBER OF CLUSTERS : 14 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4290
2               2969
3               2357
4               2188
5               2177
6               1432
7               1384
8               1240
9               1092
10              1082
11              935
12              803
13              320
14              232
===========================



###########################################
########## NUMBER OF CLUSTERS : 15 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4339
2               3051
3               2367
4               2248
5               2143
6               1394
7               1250
8               1153
9               952
10              924
11              790
12              787
13              537
14              333
15              233
===========================



===========================================================================================================
                                          Cluster Metrics Summary
-----------------------------------------------------------------------------------------------------------
Clusters       SSR/SST    D(SSR/SST)         (Psuedo)F-stat.    Silhouette-score   Davies-bouldin-score
2              50.45      50.455                  22911.836     0.530              0.729
3              84.67      34.215                  62129.371     0.689              0.503
4              89.04       4.374                  60944.555     0.652              0.591
5              92.08       3.039                  65406.023     0.701              0.477
6              94.15       2.069                  72417.812     0.644              0.538
7              95.13       0.974                  73154.625     0.533              0.738
8              95.52       0.395                  68511.797     0.487              0.908
9              95.76       0.237                  63453.414     0.407              1.049
10             96.10       0.345                  61606.973     0.399              1.024
11             96.28       0.182                  58261.570     0.374              1.043
12             96.48       0.201                  56103.879     0.371              1.070
13             96.64       0.152                  53840.742     0.327              1.200
14             96.80       0.165                  52342.699     0.351              1.065
15             96.91       0.111                  50402.191     0.302              1.157
===========================================================================================================


#####################################
Final number of cluster selected: 6
#####################################

Calculating central structure for cluster-6 ...

Reading frame       5 time 188780.000
GROMACS reminds you: "No, no, you're not thinking, you're just being logical." (Niels Bohr)



===========================================
Cluster-ID      Central Frame   Total Frames
1               18119           7485
2               3515            5829
3               10774           3872
4               12874           2205
5               833             1672
6               9439            1438
===========================================



Extracting coordinates of the central structure...


Calculating RMSD between central structures...


=====================================
 Central structurs - RMSD matrix
=====================================
    c1     c2     c3     c4     c5     c6
 0.000  0.414  0.512  0.479  0.458  0.536
 0.414  0.000  0.605  0.624  0.214  0.600
 0.512  0.605  0.000  0.253  0.587  0.221
 0.479  0.624  0.253  0.000  0.587  0.266
 0.458  0.214  0.587  0.587  0.000  0.583
 0.536  0.600  0.221  0.266  0.583  0.000
=====================================

6. Clustering with pre-determined number of clusters

As can be seen above in Cluster-Metrics table obtained in clusters_scan.log file, best scores (both Silhouette and Davies-Bouldin) were obtained for 5 clusters. Therefore, now we will do final clustering with 5 clusters and extract trajectories of these clusters.

Explanation of options are as follows:

-method kmeans: Use K-Means clustering algorithm
-ncluster 5: Five clusters will be generated using K-Means clustering.
-cmetric prior: No cluster-mterics is used, number of clusters is already known.
-nfeature 5: Take 5 feature from feat proj.xvg input file. Here it is projection of first 5 eigenvectors on the trajectory.
cluster-5: Cluster log file
-cpdb clustered_trajs/central.pdb: Central structures of each cluster as PDB file
-fout clustered_trajs/cluster.xtc: Trajectory file of each cluster
-outframe 1000: Only first 1000 frame after sorting to be written in cluster trajectory file
-sort features: Sort the conformation in each cluster trajectory file based on the distance between central structure and current conformation in feature sub-space.

This command could take a long time to execute!

This command could take a long time to execute because it is writing output trajectory file for each cluster sorted by distance in feature-space. Therefore, it needs to read input trajectory back-and-forth many time to extract the conformations in sorted manner. XTC format is fast for back-and-forth reading, and it still could take long time to dump the output trajectories.

[5]:

%%bash

mkdir clustered_trajs

echo 0 12 26 | gmx_clusterByFeatures cluster -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx \
                                            -feat proj.xvg -nfeature 5 -method kmeans -cmetric prior -ncluster 5 -g cluster-5 \
                                            -cpdb clustered_trajs/central.pdb -fout clustered_trajs/cluster.xtc -outframe 1000 \
                                            -plot pca-cluster-5.png -sort features

 :-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures cluster
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
  'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -feat proj.xvg -nfeature 5 -method kmeans -cmetric prior -ncluster 5 -g cluster-5 -cpdb clustered_trajs/central.pdb -fout clustered_trajs/cluster.xtc -outframe 1000 -plot pca-cluster-5.png -sort features


         :-)  gmx_clusterByFeatures cluster (-:

             Author: Rajendra Kumar

       Copyright (C) 2018-2019  Rajendra Kumar


gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Group    25 (  Protein_Other) has  8393 elements
Group    26 (C-alpha_Other_&_!H*) has   574 elements
Select a group: Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Group    25 (  Protein_Other) has  8393 elements
Group    26 (C-alpha_Other_&_!H*) has   574 elements
Select a group: Group     0 (         System) has  8393 elements
Group     1 (        Protein) has  8325 elements
Group     2 (      Protein-H) has  4217 elements
Group     3 (        C-alpha) has   542 elements
Group     4 (       Backbone) has  1626 elements
Group     5 (      MainChain) has  2169 elements
Group     6 (   MainChain+Cb) has  2660 elements
Group     7 (    MainChain+H) has  2668 elements
Group     8 (      SideChain) has  5657 elements
Group     9 (    SideChain-H) has  2048 elements
Group    10 (    Prot-Masses) has  8325 elements
Group    11 (    non-Protein) has    68 elements
Group    12 (          Other) has    68 elements
Group    13 (             L1) has     4 elements
Group    14 (             L2) has     3 elements
Group    15 (             L3) has     8 elements
Group    16 (             L4) has     8 elements
Group    17 (             L5) has    14 elements
Group    18 (             L6) has     2 elements
Group    19 (             L7) has     3 elements
Group    20 (             L8) has     1 elements
Group    21 (             L9) has    10 elements
Group    22 (            L10) has    10 elements
Group    23 (            L11) has     1 elements
Group    24 (            L12) has     4 elements
Group    25 (  Protein_Other) has  8393 elements
Group    26 (C-alpha_Other_&_!H*) has   574 elements
Reading frame       4 time   80.000

=======================
  Cluster Log output
=======================

Command:
=======================
'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -feat proj.xvg -nfeature 5 -method kmeans -cmetric prior -ncluster 5 -g cluster-5 -cpdb clustered_trajs/central.pdb -fout clustered_trajs/cluster.xtc -outframe 1000 -plot pca-cluster-5.png -sort features
=======================

Choose a group for the output:
Selected 0: 'System'

Choose a group for clustering/RMSD calculation:
Selected 12: 'Other'

Choose a group for fitting or superposition:
Selected 26: 'C-alpha_Other_&_!H*'


 Input Trajectory dt = 20 ps



###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               7501
2               7485
3               3872
4               2205
5               1438
===========================



#####################################
Final number of cluster selected: 5
#####################################

Calculating central structure for cluster-5 ...

Reading frame       4 time 188780.000   <string>:127: MatplotlibDeprecationWarning: The non_interactive_bk attribute was deprecated in Matplotlib 3.9 and will be removed in 3.11. Use ``matplotlib.backends.backend_registry.list_builtin(matplotlib.backends.BackendFilter.NON_INTERACTIVE)`` instead.

Back Off! I just backed up clid.xvg to ./#clid.xvg.1#
Reading frame    4000 time 188780.000
GROMACS reminds you: "Those people who think they know everything are a great annoyance to those of us who do." (Isaac Asimov)



===========================================
Cluster-ID      Central Frame   Total Frames
1               2786            7501
2               18119           7485
3               10774           3872
4               12874           2205
5               9439            1438
===========================================



Extracting coordinates of the central structure...


Calculating RMSD between central structures...


=====================================
 Central structurs - RMSD matrix
=====================================
    c1     c2     c3     c4     c5
 0.000  0.418  0.603  0.619  0.599
 0.418  0.000  0.512  0.479  0.536
 0.603  0.512  0.000  0.253  0.221
 0.619  0.479  0.253  0.000  0.266
 0.599  0.536  0.221  0.266  0.000
=====================================


Writing central structure to pdb-files...


Writing trajectory for each cluster...

7. Analysis

Now, we will perform following analysis on obtained clusters:

Comparison of RMSDs within and between the clusters: It will show the qualitative measure of separation of conformations of clusters in term of RMSD.
Plotting PC vs PC cluster-wise. In fact, this is already plotted in the above obtained pca-cluster-5.png file. However, we will focus on first three PCs to demonstrate the distribution of conformation in PC space.
Cluster-ID with time: We will plot cluster-id as a function of time to analyze, how conformation is changing between clusters as a function of time.

At first, we will load Python modules and define some functions as follows:

[6]:

import re
import sys
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

[7]:

def read_xvg(filename):
    ''' Read any XVG file and return the data as 2D array where data is row-wise with respect to time.
    '''
    fin = open(filename, 'r')

    data = []
    for line in fin:
        line = line.rstrip().lstrip()
        if not line:
            continue

        if re.search('^#|^@', line) is not None:
                     continue

        temp = re.split('\s+', line)
        data.append(list(map(float, temp)))

    data = np.asarray(data)

    return data.T

1a. Calculation of RMSDs within and between the clusters

At first, we need to calculate RMSDs of complex within and between the clusters using gmx rms command as follows.

Note: The complex structure is already superimposed when separated cluster-trajectory were written in the previous step, therefore, we are not performing fitting in RMSD calculations below,

Note: Remove %%capture --no-stdout and %%capture --no-stderr to populate all the output generated from gmx rms commands.

[8]:

%%capture --no-stdout
%%capture --no-stderr
%%script bash

# make directory for rmsd files
mkdir clustered-rmsd

for i in `seq 1 5`
do
    for j in `seq 1 5`
    do
        echo 26 | gmx rms -f clustered_trajs/cluster_c${j}.xtc -s clustered_trajs/central_c${i}.pdb -n inputs/index.ndx -o clustered-rmsd/c${i}_c${j} -nopbc -fit none
    done
done

1b. Comparison of ligand RMSDs within and between the clusters

We will use Python to plot all the obtained RMSDs above. Conformations in cluster trajectory is sorted by distance in feature-space and therefore, RMSD will be randomly fluctuate. RMSDs within the same cluster is expected to be lowest.

[9]:

mpl.rcParams['font.size']=12
fig = plt.figure(figsize=(12,16))
fig.subplots_adjust(top=0.98, bottom=0.05, hspace=0.25)

for i in range(1,6):

    ax = fig.add_subplot(6, 2, i)

    for j in range(1,6):
        filename = 'clustered-rmsd/c{0}_c{1}.xvg'.format(i, j)
        label = "{0} vs {1}".format(i, j)
        data = read_xvg(filename) # read file
        if i==j:
            ax.scatter(data[0]/1000, data[1], label=label, lw=0, s=2)
        else:
            ax.scatter(data[0]/1000, data[1], label=label, lw=0, s=0.5)

    ax.set_ylim(0, 0.5)
    ax.set_xlabel('Time (ns)')
    ax.set_ylabel('RMSD (nm)')
    plt.legend(loc='upper center', ncol=4, markerscale=10, borderaxespad=0.1, columnspacing=1, handlelength=1, handletextpad=0.4)


plt.savefig('rmsd-comparison.png', dpi=300)
plt.show()

../_images/tutorials_protein-ligand-interaction_18_0.png

3. Plotting PC vs PC cluster-wise

It will be done in two steps:

An input file will be prepared containing information about feature searial and their labels.
gmx_clusterByFeatures featuresplot will be used to generate the plot.

[10]:

%%bash

# First step - preparation of input file
echo "1,2,PC-1,PC-2" > features-label.txt
echo "2,3,PC-2,PC-3" >> features-label.txt
echo "1,3,PC-1,PC-3" >> features-label.txt
echo "1,4,PC-1,PC-4" >> features-label.txt
cat features-label.txt

# Second step - plotting
gmx_clusterByFeatures featuresplot -i features-label.txt -feat proj.xvg -clid clid.xvg -lcols 6 -o protein-ligand-interaction-PCs-vs-PCs.png

1,2,PC-1,PC-2
2,3,PC-2,PC-3
1,3,PC-1,PC-3
1,4,PC-1,PC-4

a62c78ab1c4d4206a7f018d7ffa70969

3. Cluster-ID with time

We will use clid.xvg obtained from the cluster subcommand to plot both cluster-id and also highlight the occurance of the given cluster.

[11]:

mpl.rcParams['font.size']=12

data=read_xvg('clid.xvg')
fig = plt.figure(figsize=(8,6))
fig.subplots_adjust(right=0.85)

ax = fig.add_subplot(111)
ax.scatter(data[0]/1000, data[1], marker='x', color='k')
ax.set_ylabel('Cluster')
ax.set_xlabel('Time (ns)')

plt.savefig('clid.png', dpi=300)
plt.show()

../_images/tutorials_protein-ligand-interaction_23_0.png