Protein-Ligand Interaction Clustering
During the protein-ligand simulations, both protein and ligand undergoes for conformational change, and therefore interaction between them is also evolve and changes during the simulations. Sometime, protein and ligand conformation changes such that interaction remains constant. Capturing similarities in protein-ligand interactions while both are undergoing for changes is challenging. Therefore, clustering of protein-ligand interactions will provide collection of complex conformations that are grouped together by the interactions.
This tutorial provides the steps with examples to perform conformational clustering besed on the protein-ligand interactions. This method is employed in this publication. For more details about the theory of the method, please follow the method section of this publication.
Instructions
Tutorial files: The tutorial files can be downloaded from here.
Extract the files:
tar -zxvf protein-ligand-interaction.tar.gzGo to directory:
cd protein-ligand-interactionCopy the Jupyter Notebook: This notebook is available in the GitHub repo. Download and copy it from the github.
Final result

Required Tools
GROMACS
gmx_clusterByFeatures
Steps
Modification of ligand topology file
Calculation of protein-ligand reciprocal-distance-matrix trajectory
PCA of the reciprocal-distance-matrix trajectory
Calculations of projection of first five PCs on reciprocal-distance-matrix trajectory
Pre-Clustering scan for Cluster-Metrics
Clustering with pre-determined number of clusters
Analysis
1. Modification of ligand topology file
In general, whole ligand is considered as a single residue. At first, ligand’s atoms are grouped as several residues and these are marked in the topology file as shown below in example. Idea is to divide ligand into virtual residues.

Above right side, ligand is colored by the virtual residues.
Notes:
The atoms in a residue should be continuous and its ID and Name should be unique.
Do not reorder the atoms in topology after simulation is performed because same order is present in trajectory.
If required, always reorder atoms during ligand topology generation at the very begining of the simulation setup. In this way, atoms in generated trajectory will have same order as in topology and tpr files.
After, modifying the topology, rerun gmx grompp to regenerate tpr file with ligand having virtual residues. Also, create new index files to include make new groups depending on the requirements.
2. Calculation of protein-ligand reciprocal-distance-matrix trajectory
Now, we calculate reciprocal-distance matrix using distmat sub-command. Since, ligand is now made-up of virtual residues, reciprocal-distance-matrix between ligand and protein will contain reciprocal of minimum-distance between ligand’s virtual residues and protein’s residues.
Notes:
-power -1option is used to calulate 1/(minimum-distance) in place of minimum-distance.Selected index group:
1is protein and12is ligand (Other)
[1]:
%%bash
echo 1 12 | gmx_clusterByFeatures distmat -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr \
-gx 1 -power -1 -var var_lig_protein.dat -cmap cmap_lig_protein.dat -pca pca.xtc
:-) GROMACS - gmx_clusterByFeatures distmat, 2025.0-dev-20250210-6949615-local (-:
Executable: gmx_clusterByFeatures distmat
Data prefix: /project/external/gmx_installed
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
'gmx_clusterByFeatures distmat' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -gx 1 -power -1 -var var_lig_protein.dat -cmap cmap_lig_protein.dat -pca pca.xtc
:-) gmx_clusterByFeatures distmat (-:
Author: Rajendra Kumar
Copyright (C) 2018-2019 Rajendra Kumar
gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures. If not, see <http://www.gnu.org/licenses/>.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Select first group:
Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Select a group: Select second group:
Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Select a group: There are 542 residues with 8325 atoms in first group
There are 12 residues with 68 atoms in second group
Reading frame 22000 time 440000.000
GROMACS reminds you: "Chemistry: It tends to be a messy science." (Gunnar von Heijne, former chair of the Nobel Committee for chemistry)
Selected 1: 'Protein'
Selected 12: 'Other'
Number of distance-matrix elements for PCA trajectory: 6505
Number of distance-matrix coordinates in PCA trajectory: 2169
3. PCA of the reciprocal-distance-matrix trajectory
Now, we will use pca.xtc and pca_dummy.pdb generated in the above command as input files to GROMACS tool gmx covar to perform PCA. This step will calculate covariance matrix, eigenvectors and eigenvalues. By default, the eigenvectors are written in eigenvec.trr while eigenvalues are written in eigenval.xvg files.
-nofit, -nomwa and -nopbc options are used because xtc file does not contain cartesian-coordinates and these option has no meanings in the case of the distance-matrix trajectory.
[2]:
%%bash
echo 0 0 | gmx covar -f pca.xtc -s pca_dummy.pdb -nofit -nomwa -nopbc
:-) GROMACS - gmx covar, 2025.2 (-:
Executable: /opt/gromacs-2025/bin/gmx
Data prefix: /opt/gromacs-2025
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
gmx covar -f pca.xtc -s pca_dummy.pdb -nofit -nomwa -nopbc
Group 0 ( System) has 2169 elements
Group 1 ( Protein) has 2169 elements
Group 2 ( Protein-H) has 1102 elements
Group 3 ( C-alpha) has 140 elements
Group 4 ( Backbone) has 419 elements
Group 5 ( MainChain) has 558 elements
Group 6 ( MainChain+Cb) has 682 elements
Group 7 ( MainChain+H) has 683 elements
Group 8 ( SideChain) has 1486 elements
Group 9 ( SideChain-H) has 544 elements
Select a group: Calculating the average structure ...
Reading frame 22000 time 440000.000
Constructing covariance matrix (6507x6507) ...
Reading frame 22000 time 440000.000
Read 22501 frames
Trace of the covariance matrix: 140.004 (nm^2)
Diagonalizing ...
Sum of the eigenvalues: 140.004 (nm^2)
Writing eigenvalues to eigenval.xvg
Writing average structure & eigenvectors 1--6507 to eigenvec.trr
Wrote the log to covar.log
GROMACS reminds you: "Can I have everything louder than everything else?" (Deep Purple)
WARNING: Masses and atomic (Van der Waals) radii will be guessed
based on residue and atom names, since they could not be
definitively assigned from the information in your input
files. These guessed numbers might deviate from the mass
and radius of the atom type. Please check the output
files if necessary. Note, that this functionality may
be removed in a future GROMACS version. Please, consider
using another file format for your input.
Choose a group for the covariance analysis
Selected 0: 'System'
4. Projection of first five PCs on reciprocal-distance-matrix trajectory
We will use eigenvectors written in eigenvec.trr, pca.xtc and pca_dummy.pdb as input files to GROMACS tool gmx anaeig to calculate projection of first 5 eigenvectors on distance-matrix trajectory. These projections will be written into proj.xvg file.
[3]:
%%bash
echo 0 0 | gmx anaeig -f pca.xtc -s pca_dummy.pdb -first 1 -last 5 -proj
:-) GROMACS - gmx anaeig, 2025.2 (-:
Executable: /opt/gromacs-2025/bin/gmx
Data prefix: /opt/gromacs-2025
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
gmx anaeig -f pca.xtc -s pca_dummy.pdb -first 1 -last 5 -proj
trr version: GMX_trn_file (single precision)
Eigenvectors in eigenvec.trr were determined without fitting
Read non mass weighted average/minimum structure with 2169 atoms from eigenvec.trr
Read 6507 eigenvectors (for 2169 atoms)
WARNING: If there are molecules in the input trajectory file
that are broken across periodic boundaries, they
cannot be made whole (or treated as whole) without
you providing a run input file.
Group 0 ( System) has 2169 elements
Group 1 ( Protein) has 2169 elements
Group 2 ( Protein-H) has 1102 elements
Group 3 ( C-alpha) has 140 elements
Group 4 ( Backbone) has 419 elements
Group 5 ( MainChain) has 558 elements
Group 6 ( MainChain+Cb) has 682 elements
Group 7 ( MainChain+H) has 683 elements
Group 8 ( SideChain) has 1486 elements
Group 9 ( SideChain-H) has 544 elements
Select a group: 5 eigenvectors selected for output: 1 2 3 4 5
Reading frame 22000 time 440000.000
GROMACS reminds you: "Mathematics is a game played according to certain rules with meaningless marks on paper." (David Hilbert)
Select an index group of 2169 elements that corresponds to the eigenvectors
Selected 0: 'System'
5. Pre-Clustering scan for Cluster-Metrics
Before performing final clutering, we would first generate cluster-mterics that can be used to make a decision on the number of clusters. One of the drawback of K-Means clustering is that the number of clusters should be known beforehand. Although, gmx_clusterByFeatures implements several cluster metrics and also automatated way to decide
number of clusters, here, we first calculate cluster-metrics and final clustering can be performed later.
Following command will perform the clustering of conformations using first 5 PCs projection. Explanation of options are as follows:
-method kmeans: Use K-Means clustering algorithm-ncluster 15: K-Means clustering will be performed for 1 upto 15 clusters times each time. Finally, based on-ssrchangeoption, final number of clusters will be automatically selected.-cmetric ssr-sst: Use Elbow method to decide final number of clusters. It is not for final purpose.-nfeature 5: Take 5 feature fromfeat proj.xvginput file. Here it is projection of first 5 eigenvectors on the trajectory.-ssrchange 1.0: Threshold (percentage) of change in SSR/SST ratio in Elbow method to automatically decide the number of clusters.-g clusters_scan.log: Log file containing clustering information and cluster-mterics.
index group order
First index group - Output group of atoms in the central structures and clustered trajectories
Second index group - Group of atoms to calculate RMSD between central conformations of clusters as RMSD matrix, which is dumped in the log file with
-goption. Here, it is Ligand.Third index group - Used for Superimposition by least-square fitting. ONLY used in separate clustered trajectories to superimpose conformations on the central structure. Here, it is protein’s C-alpha atoms.
Content of the output ``-g clusters_scan.log`` file
It contains the command summary, and for each input cluster-numbers, number of frames in each clusters. At the end it dumps the Cluster Metrics Summary, which is important for deciding final number of clusters.
[4]:
%%bash
echo 0 12 26 | gmx_clusterByFeatures cluster -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx \
-method kmeans -feat proj.xvg -nfeature 10 -cmetric ssr-sst -ncluster 15 -ssrchange 1.0 \
-g clusters_scan.log
:-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:
Executable: gmx_clusterByFeatures cluster
Data prefix: /project/external/gmx_installed
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -method kmeans -feat proj.xvg -nfeature 10 -cmetric ssr-sst -ncluster 15 -ssrchange 1.0 -g clusters_scan.log
:-) gmx_clusterByFeatures cluster (-:
Author: Rajendra Kumar
Copyright (C) 2018-2019 Rajendra Kumar
gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures. If not, see <http://www.gnu.org/licenses/>.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Group 25 ( Protein_Other) has 8393 elements
Group 26 (C-alpha_Other_&_!H*) has 574 elements
Select a group: Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Group 25 ( Protein_Other) has 8393 elements
Group 26 (C-alpha_Other_&_!H*) has 574 elements
Select a group: Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Group 25 ( Protein_Other) has 8393 elements
Group 26 (C-alpha_Other_&_!H*) has 574 elements
Reading frame 4 time 80.000
=======================
Cluster Log output
=======================
Command:
=======================
'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -method kmeans -feat proj.xvg -nfeature 10 -cmetric ssr-sst -ncluster 15 -ssrchange 1.0 -g clusters_scan.log
=======================
Choose a group for the output:
Selected 0: 'System'
Choose a group for clustering/RMSD calculation:
Selected 12: 'Other'
Choose a group for fitting or superposition:
Selected 26: 'C-alpha_Other_&_!H*'
Input Trajectory dt = 20 ps
###########################################
########## NUMBER OF CLUSTERS : 1 ########
###########################################
===========================
Cluster-ID TotalFrames
1 22501
===========================
###########################################
########## NUMBER OF CLUSTERS : 2 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15000
2 7501
===========================
###########################################
########## NUMBER OF CLUSTERS : 3 ########
###########################################
===========================
Cluster-ID TotalFrames
1 7510
2 7501
3 7490
===========================
###########################################
########## NUMBER OF CLUSTERS : 4 ########
###########################################
===========================
Cluster-ID TotalFrames
1 7501
2 7484
3 5309
4 2207
===========================
###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################
===========================
Cluster-ID TotalFrames
1 7501
2 7485
3 3872
4 2205
5 1438
===========================
###########################################
########## NUMBER OF CLUSTERS : 6 ########
###########################################
===========================
Cluster-ID TotalFrames
1 7485
2 5829
3 3872
4 2205
5 1672
6 1438
===========================
###########################################
########## NUMBER OF CLUSTERS : 7 ########
###########################################
===========================
Cluster-ID TotalFrames
1 5829
2 5075
3 3872
4 2415
5 2200
6 1672
7 1438
===========================
###########################################
########## NUMBER OF CLUSTERS : 8 ########
###########################################
===========================
Cluster-ID TotalFrames
1 5830
2 5082
3 2408
4 2386
5 2199
6 1671
7 1491
8 1434
===========================
###########################################
########## NUMBER OF CLUSTERS : 9 ########
###########################################
===========================
Cluster-ID TotalFrames
1 5083
2 4605
3 2407
4 2398
5 2199
6 1552
7 1479
8 1434
9 1344
===========================
###########################################
########## NUMBER OF CLUSTERS : 10 ########
###########################################
===========================
Cluster-ID TotalFrames
1 4900
2 4615
3 2387
4 2195
5 2188
6 1542
7 1489
8 1434
9 1344
10 407
===========================
###########################################
########## NUMBER OF CLUSTERS : 11 ########
###########################################
===========================
Cluster-ID TotalFrames
1 4692
2 4290
3 2387
4 2188
5 2177
6 1489
7 1473
8 1434
9 1336
10 803
11 232
===========================
###########################################
########## NUMBER OF CLUSTERS : 12 ########
###########################################
===========================
Cluster-ID TotalFrames
1 4692
2 4290
3 2189
4 2177
5 1682
6 1473
7 1434
8 1336
9 1118
10 1075
11 803
12 232
===========================
###########################################
########## NUMBER OF CLUSTERS : 13 ########
###########################################
===========================
Cluster-ID TotalFrames
1 3029
2 2944
3 2332
4 2271
5 2189
6 1866
7 1682
8 1434
9 1250
10 1118
11 1075
12 951
13 360
===========================
###########################################
########## NUMBER OF CLUSTERS : 14 ########
###########################################
===========================
Cluster-ID TotalFrames
1 4290
2 2969
3 2357
4 2188
5 2177
6 1432
7 1384
8 1240
9 1092
10 1082
11 935
12 803
13 320
14 232
===========================
###########################################
########## NUMBER OF CLUSTERS : 15 ########
###########################################
===========================
Cluster-ID TotalFrames
1 4339
2 3051
3 2367
4 2248
5 2143
6 1394
7 1250
8 1153
9 952
10 924
11 790
12 787
13 537
14 333
15 233
===========================
===========================================================================================================
Cluster Metrics Summary
-----------------------------------------------------------------------------------------------------------
Clusters SSR/SST D(SSR/SST) (Psuedo)F-stat. Silhouette-score Davies-bouldin-score
2 50.45 50.455 22911.836 0.530 0.729
3 84.67 34.215 62129.371 0.689 0.503
4 89.04 4.374 60944.555 0.652 0.591
5 92.08 3.039 65406.023 0.701 0.477
6 94.15 2.069 72417.812 0.644 0.538
7 95.13 0.974 73154.625 0.533 0.738
8 95.52 0.395 68511.797 0.487 0.908
9 95.76 0.237 63453.414 0.407 1.049
10 96.10 0.345 61606.973 0.399 1.024
11 96.28 0.182 58261.570 0.374 1.043
12 96.48 0.201 56103.879 0.371 1.070
13 96.64 0.152 53840.742 0.327 1.200
14 96.80 0.165 52342.699 0.351 1.065
15 96.91 0.111 50402.191 0.302 1.157
===========================================================================================================
#####################################
Final number of cluster selected: 6
#####################################
Calculating central structure for cluster-6 ...
Reading frame 5 time 188780.000
GROMACS reminds you: "No, no, you're not thinking, you're just being logical." (Niels Bohr)
===========================================
Cluster-ID Central Frame Total Frames
1 18119 7485
2 3515 5829
3 10774 3872
4 12874 2205
5 833 1672
6 9439 1438
===========================================
Extracting coordinates of the central structure...
Calculating RMSD between central structures...
=====================================
Central structurs - RMSD matrix
=====================================
c1 c2 c3 c4 c5 c6
0.000 0.414 0.512 0.479 0.458 0.536
0.414 0.000 0.605 0.624 0.214 0.600
0.512 0.605 0.000 0.253 0.587 0.221
0.479 0.624 0.253 0.000 0.587 0.266
0.458 0.214 0.587 0.587 0.000 0.583
0.536 0.600 0.221 0.266 0.583 0.000
=====================================
6. Clustering with pre-determined number of clusters
As can be seen above in Cluster-Metrics table obtained in clusters_scan.log file, best scores (both Silhouette and Davies-Bouldin) were obtained for 5 clusters. Therefore, now we will do final clustering with 5 clusters and extract trajectories of these clusters.
Explanation of options are as follows:
-method kmeans: Use K-Means clustering algorithm-ncluster 5: Five clusters will be generated using K-Means clustering.-cmetric prior: No cluster-mterics is used, number of clusters is already known.-nfeature 5: Take 5 feature fromfeat proj.xvginput file. Here it is projection of first 5 eigenvectors on the trajectory.cluster-5: Cluster log file-cpdb clustered_trajs/central.pdb: Central structures of each cluster as PDB file-fout clustered_trajs/cluster.xtc: Trajectory file of each cluster-outframe 1000: Only first 1000 frame after sorting to be written in cluster trajectory file-sort features: Sort the conformation in each cluster trajectory file based on the distance between central structure and current conformation in feature sub-space.
This command could take a long time to execute!
This command could take a long time to execute because it is writing output trajectory file for each cluster sorted by distance in feature-space. Therefore, it needs to read input trajectory back-and-forth many time to extract the conformations in sorted manner. XTC format is fast for back-and-forth reading, and it still could take long time to dump the output trajectories.
[5]:
%%bash
mkdir clustered_trajs
echo 0 12 26 | gmx_clusterByFeatures cluster -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx \
-feat proj.xvg -nfeature 5 -method kmeans -cmetric prior -ncluster 5 -g cluster-5 \
-cpdb clustered_trajs/central.pdb -fout clustered_trajs/cluster.xtc -outframe 1000 \
-plot pca-cluster-5.png -sort features
:-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:
Executable: gmx_clusterByFeatures cluster
Data prefix: /project/external/gmx_installed
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/protein-ligand-interaction
Command line:
'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -feat proj.xvg -nfeature 5 -method kmeans -cmetric prior -ncluster 5 -g cluster-5 -cpdb clustered_trajs/central.pdb -fout clustered_trajs/cluster.xtc -outframe 1000 -plot pca-cluster-5.png -sort features
:-) gmx_clusterByFeatures cluster (-:
Author: Rajendra Kumar
Copyright (C) 2018-2019 Rajendra Kumar
gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures. If not, see <http://www.gnu.org/licenses/>.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Reading file inputs/complex_res_segments.tpr, VERSION 5.1.2 (single precision)
Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Group 25 ( Protein_Other) has 8393 elements
Group 26 (C-alpha_Other_&_!H*) has 574 elements
Select a group: Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Group 25 ( Protein_Other) has 8393 elements
Group 26 (C-alpha_Other_&_!H*) has 574 elements
Select a group: Group 0 ( System) has 8393 elements
Group 1 ( Protein) has 8325 elements
Group 2 ( Protein-H) has 4217 elements
Group 3 ( C-alpha) has 542 elements
Group 4 ( Backbone) has 1626 elements
Group 5 ( MainChain) has 2169 elements
Group 6 ( MainChain+Cb) has 2660 elements
Group 7 ( MainChain+H) has 2668 elements
Group 8 ( SideChain) has 5657 elements
Group 9 ( SideChain-H) has 2048 elements
Group 10 ( Prot-Masses) has 8325 elements
Group 11 ( non-Protein) has 68 elements
Group 12 ( Other) has 68 elements
Group 13 ( L1) has 4 elements
Group 14 ( L2) has 3 elements
Group 15 ( L3) has 8 elements
Group 16 ( L4) has 8 elements
Group 17 ( L5) has 14 elements
Group 18 ( L6) has 2 elements
Group 19 ( L7) has 3 elements
Group 20 ( L8) has 1 elements
Group 21 ( L9) has 10 elements
Group 22 ( L10) has 10 elements
Group 23 ( L11) has 1 elements
Group 24 ( L12) has 4 elements
Group 25 ( Protein_Other) has 8393 elements
Group 26 (C-alpha_Other_&_!H*) has 574 elements
Reading frame 4 time 80.000
=======================
Cluster Log output
=======================
Command:
=======================
'gmx_clusterByFeatures cluster' -f inputs/trajectory.xtc -s inputs/complex_res_segments.tpr -n inputs/index.ndx -feat proj.xvg -nfeature 5 -method kmeans -cmetric prior -ncluster 5 -g cluster-5 -cpdb clustered_trajs/central.pdb -fout clustered_trajs/cluster.xtc -outframe 1000 -plot pca-cluster-5.png -sort features
=======================
Choose a group for the output:
Selected 0: 'System'
Choose a group for clustering/RMSD calculation:
Selected 12: 'Other'
Choose a group for fitting or superposition:
Selected 26: 'C-alpha_Other_&_!H*'
Input Trajectory dt = 20 ps
###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################
===========================
Cluster-ID TotalFrames
1 7501
2 7485
3 3872
4 2205
5 1438
===========================
#####################################
Final number of cluster selected: 5
#####################################
Calculating central structure for cluster-5 ...
Reading frame 4 time 188780.000 <string>:127: MatplotlibDeprecationWarning: The non_interactive_bk attribute was deprecated in Matplotlib 3.9 and will be removed in 3.11. Use ``matplotlib.backends.backend_registry.list_builtin(matplotlib.backends.BackendFilter.NON_INTERACTIVE)`` instead.
Back Off! I just backed up clid.xvg to ./#clid.xvg.1#
Reading frame 4000 time 188780.000
GROMACS reminds you: "Those people who think they know everything are a great annoyance to those of us who do." (Isaac Asimov)
===========================================
Cluster-ID Central Frame Total Frames
1 2786 7501
2 18119 7485
3 10774 3872
4 12874 2205
5 9439 1438
===========================================
Extracting coordinates of the central structure...
Calculating RMSD between central structures...
=====================================
Central structurs - RMSD matrix
=====================================
c1 c2 c3 c4 c5
0.000 0.418 0.603 0.619 0.599
0.418 0.000 0.512 0.479 0.536
0.603 0.512 0.000 0.253 0.221
0.619 0.479 0.253 0.000 0.266
0.599 0.536 0.221 0.266 0.000
=====================================
Writing central structure to pdb-files...
Writing trajectory for each cluster...
7. Analysis
Now, we will perform following analysis on obtained clusters:
Comparison of RMSDs within and between the clusters: It will show the qualitative measure of separation of conformations of clusters in term of RMSD.
Plotting PC vs PC cluster-wise. In fact, this is already plotted in the above obtained
pca-cluster-5.pngfile. However, we will focus on first three PCs to demonstrate the distribution of conformation in PC space.Cluster-ID with time: We will plot cluster-id as a function of time to analyze, how conformation is changing between clusters as a function of time.
At first, we will load Python modules and define some functions as follows:
[6]:
import re
import sys
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
[7]:
def read_xvg(filename):
''' Read any XVG file and return the data as 2D array where data is row-wise with respect to time.
'''
fin = open(filename, 'r')
data = []
for line in fin:
line = line.rstrip().lstrip()
if not line:
continue
if re.search('^#|^@', line) is not None:
continue
temp = re.split('\s+', line)
data.append(list(map(float, temp)))
data = np.asarray(data)
return data.T
1a. Calculation of RMSDs within and between the clusters
At first, we need to calculate RMSDs of complex within and between the clusters using gmx rms command as follows.
Note: The complex structure is already superimposed when separated cluster-trajectory were written in the previous step, therefore, we are not performing fitting in RMSD calculations below,
Note: Remove %%capture --no-stdout and %%capture --no-stderr to populate all the output generated from gmx rms commands.
[8]:
%%capture --no-stdout
%%capture --no-stderr
%%script bash
# make directory for rmsd files
mkdir clustered-rmsd
for i in `seq 1 5`
do
for j in `seq 1 5`
do
echo 26 | gmx rms -f clustered_trajs/cluster_c${j}.xtc -s clustered_trajs/central_c${i}.pdb -n inputs/index.ndx -o clustered-rmsd/c${i}_c${j} -nopbc -fit none
done
done
1b. Comparison of ligand RMSDs within and between the clusters
We will use Python to plot all the obtained RMSDs above. Conformations in cluster trajectory is sorted by distance in feature-space and therefore, RMSD will be randomly fluctuate. RMSDs within the same cluster is expected to be lowest.
[9]:
mpl.rcParams['font.size']=12
fig = plt.figure(figsize=(12,16))
fig.subplots_adjust(top=0.98, bottom=0.05, hspace=0.25)
for i in range(1,6):
ax = fig.add_subplot(6, 2, i)
for j in range(1,6):
filename = 'clustered-rmsd/c{0}_c{1}.xvg'.format(i, j)
label = "{0} vs {1}".format(i, j)
data = read_xvg(filename) # read file
if i==j:
ax.scatter(data[0]/1000, data[1], label=label, lw=0, s=2)
else:
ax.scatter(data[0]/1000, data[1], label=label, lw=0, s=0.5)
ax.set_ylim(0, 0.5)
ax.set_xlabel('Time (ns)')
ax.set_ylabel('RMSD (nm)')
plt.legend(loc='upper center', ncol=4, markerscale=10, borderaxespad=0.1, columnspacing=1, handlelength=1, handletextpad=0.4)
plt.savefig('rmsd-comparison.png', dpi=300)
plt.show()
3. Plotting PC vs PC cluster-wise
It will be done in two steps:
An input file will be prepared containing information about feature searial and their labels.
gmx_clusterByFeatures featuresplotwill be used to generate the plot.
[10]:
%%bash
# First step - preparation of input file
echo "1,2,PC-1,PC-2" > features-label.txt
echo "2,3,PC-2,PC-3" >> features-label.txt
echo "1,3,PC-1,PC-3" >> features-label.txt
echo "1,4,PC-1,PC-4" >> features-label.txt
cat features-label.txt
# Second step - plotting
gmx_clusterByFeatures featuresplot -i features-label.txt -feat proj.xvg -clid clid.xvg -lcols 6 -o protein-ligand-interaction-PCs-vs-PCs.png
1,2,PC-1,PC-2
2,3,PC-2,PC-3
1,3,PC-1,PC-3
1,4,PC-1,PC-4

3. Cluster-ID with time
We will use clid.xvg obtained from the cluster subcommand to plot both cluster-id and also highlight the occurance of the given cluster.
[11]:
mpl.rcParams['font.size']=12
data=read_xvg('clid.xvg')
fig = plt.figure(figsize=(8,6))
fig.subplots_adjust(right=0.85)
ax = fig.add_subplot(111)
ax.scatter(data[0]/1000, data[1], marker='x', color='k')
ax.set_ylabel('Cluster')
ax.set_xlabel('Time (ns)')
plt.savefig('clid.png', dpi=300)
plt.show()