To cluster a protein conformations using distance-matrix PCA

In this tutorial, conformational clustering of a flexible protein TFAM will be performed using the distance-matrix PCA (dmPCA). TFAM is a mitochondiral protein that interact and binds with the mitochondiral DNA. When TFAM is bound to the DNA, its conformation is rigid, however, in unbound form TFAM is extremely flexible.

Final result

2e38ba1ccaf94dcc8164cb8feb4298c0

Instructions

Tutorial files: The tutorial files can be downloaded from here.
Extract the files: tar -zxvf distmat-TFAM-pca.tar.gz
Go to directory: cd distmat-TFAM-pca
Copy the Jupyter Notebook: This notebook is available in the GitHub repo. Download and copy it from the github.

Required Tools

GROMACS
gmx_clusterByFeatures

Steps

Since the conformations are flexible, the fitting or superimposition is not accurate with the reference structure, therefore cartesian-coordinate PCA is not enough to accurately cluster the conformation. Instead we will employ distance-matrix based PCA to cluster the conformations of the TFAM. Following steps will be performed to do clustering:

Calculation of distance-matrix over the trajectory
PCA from the distance-matrix
Calculation of Projections of first ten PCs on the trajectory
Clustering using first ten PCs as the features
Analysis

1. Calculation of distance-matrix over the trajectory

We will use distmat sub-command to calculate the distance-matrix over the trajectory. Following command will generate two files:

pca.xtc: This file is a container for distance-matrices over the entire trajectory in xtc file format. This is not a real trajectory file.
pca_dummy.pdb: This is a dummy pdb file containing same number of entries as obtained in above xtc file.

Notes

-gx 5 is used to reduce the size of distance-matrix. It means that there is a gap of 4 residues along X-axis in distance-matrix. For example, if a protein contains 100 residues, distance-matrix size is 100x100. If -gx 5 is used, new size is 20x100.
-gx and -gy options ONLY affect output produced with -pca option of distmat.
The distance-matrix calculation is parallelized using pthread, and therefore will use all the available threads of the CPU. It can be controlled using -nt option.

[1]:

%%bash

echo 3 3 | gmx_clusterByFeatures distmat -f inputs/woDNA.xtc -s inputs/woDNA.tpr -n inputs/index.ndx -pca -gx 5 -std stdev-matrix.dat

 :-) GROMACS - gmx_clusterByFeatures distmat, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures distmat
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distmat-TFAM-pca
Command line:
  'gmx_clusterByFeatures distmat' -f inputs/woDNA.xtc -s inputs/woDNA.tpr -n inputs/index.ndx -pca -gx 5 -std stdev-matrix.dat


         :-)  gmx_clusterByFeatures distmat (-:

             Author: Rajendra Kumar

       Copyright (C) 2018-2019  Rajendra Kumar


gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Reading file inputs/woDNA.tpr, VERSION 4.5.5-dev-20110921-e25c350 (single precision)
Reading file inputs/woDNA.tpr, VERSION 4.5.5-dev-20110921-e25c350 (single precision)
Select first group:
Group     0 (         System) has  3580 elements
Group     1 (        Protein) has  3580 elements
Group     2 (      Protein-H) has  1884 elements
Group     3 (        C-alpha) has   195 elements
Group     4 (       Backbone) has   585 elements
Group     5 (      MainChain) has   781 elements
Group     6 (   MainChain+Cb) has   973 elements
Group     7 (    MainChain+H) has   969 elements
Group     8 (      SideChain) has  2611 elements
Group     9 (    SideChain-H) has  1103 elements
Group    10 (    Prot-Masses) has  3338 elements
Group    11 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180) has  2331 elements
Group    12 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180_&_C-alpha) has   124 elements
Group    13 (r_1-98_&_C-alpha) has    98 elements
Group    14 (r_99-195_&_C-alpha) has    97 elements
Select a group: Select second group:
Group     0 (         System) has  3580 elements
Group     1 (        Protein) has  3580 elements
Group     2 (      Protein-H) has  1884 elements
Group     3 (        C-alpha) has   195 elements
Group     4 (       Backbone) has   585 elements
Group     5 (      MainChain) has   781 elements
Group     6 (   MainChain+Cb) has   973 elements
Group     7 (    MainChain+H) has   969 elements
Group     8 (      SideChain) has  2611 elements
Group     9 (    SideChain-H) has  1103 elements
Group    10 (    Prot-Masses) has  3338 elements
Group    11 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180) has  2331 elements
Group    12 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180_&_C-alpha) has   124 elements
Group    13 (r_1-98_&_C-alpha) has    98 elements
Group    14 (r_99-195_&_C-alpha) has    97 elements
Select a group: There are 195 residues with 195 atoms in first group
There are 195 residues with 195 atoms in second group
Reading frame   32000 time 640000.000

GROMACS reminds you: "The Feeling of Power was Intoxicating, Magic" (Frida Hyvonen)

Selected 3: 'C-alpha'
Selected 3: 'C-alpha'
 Number of distance-matrix elements for PCA trajectory: 3706
 Number of distance-matrix coordinates in PCA trajectory: 1236

2. PCA from the distance-matrix

Now, we will use pca.xtc and pca_dummy.pdb generated in the above command as input files to GROMACS tool gmx covar to perform PCA. This step will calculate covariance matrix, eigenvectors and eigenvalues. By default, the eigenvectors are written in eigenvec.trr while eigenvalues are written in eigenval.xvg files.

-nofit, -nomwa and -nopbc options are used because xtc file does not contain cartesian-coordinates and these option has no meanings in the case of the distance-matrix trajectory.

[2]:

%%bash

echo 0 | gmx covar -f pca.xtc -s pca_dummy.pdb -nofit -nomwa -nopbc

                      :-) GROMACS - gmx covar, 2025.2 (-:

Executable:   /opt/gromacs-2025/bin/gmx
Data prefix:  /opt/gromacs-2025
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distmat-TFAM-pca
Command line:
  gmx covar -f pca.xtc -s pca_dummy.pdb -nofit -nomwa -nopbc

Group     0 (         System) has  1236 elements
Group     1 (        Protein) has  1236 elements
Group     2 (      Protein-H) has  1236 elements
Group     3 (        C-alpha) has  1236 elements
Group     4 (       Backbone) has  1236 elements
Group     5 (      MainChain) has  1236 elements
Group     6 (   MainChain+Cb) has  1236 elements
Group     7 (    MainChain+H) has  1236 elements
Group     8 (      SideChain) has     0 elements
Group     9 (    SideChain-H) has     0 elements
Select a group: Calculating the average structure ...
Reading frame   32000 time 640000.000
Constructing covariance matrix (3708x3708) ...
Reading frame   32000 time 640000.000
Read 32016 frames

Trace of the covariance matrix: 803.441 (nm^2)

Diagonalizing ...

Sum of the eigenvalues: 803.442 (nm^2)

Writing eigenvalues to eigenval.xvg

Writing average structure & eigenvectors 1--3708 to eigenvec.trr
Wrote the log to covar.log

GROMACS reminds you: "Courage is like - it's a habitus, a habit, a virtue: you get it by courageous acts. It's like you learn to swim by swimming. You learn courage by couraging." (Marie Daly)


WARNING: Masses and atomic (Van der Waals) radii will be guessed
         based on residue and atom names, since they could not be
         definitively assigned from the information in your input
         files. These guessed numbers might deviate from the mass
         and radius of the atom type. Please check the output
         files if necessary. Note, that this functionality may
         be removed in a future GROMACS version. Please, consider
         using another file format for your input.


Choose a group for the covariance analysis
Selected 0: 'System'

3. Projections of first ten PCs on the trajectory

We will use eigenvectors written in eigenvec.trr, pca.xtc and pca_dummy.pdb as input files to GROMACS tool gmx anaeig to calculate projection of first 10 eigenvectors on distance-matrix trajectory. These projections will be written into proj.xvg file.

[3]:

%%bash

echo 0 0 | gmx anaeig -f pca.xtc -s pca_dummy.pdb -first 1 -last 10 -proj

                      :-) GROMACS - gmx anaeig, 2025.2 (-:

Executable:   /opt/gromacs-2025/bin/gmx
Data prefix:  /opt/gromacs-2025
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distmat-TFAM-pca
Command line:
  gmx anaeig -f pca.xtc -s pca_dummy.pdb -first 1 -last 10 -proj

trr version: GMX_trn_file (single precision)
Eigenvectors in eigenvec.trr were determined without fitting
Read non mass weighted average/minimum structure with 1236 atoms from eigenvec.trr
Read 3708 eigenvectors (for 1236 atoms)


WARNING: If there are molecules in the input trajectory file
         that are broken across periodic boundaries, they
         cannot be made whole (or treated as whole) without
         you providing a run input file.

Group     0 (         System) has  1236 elements
Group     1 (        Protein) has  1236 elements
Group     2 (      Protein-H) has  1236 elements
Group     3 (        C-alpha) has  1236 elements
Group     4 (       Backbone) has  1236 elements
Group     5 (      MainChain) has  1236 elements
Group     6 (   MainChain+Cb) has  1236 elements
Group     7 (    MainChain+H) has  1236 elements
Group     8 (      SideChain) has     0 elements
Group     9 (    SideChain-H) has     0 elements
Select a group: 10 eigenvectors selected for output: 1 2 3 4 5 6 7 8 9 10
Reading frame   32000 time 640000.000


GROMACS reminds you: "Courage is like - it's a habitus, a habit, a virtue: you get it by courageous acts. It's like you learn to swim by swimming. You learn courage by couraging." (Marie Daly)


Select an index group of 1236 elements that corresponds to the eigenvectors
Selected 0: 'System'

4. Clustering using first ten PCs

Now, we will perform clustering using K-Means algorithm. One of the drawback of K-Means clustering is that the number of clusters should be known beforehand. To automate the decision about number of clusters, gmx_clusterByFeatures implements several cluster metrics. We will use option -cmetric ssr-sst to use the [Elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering).

Following command will perform the clustering of conformations using firt 10 PCs projection. Explanation of options are as follows:

-method kmeans: Use K-Means clustering algorithm
-ncluster 10: K-Means clustering will be performed for 1 upto 10 clusters times each time. Finally, based on -ssrchange option, final number of clusters will be automatically selected.
-cmetric ssr-sst: Use Elbow method to decide final number of clusters.
-nfeature 10: Take 10 feature from feat proj.xvg input file. Here it is projection of first ten eigenvectors.
-sort features: Sort the output clustered trajectory based on the distance in feature space from its central structure.
-fit2central: Fit/Superimpose the conformation on central structure in output clustered trajectory.
-ssrchange 2: Threshold (percentage) of change in SSR/SST ratio in Elbow method to decide the number of clusters.
-cpdb clustered-trajs/central.pdb: Dump central conformation of each cluster as a separate pdb file.
-fout clustered-trajs/cluster.xtc: Dump conformations of each cluster as the separate trajectory file.
-plot pca_cluster.png: Plot the feature-space (here, it is first 10 PCs from PCA) coloured by the clusters.
-rmsd clustered-trajs/rmsd.xvg: Calculate the RMSD with reference to central conformation of each clusters and written in separate files.

index group order

First index group - Output group of atoms in the central structures and clustered trajectories
Second index group - Group of atoms to calculate distance-matrix RMSD between central conformations of clusters as RMSD matrix, which is dumped in the log file with -g option. Here, it is C-alpha atoms of protein.
Third index group - Used for Superimposition by least-square fitting. ONLY used in separate clustered trajectories to superimpose conformations on the central structure.

Note: This command could take a long time to execute!

This command could take a long time to execute because it is writing output trajectory file for each cluster sorted by distance in feature-space. Therefore, it needs to read input trajectory back-and-forth many time to extract the conformations in sorted manner. XTC format is fast for back-and-forth reading, and it still could take long time to dump the output trajectories.

Content of output ``-g cluster.log`` file

It contains the command summary, and for each input cluster-numbers, number of frames in each clusters. At the end, it contains the Cluster Metrics Summary, which is important for deciding final number of clusters.

[ ]:

%%bash

# create a new folder to contain clustered trajectory and pdb files
mkdir clustered-trajs

echo 0 3 3 | gmx_clusterByFeatures cluster -s inputs/woDNA.tpr -f inputs/woDNA.xtc -n inputs/index.ndx -feat proj.xvg -method kmeans \
                                           -nfeature 10 -cmetric ssr-sst -ncluster 20 -fit2central -sort rmsdist \
                                           -ssrchange 2 -cpdb clustered-trajs/central.pdb -fout clustered-trajs/cluster.xtc \
                                           -plot pca_cluster.png

 :-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:

Executable:   gmx_clusterByFeatures cluster
Data prefix:  /project/external/gmx_installed
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distmat-TFAM-pca
Command line:
  'gmx_clusterByFeatures cluster' -s inputs/woDNA.tpr -f inputs/woDNA.xtc -n inputs/index.ndx -feat proj.xvg -method kmeans -nfeature 10 -cmetric ssr-sst -ncluster 20 -fit2central -sort rmsdist -ssrchange 2 -cpdb clustered-trajs/central.pdb -fout clustered-trajs/cluster.xtc -plot pca_cluster.png


         :-)  gmx_clusterByFeatures cluster (-:

             Author: Rajendra Kumar

       Copyright (C) 2018-2019  Rajendra Kumar


gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures.  If not, see <http://www.gnu.org/licenses/>.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Reading file inputs/woDNA.tpr, VERSION 4.5.5-dev-20110921-e25c350 (single precision)
Reading file inputs/woDNA.tpr, VERSION 4.5.5-dev-20110921-e25c350 (single precision)
Group     0 (         System) has  3580 elements
Group     1 (        Protein) has  3580 elements
Group     2 (      Protein-H) has  1884 elements
Group     3 (        C-alpha) has   195 elements
Group     4 (       Backbone) has   585 elements
Group     5 (      MainChain) has   781 elements
Group     6 (   MainChain+Cb) has   973 elements
Group     7 (    MainChain+H) has   969 elements
Group     8 (      SideChain) has  2611 elements
Group     9 (    SideChain-H) has  1103 elements
Group    10 (    Prot-Masses) has  3338 elements
Group    11 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180) has  2331 elements
Group    12 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180_&_C-alpha) has   124 elements
Group    13 (r_1-98_&_C-alpha) has    98 elements
Group    14 (r_99-195_&_C-alpha) has    97 elements
Select a group: Group     0 (         System) has  3580 elements
Group     1 (        Protein) has  3580 elements
Group     2 (      Protein-H) has  1884 elements
Group     3 (        C-alpha) has   195 elements
Group     4 (       Backbone) has   585 elements
Group     5 (      MainChain) has   781 elements
Group     6 (   MainChain+Cb) has   973 elements
Group     7 (    MainChain+H) has   969 elements
Group     8 (      SideChain) has  2611 elements
Group     9 (    SideChain-H) has  1103 elements
Group    10 (    Prot-Masses) has  3338 elements
Group    11 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180) has  2331 elements
Group    12 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180_&_C-alpha) has   124 elements
Group    13 (r_1-98_&_C-alpha) has    98 elements
Group    14 (r_99-195_&_C-alpha) has    97 elements
Select a group: Group     0 (         System) has  3580 elements
Group     1 (        Protein) has  3580 elements
Group     2 (      Protein-H) has  1884 elements
Group     3 (        C-alpha) has   195 elements
Group     4 (       Backbone) has   585 elements
Group     5 (      MainChain) has   781 elements
Group     6 (   MainChain+Cb) has   973 elements
Group     7 (    MainChain+H) has   969 elements
Group     8 (      SideChain) has  2611 elements
Group     9 (    SideChain-H) has  1103 elements
Group    10 (    Prot-Masses) has  3338 elements
Group    11 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180) has  2331 elements
Group    12 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180_&_C-alpha) has   124 elements
Group    13 (r_1-98_&_C-alpha) has    98 elements
Group    14 (r_99-195_&_C-alpha) has    97 elements
Reading frame       4 time   80.000

=======================
  Cluster Log output
=======================

Command:
=======================
'gmx_clusterByFeatures cluster' -s inputs/woDNA.tpr -f inputs/woDNA.xtc -n inputs/index.ndx -feat proj.xvg -method kmeans -nfeature 10 -cmetric ssr-sst -ncluster 20 -fit2central -sort rmsdist -ssrchange 2 -cpdb clustered-trajs/central.pdb -fout clustered-trajs/cluster.xtc -plot pca_cluster.png
=======================

Choose a group for the output:
Selected 0: 'System'

Choose a group for clustering/RMSD calculation:
Selected 3: 'C-alpha'

Choose a group for fitting or superposition:
Selected 3: 'C-alpha'


 Input Trajectory dt = 20 ps



###########################################
########## NUMBER OF CLUSTERS : 1 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               32016
===========================



###########################################
########## NUMBER OF CLUSTERS : 2 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               16361
2               15655
===========================



###########################################
########## NUMBER OF CLUSTERS : 3 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               11769
2               11232
3               9015
===========================



###########################################
########## NUMBER OF CLUSTERS : 4 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               14782
2               6838
3               6711
4               3685
===========================



###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               11103
2               8298
3               4572
4               4469
5               3574
===========================



###########################################
########## NUMBER OF CLUSTERS : 6 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               10817
2               8501
3               4195
4               3453
5               3117
6               1933
===========================



###########################################
########## NUMBER OF CLUSTERS : 7 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               7022
2               6867
3               6716
4               4211
5               3361
6               2132
7               1707
===========================



###########################################
########## NUMBER OF CLUSTERS : 8 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               6783
2               5835
3               4901
4               4757
5               3120
6               2781
7               2043
8               1796
===========================



###########################################
########## NUMBER OF CLUSTERS : 9 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               6175
2               5059
3               4361
4               4350
5               3074
6               2767
7               2476
8               2040
9               1714
===========================



###########################################
########## NUMBER OF CLUSTERS : 10 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4994
2               4179
3               3965
4               3882
5               3618
6               2685
7               2623
8               2473
9               1988
10              1609
===========================



###########################################
########## NUMBER OF CLUSTERS : 11 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4465
2               3960
3               3851
4               3330
5               3283
6               2662
7               2622
8               2506
9               1946
10              1878
11              1513
===========================



###########################################
########## NUMBER OF CLUSTERS : 12 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               4370
2               4069
3               4000
4               3629
5               3293
6               2108
7               2038
8               1934
9               1872
10              1814
11              1478
12              1411
===========================



###########################################
########## NUMBER OF CLUSTERS : 13 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3456
2               3334
3               3306
4               3256
5               3109
6               2748
7               2350
8               2327
9               1940
10              1803
11              1657
12              1432
13              1298
===========================



###########################################
########## NUMBER OF CLUSTERS : 14 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3629
2               3217
3               2921
4               2887
5               2846
6               2839
7               2429
8               1837
9               1762
10              1744
11              1743
12              1547
13              1378
14              1237
===========================



###########################################
########## NUMBER OF CLUSTERS : 15 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3666
2               3233
3               3217
4               3012
5               2987
6               2790
7               1906
8               1718
9               1603
10              1479
11              1342
12              1286
13              1281
14              1255
15              1241
===========================



###########################################
########## NUMBER OF CLUSTERS : 16 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3524
2               3309
3               2865
4               2841
5               2744
6               2651
7               2305
8               1745
9               1568
10              1436
11              1337
12              1263
13              1236
14              1197
15              1160
16              835
===========================



###########################################
########## NUMBER OF CLUSTERS : 17 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3192
2               2864
3               2849
4               2579
5               2405
6               2366
7               2259
8               1904
9               1602
10              1528
11              1468
12              1373
13              1349
14              1169
15              1168
16              1012
17              929
===========================



###########################################
########## NUMBER OF CLUSTERS : 18 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3326
2               2848
3               2777
4               2575
5               2479
6               2444
7               2109
8               1532
9               1483
10              1455
11              1356
12              1334
13              1202
14              1172
15              1160
16              1018
17              928
18              818
===========================



###########################################
########## NUMBER OF CLUSTERS : 19 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               3158
2               2785
3               2657
4               2474
5               2447
6               2361
7               2211
8               1531
9               1444
10              1393
11              1362
12              1349
13              1255
14              1234
15              1159
16              1011
17              921
18              803
19              461
===========================



###########################################
########## NUMBER OF CLUSTERS : 20 ########
###########################################

===========================
Cluster-ID      TotalFrames
1               2566
2               2549
3               2314
4               2285
5               2260
6               2216
7               2025
8               1777
9               1489
10              1469
11              1443
12              1324
13              1158
14              1157
15              1119
16              1115
17              1102
18              988
19              909
20              751
===========================



===========================================================================================================
                                          Cluster Metrics Summary
-----------------------------------------------------------------------------------------------------------
Clusters       SSR/SST    D(SSR/SST)         (Psuedo)F-stat.    Silhouette-score   Davies-bouldin-score
2              31.27      31.268                  14563.710     0.262              1.381
3              44.44      13.173                  12803.176     0.228              1.406
4              52.77       8.332                  11923.432     0.234              1.306
5              57.66       4.885                  10897.036     0.203              1.391
6              61.47       3.816                  10214.958     0.211              1.345
7              64.70       3.222                   9775.873     0.206              1.339
8              66.75       2.057                   9180.444     0.191              1.398
9              68.29       1.540                   8616.962     0.188              1.428
10             69.71       1.420                   8185.188     0.182              1.403
11             70.90       1.193                   7799.618     0.183              1.396
12             72.00       1.095                   7481.315     0.181              1.486
13             72.84       0.841                   7152.562     0.179              1.432
14             73.87       1.033                   6960.579     0.179              1.424
15             74.69       0.816                   6745.368     0.178              1.446
16             75.54       0.849                   6588.000     0.181              1.416
17             76.10       0.559                   6367.346     0.182              1.404
18             76.89       0.793                   6262.695     0.180              1.407
19             77.40       0.508                   6087.359     0.181              1.387
20             77.92       0.521                   5942.556     0.182              1.411
===========================================================================================================


#####################################
Final number of cluster selected: 8
#####################################

Calculating central structure for cluster-8 ...

Reading frame       7 time 274760.000   <string>:127: MatplotlibDeprecationWarning: The non_interactive_bk attribute was deprecated in Matplotlib 3.9 and will be removed in 3.11. Use ``matplotlib.backends.backend_registry.list_builtin(matplotlib.backends.BackendFilter.NON_INTERACTIVE)`` instead.
Reading frame    2000 time 220340.000

5. Analysis

Now, we will perform following analysis on obtained clusters:

Comparison of Distance-Matrix-RMSDs within and between the clusters: It will highlight the quality of clustering by measuring the difference in the clusters using distance-matrix RMSD
Plotting PC vs PC cluster-wise. In fact, this is already plotted in the above obtained pca_cluster.png file. However, we will focus on first three PCs to demonstrate the distribution of conformation in PC space.
Plotting distance-matrix-RMSF: Because, we are calculating distance-matrix over the trajectory, it gives us another way to compute RMSF by using distance-matrix. To gain insight of the flexible and rigid regions of the protein, this analysis is useful. It is similar to RMSF but avoids structure fitting and therefore, suitable for highly flexible proteins.
Cluster-ID with time: We will plot cluster-id as a function of time to analyze, how conformation is changing between clusters as a function of time.

At first, we will load Python modules and define some functions as follows:

[13]:

import re
import sys
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

[14]:

def read_mat_file(filename):
    ''' Read the matrix file generated from distmat sub-command
    '''
    fin = open(filename, 'r')

    data = []
    for line in fin:
        line = line.rstrip().lstrip()
        if not line:
            continue

        temp = re.split('\s+', line)
        data.append(list(map(float, temp)))

    data = np.asarray(data)

    return data.T

def read_xvg(filename):
    ''' Read any XVG file and return the data as 2D array where data is row-wise with respect to time.
    '''
    fin = open(filename, 'r')

    data = []
    for line in fin:
        line = line.rstrip().lstrip()
        if not line:
            continue

        if re.search('^#|^@', line) is not None:
                     continue

        temp = re.split('\s+', line)
        data.append(list(map(float, temp)))

    data = np.asarray(data)

    return data.T

1a. Calculation of Distance-Matrix-RMSDs within and between the clusters

At first, we need to calculate distance-matrix RMSD within and between the clusters using gmx rmsdist command as follows.

Note: Remove %%capture --no-stdout and %%capture --no-stderr to populate all the output generated from gmx rmsdist commands.

[15]:

%%capture --no-stdout
%%capture --no-stderr
%%script bash

mkdir rmsdist

# Nested For loop to calculate rmsdist of clustered-trajectory with reference to central structure with all possible combinations.
for i in `seq 1 8`
do
    for j in `seq 1 8`
    do
        echo 3 | gmx rmsdist -f clustered-trajs/cluster_c${j}.xtc -s clustered-trajs/central_c${i}.pdb -o rmsdist/c${i}_c${j} -nopbc
    done
done

1b. Comparison of Distance-Matrix-RMSDs within and between the clusters

We will use Python to plot all the obtained Distance-Matrix-RMSDs above.

[16]:

mpl.rcParams['font.size']=12
fig = plt.figure(figsize=(12,12))
fig.subplots_adjust(top=0.98, bottom=0.05, hspace=0.25)

for i in range(1,9):

    ax = fig.add_subplot(4, 2, i)

    for j in range(1,9):
        filename = 'rmsdist/c{0}_c{1}.xvg'.format(i, j)
        label = "{0} vs {1}".format(i, j)
        data = read_xvg(filename) # read file
        if i==j:
            ax.scatter(data[0]/1000, data[1], label=label, lw=0, s=2)
        else:
            ax.scatter(data[0]/1000, data[1], label=label, lw=0, s=0.5)

    ax.set_ylim(0, 1.8)
    ax.set_xlabel('Time (ns)')
    ax.set_ylabel('dmRMSD (nm)')
    plt.legend(loc='upper center', ncol=4, markerscale=10, borderaxespad=0.1, columnspacing=1, handlelength=1, handletextpad=0.4)


plt.savefig('rmsdist/rmsdist.png', dpi=300)
plt.show()

../_images/tutorials_distmat-pca-TFAM_14_0.png

2. Plotting PC vs PC cluster-wise

It will be done in two steps:

An input file will be prepared containing information about feature searial and their labels.
gmx_clusterByFeatures featuresplot will be used to generate the plot.

[21]:

%%bash

# First step - preparation of input file
echo "1,2,PC-1,PC-2" > features-label.txt
echo "2,3,PC-2,PC-3" >> features-label.txt
echo "1,3,PC-1,PC-3" >> features-label.txt
echo "1,4,PC-1,PC-4" >> features-label.txt
cat features-label.txt

# Second step - plotting
gmx_clusterByFeatures featuresplot -i features-label.txt -feat proj.xvg -clid clid.xvg -o distmat-pca-TFAM-PCs-vs-PCs.png

1,2,PC-1,PC-2
2,3,PC-2,PC-3
1,3,PC-1,PC-3
1,4,PC-1,PC-4

389fe3090e3b4bb6a74b50c2a19b2284

3a. Plotting distance-matrix-RMSF

Here we compare dmRMSF and conventional RMSF

We plot the standard deviation in distances in distance matrix. We use stdev-matrix.dat file obtained from distmat sub-command as the input file.
We calculate RMSF of the protein using gmx rmsf

[18]:

%%bash

# Generate the dmRMSF plot
gmx_clusterByFeatures matplot -i stdev-matrix.dat -o dmRMSF-matrix.png

# Calculate the conventional RMSF
echo 3 | gmx rmsf -f inputs/woDNA.xtc -s inputs/woDNA.tpr -n inputs/index.ndx -res -o rmsf.xvg

                       :-) GROMACS - gmx rmsf, 2025.2 (-:

Executable:   /opt/gromacs-2025/bin/gmx
Data prefix:  /opt/gromacs-2025
Working dir:  /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distmat-TFAM-pca
Command line:
  gmx rmsf -f inputs/woDNA.xtc -s inputs/woDNA.tpr -n inputs/index.ndx -res -o rmsf.xvg

Reading file inputs/woDNA.tpr, VERSION 4.5.5-dev-20110921-e25c350 (single precision)
Reading file inputs/woDNA.tpr, VERSION 4.5.5-dev-20110921-e25c350 (single precision)
Select group(s) for root mean square calculation
Group     0 (         System) has  3580 elements
Group     1 (        Protein) has  3580 elements
Group     2 (      Protein-H) has  1884 elements
Group     3 (        C-alpha) has   195 elements
Group     4 (       Backbone) has   585 elements
Group     5 (      MainChain) has   781 elements
Group     6 (   MainChain+Cb) has   973 elements
Group     7 (    MainChain+H) has   969 elements
Group     8 (      SideChain) has  2611 elements
Group     9 (    SideChain-H) has  1103 elements
Group    10 (    Prot-Masses) has  3338 elements
Group    11 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180) has  2331 elements
Group    12 (r_16-25_r_35-45_r_52-75_r_82-110_r_120-130_r_136-145_r_152-180_&_C-alpha) has   124 elements
Group    13 (r_1-98_&_C-alpha) has    98 elements
Group    14 (r_99-195_&_C-alpha) has    97 elements
Reading frame   32000 time 640000.000

Back Off! I just backed up rmsf.xvg to ./#rmsf.xvg.1#

GROMACS reminds you: "Being Great is Not So Good" (Red Hot Chili Peppers)

Selected 3: 'C-alpha'

91edf6a50f00409293d4104d5e5fa80a

3b. Deriving residue-wise fluctuation from dmRMSF

Above standard-deviation is a matrix. Therefore, to calculate fluctuations of residues, we will calculate mean of each rows. It will give the idea about fluctuations of each residue as simular to that of conventional RMSF. Subsequently, rmsf.xvg is read and plotted.

[19]:

dmRMSF = read_mat_file('stdev-matrix.dat') # read the stdev-matrix.dat file containing standard-deviation of distance matrix
derived_dmRMSF = np.mean(dmRMSF, axis=1)   # Derive the fluctuation by averaging over rows
rmsfData = read_xvg('rmsf.xvg')            # Read the conventional RMSF data

mpl.rcParams['font.size']=12
plt.plot(rmsfData[0], derived_dmRMSF, label='dmRMSF-derived')
plt.plot(rmsfData[0], rmsfData[1], label='Original RMSF')
plt.xlabel('Residue Number')
plt.ylabel('Fluctuation (nm)')
plt.legend()
plt.show()

../_images/tutorials_distmat-pca-TFAM_22_0.png

4. Cluster-ID with time

We will use clid.xvg obtained from the cluster subcommand to plot both cluster-id and also highlight the occurance of the given cluster.

[20]:

# Occurance has been calculated manually from the data obtained in cluster.log file
occur=[ 21.0, 18.1, 15.5, 14.7, 9.8, 8.7, 6.5, 5.6]

mpl.rcParams['font.size']=12

data=read_xvg('clid.xvg')
fig = plt.figure(figsize=(6,4))
fig.subplots_adjust(right=0.85)

ax = fig.add_subplot(111)
ax.scatter(data[0]/1000, data[1], marker='x', color='k')
ax.tick_params(axis='y', right=True)

ax.set_ylabel('Cluster')
ax.set_xlabel('Time (ns)')
ax.set_xlim(-5, 650)
for i in range(8):
    ax.text(665, i+1, '{0:3.1f}'.format(occur[i]), verticalalignment='center')
ax.text(760, 4.5, 'Occurence (%)', verticalalignment='center', rotation='vertical', horizontalalignment='center')

plt.savefig('clid.png', dpi=300)
plt.show()

../_images/tutorials_distmat-pca-TFAM_24_0.png