Clustering conformations using distances between atoms
This tutorial showcase the example where calculated features such as any distances can be used to perform the conformational clustering. The advantages is that it can be used to filter-out parts of the trajectory based on the given features. Since, trajectories could be separated, further analysis can be perofrmed on filtered trajectory.
Instructions
Tutorial files: The tutorial files can be downloaded from here.
Extract the files:
tar -zxvf distances.tar.gzGo to directory:
cd distancesCopy the Jupyter Notebook: This notebook is available in the GitHub repo. Download and copy it from the github.
Required Tools
GROMACS
gmx_clusterByFeatures
Steps
In this tutorial, conformations of G-Quadruplex DNA is clustered according to distances between three atom-pairs. These atom-pairs form hydrogen bonds during the simulations. However, only one atom-pair among them could form hydrogen bond at a time or neither of them ccould form hydrogen bond. The formation of hydrogen bonds are extremely fluctuating between these atom-pairs, and therefore, clustering will filter conformations based on these hydrogen bonds and distances between atom-pairs.
Calculation of distances
Preparation of feature input file
Clustering using three distances
Analysis
1. Calculation of distances

We will use the GROMACS command gmx pairdist ro calculate these three distances between these three atom-pairs as described in following.
[1]:
%%bash
gmx pairdist -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -ref "resid 1 and atomname N7" -sel "resid 17 and atomname H62" -o r1N7-r17H62
gmx pairdist -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -ref "resid 1 and atomname H62" -sel "resid 17 and atomname N3" -o r1H62-r17N3
gmx pairdist -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -ref "resid 1 and atomname H61" -sel "resid 17 and atomname N1" -o r1H61-r17N1
:-) GROMACS - gmx pairdist, 2025.2 (-:
Executable: /opt/gromacs-2025/bin/gmx
Data prefix: /opt/gromacs-2025
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distances
Command line:
gmx pairdist -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -ref 'resid 1 and atomname N7' -sel 'resid 17 and atomname H62' -o r1N7-r17H62
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading frame 40000 time 800000.000
Analyzed 40639 frames, last time 812760.000
GROMACS reminds you: "Load Up Your Rubber Bullets" (10 CC)
:-) GROMACS - gmx pairdist, 2025.2 (-:
Executable: /opt/gromacs-2025/bin/gmx
Data prefix: /opt/gromacs-2025
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distances
Command line:
gmx pairdist -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -ref 'resid 1 and atomname H62' -sel 'resid 17 and atomname N3' -o r1H62-r17N3
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading frame 40000 time 800000.000
Analyzed 40639 frames, last time 812760.000
GROMACS reminds you: "A programmer's spouse says 'While you're at the grocery store, buy some eggs.' The programmer never comes back." (Anonymous)
:-) GROMACS - gmx pairdist, 2025.2 (-:
Executable: /opt/gromacs-2025/bin/gmx
Data prefix: /opt/gromacs-2025
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distances
Command line:
gmx pairdist -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -ref 'resid 1 and atomname H61' -sel 'resid 17 and atomname N1' -o r1H61-r17N1
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading frame 40000 time 800000.000
Analyzed 40639 frames, last time 812760.000
GROMACS reminds you: "A scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it." (Max Planck)
2. Preparation of feature input file
For clustering, these distances are required to put in a single file as the three features. We perform following steps to create a single feature file compatible with the gmx_clusterByFeatures:
[2]:
%%bash
cat r1N7-r17H62.xvg > distances.xvg
printf "\n& \n\n" >> distances.xvg
cat r1H62-r17N3.xvg >> distances.xvg
printf "\n& \n\n" >> distances.xvg
cat r1H61-r17N1.xvg >> distances.xvg
printf "\n& \n\n" >> distances.xvg
3. Clustering using three distances
Now, we will perform clustering using K-Means algorithm. One of the drawback of K-Means clustering is that the number of clusters should be known beforehand. To automate the decision about number of clusters, gmx_clusterByFeatures implements several cluster metrics. We will use option -cmetric ssr-sst to use the [Elbow
method](https://en.wikipedia.org/wiki/Elbow_method_(clustering).
Following command will perform the clustering of conformations using firt 10 PCs projection. Explanation of options are as follows:
-method kmeans: Use K-Means clustering algorithm-ncluster 10: K-Means clustering will be performed for 1 upto 10 clusters times each time. Finally, based on-ssrchangeoption, final number of clusters will be automatically selected.-cmetric ssr-sst: Use Elbow method to decide final number of clusters.-nfeature 3: There are three distances as the features for clustering.-sort features: Sort the output clustered trajectory based on the distance in feature space from its central structure.-fit2central: Fit/Superimpose the conformation on central structure in output clustered trajectory.-ssrchange 2: Threshold (percentage) of change in SSR/SST ratio in Elbow method to decide the number of clusters.-cpdb clustered-trajs/central.pdb: Dump central conformation of each cluster as a separate pdb file.-fout clustered-trajs/cluster.xtc: Dump conformations of each cluster as the separate trajectory file.-plot pca_cluster.png: Plot the feature-space (here, it will plot distance vs distance for all three combinations) coloured by the clusters.
index group order
First index group - Output group of atoms in the central structures and clustered trajectories
Second index group - Group of atoms to calculate RMSD between central conformations of clusters as RMSD matrix, which is dumped in the log file with
-goption. Here, it is C-alpha atoms of protein.Third index group - Used for Superposition by least-square fitting. ONLY used in separate clustered trajectories to superimpose conformations on the central structure.
[3]:
%%bash
# create a new folder to contain clustered trajectory and pdb files
mkdir clustered-trajs
echo 0 1 7 | gmx_clusterByFeatures cluster -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -n inputs/dna_ion.ndx -feat distances.xvg \
-method kmeans -nfeature 3 -cmetric ssr-sst -ncluster 10 -fit2central \
-sort features -ssrchange 2 -cpdb clustered-trajs/central.pdb \
-fout clustered-trajs/cluster.xtc -plot features_cluster.png \
:-) GROMACS - gmx_clusterByFeatures cluster, 2025.0-dev-20250210-6949615-local (-:
Executable: gmx_clusterByFeatures cluster
Data prefix: /project/external/gmx_installed
Working dir: /home/raj/workspace/gmx_clusterByFeatrues/tutorials/distances
Command line:
'gmx_clusterByFeatures cluster' -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -n inputs/dna_ion.ndx -feat distances.xvg -method kmeans -nfeature 3 -cmetric ssr-sst -ncluster 10 -fit2central -sort features -ssrchange 2 -cpdb clustered-trajs/central.pdb -fout clustered-trajs/cluster.xtc -plot features_cluster.png
:-) gmx_clusterByFeatures cluster (-:
Author: Rajendra Kumar
Copyright (C) 2018-2019 Rajendra Kumar
gmx_clusterByFeatures is a free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
gmx_clusterByFeatures is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with gmx_clusterByFeatures. If not, see <http://www.gnu.org/licenses/>.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Reading file inputs/dna_ion.tpr, VERSION 2016.5 (single precision)
Group 0 ( System) has 774 elements
Group 1 ( DNA) has 719 elements
Group 2 ( K) has 38 elements
Group 3 ( CL) has 17 elements
Group 4 ( Ion) has 55 elements
Group 5 ( K) has 38 elements
Group 6 ( CL) has 17 elements
Group 7 (r_2-4_r_6-7_r_9_r_11-13_r_18-20_&_!O1P_O2P_P_O5'_C5'_C4'_C3'_C2'_C1'_O4'_O3'_H*) has 132 elements
Select a group: Group 0 ( System) has 774 elements
Group 1 ( DNA) has 719 elements
Group 2 ( K) has 38 elements
Group 3 ( CL) has 17 elements
Group 4 ( Ion) has 55 elements
Group 5 ( K) has 38 elements
Group 6 ( CL) has 17 elements
Group 7 (r_2-4_r_6-7_r_9_r_11-13_r_18-20_&_!O1P_O2P_P_O5'_C5'_C4'_C3'_C2'_C1'_O4'_O3'_H*) has 132 elements
Select a group: Group 0 ( System) has 774 elements
Group 1 ( DNA) has 719 elements
Group 2 ( K) has 38 elements
Group 3 ( CL) has 17 elements
Group 4 ( Ion) has 55 elements
Group 5 ( K) has 38 elements
Group 6 ( CL) has 17 elements
Group 7 (r_2-4_r_6-7_r_9_r_11-13_r_18-20_&_!O1P_O2P_P_O5'_C5'_C4'_C3'_C2'_C1'_O4'_O3'_H*) has 132 elements
Reading frame 4 time 80.000
=======================
Cluster Log output
=======================
Command:
=======================
'gmx_clusterByFeatures cluster' -s inputs/dna_ion.tpr -f inputs/dna_ion_combine.xtc -n inputs/dna_ion.ndx -feat distances.xvg -method kmeans -nfeature 3 -cmetric ssr-sst -ncluster 10 -fit2central -sort features -ssrchange 2 -cpdb clustered-trajs/central.pdb -fout clustered-trajs/cluster.xtc -plot features_cluster.png
=======================
Choose a group for the output:
Selected 0: 'System'
Choose a group for clustering/RMSD calculation:
Selected 1: 'DNA'
Choose a group for fitting or superposition:
Selected 7: 'r_2-4_r_6-7_r_9_r_11-13_r_18-20_&_!O1P_O2P_P_O5'_C5'_C4'_C3'_C2'_C1'_O4'_O3'_H*'
Input Trajectory dt = 20 ps
###########################################
########## NUMBER OF CLUSTERS : 1 ########
###########################################
===========================
Cluster-ID TotalFrames
1 40639
===========================
###########################################
########## NUMBER OF CLUSTERS : 2 ########
###########################################
===========================
Cluster-ID TotalFrames
1 29163
2 11476
===========================
###########################################
########## NUMBER OF CLUSTERS : 3 ########
###########################################
===========================
Cluster-ID TotalFrames
1 21574
2 11844
3 7221
===========================
###########################################
########## NUMBER OF CLUSTERS : 4 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15783
2 11509
3 8457
4 4890
===========================
###########################################
########## NUMBER OF CLUSTERS : 5 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15640
2 8435
3 6343
4 5325
5 4896
===========================
###########################################
########## NUMBER OF CLUSTERS : 6 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15632
2 6331
3 5360
4 4828
5 4465
6 4023
===========================
###########################################
########## NUMBER OF CLUSTERS : 7 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15632
2 6241
3 4767
4 4766
5 4108
6 2830
7 2295
===========================
###########################################
########## NUMBER OF CLUSTERS : 8 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15631
2 6130
3 4687
4 4073
5 2854
6 2703
7 2428
8 2133
===========================
###########################################
########## NUMBER OF CLUSTERS : 9 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15630
2 6121
3 4680
4 3901
5 2873
6 2293
7 1996
8 1876
9 1269
===========================
###########################################
########## NUMBER OF CLUSTERS : 10 ########
###########################################
===========================
Cluster-ID TotalFrames
1 15633
2 5792
3 4203
4 3343
5 2877
6 2116
7 1950
8 1857
9 1613
10 1255
===========================
===========================================================================================================
Cluster Metrics Summary
-----------------------------------------------------------------------------------------------------------
Clusters SSR/SST D(SSR/SST) (Psuedo)F-stat. Silhouette-score Davies-bouldin-score
2 65.54 65.541 77290.039 0.611 0.628
3 79.25 13.707 77587.961 0.565 0.781
4 90.08 10.832 122991.211 0.661 0.550
5 93.78 3.699 153133.891 0.713 0.430
6 95.60 1.826 176765.344 0.686 0.520
7 96.87 1.263 209391.516 0.687 0.534
8 97.35 0.485 213448.906 0.675 0.611
9 97.68 0.330 214039.922 0.674 0.621
10 97.98 0.299 219102.672 0.670 0.680
===========================================================================================================
#####################################
Final number of cluster selected: 5
#####################################
Calculating central structure for cluster-5 ...
Reading frame 4 time 787380.000 <string>:127: MatplotlibDeprecationWarning: The non_interactive_bk attribute was deprecated in Matplotlib 3.9 and will be removed in 3.11. Use ``matplotlib.backends.backend_registry.list_builtin(matplotlib.backends.BackendFilter.NON_INTERACTIVE)`` instead.
Reading frame 40000 time 712940.000
GROMACS reminds you: "You Fill Your Space So Sweet" (F. Apple)
===========================================
Cluster-ID Central Frame Total Frames
1 27707 15640
2 19707 8435
3 2166 6343
4 24630 5325
5 39369 4896
===========================================
Extracting coordinates of the central structure...
Calculating RMSD between central structures...
=====================================
Central structurs - RMSD matrix
=====================================
c1 c2 c3 c4 c5
0.000 0.522 0.192 0.266 0.401
0.522 0.000 0.562 0.453 0.408
0.192 0.562 0.000 0.271 0.418
0.266 0.453 0.271 0.000 0.385
0.401 0.408 0.418 0.385 0.000
=====================================
Writing central structure to pdb-files...
Writing trajectory for each cluster...
Number of clusters
As can be seen above in the Cluster Metrics Summary, both Silhouette and Davies-Bouldin score is best when 5 clusters are generated. Using Elbow method also, same number of clusters have been automatically decided. Therefore, for this purpose, total number of clusters should be 5 according to the cluster-metrics.
4. Analysis
We will perform following analyses on how the above clustering is able to separate conformations based on these three distances.
Mapping of cluster with the distances as a function of time
Distribution of distance cluster-wise
Setup for analysis: At first, we defined few Python functions to read the file in Python.
[4]:
import re
import sys
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
def read_clid(filename):
''' Read the clid.xvg file
It returns the dictionary listing all the frame indices belonging to that given cluster.
Returns
-------
Dictionary of clid to index
'''
fin = open(filename, 'r')
data = dict()
index = 0
for line in fin:
line = line.rstrip().lstrip()
if not line:
continue
if re.search('^#|^@', line) is not None:
continue
temp = re.split('\s+', line)
clid = int(temp[1])
if clid not in data:
data[clid] = [ index ]
else:
data[clid] = data[clid] + [ index ]
index = index + 1
return data
def read_xvg(filename):
''' Read any XVG file and return the data as 2D array where data is row-wise with respect to time.
'''
fin = open(filename, 'r')
data = []
for line in fin:
line = line.rstrip().lstrip()
if not line:
continue
if re.search('^#|^@', line) is not None:
continue
temp = re.split('\s+', line)
data.append(list(map(float, temp)))
data = np.asarray(data)
return data.T
1. Mapping of cluster with the distances as a function of time
Here, we will highlight the distances according to the clusters as a function of time.
[17]:
dist_files=["r1N7-r17H62", "r1H62-r17N3", "r1H61-r17N1" ]
labels = ["A1.N7-A17.H62", "A1.H62-A17.N3", "A1.H61-A17.N1" ]
clid_index = read_clid("clid.xvg")
clid = list(sorted(clid_index.keys()))
fig = plt.figure()
fig.subplots_adjust(hspace=0)
axes=[]
for i in range(len(dist_files)):
data = read_xvg(dist_files[i]+'.xvg')
data[0] = data[0]/1000
if i == 0:
ax = fig.add_subplot(3,1,i+1)
else:
ax = fig.add_subplot(3,1,i+1, sharex=axes[0])
ax.tick_params(axis='x', bottom=True, top=True, direction='inout', length=10)
ax.set_ylabel(labels[i])
axes.append(ax)
plot_data = []
time = []
for ix in clid:
plot_data.append(data[1][clid_index[ix]])
time.append(data[0][clid_index[ix]])
ax.scatter(time, plot_data, label='cluster-{0}'.format(ix), s=0.2)
plot_data = []
time = []
if i==0:
handles, legend_labels = ax.get_legend_handles_labels()
axes[2].set_xlabel('Time (ns)')
fig.text(0.03, 0.4,'Distance (nm)', rotation='vertical')
fig.legend(handles, legend_labels, ncol=5, loc='upper center',scatterpoints=4,markerscale=8, handletextpad=0.1,
columnspacing=1.4, handlelength=2, borderaxespad=1.2)
plt.savefig('distance_cluster.png', dpi=300)
plt.show()
2. Distribution of distance cluster-wise
We will calculate and plot distribution of all three distances with respect to each cluster. It will demonstrate that which distance is filtered out in which cluster more easily.
[23]:
fig = plt.figure()
fig.subplots_adjust(hspace=0)
axes=[]
for i in range(len(dist_files)):
clid = list(sorted(clid_index.keys()))
data = read_xvg(dist_files[i]+'.xvg')
if i == 0:
ax = fig.add_subplot(3,1,i+1)
else:
ax = fig.add_subplot(3,1,i+1, sharex=axes[0])
ax.tick_params(axis='x', bottom=True, top=True, direction='inout', length=7)
ax.set_ylabel(labels[i])
axes.append(ax)
plot_data = []
plot_data.append(data[1])
for ix in clid:
plot_data.append(data[1][clid_index[ix]])
clid = [0] + clid
ax.violinplot(plot_data, positions=clid, showmedians = True, showextrema=False, points=500)
xticks = []
for c in clid:
if c == 0:
xticks.append('Full')
else:
xticks.append('Cluster-{0}'.format(c))
ax.set_xticks([ cid for cid in clid])
ax.set_xticklabels(xticks)
fig.text(0.03, 0.4,'Distance (nm)', rotation='vertical')
plt.savefig("distance_distribution.png", dpi=300)
Above distance distributions clearly show:
Cluster-1 : Hydrogen bond between A1.N7 and A17.H62
Cluster-2 : All three distances between 0.5 to 1.5 nm
Cluster-3 : Hydrogen bond between A1.H61 and A17.N1
Cluster-4 : Hydrogen bond between A1.H62 and A17.N3
Cluster-5 : All distances above 1.5 nm