cluster
It is the main tool for clustering. It takes at least three input files and perform clustering according to the given option. It also generate a log file containing the information related to clustering.
gmx_clusterByFeatures clustercan be used with trajectory and tpr file generated by GROMACS.In case of other versions or other programs such as NAMD and AMBER, PDB file can be used in place of tpr file.
Trajectories from NAMD and AMBER should be converted to GROMACS compatible formats such as trr, xtc, pdb etc.
Execute following command to get full help
gmx_clusterByFeatures cluster -h
Warning
Only PBC corrected trajectory and tpr files should be used as inputs. PBC corrected PDB/GRO file can be used in place of tpr file.
Command summary
gmx_clusterByFeatures cluster [-f [<.xtc/.trr/...>]] [-s [<.tpr/.gro/...>]]
[-feat [<.xvg>]] [-n [<.ndx>]] [-clid [<.xvg>]] [-g [<.log>]]
[-fout [<.xtc/.trr/...>]] [-cpdb [<.pdb>]] [-rmsd [<.xvg>]]
[-b <time>] [-e <time>] [-dt <time>] [-tu <enum>] [-xvg <enum>]
[-method <enum>] [-nfeature <int>] [-cmetric <enum>]
[-ncluster <int>] [-crmsthres <real>] [-ssrchange <real>]
[-db_eps <real>] [-db_min_samples <int>] [-sil_ssize <real>]
[-nminfr <int>] [-[no]fit] [-[no]fit2central] [-outframe <int>]
[-sort <enum>] [-plot <string>] [-fsize <int>] [-pltw <real>]
[-plth <real>]
Options summary
Option |
Default |
File type |
|---|---|---|
traj.xtc |
Trajectory: xtc trr cpt gro g96 pdb tng |
|
topol.tpr |
Structure+mass(db): tpr gro g96 pdb brk ent |
|
index.ndx |
Index file |
|
feature.xvg |
xvgr/xmgr file |
Option |
Default |
File type |
|---|---|---|
clid.xvg |
xvgr/xmgr file (Can be used as both input and output) |
|
cluster.log |
Log file |
|
trajout.xtc |
Trajectory: xtc trr cpt gro g96 pdb tng |
|
central.pdb |
Protein data bank file |
|
rmsd.xvg |
xvgr/xmgr file |
Option |
Default |
Description |
|---|---|---|
|
0 |
First frame (ps) to read from trajectory |
|
0 |
Last frame (ps) to read from trajectory |
|
0 |
Only use frame when t MOD dt = first time (ps) |
xmgrace |
xvg plot formatting: xmgrace, xmgr, none |
|
kmeans |
Clustering methods. Accepted methods are:: kmeans, dbscan, gmixture |
|
10 |
Number of features to use for clustering |
|
prior |
Cluster metrics: Method to determine cluster number. Accepted methods are: prior, rmsd, ssr-sst, pFS, DBI |
|
5 |
Number of clusters to generate for prior method. Maximum number of cluster for ctrmsd method. |
|
0.1 |
RMSD (nm) threshold between central structures for RMSD cluster metric method. |
|
2 |
Thershold relative change % in SSR/SST ratio for ssr-sst cluster metric method. |
|
20 |
Percentage of number of frames to be considered as sample size for silhouette score calculation. |
|
0.5 |
The maximum distance between two samples for them to be considered as in the same neighborhood. |
|
20 |
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. |
|
20 |
Number of minimum frames in a cluster to output it as trajectory |
|
Enable |
Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation. |
|
Disable |
Enable/Disable trajectory superimposition or fitting to central structure in the output trajectory |
|
-1 |
Number of maximum frames in the output trajectories. |
|
none |
Sort trajectory according to these values. Accepted methods are: none, rmsd, rmsdist, features, user |
|
pca_cluster.png |
To plot features with clusters in this file. |
|
|
14 |
Font size in plot. |
|
12 |
Width (inch) of the plot. |
|
20 |
Height (inch) of the plot. |
Options to specify input files
-f traj.xtc
Input trajectory file of xtc trr cpt gro g96 pdb or
tng format.
Note
If this file is not provided, only clustering will be performed. No operations will be performed that require trajectory such as RMSD calculation, central structure calculations, clustered trajectories etc.
Note
In case of XTC and TNG formats, writing central structures and clustered trajectories are relatively fast.
-s topol.tpr
An input structure file of tpr gro g96 or pdb format. It is required
if trajectory is given as input.
-n index.ndx
If given, index groups from this file will be prompted for selection. Otherwise, default index groups will be prompted for selection.
This file will be ignored when no trajectory file will be provided.
- Users will be prompted for three index group
Choose a group for the output: Select a index group to output it as central structure and clustered trajectory. It can be whole system or any part of the system.
Choose a group for clustering/RMSD calculation: The actual atom groups for which clustering has to be done and RMSD has to be calculated.
Note
If you are doing PCA based clustering, it should be same second index group as selected in
gmx covarandgmx anaeig.Choose a group for fitting or superposition: The atom groups used for fitting or superposition before RMSD calculation.
Note
This input will be only prompted when
-fitor-fit2centraloption is given. Otherwise, group selected above will be used for fitting.Note
If you are doing PCA based clustering, it should be same as first index group selected in
gmx covarandgmx anaeig.
-feat features.xvg
It accepts a file containing features of trajectory as a function of time.
Its format is similar to the projections file generated by gmx anaeig.
Therefore, in case of PCA data, output (-proj) of gmx anaeig can be
directly used as input for gmx_clusterByFeatures.
In this file, two columns should be present. First column is time and second column is feature values. Each time-feature columns should be separated by “&”.
The format is as following:
# FEATURE - 1
# Time values
0.0 123.12
10.0 123.12
20.0 123.12
.
.
.
&
# FEATURE - 2
0.0 123.12
10.0 123.12
20.0 123.12
.
.
.
&
# FEATURE - 3
0.0 123.12
10.0 123.12
20.0 123.12
.
.
.
&
Note
If this file is not provided, -clid [<.xvg>] is the required option.
Options to specify output files
-clid clid.xvg
It can be both input and output file. It contains two columns, first column is time and second column is cluster label/id.
In default case when clustering has to be done, it is generated after clustering is finished and contains information about cluster id of each frame.
However, it can be also given as input to obtain clustered trajectories. For example,
if clustering was performed with “gmx cluster”, the obtained -clid [<.xvg>]
file can be used here to extract clustered trajectory.
Note
To treat this as an input file, do not use -feat [<.xvg>] option.
-g cluster.log
It is output log file and contains several information about clustering methods and obtained results.
-fout trajout.xtc
Output clustered trajectories. Separate trajectory of clusters is written for convenience. These separate trajectories can be used for further analysis.
Each trajectory file name is suffixed by its respective cluster-id.
-cpdb central.pdb
Output separate pdb files for central structures of each cluster.
Each pdb file name is suffixed by its respective cluster-id.
-rmsd rmsd.xvg
RMSD of clustering atom groups with respect to central structure.
Each RMSD file name is suffixed by its respective cluster-id.
In case of -sort rmsdist option, RMSD in distance-matrix is calculated.
Other options
-xvg xmgrace
It directs the formatting of all output <.xvg> files. By default, <.xvg> files are
in xmgrace format, which can be plotted using Grace (xmgrace command).
To plot with any other program, use -xvg none then a plain text file is
obtained.
- Three keywords are accepted:
xmgrace
xmgr
none
-method kmeans
Method to use for clustering. All the methods used here are used from Python scikit-learn library.
An overview on clustering method are presented here.
- Presently following methods are implemented:
-method kmeansK-means clustering- It needs cluster number as input (
-ncluster <int>). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, -cmetric For more details about k-means method, see here.-method dbscanDBSCAN - Density-based spatial clustering of applications with noise - It does not require cluster number beforehand. The clusters are controlled by two other input options: -db_eps and -db_min_samples. For more details about DBSCAN method, see here.
-method gmixtureGaussian mixture model clustering - It also needs cluster number as input (
-ncluster <int>). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, see -cmetric For more details about k-means method, see here.
-nfeature 10
Number of features to be read from -feat file.
If file contains less than requested number of features, all features will be read.
-cmetric prior
Cluster metric to determine the total number of cluster automatically, particularly for k-means and Gaussian-mixture model.
Note
All the cluster metrics are only applicable when -method kmeans or
-method gmixture is used.
- Presently following cluster metrics are implemented:
-cmetric priorIf clusters count is known beforehand, use this with
-ncluster <int>. Here,-nclustertakes input as the clusters count.-cmetric rmsdRoot Mean Square deviation between central structures of clusters. It uses -crmsthres option for RMSD threshold/cutoff.
Note
It requires trajectory file as input. Otherwise,
-cmetric ssr-sstwill be used for cluster metric with default -ssrchange value.-cmetric ssr-sstIt is SSR/SST ratio and used for Elbow method. It is the threshold in relative change in SSR/SST ratio in percentage.
-cmetric silhouetteSilhouette score. From wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters." First encountered clusters count with highest Silhouette score value is considered as final cluster number.
To calculate score, either entire data will be considered with option
-sil_ssize -1, which could be time expensive or percentage of data by random sampling will be taken with option-sil_ssize. Because of the random sampling, this score might not be precisely reproduced in successive calculation.-cmetric DBIDBI : Davies–Bouldin index. Lowest value is considered.
-ncluster 5
It takes the number of clusters. Its usage depends on -cmetric.
Note
It is only applicable when -method kmeans or -method gmixture
is used.
- Conditions:
For
-cmetric prior, it is considered as the number clusters to be generated.For
-cmetric rmsd, it is considered as largest number of clusters to be generated and iteratively number of clusters are reduced to check whether RMSD between central structures are not below RMSD threshold (-crmsthres <real>).For
-cmetric ssr-sst,-cmetric pFSand-cmetric DBI, it is considered as maximum number of clusters to generated. At first, two clusters are generated and iteratively number of clusters are increased by one. When maximum number of clusters is reached, these three cluster-metrics are calculated and finally, number of clusters is selected.
-crmsthres 0.1
RMSD (nm) threshold between central structures for RMSD cluster metric method.
It is used with -cmetric rmsd. In each iteration, RMSD between all central
structures are calculated. If any RMSD value is within the input RMSD (nm)
threshold, number of clusters is decreased by one in next iteration.
It is assumed that when RMSD between two central structures are within the threshold, central structures are similar enough to merge the two clusters as a single cluster. However, it is not necessary that these two clusters will merge in next iteration.
-ssrchange 2.0
Threshold relative percentage change in SSR/SST ratio to choose number of clusters automatically. This threshold gives potential position of Elbow in Elbow method.
Note
This option is only used when -cmetric ssr-sst is provided as input.
-sil_ssize 20
Percentage of number of frames to be considered as sample size for
silhouette score calculation. If its value is -1, sampling is not
considered.
-db_eps 0.5
The maximum distance between two samples for them to be considered as in the same neighborhood.
See also
-db_min_samples 20
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
See also
-nminfr 20
Number of minimum frames in a cluster to output it as trajectory. If number of frames is less than this number, the cluster will be ignored.
-fit/-nofit
Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation. If Enabled, index group for fitting will be prompted. Otherwise, fitting will be performed with RMSD/clustering group.
-fit2central/-nofit2central
Enable/Disable trajectory superimposition or fitting to central structure in
the output trajectory. Atoms group used for fitting depends on -[no]fit
option. If -nofit, second input index group (RMSD/clustering group) will
be used for fitting otherwise third index group will be used for fitting.
-outframe -1
Number of maximum frames in the output trajectories. It can be helpful to get output trajectory with only structures around the central structure.
-sort none
Sort trajectory according to these values.
- Accepted methods are:
-sort none: Ouput trajectory will not be sorted-sort rmsdSort trajectory according to RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and RMSD will increase gradually after first frame.
-sort rmsdistSort trajectory according to distance-matrix RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and distance-matrix RMSD will increase gradually after first frame.
-sort featuresSort trajectory according to features sub-space. Distance of each conformation to respective central structure is calculated in feature-space and Trajectory is written from lowest to highest distance. In this trajectory, first frame will be central structure.
This option is very useful when features are other than PCA’s projections of eigenvector.
-sort userSort trajectory using values supplied by user. Not yet implemented.
-plot pca_cluster.png
To plot features vs featrues with clusters in this file.
Plot is generated where feature-vs-feature are depicted with different clusters as colors. It is helpful in checking whether number of clusters is enough.
See also
Similar types of plots can be obtained using featuresplot sub-command.