cluster
¶
It is the main tool for clustering. It takes at least three input files and perform clustering according to the given option. It also generate a log file containing the information related to clustering.
gmx_clusterByFeatures cluster
can be used with trajectory and tpr file generated by GROMACS.- In case of other versions or other programs such as NAMD and AMBER, PDB file can be used in place of tpr file.
- Trajectories from NAMD and AMBER should be converted to GROMACS compatible formats such as trr, xtc, pdb etc.
Execute following command to get full help
gmx_clusterByFeatures cluster -h
Warning
Only PBC corrected trajectory and tpr files should be used as inputs. PBC corrected PDB/GRO file can be used in place of tpr file.
Command summary¶
gmx_clusterByFeatures cluster [-f [<.xtc/.trr/...>]] [-s [<.tpr/.gro/...>]]
[-feat [<.xvg>]] [-n [<.ndx>]] [-clid [<.xvg>]] [-g [<.log>]]
[-fout [<.xtc/.trr/...>]] [-cpdb [<.pdb>]] [-rmsd [<.xvg>]]
[-b <time>] [-e <time>] [-dt <time>] [-tu <enum>] [-xvg <enum>]
[-method <enum>] [-nfeature <int>] [-cmetric <enum>]
[-ncluster <int>] [-crmsthres <real>] [-ssrchange <real>]
[-db_eps <real>] [-db_min_samples <int>] [-sil_ssize <real>]
[-nminfr <int>] [-[no]fit] [-[no]fit2central] [-outframe <int>]
[-sort <enum>] [-plot <string>] [-fsize <int>] [-pltw <real>]
[-plth <real>]
Options summary¶
Option | Default | File type |
---|---|---|
-f [<.xtc/.trr/…>] | traj.xtc | Trajectory: xtc trr cpt gro g96 pdb tng |
-s [<.tpr/.gro/…>] | topol.tpr | Structure+mass(db): tpr gro g96 pdb brk ent |
-n [<.ndx>] | index.ndx | Index file |
-feat [<.xvg>] | feature.xvg | xvgr/xmgr file |
Option | Default | File type |
---|---|---|
-clid [<.xvg>] | clid.xvg | xvgr/xmgr file (Can be used as both input and output) |
-g [<.log>] | cluster.log | Log file |
-fout [<.xtc/.trr/…>] | trajout.xtc | Trajectory: xtc trr cpt gro g96 pdb tng |
-cpdb [<.pdb>] | central.pdb | Protein data bank file |
-rmsd [<.xvg>] | rmsd.xvg | xvgr/xmgr file |
Option | Default | Description |
---|---|---|
-b <real> |
0 | First frame (ps) to read from trajectory |
-e <real> |
0 | Last frame (ps) to read from trajectory |
-dt <real> |
0 | Only use frame when t MOD dt = first time (ps) |
-xvg <keyword> | xmgrace | xvg plot formatting: xmgrace, xmgr, none |
-method <keyword> | kmeans | Clustering methods. Accepted methods are:: kmeans, dbscan, gmixture |
-nfeature <int> | 10 | Number of features to use for clustering |
-cmetric <keyword> | prior | Cluster metrics: Method to determine cluster number. Accepted methods are: prior, rmsd, ssr-sst, pFS, DBI |
-ncluster <int> | 5 | Number of clusters to generate for prior method. Maximum number of cluster for ctrmsd method. |
-crmsthres <real> | 0.1 | RMSD (nm) threshold between central structures for RMSD cluster metric method. |
-ssrchange <real> | 2 | Thershold relative change % in SSR/SST ratio for ssr-sst cluster metric method. |
-sil_ssize <real> | 20 | Percentage of number of frames to be considered as sample size for silhouette score calculation. |
-db_eps <real> | 0.5 | The maximum distance between two samples for them to be considered as in the same neighborhood. |
-db_min_samples <int> | 20 | The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. |
-nminfr <int> | 20 | Number of minimum frames in a cluster to output it as trajectory |
-[no]fit | Enable | Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation. |
-[no]fit2central | Disable | Enable/Disable trajectory superimposition or fitting to central structure in the output trajectory |
-outframe <int> | -1 | Number of maximum frames in the output trajectories. |
-sort <keyword> | none | Sort trajectory according to these values. Accepted methods are: none, rmsd, rmsdist, features, user |
-plot <string> | pca_cluster.png | To plot features with clusters in this file. |
-fsize <int> |
14 | Font size in plot. |
-pltw <real> |
12 | Width (inch) of the plot. |
-plth <real> |
20 | Height (inch) of the plot. |
Options to specify input files¶
-f traj.xtc
¶
Input trajectory file of xtc
trr
cpt
gro
g96
pdb
or
tng
format.
Note
If this file is not provided, only clustering will be performed. No operations will be performed that require trajectory such as RMSD calculation, central structure calculations, clustered trajectories etc.
Note
In case of XTC and TNG formats, writing central structures and clustered trajectories are relatively fast.
-s topol.tpr
¶
An input structure file of tpr
gro
g96
or pdb
format. It is required
if trajectory is given as input.
-n index.ndx
¶
If given, index groups from this file will be prompted for selection. Otherwise, default index groups will be prompted for selection.
This file will be ignored when no trajectory file will be provided.
- Users will be prompted for three index group
Choose a group for the output: Select a index group to output it as central structure and clustered trajectory. It can be whole system or any part of the system.
Choose a group for clustering/RMSD calculation: The actual atom groups for which clustering has to be done and RMSD has to be calculated.
Note
If you are doing PCA based clustering, it should be same second index group as selected in
gmx covar
andgmx anaeig
.Choose a group for fitting or superposition: The atom groups used for fitting or superposition before RMSD calculation.
Note
This input will be only prompted when
-fit
or-fit2central
option is given. Otherwise, group selected above will be used for fitting.Note
If you are doing PCA based clustering, it should be same as first index group selected in
gmx covar
andgmx anaeig
.
-feat features.xvg
¶
It accepts a file containing features of trajectory as a function of time.
Its format is similar to the projections file generated by gmx anaeig
.
Therefore, in case of PCA data, output (-proj
) of gmx anaeig
can be
directly used as input for gmx_clusterByFeatures.
In this file, two columns should be present. First column is time and second column is feature values. Each time-feature columns should be separated by “&”.
The format is as following:
# FEATURE - 1
# Time values
0.0 123.12
10.0 123.12
20.0 123.12
.
.
.
&
# FEATURE - 2
0.0 123.12
10.0 123.12
20.0 123.12
.
.
.
&
# FEATURE - 3
0.0 123.12
10.0 123.12
20.0 123.12
.
.
.
&
Note
If this file is not provided, -clid [<.xvg>]
is the required option.
Options to specify output files¶
-clid clid.xvg
¶
It can be both input and output file. It contains two columns, first column is time and second column is cluster label/id.
In default case when clustering has to be done, it is generated after clustering is finished and contains information about cluster id of each frame.
However, it can be also given as input to obtain clustered trajectories. For example,
if clustering was performed with “gmx cluster”, the obtained -clid [<.xvg>]
file can be used here to extract clustered trajectory.
Note
To treat this as an input file, do not use -feat [<.xvg>]
option.
-g cluster.log
¶
It is output log file and contains several information about clustering methods and obtained results.
-fout trajout.xtc
¶
Output clustered trajectories. Separate trajectory of clusters is written for convenience. These separate trajectories can be used for further analysis.
Each trajectory file name is suffixed by its respective cluster-id.
-cpdb central.pdb
¶
Output separate pdb files for central structures of each cluster.
Each pdb file name is suffixed by its respective cluster-id.
-rmsd rmsd.xvg
¶
RMSD of clustering atom groups with respect to central structure.
Each RMSD file name is suffixed by its respective cluster-id.
In case of -sort rmsdist
option, RMSD in distance-matrix is calculated.
Other options¶
-xvg xmgrace
¶
It directs the formatting of all output <.xvg> files. By default, <.xvg> files are
in xmgrace
format, which can be plotted using Grace (xmgrace
command).
To plot with any other program, use -xvg none
then a plain text file is
obtained.
- Three keywords are accepted:
- xmgrace
- xmgr
- none
-method kmeans
¶
Method to use for clustering. All the methods used here are used from Python scikit-learn library.
An overview on clustering method are presented here.
- Presently following methods are implemented:
-method kmeans
K-means clustering- It needs cluster number as input (
-ncluster <int>
). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, -cmetric For more details about k-means method, see here.-method dbscan
DBSCAN - Density-based spatial clustering of applications with noise - It does not require cluster number beforehand. The clusters are controlled by two other input options: -db_eps and -db_min_samples. For more details about DBSCAN method, see here.
-method gmixture
Gaussian mixture model clustering - It also needs cluster number as input (
-ncluster <int>
). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, see -cmetric For more details about k-means method, see here.
-nfeature 10
¶
Number of features to be read from -feat file.
If file contains less than requested number of features, all features will be read.
-cmetric prior
¶
Cluster metric to determine the total number of cluster automatically, particularly for k-means and Gaussian-mixture model.
Note
All the cluster metrics are only applicable when -method kmeans
or
-method gmixture
is used.
- Presently following cluster metrics are implemented:
-cmetric prior
If clusters count is known beforehand, use this with
-ncluster <int>
. Here,-ncluster
takes input as the clusters count.-cmetric rmsd
Root Mean Square deviation between central structures of clusters. It uses -crmsthres option for RMSD threshold/cutoff.
Note
It requires trajectory file as input. Otherwise,
-cmetric ssr-sst
will be used for cluster metric with default -ssrchange value.-cmetric ssr-sst
It is SSR/SST ratio and used for Elbow method. It is the threshold in relative change in SSR/SST ratio in percentage.
-cmetric silhouette
Silhouette score. From wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters." First encountered clusters count with highest Silhouette score value is considered as final cluster number.
To calculate score, either entire data will be considered with option
-sil_ssize -1
, which could be time expensive or percentage of data by random sampling will be taken with option-sil_ssize
. Because of the random sampling, this score might not be precisely reproduced in successive calculation.-cmetric DBI
DBI : Davies–Bouldin index. Lowest value is considered.
-ncluster 5
¶
It takes the number of clusters. Its usage depends on -cmetric.
Note
It is only applicable when -method kmeans
or -method gmixture
is used.
- Conditions:
- For
-cmetric prior
, it is considered as the number clusters to be generated. - For
-cmetric rmsd
, it is considered as largest number of clusters to be generated and iteratively number of clusters are reduced to check whether RMSD between central structures are not below RMSD threshold (-crmsthres <real>
). - For
-cmetric ssr-sst
,-cmetric pFS
and-cmetric DBI
, it is considered as maximum number of clusters to generated. At first, two clusters are generated and iteratively number of clusters are increased by one. When maximum number of clusters is reached, these three cluster-metrics are calculated and finally, number of clusters is selected.
- For
-crmsthres 0.1
¶
RMSD (nm) threshold between central structures for RMSD cluster metric method.
It is used with -cmetric rmsd
. In each iteration, RMSD between all central
structures are calculated. If any RMSD value is within the input RMSD (nm)
threshold, number of clusters is decreased by one in next iteration.
It is assumed that when RMSD between two central structures are within the threshold, central structures are similar enough to merge the two clusters as a single cluster. However, it is not necessary that these two clusters will merge in next iteration.
-ssrchange 2.0
¶
Threshold relative percentage change in SSR/SST ratio to choose number of clusters automatically. This threshold gives potential position of Elbow in Elbow method.
Note
This option is only used when -cmetric ssr-sst
is provided as input.
-sil_ssize 20
¶
Percentage of number of frames to be considered as sample size for
silhouette score calculation. If its value is -1
, sampling is not
considered.
-db_eps 0.5
¶
The maximum distance between two samples for them to be considered as in the same neighborhood.
See also
-db_min_samples 20
¶
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
See also
-nminfr 20
¶
Number of minimum frames in a cluster to output it as trajectory. If number of frames is less than this number, the cluster will be ignored.
-fit/-nofit
¶
Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation. If Enabled, index group for fitting will be prompted. Otherwise, fitting will be performed with RMSD/clustering group.
-fit2central/-nofit2central
¶
Enable/Disable trajectory superimposition or fitting to central structure in
the output trajectory. Atoms group used for fitting depends on -[no]fit
option. If -nofit
, second input index group (RMSD/clustering group) will
be used for fitting otherwise third index group will be used for fitting.
-outframe -1
¶
Number of maximum frames in the output trajectories. It can be helpful to get output trajectory with only structures around the central structure.
-sort none
¶
Sort trajectory according to these values.
- Accepted methods are:
-sort none
: Ouput trajectory will not be sorted-sort rmsd
Sort trajectory according to RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and RMSD will increase gradually after first frame.
-sort rmsdist
Sort trajectory according to distance-matrix RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and distance-matrix RMSD will increase gradually after first frame.
-sort features
Sort trajectory according to features sub-space. Distance of each conformation to respective central structure is calculated in feature-space and Trajectory is written from lowest to highest distance. In this trajectory, first frame will be central structure.
This option is very useful when features are other than PCA’s projections of eigenvector.
-sort user
Sort trajectory using values supplied by user. Not yet implemented.
-plot pca_cluster.png
¶
To plot features vs featrues with clusters in this file.
Plot is generated where feature-vs-feature are depicted with different clusters as colors. It is helpful in checking whether number of clusters is enough.
See also
Similar types of plots can be obtained using featuresplot sub-command.