cluster

It is the main tool for clustering. It takes at least three input files and perform clustering according to the given option. It also generate a log file containing the information related to clustering.

  • gmx_clusterByFeatures cluster can be used with trajectory and tpr file generated by GROMACS.
  • In case of other versions or other programs such as NAMD and AMBER, PDB file can be used in place of tpr file.
  • Trajectories from NAMD and AMBER should be converted to GROMACS compatible formats such as trr, xtc, pdb etc.

Execute following command to get full help

gmx_clusterByFeatures cluster -h

Warning

Only PBC corrected trajectory and tpr files should be used as inputs. PBC corrected PDB/GRO file can be used in place of tpr file.

Command summary

gmx_clusterByFeatures cluster [-f [<.xtc/.trr/...>]] [-s [<.tpr/.gro/...>]]
            [-feat [<.xvg>]] [-n [<.ndx>]] [-clid [<.xvg>]] [-g [<.log>]]
            [-fout [<.xtc/.trr/...>]] [-cpdb [<.pdb>]] [-rmsd [<.xvg>]]
            [-b <time>] [-e <time>] [-dt <time>] [-tu <enum>] [-xvg <enum>]
            [-method <enum>] [-nfeature <int>] [-cmetric <enum>]
            [-ncluster <int>] [-crmsthres <real>] [-ssrchange <real>]
            [-db_eps <real>] [-db_min_samples <int>] [-sil_ssize <real>]
            [-nminfr <int>] [-[no]fit] [-[no]fit2central] [-outframe <int>]
            [-sort <enum>] [-plot <string>] [-fsize <int>] [-pltw <real>]
            [-plth <real>]

Options summary

Options to specify input files to cluster
Option Default File type
-f [<.xtc/.trr/…>] traj.xtc Trajectory: xtc trr cpt gro g96 pdb tng
-s [<.tpr/.gro/…>] topol.tpr Structure+mass(db): tpr gro g96 pdb brk ent
-n [<.ndx>] index.ndx Index file
-feat [<.xvg>] feature.xvg xvgr/xmgr file
Options to specify output files to cluster
Option Default File type
-clid [<.xvg>] clid.xvg xvgr/xmgr file (Can be used as both input and output)
-g [<.log>] cluster.log Log file
-fout [<.xtc/.trr/…>] trajout.xtc Trajectory: xtc trr cpt gro g96 pdb tng
-cpdb [<.pdb>] central.pdb Protein data bank file
-rmsd [<.xvg>] rmsd.xvg xvgr/xmgr file
Other options to cluster
Option Default Description
-b <real> 0 First frame (ps) to read from trajectory
-e <real> 0 Last frame (ps) to read from trajectory
-dt <real> 0 Only use frame when t MOD dt = first time (ps)
-xvg <keyword> xmgrace xvg plot formatting: xmgrace, xmgr, none
-method <keyword> kmeans Clustering methods. Accepted methods are:: kmeans, dbscan, gmixture
-nfeature <int> 10 Number of features to use for clustering
-cmetric <keyword> prior Cluster metrics: Method to determine cluster number. Accepted methods are: prior, rmsd, ssr-sst, pFS, DBI
-ncluster <int> 5 Number of clusters to generate for prior method. Maximum number of cluster for ctrmsd method.
-crmsthres <real> 0.1 RMSD (nm) threshold between central structures for RMSD cluster metric method.
-ssrchange <real> 2 Thershold relative change % in SSR/SST ratio for ssr-sst cluster metric method.
-sil_ssize <real> 20 Percentage of number of frames to be considered as sample size for silhouette score calculation.
-db_eps <real> 0.5 The maximum distance between two samples for them to be considered as in the same neighborhood.
-db_min_samples <int> 20 The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
-nminfr <int> 20 Number of minimum frames in a cluster to output it as trajectory
-[no]fit Enable Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation.
-[no]fit2central Disable Enable/Disable trajectory superimposition or fitting to central structure in the output trajectory
-outframe <int> -1 Number of maximum frames in the output trajectories.
-sort <keyword> none Sort trajectory according to these values. Accepted methods are: none, rmsd, rmsdist, features, user
-plot <string> pca_cluster.png To plot features with clusters in this file.
-fsize  <int> 14 Font size in plot.
-pltw  <real> 12 Width (inch) of the plot.
-plth  <real> 20 Height (inch) of the plot.

Options to specify input files

-f traj.xtc

Input trajectory file of xtc trr cpt gro g96 pdb or tng format.

Note

If this file is not provided, only clustering will be performed. No operations will be performed that require trajectory such as RMSD calculation, central structure calculations, clustered trajectories etc.

Note

In case of XTC and TNG formats, writing central structures and clustered trajectories are relatively fast.


-s topol.tpr

An input structure file of tpr gro g96 or pdb format. It is required if trajectory is given as input.


-n index.ndx

If given, index groups from this file will be prompted for selection. Otherwise, default index groups will be prompted for selection.

This file will be ignored when no trajectory file will be provided.

Users will be prompted for three index group
  • Choose a group for the output: Select a index group to output it as central structure and clustered trajectory. It can be whole system or any part of the system.

  • Choose a group for clustering/RMSD calculation: The actual atom groups for which clustering has to be done and RMSD has to be calculated.

    Note

    If you are doing PCA based clustering, it should be same second index group as selected in gmx covar and gmx anaeig.

  • Choose a group for fitting or superposition: The atom groups used for fitting or superposition before RMSD calculation.

    Note

    This input will be only prompted when -fit or -fit2central option is given. Otherwise, group selected above will be used for fitting.

    Note

    If you are doing PCA based clustering, it should be same as first index group selected in gmx covar and gmx anaeig.


-feat features.xvg

It accepts a file containing features of trajectory as a function of time. Its format is similar to the projections file generated by gmx anaeig. Therefore, in case of PCA data, output (-proj) of gmx anaeig can be directly used as input for gmx_clusterByFeatures.

In this file, two columns should be present. First column is time and second column is feature values. Each time-feature columns should be separated by “&”.

The format is as following:

# FEATURE - 1
# Time    values
0.0     123.12
10.0    123.12
20.0    123.12
.
.
.

&

# FEATURE - 2
0.0     123.12
10.0    123.12
20.0    123.12
.
.
.

&

# FEATURE - 3
0.0     123.12
10.0    123.12
20.0    123.12
.
.
.

&

Note

If this file is not provided, -clid [<.xvg>] is the required option.


Options to specify output files

-clid clid.xvg

It can be both input and output file. It contains two columns, first column is time and second column is cluster label/id.

In default case when clustering has to be done, it is generated after clustering is finished and contains information about cluster id of each frame.

However, it can be also given as input to obtain clustered trajectories. For example, if clustering was performed with “gmx cluster”, the obtained -clid [<.xvg>] file can be used here to extract clustered trajectory.

Note

To treat this as an input file, do not use -feat [<.xvg>] option.


-g cluster.log

It is output log file and contains several information about clustering methods and obtained results.


-fout trajout.xtc

Output clustered trajectories. Separate trajectory of clusters is written for convenience. These separate trajectories can be used for further analysis.

Each trajectory file name is suffixed by its respective cluster-id.


-cpdb central.pdb

Output separate pdb files for central structures of each cluster.

Each pdb file name is suffixed by its respective cluster-id.


-rmsd rmsd.xvg

RMSD of clustering atom groups with respect to central structure.

Each RMSD file name is suffixed by its respective cluster-id.

In case of -sort rmsdist option, RMSD in distance-matrix is calculated.


Other options

-xvg  xmgrace

It directs the formatting of all output <.xvg> files. By default, <.xvg> files are in xmgrace format, which can be plotted using Grace (xmgrace command).

To plot with any other program, use -xvg none then a plain text file is obtained.

Three keywords are accepted:
  • xmgrace
  • xmgr
  • none

-method kmeans

Method to use for clustering. All the methods used here are used from Python scikit-learn library.

An overview on clustering method are presented here.

Presently following methods are implemented:
  1. -method kmeans

    K-means clustering- It needs cluster number as input (-ncluster <int>). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, -cmetric For more details about k-means method, see here.

  2. -method dbscan

    DBSCAN - Density-based spatial clustering of applications with noise - It does not require cluster number beforehand. The clusters are controlled by two other input options: -db_eps and -db_min_samples. For more details about DBSCAN method, see here.

  3. -method gmixture

    Gaussian mixture model clustering - It also needs cluster number as input (-ncluster <int>). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, see -cmetric For more details about k-means method, see here.


-nfeature 10

Number of features to be read from -feat file.

If file contains less than requested number of features, all features will be read.


-cmetric prior

Cluster metric to determine the total number of cluster automatically, particularly for k-means and Gaussian-mixture model.

Note

All the cluster metrics are only applicable when -method kmeans or -method gmixture is used.

Presently following cluster metrics are implemented:
  1. -cmetric prior

    If clusters count is known beforehand, use this with -ncluster <int>. Here, -ncluster takes input as the clusters count.

  2. -cmetric rmsd

    Root Mean Square deviation between central structures of clusters. It uses -crmsthres option for RMSD threshold/cutoff.

    Note

    It requires trajectory file as input. Otherwise, -cmetric ssr-sst will be used for cluster metric with default -ssrchange value.

  3. -cmetric ssr-sst

    It is SSR/SST ratio and used for Elbow method. It is the threshold in relative change in SSR/SST ratio in percentage.

  4. -cmetric silhouette

    Silhouette score. From wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters." First encountered clusters count with highest Silhouette score value is considered as final cluster number.

    To calculate score, either entire data will be considered with option -sil_ssize -1, which could be time expensive or percentage of data by random sampling will be taken with option -sil_ssize. Because of the random sampling, this score might not be precisely reproduced in successive calculation.

  5. -cmetric DBI

    DBI : Davies–Bouldin index. Lowest value is considered.


-ncluster 5

It takes the number of clusters. Its usage depends on -cmetric.

Note

It is only applicable when -method kmeans or -method gmixture is used.

Conditions:
  1. For -cmetric prior, it is considered as the number clusters to be generated.
  2. For -cmetric rmsd, it is considered as largest number of clusters to be generated and iteratively number of clusters are reduced to check whether RMSD between central structures are not below RMSD threshold (-crmsthres <real>).
  3. For -cmetric ssr-sst, -cmetric pFS and -cmetric DBI, it is considered as maximum number of clusters to generated. At first, two clusters are generated and iteratively number of clusters are increased by one. When maximum number of clusters is reached, these three cluster-metrics are calculated and finally, number of clusters is selected.

-crmsthres 0.1

RMSD (nm) threshold between central structures for RMSD cluster metric method.

It is used with -cmetric rmsd. In each iteration, RMSD between all central structures are calculated. If any RMSD value is within the input RMSD (nm) threshold, number of clusters is decreased by one in next iteration.

It is assumed that when RMSD between two central structures are within the threshold, central structures are similar enough to merge the two clusters as a single cluster. However, it is not necessary that these two clusters will merge in next iteration.


-ssrchange 2.0

Threshold relative percentage change in SSR/SST ratio to choose number of clusters automatically. This threshold gives potential position of Elbow in Elbow method.

Note

This option is only used when -cmetric ssr-sst is provided as input.


-sil_ssize 20

Percentage of number of frames to be considered as sample size for silhouette score calculation. If its value is -1, sampling is not considered.


-db_eps 0.5

The maximum distance between two samples for them to be considered as in the same neighborhood.


-db_min_samples 20

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.


-nminfr 20

Number of minimum frames in a cluster to output it as trajectory. If number of frames is less than this number, the cluster will be ignored.


-fit/-nofit

Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation. If Enabled, index group for fitting will be prompted. Otherwise, fitting will be performed with RMSD/clustering group.


-fit2central/-nofit2central

Enable/Disable trajectory superimposition or fitting to central structure in the output trajectory. Atoms group used for fitting depends on -[no]fit option. If -nofit, second input index group (RMSD/clustering group) will be used for fitting otherwise third index group will be used for fitting.


-outframe -1

Number of maximum frames in the output trajectories. It can be helpful to get output trajectory with only structures around the central structure.


-sort none

Sort trajectory according to these values.

Accepted methods are:
  • -sort none : Ouput trajectory will not be sorted

  • -sort rmsd

    Sort trajectory according to RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and RMSD will increase gradually after first frame.

  • -sort rmsdist

    Sort trajectory according to distance-matrix RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and distance-matrix RMSD will increase gradually after first frame.

  • -sort features

    Sort trajectory according to features sub-space. Distance of each conformation to respective central structure is calculated in feature-space and Trajectory is written from lowest to highest distance. In this trajectory, first frame will be central structure.

    This option is very useful when features are other than PCA’s projections of eigenvector.

  • -sort user

    Sort trajectory using values supplied by user. Not yet implemented.


-plot pca_cluster.png

To plot features vs featrues with clusters in this file.

Plot is generated where feature-vs-feature are depicted with different clusters as colors. It is helpful in checking whether number of clusters is enough.

See also

Similar types of plots can be obtained using featuresplot sub-command.