`cluster`¶

It is the main tool for clustering. It takes at least three input files and perform clustering according to the given option. It also generate a log file containing the information related to clustering.

gmx_clusterByFeatures cluster can be used with trajectory and tpr file generated by GROMACS.
In case of other versions or other programs such as NAMD and AMBER, PDB file can be used in place of tpr file.
Trajectories from NAMD and AMBER should be converted to GROMACS compatible formats such as trr, xtc, pdb etc.

Execute following command to get full help

gmx_clusterByFeatures cluster -h

Warning

Only PBC corrected trajectory and tpr files should be used as inputs. PBC corrected PDB/GRO file can be used in place of tpr file.

Command summary¶

gmx_clusterByFeatures cluster [-f [<.xtc/.trr/...>]] [-s [<.tpr/.gro/...>]]
            [-feat [<.xvg>]] [-n [<.ndx>]] [-clid [<.xvg>]] [-g [<.log>]]
            [-fout [<.xtc/.trr/...>]] [-cpdb [<.pdb>]] [-rmsd [<.xvg>]]
            [-b <time>] [-e <time>] [-dt <time>] [-tu <enum>] [-xvg <enum>]
            [-method <enum>] [-nfeature <int>] [-cmetric <enum>]
            [-ncluster <int>] [-crmsthres <real>] [-ssrchange <real>]
            [-db_eps <real>] [-db_min_samples <int>] [-sil_ssize <real>]
            [-nminfr <int>] [-[no]fit] [-[no]fit2central] [-outframe <int>]
            [-sort <enum>] [-plot <string>] [-fsize <int>] [-pltw <real>]
            [-plth <real>]

Options summary¶

Options to specify input files to cluster¶
Option	Default	File type
-f [<.xtc/.trr/…>]	traj.xtc	Trajectory: xtc trr cpt gro g96 pdb tng
-s [<.tpr/.gro/…>]	topol.tpr	Structure+mass(db): tpr gro g96 pdb brk ent
-n [<.ndx>]	index.ndx	Index file
-feat [<.xvg>]	feature.xvg	xvgr/xmgr file

Options to specify output files to cluster¶
Option	Default	File type
-clid [<.xvg>]	clid.xvg	xvgr/xmgr file (Can be used as both input and output)
-g [<.log>]	cluster.log	Log file
-fout [<.xtc/.trr/…>]	trajout.xtc	Trajectory: xtc trr cpt gro g96 pdb tng
-cpdb [<.pdb>]	central.pdb	Protein data bank file
-rmsd [<.xvg>]	rmsd.xvg	xvgr/xmgr file

Other options to cluster¶
Option	Default	Description
`-b <real>`	0	First frame (ps) to read from trajectory
`-e <real>`	0	Last frame (ps) to read from trajectory
`-dt <real>`	0	Only use frame when t MOD dt = first time (ps)
-xvg <keyword>	xmgrace	xvg plot formatting: xmgrace, xmgr, none
-method <keyword>	kmeans	Clustering methods. Accepted methods are:: kmeans, dbscan, gmixture
-nfeature <int>	10	Number of features to use for clustering
-cmetric <keyword>	prior	Cluster metrics: Method to determine cluster number. Accepted methods are: prior, rmsd, ssr-sst, pFS, DBI
-ncluster <int>	5	Number of clusters to generate for prior method. Maximum number of cluster for ctrmsd method.
-crmsthres <real>	0.1	RMSD (nm) threshold between central structures for RMSD cluster metric method.
-ssrchange <real>	2	Thershold relative change % in SSR/SST ratio for ssr-sst cluster metric method.
-sil_ssize <real>	20	Percentage of number of frames to be considered as sample size for silhouette score calculation.
-db_eps <real>	0.5	The maximum distance between two samples for them to be considered as in the same neighborhood.
-db_min_samples <int>	20	The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
-nminfr <int>	20	Number of minimum frames in a cluster to output it as trajectory
-[no]fit	Enable	Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation.
-[no]fit2central	Disable	Enable/Disable trajectory superimposition or fitting to central structure in the output trajectory
-outframe <int>	-1	Number of maximum frames in the output trajectories.
-sort <keyword>	none	Sort trajectory according to these values. Accepted methods are: none, rmsd, rmsdist, features, user
-plot <string>	pca_cluster.png	To plot features with clusters in this file.
`-fsize <int>`	14	Font size in plot.
`-pltw <real>`	12	Width (inch) of the plot.
`-plth <real>`	20	Height (inch) of the plot.

Options to specify input files¶

`-f traj.xtc`¶

Input trajectory file of xtc trr cpt gro g96 pdb or tng format.

Note

If this file is not provided, only clustering will be performed. No operations will be performed that require trajectory such as RMSD calculation, central structure calculations, clustered trajectories etc.

Note

In case of XTC and TNG formats, writing central structures and clustered trajectories are relatively fast.

`-s topol.tpr`¶

An input structure file of tpr gro g96 or pdb format. It is required if trajectory is given as input.

`-n index.ndx`¶

If given, index groups from this file will be prompted for selection. Otherwise, default index groups will be prompted for selection.

This file will be ignored when no trajectory file will be provided.

Users will be prompted for three index group

Choose a group for the output: Select a index group to output it as central structure and clustered trajectory. It can be whole system or any part of the system.
Choose a group for clustering/RMSD calculation: The actual atom groups for which clustering has to be done and RMSD has to be calculated.

Note

If you are doing PCA based clustering, it should be same second index group as selected in gmx covar and gmx anaeig.
Choose a group for fitting or superposition: The atom groups used for fitting or superposition before RMSD calculation.

Note

This input will be only prompted when -fit or -fit2central option is given. Otherwise, group selected above will be used for fitting.

Note

If you are doing PCA based clustering, it should be same as first index group selected in gmx covar and gmx anaeig.

`-feat features.xvg`¶

It accepts a file containing features of trajectory as a function of time. Its format is similar to the projections file generated by gmx anaeig. Therefore, in case of PCA data, output (-proj) of gmx anaeig can be directly used as input for gmx_clusterByFeatures.

In this file, two columns should be present. First column is time and second column is feature values. Each time-feature columns should be separated by “&”.

The format is as following:

# FEATURE - 1
# Time    values
0.0     123.12
10.0    123.12
20.0    123.12
.
.
.

&

# FEATURE - 2
0.0     123.12
10.0    123.12
20.0    123.12
.
.
.

&

# FEATURE - 3
0.0     123.12
10.0    123.12
20.0    123.12
.
.
.

&

Note

If this file is not provided, -clid [<.xvg>] is the required option.

Options to specify output files¶

`-clid clid.xvg`¶

It can be both input and output file. It contains two columns, first column is time and second column is cluster label/id.

In default case when clustering has to be done, it is generated after clustering is finished and contains information about cluster id of each frame.

However, it can be also given as input to obtain clustered trajectories. For example, if clustering was performed with “gmx cluster”, the obtained -clid [<.xvg>] file can be used here to extract clustered trajectory.

Note

To treat this as an input file, do not use -feat [<.xvg>] option.

`-g cluster.log`¶

It is output log file and contains several information about clustering methods and obtained results.

`-fout trajout.xtc`¶

Output clustered trajectories. Separate trajectory of clusters is written for convenience. These separate trajectories can be used for further analysis.

Each trajectory file name is suffixed by its respective cluster-id.

`-cpdb central.pdb`¶

Output separate pdb files for central structures of each cluster.

Each pdb file name is suffixed by its respective cluster-id.

`-rmsd rmsd.xvg`¶

RMSD of clustering atom groups with respect to central structure.

Each RMSD file name is suffixed by its respective cluster-id.

In case of -sort rmsdist option, RMSD in distance-matrix is calculated.

Other options¶

`-xvg xmgrace`¶

It directs the formatting of all output <.xvg> files. By default, <.xvg> files are in xmgrace format, which can be plotted using Grace (xmgrace command).

To plot with any other program, use -xvg none then a plain text file is obtained.

Three keywords are accepted:

xmgrace
xmgr
none

`-method kmeans`¶

Method to use for clustering. All the methods used here are used from Python scikit-learn library.

An overview on clustering method are presented here.

Presently following methods are implemented:

-method kmeans

K-means clustering- It needs cluster number as input (-ncluster <int>). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, -cmetric For more details about k-means method, see here.
-method dbscan

DBSCAN - Density-based spatial clustering of applications with noise - It does not require cluster number beforehand. The clusters are controlled by two other input options: -db_eps and -db_min_samples. For more details about DBSCAN method, see here.
-method gmixture

Gaussian mixture model clustering - It also needs cluster number as input (-ncluster <int>). Therefore, one should know beforehand how many cluster is there in data. To automatically determine the cluster number, see -cmetric For more details about k-means method, see here.

`-nfeature 10`¶

Number of features to be read from -feat file.

If file contains less than requested number of features, all features will be read.

`-cmetric prior`¶

Cluster metric to determine the total number of cluster automatically, particularly for k-means and Gaussian-mixture model.

Note

All the cluster metrics are only applicable when -method kmeans or -method gmixture is used.

Presently following cluster metrics are implemented:

-cmetric prior

If clusters count is known beforehand, use this with -ncluster <int>. Here, -ncluster takes input as the clusters count.
-cmetric rmsd

Root Mean Square deviation between central structures of clusters. It uses -crmsthres option for RMSD threshold/cutoff.

Note

It requires trajectory file as input. Otherwise, -cmetric ssr-sst will be used for cluster metric with default -ssrchange value.
-cmetric ssr-sst

It is SSR/SST ratio and used for Elbow method. It is the threshold in relative change in SSR/SST ratio in percentage.
-cmetric silhouette

Silhouette score. From wikipedia: "The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters." First encountered clusters count with highest Silhouette score value is considered as final cluster number.

To calculate score, either entire data will be considered with option -sil_ssize -1, which could be time expensive or percentage of data by random sampling will be taken with option -sil_ssize. Because of the random sampling, this score might not be precisely reproduced in successive calculation.
-cmetric DBI

DBI : Davies–Bouldin index. Lowest value is considered.

`-ncluster 5`¶

It takes the number of clusters. Its usage depends on -cmetric.

Note

It is only applicable when -method kmeans or -method gmixture is used.

Conditions:

For -cmetric prior, it is considered as the number clusters to be generated.
For -cmetric rmsd, it is considered as largest number of clusters to be generated and iteratively number of clusters are reduced to check whether RMSD between central structures are not below RMSD threshold (-crmsthres <real>).
For -cmetric ssr-sst, -cmetric pFS and -cmetric DBI, it is considered as maximum number of clusters to generated. At first, two clusters are generated and iteratively number of clusters are increased by one. When maximum number of clusters is reached, these three cluster-metrics are calculated and finally, number of clusters is selected.

`-crmsthres 0.1`¶

RMSD (nm) threshold between central structures for RMSD cluster metric method.

It is used with -cmetric rmsd. In each iteration, RMSD between all central structures are calculated. If any RMSD value is within the input RMSD (nm) threshold, number of clusters is decreased by one in next iteration.

It is assumed that when RMSD between two central structures are within the threshold, central structures are similar enough to merge the two clusters as a single cluster. However, it is not necessary that these two clusters will merge in next iteration.

`-ssrchange 2.0`¶

Threshold relative percentage change in SSR/SST ratio to choose number of clusters automatically. This threshold gives potential position of Elbow in Elbow method.

Note

This option is only used when -cmetric ssr-sst is provided as input.

`-sil_ssize 20`¶

Percentage of number of frames to be considered as sample size for silhouette score calculation. If its value is -1, sampling is not considered.

`-db_eps 0.5`¶

The maximum distance between two samples for them to be considered as in the same neighborhood.

`-db_min_samples 20`¶

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

`-nminfr 20`¶

Number of minimum frames in a cluster to output it as trajectory. If number of frames is less than this number, the cluster will be ignored.

`-fit/-nofit`¶

Enable fitting and superimposition of the atoms groups different from RMSD/clustering group before RMSD calculation. If Enabled, index group for fitting will be prompted. Otherwise, fitting will be performed with RMSD/clustering group.

`-fit2central/-nofit2central`¶

Enable/Disable trajectory superimposition or fitting to central structure in the output trajectory. Atoms group used for fitting depends on -[no]fit option. If -nofit, second input index group (RMSD/clustering group) will be used for fitting otherwise third index group will be used for fitting.

`-outframe -1`¶

Number of maximum frames in the output trajectories. It can be helpful to get output trajectory with only structures around the central structure.

`-sort none`¶

Sort trajectory according to these values.

Accepted methods are:

-sort none : Ouput trajectory will not be sorted
-sort rmsd

Sort trajectory according to RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and RMSD will increase gradually after first frame.
-sort rmsdist

Sort trajectory according to distance-matrix RMSD with respect to central structure. Therefore, obtained trajectory’s first frame will be central structure and distance-matrix RMSD will increase gradually after first frame.
-sort features

Sort trajectory according to features sub-space. Distance of each conformation to respective central structure is calculated in feature-space and Trajectory is written from lowest to highest distance. In this trajectory, first frame will be central structure.

This option is very useful when features are other than PCA’s projections of eigenvector.
-sort user

Sort trajectory using values supplied by user. Not yet implemented.

`-plot pca_cluster.png`¶

To plot features vs featrues with clusters in this file.

Plot is generated where feature-vs-feature are depicted with different clusters as colors. It is helpful in checking whether number of clusters is enough.

cluster¶

Command summary¶

Options summary¶

Options to specify input files¶

-f traj.xtc¶

-s topol.tpr¶

-n index.ndx¶

-feat features.xvg¶

Options to specify output files¶

-clid clid.xvg¶

-g cluster.log¶

-fout trajout.xtc¶

-cpdb central.pdb¶

-rmsd rmsd.xvg¶

Other options¶

-xvg xmgrace¶

-method kmeans¶

-nfeature 10¶

-cmetric prior¶

-ncluster 5¶

-crmsthres 0.1¶

-ssrchange 2.0¶

-sil_ssize 20¶

-db_eps 0.5¶

-db_min_samples 20¶

-nminfr 20¶

-fit/-nofit¶

-fit2central/-nofit2central¶

-outframe -1¶

-sort none¶

-plot pca_cluster.png¶

`cluster`¶

`-f traj.xtc`¶

`-s topol.tpr`¶

`-n index.ndx`¶

`-feat features.xvg`¶

`-clid clid.xvg`¶

`-g cluster.log`¶

`-fout trajout.xtc`¶

`-cpdb central.pdb`¶

`-rmsd rmsd.xvg`¶

`-xvg xmgrace`¶

`-method kmeans`¶

`-nfeature 10`¶

`-cmetric prior`¶

`-ncluster 5`¶

`-crmsthres 0.1`¶

`-ssrchange 2.0`¶

`-sil_ssize 20`¶

`-db_eps 0.5`¶

`-db_min_samples 20`¶

`-nminfr 20`¶

`-fit/-nofit`¶

`-fit2central/-nofit2central`¶

`-outframe -1`¶

`-sort none`¶

`-plot pca_cluster.png`¶