Manual on how to use the fit-clusters script for clustering
NOTE: fit-clusters
requires installation with gin
extras, e.g. pip install divik[gin]
fit-clusters is just one CLI executable that allows you to run DiviK algorithm, any other clustering algorithms supported by scikit-learn or even a pipeline with pre-processing.
There are two types of parameters:
--param
- this way you can set the value of a parameter during
fit-clusters executable launch, i.e. you can overwrite parameter provided
in a config file or a default.
--config
- this way you can provide a list of config files. Their
content will be treated as a one big (ordered) list of settings. In case of
conflict, the later file overwrites a setting provided by earlier one.
These go directly to the CLI.
Sample fit-clusters
call:
The elaboration of all the parameters is included in Experiment configuration and Model setup.
Following parameters are available when launching experiments:
load_data.path
- path to the file with data for clustering. Observations
in rows, features in columns.
load_xy.path
- path to the file with X and Y coordinates for the
observations. The number of coordinate pairs must be the same as the number
of observations. Only integer coordinates are supported now.
experiment.model
- the clustering model to fit to the data. See more in
Model setup.
experiment.steps_that_require_xy
- when using scikit-learn Pipeline,
it may be required to provide spatial coordinates to fit specific algorithms.
This parameter accepts the list of the steps that should be provided with
spatial coordinates during pipeline execution (e.g. EximsSelector
).
experiment.destination
- the destination directory for the experiment
outputs. Default result
.
experiment.omit_datetime
- if True
, the destination directory will be
directly populated with the results of the experiment. Otherwise, a
subdirectory with date and time will be created to keep separation between
runs. Default False
.
experiment.verbose
- if True
, extends the messaging on the console.
Default False.
experiment.exist_ok
- if True
, the experiment will not fail if the
destination directory exists. This is to avoid results overwrites. Default
False
.
divik
modelsTo use DiviK algorithm in the experiment, a config file must:
Import the algorithms to the scope, e.g.:
Point experiment which algorithm to use, e.g.:
Configure the algorithm, e.g.:
KMeans
Below you can check sample configuration file, that sets up simple KMeans:
DiviK
Below is the configuration file with full setup of DiviK. DiviK
requires an automated clustering method for stop condition and a separate one for clustering. Here we use GAPSearch
for stop condition and DunnSearch
for selecting the number of clusters. These in turn require a KMeans
method set for a specific distance method, etc.:
scikit-learn
modelsFor a model to be used with fit-clusters
, it needs to be marked as gin.configurable
. While it is true for DiviK and remaining algorithms within divik
package, scikit-learn
requires additional setup.
Import helper module:
Point experiment which algorithm to use, e.g.:
Configure the algorithm, e.g.:
WARNING: Importing both scikit-learn
and divik
will result in an ambiguity when using e.g. KMeans
. In such a case it is necesary to point specific algorithms by a full name, e.g. divik.cluster._kmeans._core.KMeans
.
MeanShift
Below you can check sample configuration file, that sets up simple MeanShift:
scikit-learn
Pipelines have a separate section to provide an additional explanation, even though these are part of scikit-learn
.
Import helper module:
Import the algorithms into the scope:
Point experiment which algorithm to use, e.g.:
Configure the algorithms, e.g.:
Configure the pipeline:
(If needed) configure steps that require spatial coordinates:
Pipeline
Below you can check sample configuration file, that sets up simple Pipeline:
The fit-clusters
executable can work with custom algorithms as well.
Mark an algorithm class gin.configurable
at the definition time:
or when importing them from a library:
Define artifacts saving methods:
There are some default savers defined, which are compatible with lots of divik
and scikit-learn
algorithms, supporting things like:
model pickling
JSON summary saving
labels saving (.npy
, .csv
)
centroids saving (.npy
, .csv
)
pipeline saving
A saver
should be highly reusable and could be a pleasant contribution to the divik
library.
In config, import the module which marks your algorithm configurable:
Continue with the algorithm setup and plumbing as in the previous scenarios