# Cluster analysis with fit-clusters

**NOTE**: `fit-clusters` requires installation with `gin` extras, e.g. `pip install divik[gin]`

fit-clusters is just one CLI executable that allows you to run DiviK algorithm, any other clustering algorithms supported by scikit-learn or even a pipeline with pre-processing.

## Usage

### CLI interface

There are two types of parameters:

1. `--param` - this way you can set the value of a parameter during

   fit-clusters executable launch, i.e. you can overwrite parameter provided

   in a config file or a default.
2. `--config` - this way you can provide a list of config files. Their

   content will be treated as a one big (ordered) list of settings. In case of

   conflict, the later file overwrites a setting provided by earlier one.

These go directly to the CLI.

```
usage: fit-clusters [-h] [--param [PARAM [PARAM ...]]]
                [--config [CONFIG [CONFIG ...]]]

optional arguments:
-h, --help            show this help message and exit
--param [PARAM [PARAM ...]]
                        List of Gin parameter bindings
--config [CONFIG [CONFIG ...]]
                        List of paths to the config files
```

Sample `fit-clusters` call:

```
fit-clusters \
  --param \
    load_data.path='/data/my_data.csv' \
    DiviK.distance='euclidean' \
    DiviK.use_logfilters=False \
    DiviK.n_jobs=-1 \
  --config \
    my-defaults.gin \
    my-overrides.gin
```

The elaboration of all the parameters is included in Experiment configuration and Model setup.

### Experiment configuration

Following parameters are available when launching experiments:

1. `load_data.path` - path to the file with data for clustering. Observations

   in rows, features in columns.
2. `load_xy.path` - path to the file with X and Y coordinates for the

   observations. The number of coordinate pairs must be the same as the number

   of observations. Only integer coordinates are supported now.
3. `experiment.model` - the clustering model to fit to the data. See more in

   Model setup.
4. `experiment.steps_that_require_xy` - when using scikit-learn Pipeline,

   it may be required to provide spatial coordinates to fit specific algorithms.

   This parameter accepts the list of the steps that should be provided with

   spatial coordinates during pipeline execution (e.g. `EximsSelector`).
5. `experiment.destination` - the destination directory for the experiment

   outputs. Default `result`.
6. `experiment.omit_datetime` - if `True`, the destination directory will be

   directly populated with the results of the experiment. Otherwise, a

   subdirectory with date and time will be created to keep separation between

   runs. Default `False`.
7. `experiment.verbose` - if `True`, extends the messaging on the console.

   Default False.
8. `experiment.exist_ok` - if `True`, the experiment will not fail if the

   destination directory exists. This is to avoid results overwrites. Default

   `False`.

## Model setup

### `divik` models

To use DiviK algorithm in the experiment, a config file must:

1. Import the algorithms to the scope, e.g.:

```
import divik.cluster
```

1. Point experiment which algorithm to use, e.g.:

```
experiment.model = @DiviK()
```

1. Configure the algorithm, e.g.:

```
DiviK.distance = 'euclidean'
DiviK.verbose = True
```

#### Sample config with `KMeans`

Below you can check sample configuration file, that sets up simple KMeans:

```
import divik.cluster

KMeans.n_clusters = 3
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True

experiment.model = @KMeans()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
```

#### Sample config with `DiviK`

Below is the configuration file with full setup of DiviK. `DiviK` requires an automated clustering method for stop condition and a separate one for clustering. Here we use `GAPSearch` for stop condition and `DunnSearch` for selecting the number of clusters. These in turn require a `KMeans` method set for a specific distance method, etc.:

```
import divik.cluster

KMeans.n_clusters = 1
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True

GAPSearch.kmeans = @KMeans()
GAPSearch.max_clusters = 2
GAPSearch.n_jobs = 1
GAPSearch.seed = 42
GAPSearch.n_trials = 10
GAPSearch.sample_size = 1000
GAPSearch.drop_unfit = True
GAPSearch.verbose = True

DunnSearch.kmeans = @KMeans()
DunnSearch.max_clusters = 10
DunnSearch.method = "auto"
DunnSearch.inter = "closest"
DunnSearch.intra = "furthest"
DunnSearch.sample_size = 1000
DunnSearch.seed = 42
DunnSearch.n_jobs = 1
DunnSearch.drop_unfit = True
DunnSearch.verbose = True

DiviK.kmeans = @DunnSearch()
DiviK.fast_kmeans = @GAPSearch()
DiviK.distance = "correlation"
DiviK.minimal_size = 200
DiviK.rejection_size = 2
DiviK.minimal_features_percentage = 0.005
DiviK.features_percentage = 1.0
DiviK.normalize_rows = True
DiviK.use_logfilters = True
DiviK.filter_type = "gmm"
DiviK.n_jobs = 1
DiviK.verbose = True

experiment.model = @DiviK()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
```

### `scikit-learn` models

For a model to be used with `fit-clusters`, it needs to be marked as `gin.configurable`. While it is true for DiviK and remaining algorithms within `divik` package, `scikit-learn` requires additional setup.

1. Import helper module:

```
import divik.core.gin_sklearn_configurables
```

1. Point experiment which algorithm to use, e.g.:

```
experiment.model = @MeanShift()
```

1. Configure the algorithm, e.g.:

```
MeanShift.n_jobs = -1
MeanShift.max_iter = 300
```

**WARNING**: Importing both `scikit-learn` and `divik` will result in an ambiguity when using e.g. `KMeans`. In such a case it is necesary to point specific algorithms by a full name, e.g. `divik.cluster._kmeans._core.KMeans`.

#### Sample config with `MeanShift`

Below you can check sample configuration file, that sets up simple MeanShift:

```
import divik.core.gin_sklearn_configurables

MeanShift.cluster_all = True
MeanShift.n_jobs = -1
MeanShift.max_iter = 300

experiment.model = @MeanShift()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
```

### Pipelines

`scikit-learn` Pipelines have a separate section to provide an additional explanation, even though these are part of `scikit-learn`.

1. Import helper module:

```
import divik.core.gin_sklearn_configurables
```

1. Import the algorithms into the scope:

```
import divik.feature_extraction
```

1. Point experiment which algorithm to use, e.g.:

```
experiment.model = @Pipeline()
```

1. Configure the algorithms, e.g.:

```
MeanShift.n_jobs = -1
MeanShift.max_iter = 300
```

1. Configure the pipeline:

```
Pipeline.steps = [
    ('histogram_equalization', @HistogramEqualization()),
    ('exims', @EximsSelector()),
    ('pca', @KneePCA()),
    ('mean_shift', @MeanShift()),
]
```

1. (If needed) configure steps that require spatial coordinates:

```
experiment.steps_that_require_xy = ['exims']
```

#### Sample config with `Pipeline`

Below you can check sample configuration file, that sets up simple Pipeline:

```
import divik.core.gin_sklearn_configurables
import divik.feature_extraction

MeanShift.n_jobs = -1
MeanShift.max_iter = 300

Pipeline.steps = [
    ('histogram_equalization', @HistogramEqualization()),
    ('exims', @EximsSelector()),
    ('pca', @KneePCA()),
    ('mean_shift', @MeanShift()),
]

experiment.model = @Pipeline()
experiment.steps_that_require_xy = ['exims']
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
```

### Custom models

The `fit-clusters` executable can work with custom algorithms as well.

1. Mark an algorithm class `gin.configurable` at the definition time:

```
import gin

@gin.configurable
class MyClustering:
    pass
```

or when importing them from a library:

```
import gin

gin.external_configurable(MyClustering)
```

1. Define artifacts saving methods:

```
from divik.core.io import saver

@saver
def save_my_clustering(model, fname_fn, **kwargs):
    if not hasattr(model, 'my_custom_field_'):
        return
    # custom saving logic comes here
```

There are some default savers defined, which are compatible with lots of `divik` and `scikit-learn` algorithms, supporting things like:

* model pickling
* JSON summary saving
* labels saving (`.npy`, `.csv`)
* centroids saving (`.npy`, `.csv`)
* pipeline saving

A `saver` should be highly reusable and could be a pleasant contribution to the `divik` library.

1. In config, import the module which marks your algorithm configurable:

```
import myclustering
```

1. Continue with the algorithm setup and plumbing as in the previous scenarios


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sut-data-mining.gitbook.io/divik/cluster-analysis.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
