Python implementation of Divisive iK-means (DiviK) algorithm.
Clustering at your command line with fit-clusters
Set of algorithm implementations for unsupervised analyses
DiviK - hands-free clustering method with built-in feature selection
K-Means with Dunn method for selecting the number of clusters
K-Means with GAP index for selecting the number of clusters
Modular K-Means implementation with custom distance metrics and initializations
Two-step meta-clustering
Gaussian Mixture Model based data-driven feature selection
High Abundance And Variance Selector - allows you to select highly variant features above noise level, based on GMM-decomposition
Outlier Abundance And Variance Selector - allows you to select highly variant features above noise level, based on outlier detection
Percentage based selector - allows you to select highly variant features above noise level with your predefined thresholds for each
Stratified Sampler - generates samples of fixed number of rows from given dataset, preserving groups proportion
Uniform PCA Sampler - generates samples of random observations within boundaries of an original dataset, and preserving the rotation of the data
Uniform Sampler - generates samples of random observations within boundaries of an original dataset
The recommended way to use this software is through Docker. This is the most convenient way, if you want to use divik application.
To install latest stable version use:
docker pull gmrukwa/divikPrerequisites for installation of base package:
Python 3.6 / 3.7 / 3.8
compiler capable of compiling the native C code and OpenMP support
You should have it already installed with GCC compiler, but if somehow not, try the following:
sudo apt-get install libgomp1OpenMP is available as part of LLVM. You may need to install it with conda:
conda install -c conda-forge "compilers>=1.0.4,!=1.1.0" llvm-openmpHaving prerequisites installed, one can install latest base version of the package:
pip install divikIf you want to have compatibility with gin-config, you can install necessary extras with:
pip install divik[gin]Note: Remember about \ before [ and ] in zsh shell.
You can install all extras with:
pip install divik[all]If you are using DiviK to run the analysis that could fail to fit RAM of your computer, consider disabling the default parallelism and switch to dask. It's easy to achieve through configuration:
set all parameters named n_jobs to 1;
set all parameters named allow_dask to True.
Note: Never set n_jobs>1 and allow_dask=True at the same time, the computations will freeze due to how multiprocessing and dask handle parallelism.
It can happen if the he gamred_native package (part of divik package) was compiled with different numpy ABI than scikit-learn. This could happen if you used different set of compilers than the developers of the scikit-learn package.
In such a case, a handler is defined to display the stack trace. If the trace comes from _matlab_legacy.py, the most probably this is the issue.
To resolve the issue, consider following the installation instructions once again. The exact versions get updated to avoid the issue.
Contribution guide will be developed soon.
Format the code with:
isort -m 3 --fgw 3 --tc .
black -t py36 .This software is part of contribution made by Data Mining Group of Silesian University of Technology, rest of which is published here.
Manual on how to use the fit-clusters script for clustering
NOTE: fit-clusters requires installation with gin extras, e.g. pip install divik[gin]
fit-clusters is just one CLI executable that allows you to run DiviK algorithm, any other clustering algorithms supported by scikit-learn or even a pipeline with pre-processing.
There are two types of parameters:
--param - this way you can set the value of a parameter during
fit-clusters executable launch, i.e. you can overwrite parameter provided
in a config file or a default.
--config - this way you can provide a list of config files. Their
content will be treated as a one big (ordered) list of settings. In case of
conflict, the later file overwrites a setting provided by earlier one.
These go directly to the CLI.
usage: fit-clusters [-h] [--param [PARAM [PARAM ...]]]
                [--config [CONFIG [CONFIG ...]]]
optional arguments:
-h, --help            show this help message and exit
--param [PARAM [PARAM ...]]
                        List of Gin parameter bindings
--config [CONFIG [CONFIG ...]]
                        List of paths to the config filesSample fit-clusters call:
fit-clusters \
  --param \
    load_data.path='/data/my_data.csv' \
    DiviK.distance='euclidean' \
    DiviK.use_logfilters=False \
    DiviK.n_jobs=-1 \
  --config \
    my-defaults.gin \
    my-overrides.ginThe elaboration of all the parameters is included in Experiment configuration and Model setup.
Following parameters are available when launching experiments:
load_data.path - path to the file with data for clustering. Observations
in rows, features in columns.
load_xy.path - path to the file with X and Y coordinates for the
observations. The number of coordinate pairs must be the same as the number
of observations. Only integer coordinates are supported now.
experiment.model - the clustering model to fit to the data. See more in
Model setup.
experiment.steps_that_require_xy - when using scikit-learn Pipeline,
it may be required to provide spatial coordinates to fit specific algorithms.
This parameter accepts the list of the steps that should be provided with
spatial coordinates during pipeline execution (e.g. EximsSelector).
experiment.destination - the destination directory for the experiment
outputs. Default result.
experiment.omit_datetime - if True, the destination directory will be
directly populated with the results of the experiment. Otherwise, a
subdirectory with date and time will be created to keep separation between
runs. Default False.
experiment.verbose - if True, extends the messaging on the console.
Default False.
experiment.exist_ok - if True, the experiment will not fail if the
destination directory exists. This is to avoid results overwrites. Default
False.
divik modelsTo use DiviK algorithm in the experiment, a config file must:
Import the algorithms to the scope, e.g.:
import divik.clusterPoint experiment which algorithm to use, e.g.:
experiment.model = @DiviK()Configure the algorithm, e.g.:
DiviK.distance = 'euclidean'
DiviK.verbose = TrueKMeansBelow you can check sample configuration file, that sets up simple KMeans:
import divik.cluster
KMeans.n_clusters = 3
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True
experiment.model = @KMeans()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = TrueDiviKBelow is the configuration file with full setup of DiviK. DiviK requires an automated clustering method for stop condition and a separate one for clustering. Here we use GAPSearch for stop condition and DunnSearch for selecting the number of clusters. These in turn require a KMeans method set for a specific distance method, etc.:
import divik.cluster
KMeans.n_clusters = 1
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True
GAPSearch.kmeans = @KMeans()
GAPSearch.max_clusters = 2
GAPSearch.n_jobs = 1
GAPSearch.seed = 42
GAPSearch.n_trials = 10
GAPSearch.sample_size = 1000
GAPSearch.drop_unfit = True
GAPSearch.verbose = True
DunnSearch.kmeans = @KMeans()
DunnSearch.max_clusters = 10
DunnSearch.method = "auto"
DunnSearch.inter = "closest"
DunnSearch.intra = "furthest"
DunnSearch.sample_size = 1000
DunnSearch.seed = 42
DunnSearch.n_jobs = 1
DunnSearch.drop_unfit = True
DunnSearch.verbose = True
DiviK.kmeans = @DunnSearch()
DiviK.fast_kmeans = @GAPSearch()
DiviK.distance = "correlation"
DiviK.minimal_size = 200
DiviK.rejection_size = 2
DiviK.minimal_features_percentage = 0.005
DiviK.features_percentage = 1.0
DiviK.normalize_rows = True
DiviK.use_logfilters = True
DiviK.filter_type = "gmm"
DiviK.n_jobs = 1
DiviK.verbose = True
experiment.model = @DiviK()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = Truescikit-learn modelsFor a model to be used with fit-clusters, it needs to be marked as gin.configurable. While it is true for DiviK and remaining algorithms within divik package, scikit-learn requires additional setup.
Import helper module:
import divik.core.gin_sklearn_configurablesPoint experiment which algorithm to use, e.g.:
experiment.model = @MeanShift()Configure the algorithm, e.g.:
MeanShift.n_jobs = -1
MeanShift.max_iter = 300WARNING: Importing both scikit-learn and divik will result in an ambiguity when using e.g. KMeans. In such a case it is necesary to point specific algorithms by a full name, e.g. divik.cluster._kmeans._core.KMeans.
MeanShiftBelow you can check sample configuration file, that sets up simple MeanShift:
import divik.core.gin_sklearn_configurables
MeanShift.cluster_all = True
MeanShift.n_jobs = -1
MeanShift.max_iter = 300
experiment.model = @MeanShift()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = Truescikit-learn Pipelines have a separate section to provide an additional explanation, even though these are part of scikit-learn.
Import helper module:
import divik.core.gin_sklearn_configurablesImport the algorithms into the scope:
import divik.feature_extractionPoint experiment which algorithm to use, e.g.:
experiment.model = @Pipeline()Configure the algorithms, e.g.:
MeanShift.n_jobs = -1
MeanShift.max_iter = 300Configure the pipeline:
Pipeline.steps = [
    ('histogram_equalization', @HistogramEqualization()),
    ('exims', @EximsSelector()),
    ('pca', @KneePCA()),
    ('mean_shift', @MeanShift()),
](If needed) configure steps that require spatial coordinates:
experiment.steps_that_require_xy = ['exims']PipelineBelow you can check sample configuration file, that sets up simple Pipeline:
import divik.core.gin_sklearn_configurables
import divik.feature_extraction
MeanShift.n_jobs = -1
MeanShift.max_iter = 300
Pipeline.steps = [
    ('histogram_equalization', @HistogramEqualization()),
    ('exims', @EximsSelector()),
    ('pca', @KneePCA()),
    ('mean_shift', @MeanShift()),
]
experiment.model = @Pipeline()
experiment.steps_that_require_xy = ['exims']
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = TrueThe fit-clusters executable can work with custom algorithms as well.
Mark an algorithm class gin.configurable at the definition time:
import gin
@gin.configurable
class MyClustering:
    passor when importing them from a library:
import gin
gin.external_configurable(MyClustering)Define artifacts saving methods:
from divik.core.io import saver
@saver
def save_my_clustering(model, fname_fn, **kwargs):
    if not hasattr(model, 'my_custom_field_'):
        return
    # custom saving logic comes hereThere are some default savers defined, which are compatible with lots of divik and scikit-learn algorithms, supporting things like:
model pickling
JSON summary saving
labels saving (.npy, .csv)
centroids saving (.npy, .csv)
pipeline saving
A saver should be highly reusable and could be a pleasant contribution to the divik library.
In config, import the module which marks your algorithm configurable:
import myclusteringContinue with the algorithm setup and plumbing as in the previous scenarios