arrow-left

Only this pageAll pages
gitbookPowered by GitBook
1 of 2

divik

Loading...

Loading...

Cluster analysis with fit-clusters

Manual on how to use the fit-clusters script for clustering

NOTE: fit-clusters requires installation with gin extras, e.g. pip install divik[gin]

fit-clusters is just one CLI executable that allows you to run DiviK algorithm, any other clustering algorithms supported by scikit-learn or even a pipeline with pre-processing.

hashtag
Usage

hashtag
CLI interface

There are two types of parameters:

  1. --param - this way you can set the value of a parameter during

    fit-clusters executable launch, i.e. you can overwrite parameter provided

    in a config file or a default.

  2. --config - this way you can provide a list of config files. Their

These go directly to the CLI.

Sample fit-clusters call:

The elaboration of all the parameters is included in Experiment configuration and Model setup.

hashtag
Experiment configuration

Following parameters are available when launching experiments:

  1. load_data.path - path to the file with data for clustering. Observations

    in rows, features in columns.

  2. load_xy.path - path to the file with X and Y coordinates for the

    observations. The number of coordinate pairs must be the same as the number

hashtag
Model setup

hashtag
divik models

To use DiviK algorithm in the experiment, a config file must:

  1. Import the algorithms to the scope, e.g.:

  1. Point experiment which algorithm to use, e.g.:

  1. Configure the algorithm, e.g.:

hashtag
Sample config with KMeans

Below you can check sample configuration file, that sets up simple KMeans:

hashtag
Sample config with DiviK

Below is the configuration file with full setup of DiviK. DiviK requires an automated clustering method for stop condition and a separate one for clustering. Here we use GAPSearch for stop condition and DunnSearch for selecting the number of clusters. These in turn require a KMeans method set for a specific distance method, etc.:

hashtag
scikit-learn models

For a model to be used with fit-clusters, it needs to be marked as gin.configurable. While it is true for DiviK and remaining algorithms within divik package, scikit-learn requires additional setup.

  1. Import helper module:

  1. Point experiment which algorithm to use, e.g.:

  1. Configure the algorithm, e.g.:

WARNING: Importing both scikit-learn and divik will result in an ambiguity when using e.g. KMeans. In such a case it is necesary to point specific algorithms by a full name, e.g. divik.cluster._kmeans._core.KMeans.

hashtag
Sample config with MeanShift

Below you can check sample configuration file, that sets up simple MeanShift:

hashtag
Pipelines

scikit-learn Pipelines have a separate section to provide an additional explanation, even though these are part of scikit-learn.

  1. Import helper module:

  1. Import the algorithms into the scope:

  1. Point experiment which algorithm to use, e.g.:

  1. Configure the algorithms, e.g.:

  1. Configure the pipeline:

  1. (If needed) configure steps that require spatial coordinates:

hashtag
Sample config with Pipeline

Below you can check sample configuration file, that sets up simple Pipeline:

hashtag
Custom models

The fit-clusters executable can work with custom algorithms as well.

  1. Mark an algorithm class gin.configurable at the definition time:

or when importing them from a library:

  1. Define artifacts saving methods:

There are some default savers defined, which are compatible with lots of divik and scikit-learn algorithms, supporting things like:

  • model pickling

  • JSON summary saving

  • labels saving (.npy, .csv)

A saver should be highly reusable and could be a pleasant contribution to the divik library.

  1. In config, import the module which marks your algorithm configurable:

  1. Continue with the algorithm setup and plumbing as in the previous scenarios

content will be treated as a one big (ordered) list of settings. In case of

conflict, the later file overwrites a setting provided by earlier one.

of observations. Only integer coordinates are supported now.
  • experiment.model - the clustering model to fit to the data. See more in

    Model setup.

  • experiment.steps_that_require_xy - when using scikit-learn Pipeline,

    it may be required to provide spatial coordinates to fit specific algorithms.

    This parameter accepts the list of the steps that should be provided with

    spatial coordinates during pipeline execution (e.g. EximsSelector).

  • experiment.destination - the destination directory for the experiment

    outputs. Default result.

  • experiment.omit_datetime - if True, the destination directory will be

    directly populated with the results of the experiment. Otherwise, a

    subdirectory with date and time will be created to keep separation between

    runs. Default False.

  • experiment.verbose - if True, extends the messaging on the console.

    Default False.

  • experiment.exist_ok - if True, the experiment will not fail if the

    destination directory exists. This is to avoid results overwrites. Default

    False.

  • centroids saving (.npy, .csv)

  • pipeline saving

  • usage: fit-clusters [-h] [--param [PARAM [PARAM ...]]]
                    [--config [CONFIG [CONFIG ...]]]
    
    optional arguments:
    -h, --help            show this help message and exit
    --param [PARAM [PARAM ...]]
                            List of Gin parameter bindings
    --config [CONFIG [CONFIG ...]]
                            List of paths to the config files
    fit-clusters \
      --param \
        load_data.path='/data/my_data.csv' \
        DiviK.distance='euclidean' \
        DiviK.use_logfilters=False \
        DiviK.n_jobs=-1 \
      --config \
        my-defaults.gin \
        my-overrides.gin
    import divik.cluster
    experiment.model = @DiviK()
    DiviK.distance = 'euclidean'
    DiviK.verbose = True
    import divik.cluster
    
    KMeans.n_clusters = 3
    KMeans.distance = "correlation"
    KMeans.init = "kdtree_percentile"
    KMeans.leaf_size = 0.01
    KMeans.percentile = 99.0
    KMeans.max_iter = 100
    KMeans.normalize_rows = True
    
    experiment.model = @KMeans()
    experiment.omit_datetime = True
    experiment.verbose = True
    experiment.exist_ok = True
    import divik.cluster
    
    KMeans.n_clusters = 1
    KMeans.distance = "correlation"
    KMeans.init = "kdtree_percentile"
    KMeans.leaf_size = 0.01
    KMeans.percentile = 99.0
    KMeans.max_iter = 100
    KMeans.normalize_rows = True
    
    GAPSearch.kmeans = @KMeans()
    GAPSearch.max_clusters = 2
    GAPSearch.n_jobs = 1
    GAPSearch.seed = 42
    GAPSearch.n_trials = 10
    GAPSearch.sample_size = 1000
    GAPSearch.drop_unfit = True
    GAPSearch.verbose = True
    
    DunnSearch.kmeans = @KMeans()
    DunnSearch.max_clusters = 10
    DunnSearch.method = "auto"
    DunnSearch.inter = "closest"
    DunnSearch.intra = "furthest"
    DunnSearch.sample_size = 1000
    DunnSearch.seed = 42
    DunnSearch.n_jobs = 1
    DunnSearch.drop_unfit = True
    DunnSearch.verbose = True
    
    DiviK.kmeans = @DunnSearch()
    DiviK.fast_kmeans = @GAPSearch()
    DiviK.distance = "correlation"
    DiviK.minimal_size = 200
    DiviK.rejection_size = 2
    DiviK.minimal_features_percentage = 0.005
    DiviK.features_percentage = 1.0
    DiviK.normalize_rows = True
    DiviK.use_logfilters = True
    DiviK.filter_type = "gmm"
    DiviK.n_jobs = 1
    DiviK.verbose = True
    
    experiment.model = @DiviK()
    experiment.omit_datetime = True
    experiment.verbose = True
    experiment.exist_ok = True
    import divik.core.gin_sklearn_configurables
    experiment.model = @MeanShift()
    MeanShift.n_jobs = -1
    MeanShift.max_iter = 300
    import divik.core.gin_sklearn_configurables
    
    MeanShift.cluster_all = True
    MeanShift.n_jobs = -1
    MeanShift.max_iter = 300
    
    experiment.model = @MeanShift()
    experiment.omit_datetime = True
    experiment.verbose = True
    experiment.exist_ok = True
    import divik.core.gin_sklearn_configurables
    import divik.feature_extraction
    experiment.model = @Pipeline()
    MeanShift.n_jobs = -1
    MeanShift.max_iter = 300
    Pipeline.steps = [
        ('histogram_equalization', @HistogramEqualization()),
        ('exims', @EximsSelector()),
        ('pca', @KneePCA()),
        ('mean_shift', @MeanShift()),
    ]
    experiment.steps_that_require_xy = ['exims']
    import divik.core.gin_sklearn_configurables
    import divik.feature_extraction
    
    MeanShift.n_jobs = -1
    MeanShift.max_iter = 300
    
    Pipeline.steps = [
        ('histogram_equalization', @HistogramEqualization()),
        ('exims', @EximsSelector()),
        ('pca', @KneePCA()),
        ('mean_shift', @MeanShift()),
    ]
    
    experiment.model = @Pipeline()
    experiment.steps_that_require_xy = ['exims']
    experiment.omit_datetime = True
    experiment.verbose = True
    experiment.exist_ok = True
    import gin
    
    @gin.configurable
    class MyClustering:
        pass
    import gin
    
    gin.external_configurable(MyClustering)
    from divik.core.io import saver
    
    @saver
    def save_my_clustering(model, fname_fn, **kwargs):
        if not hasattr(model, 'my_custom_field_'):
            return
        # custom saving logic comes here
    import myclustering

    DiviK package

    Python implementation of Divisive iK-means (DiviK) algorithm.

    hashtag
    Tools within this package

    • Clustering at your command line with fit-clusters

    • Set of algorithm implementations for unsupervised analyses

        • - hands-free clustering method with built-in feature selection

        • for selecting the number of clusters

    hashtag
    Installation

    hashtag
    Docker

    The recommended way to use this software is through . This is the most convenient way, if you want to use divik application.

    To install latest stable version use:

    hashtag
    Python package

    Prerequisites for installation of base package:

    • Python 3.6 / 3.7 / 3.8

    • compiler capable of compiling the native C code and OpenMP support

    hashtag
    Installation of OpenMP for Ubuntu / Debian

    You should have it already installed with GCC compiler, but if somehow not, try the following:

    hashtag
    Installation of OpenMP for Mac

    OpenMP is available as part of LLVM. You may need to install it with conda:

    hashtag
    DiviK Installation

    Having prerequisites installed, one can install latest base version of the package:

    If you want to have compatibility with , you can install necessary extras with:

    Note: Remember about \ before [ and ] in zsh shell.

    You can install all extras with:

    hashtag
    High-Volume Data Considerations

    If you are using DiviK to run the analysis that could fail to fit RAM of your computer, consider disabling the default parallelism and switch to . It's easy to achieve through configuration:

    • set all parameters named n_jobs to 1;

    • set all parameters named allow_dask to True.

    Note: Never set n_jobs>1 and allow_dask=True at the same time, the computations will freeze due to how multiprocessing and dask handle parallelism.

    hashtag
    Known Issues

    hashtag
    Segmentation Fault

    It can happen if the he gamred_native package (part of divik package) was compiled with different numpy ABI than scikit-learn. This could happen if you used different set of compilers than the developers of the scikit-learn package.

    In such a case, a handler is defined to display the stack trace. If the trace comes from _matlab_legacy.py, the most probably this is the issue.

    To resolve the issue, consider following the installation instructions once again. The exact versions get updated to avoid the issue.

    hashtag
    Contributing

    Contribution guide will be developed soon.

    Format the code with:

    hashtag
    References

    This software is part of contribution made by , rest of which is published .

    K-Means with GAP indexarrow-up-right for selecting the number of clusters

  • Modular K-Means implementationarrow-up-right with custom distance metrics and initializations

  • Two-steparrow-up-right meta-clustering

  • Feature extractionarrow-up-right

    • PCA with knee-based components selectionarrow-up-right

    • Locally Adjusted RBF Spectral Embeddingarrow-up-right

  • Feature selectionarrow-up-right

    • EXIMSarrow-up-right

    • Gaussian Mixture Model basedarrow-up-right data-driven feature selection

      • - allows you to select highly variant features above noise level, based on GMM-decomposition

      • - allows you to select highly variant features above noise level, based on outlier detection

    • - allows you to select highly variant features above noise level with your predefined thresholds for each

  • Samplingarrow-up-right

    • Stratified Samplerarrow-up-right - generates samples of fixed number of rows from given dataset, preserving groups proportion

    • Uniform PCA Samplerarrow-up-right - generates samples of random observations within boundaries of an original dataset, and preserving the rotation of the data

    • - generates samples of random observations within boundaries of an original dataset

  • Clusteringarrow-up-right
    DiviKarrow-up-right
    K-Means with Dunn methodarrow-up-right
    Dockerarrow-up-right
    gin-configarrow-up-right
    daskarrow-up-right
    Data Mining Group of Silesian University of Technologyarrow-up-right
    herearrow-up-right
    Mrukwa, G. and Polanska, J., 2020. DiviK: Divisive intelligent K-means for hands-free unsupervised clustering in biological big data. arXiv preprint arXiv:2009.10706.arrow-up-right
    docker pull gmrukwa/divik
    sudo apt-get install libgomp1
    conda install -c conda-forge "compilers>=1.0.4,!=1.1.0" llvm-openmp
    pip install divik
    pip install divik[gin]
    pip install divik[all]
    isort -m 3 --fgw 3 --tc .
    black -t py36 .
    High Abundance And Variance Selectorarrow-up-right
    Outlier based selectorarrow-up-right
    Outlier Abundance And Variance Selectorarrow-up-right
    Percentage based selector arrow-up-right
    Uniform Samplerarrow-up-right