arrow-left

All pages
gitbookPowered by GitBook
1 of 1

Loading...

DiviK package

Python implementation of Divisive iK-means (DiviK) algorithm.

hashtag
Tools within this package

  • Clustering at your command line with fit-clusters

  • Set of algorithm implementations for unsupervised analyses

      • - hands-free clustering method with built-in feature selection

      • for selecting the number of clusters

hashtag
Installation

hashtag
Docker

The recommended way to use this software is through . This is the most convenient way, if you want to use divik application.

To install latest stable version use:

hashtag
Python package

Prerequisites for installation of base package:

  • Python 3.6 / 3.7 / 3.8

  • compiler capable of compiling the native C code and OpenMP support

hashtag
Installation of OpenMP for Ubuntu / Debian

You should have it already installed with GCC compiler, but if somehow not, try the following:

hashtag
Installation of OpenMP for Mac

OpenMP is available as part of LLVM. You may need to install it with conda:

hashtag
DiviK Installation

Having prerequisites installed, one can install latest base version of the package:

If you want to have compatibility with , you can install necessary extras with:

Note: Remember about \ before [ and ] in zsh shell.

You can install all extras with:

hashtag
High-Volume Data Considerations

If you are using DiviK to run the analysis that could fail to fit RAM of your computer, consider disabling the default parallelism and switch to . It's easy to achieve through configuration:

  • set all parameters named n_jobs to 1;

  • set all parameters named allow_dask to True.

Note: Never set n_jobs>1 and allow_dask=True at the same time, the computations will freeze due to how multiprocessing and dask handle parallelism.

hashtag
Known Issues

hashtag
Segmentation Fault

It can happen if the he gamred_native package (part of divik package) was compiled with different numpy ABI than scikit-learn. This could happen if you used different set of compilers than the developers of the scikit-learn package.

In such a case, a handler is defined to display the stack trace. If the trace comes from _matlab_legacy.py, the most probably this is the issue.

To resolve the issue, consider following the installation instructions once again. The exact versions get updated to avoid the issue.

hashtag
Contributing

Contribution guide will be developed soon.

Format the code with:

hashtag
References

This software is part of contribution made by , rest of which is published .

K-Means with GAP indexarrow-up-right for selecting the number of clusters

  • Modular K-Means implementationarrow-up-right with custom distance metrics and initializations

  • Two-steparrow-up-right meta-clustering

  • Feature extractionarrow-up-right

    • PCA with knee-based components selectionarrow-up-right

    • Locally Adjusted RBF Spectral Embeddingarrow-up-right

  • Feature selectionarrow-up-right

    • EXIMSarrow-up-right

    • Gaussian Mixture Model basedarrow-up-right data-driven feature selection

      • - allows you to select highly variant features above noise level, based on GMM-decomposition

      • - allows you to select highly variant features above noise level, based on outlier detection

    • - allows you to select highly variant features above noise level with your predefined thresholds for each

  • Samplingarrow-up-right

    • Stratified Samplerarrow-up-right - generates samples of fixed number of rows from given dataset, preserving groups proportion

    • Uniform PCA Samplerarrow-up-right - generates samples of random observations within boundaries of an original dataset, and preserving the rotation of the data

    • - generates samples of random observations within boundaries of an original dataset

  • Clusteringarrow-up-right
    DiviKarrow-up-right
    K-Means with Dunn methodarrow-up-right
    Dockerarrow-up-right
    gin-configarrow-up-right
    daskarrow-up-right
    Data Mining Group of Silesian University of Technologyarrow-up-right
    herearrow-up-right
    Mrukwa, G. and Polanska, J., 2020. DiviK: Divisive intelligent K-means for hands-free unsupervised clustering in biological big data. arXiv preprint arXiv:2009.10706.arrow-up-right
    docker pull gmrukwa/divik
    sudo apt-get install libgomp1
    conda install -c conda-forge "compilers>=1.0.4,!=1.1.0" llvm-openmp
    pip install divik
    pip install divik[gin]
    pip install divik[all]
    isort -m 3 --fgw 3 --tc .
    black -t py36 .
    High Abundance And Variance Selectorarrow-up-right
    Outlier based selectorarrow-up-right
    Outlier Abundance And Variance Selectorarrow-up-right
    Percentage based selector arrow-up-right
    Uniform Samplerarrow-up-right