DiviK package

Python implementation of Divisive iK-means (DiviK) algorithm.

Tools within this package

Clustering at your command line with fit-clusters
Set of algorithm implementations for unsupervised analyses
- Clustering
  - DiviK - hands-free clustering method with built-in feature selection
  - K-Means with Dunn method for selecting the number of clusters
  - K-Means with GAP index for selecting the number of clusters
  - Modular K-Means implementation with custom distance metrics and initializations
  - Two-step meta-clustering
- Feature extraction
  - PCA with knee-based components selection
  - Locally Adjusted RBF Spectral Embedding
- Feature selection
  - EXIMS
  - Gaussian Mixture Model based data-driven feature selection
    High Abundance And Variance Selector - allows you to select highly variant features above noise level, based on GMM-decomposition
  - Outlier based selector
    Outlier Abundance And Variance Selector - allows you to select highly variant features above noise level, based on outlier detection
  - Percentage based selector - allows you to select highly variant features above noise level with your predefined thresholds for each
- Sampling
  - Stratified Sampler - generates samples of fixed number of rows from given dataset, preserving groups proportion
  - Uniform PCA Sampler - generates samples of random observations within boundaries of an original dataset, and preserving the rotation of the data
  - Uniform Sampler - generates samples of random observations within boundaries of an original dataset

Installation

Docker

The recommended way to use this software is through Docker. This is the most convenient way, if you want to use divik application.

To install latest stable version use:

docker pull gmrukwa/divik

Python package

Prerequisites for installation of base package:

Python 3.6 / 3.7 / 3.8
compiler capable of compiling the native C code and OpenMP support

Installation of OpenMP for Ubuntu / Debian

You should have it already installed with GCC compiler, but if somehow not, try the following:

sudo apt-get install libgomp1

Installation of OpenMP for Mac

OpenMP is available as part of LLVM. You may need to install it with conda:

conda install -c conda-forge "compilers>=1.0.4,!=1.1.0" llvm-openmp

DiviK Installation

Having prerequisites installed, one can install latest base version of the package:

pip install divik

If you want to have compatibility with gin-config, you can install necessary extras with:

pip install divik[gin]

Note: Remember about \ before [ and ] in zsh shell.

You can install all extras with:

pip install divik[all]

High-Volume Data Considerations

If you are using DiviK to run the analysis that could fail to fit RAM of your computer, consider disabling the default parallelism and switch to dask. It's easy to achieve through configuration:

set all parameters named n_jobs to 1;
set all parameters named allow_dask to True.

Note: Never set n_jobs>1 and allow_dask=True at the same time, the computations will freeze due to how multiprocessing and dask handle parallelism.

Known Issues

Segmentation Fault

It can happen if the he gamred_native package (part of divik package) was compiled with different numpy ABI than scikit-learn. This could happen if you used different set of compilers than the developers of the scikit-learn package.

In such a case, a handler is defined to display the stack trace. If the trace comes from _matlab_legacy.py, the most probably this is the issue.

To resolve the issue, consider following the installation instructions once again. The exact versions get updated to avoid the issue.

Contributing

Contribution guide will be developed soon.

Format the code with:

isort -m 3 --fgw 3 --tc .
black -t py36 .

References

This software is part of contribution made by Data Mining Group of Silesian University of Technology, rest of which is published here.

Mrukwa, G. and Polanska, J., 2020. DiviK: Divisive intelligent K-means for hands-free unsupervised clustering in biological big data. arXiv preprint arXiv:2009.10706.

NextCluster analysis with fit-clusters

Last updated 5 years ago

Was this helpful?

hashtagTools within this package

hashtagInstallation

hashtagDocker

hashtagPython package

hashtagInstallation of OpenMP for Ubuntu / Debian

hashtagInstallation of OpenMP for Mac

hashtagDiviK Installation

hashtagHigh-Volume Data Considerations

hashtagKnown Issues

hashtagSegmentation Fault

hashtagContributing

hashtagReferences