# DiviK package

## Tools within this package

* Clustering at your command line with [`fit-clusters`](https://sut-data-mining.gitbook.io/divik/cluster-analysis)
* Set of algorithm implementations for unsupervised analyses
  * [Clustering](https://divik.readthedocs.io/en/latest/modules/divik.cluster.html)
    * [DiviK](https://divik.readthedocs.io/en/latest/modules/divik.cluster.html#divik.cluster.DiviK) - hands-free clustering method with built-in feature selection
    * [K-Means with Dunn method](https://divik.readthedocs.io/en/latest/modules/divik.cluster.html#divik.cluster.DunnSearch) for selecting the number of clusters
    * [K-Means with GAP index](https://divik.readthedocs.io/en/latest/modules/divik.cluster.html#divik.cluster.GAPSearch) for selecting the number of clusters
    * Modular [K-Means implementation](https://divik.readthedocs.io/en/latest/modules/divik.cluster.html#divik.cluster.KMeans) with custom distance metrics and initializations
    * [Two-step](https://divik.readthedocs.io/en/latest/modules/divik.cluster.html#divik.cluster.TwoStep) meta-clustering
  * [Feature extraction](https://divik.readthedocs.io/en/latest/modules/divik.feature_extraction.html)
    * [PCA with knee-based components selection](https://divik.readthedocs.io/en/latest/modules/divik.feature_extraction.html#divik.feature_extraction.KneePCA)
    * [Locally Adjusted RBF Spectral Embedding](https://divik.readthedocs.io/en/latest/modules/divik.feature_extraction.html#divik.feature_extraction.LocallyAdjustedRbfSpectralEmbedding)
  * [Feature selection](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html)
    * [EXIMS](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html#divik.feature_selection.EximsSelector)
    * [Gaussian Mixture Model based](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html#divik.feature_selection.GMMSelector) data-driven feature selection
      * [High Abundance And Variance Selector](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html#divik.feature_selection.HighAbundanceAndVarianceSelector) - allows you to select highly variant features above noise level, based on GMM-decomposition
    * [Outlier based selector](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html#divik.feature_selection.OutlierSelector)
      * [Outlier Abundance And Variance Selector](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html#divik.feature_selection.OutlierAbundanceAndVarianceSelector) - allows you to select highly variant features above noise level, based on outlier detection
    * [Percentage based selector ](https://divik.readthedocs.io/en/latest/modules/divik.feature_selection.html#divik.feature_selection.PercentageSelector)- allows you to select highly variant features above noise level with your predefined thresholds for each
  * [Sampling](https://divik.readthedocs.io/en/latest/modules/divik.sampler.html)
    * [Stratified Sampler](https://divik.readthedocs.io/en/latest/modules/divik.sampler.html#divik.sampler.StratifiedSampler) - generates samples of fixed number of rows from given dataset, preserving groups proportion
    * [Uniform PCA Sampler](https://divik.readthedocs.io/en/latest/modules/divik.sampler.html#divik.sampler.UniformPCASampler) - generates samples of random observations within boundaries of an original dataset, and preserving the rotation of the data
    * [Uniform Sampler](https://divik.readthedocs.io/en/latest/modules/divik.sampler.html#divik.sampler.UniformSampler) - generates samples of random observations within boundaries of an original dataset

## Installation

### Docker

The recommended way to use this software is through [Docker](https://www.docker.com/). This is the most convenient way, if you want to use `divik` application.

To install latest stable version use:

```bash
docker pull gmrukwa/divik
```

### Python package

Prerequisites for installation of base package:

* Python 3.6 / 3.7 / 3.8
* compiler capable of compiling the native C code and OpenMP support

#### **Installation of OpenMP for Ubuntu / Debian**

You should have it already installed with GCC compiler, but if somehow not, try the following:

```bash
sudo apt-get install libgomp1
```

#### **Installation of OpenMP for Mac**

OpenMP is available as part of LLVM. You may need to install it with conda:

```bash
conda install -c conda-forge "compilers>=1.0.4,!=1.1.0" llvm-openmp
```

#### **DiviK Installation**

Having prerequisites installed, one can install latest base version of the package:

```bash
pip install divik
```

If you want to have compatibility with [`gin-config`](https://github.com/google/gin-config), you can install necessary extras with:

```bash
pip install divik[gin]
```

**Note:** Remember about `\` before `[` and `]` in `zsh` shell.

You can install all extras with:

```bash
pip install divik[all]
```

## High-Volume Data Considerations

If you are using DiviK to run the analysis that could fail to fit RAM of your computer, consider disabling the default parallelism and switch to [dask](https://dask.org/). It's easy to achieve through configuration:

* set all parameters named `n_jobs` to `1`;
* set all parameters named `allow_dask` to `True`.

**Note:** Never set `n_jobs>1` and `allow_dask=True` at the same time, the computations will freeze due to how `multiprocessing` and `dask` handle parallelism.

## Known Issues

### Segmentation Fault

It can happen if the he `gamred_native` package (part of `divik` package) was compiled with different numpy ABI than scikit-learn. This could happen if you used different set of compilers than the developers of the scikit-learn package.

In such a case, a handler is defined to display the stack trace. If the trace comes from `_matlab_legacy.py`, the most probably this is the issue.

To resolve the issue, consider following the installation instructions once again. The exact versions get updated to avoid the issue.

## Contributing

Contribution guide will be developed soon.

Format the code with:

```bash
isort -m 3 --fgw 3 --tc .
black -t py36 .
```

## References

This software is part of contribution made by [Data Mining Group of Silesian University of Technology](http://www.zaed.polsl.pl/), rest of which is published [here](https://github.com/ZAEDPolSl).

[Mrukwa, G. and Polanska, J., 2020. DiviK: Divisive intelligent K-means for hands-free unsupervised clustering in biological big data. *arXiv preprint arXiv:2009.10706*.](https://arxiv.org/abs/2009.10706)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sut-data-mining.gitbook.io/divik/master.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
