Metadata-Version: 2.4
Name: timc-vector-toolkit
Version: 20250509
Summary: Ensemble package for all Tutte Institute librairies involved in data vectorization, dimension reduction, clustering and exploratory visualization
Author-email: Benoit Hamelin <benoit.hamelin@cyber.gc.ca>
Requires-Python: >=3.10
Requires-Dist: datamapplot>=0.6.4
Requires-Dist: einops>=0.8.1
Requires-Dist: evoc>=0.1.3
Requires-Dist: fast-hdbscan>=0.2.2
Requires-Dist: glasbey>=0.3.0
Requires-Dist: hdbscan>=0.8.40
Requires-Dist: ipykernel>=6.30.1
Requires-Dist: toponymy>=0.3.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: umap-learn[plot]>=0.5.8
Requires-Dist: vectorizers>=0.2.2
Description-Content-Type: text/markdown

This package brings together the various libraries that the Tutte Institute
has built towards exploratory analysis, unsupervised learning and interactive
visualization for unstructured data. It includes the following individual
packages.

Learn more at <https://github.com/TutteInstitute>


----------------------
Vector space embedding
----------------------

**vectorizers**

Embeds various types of data into large-dimension vector spaces. This
includes data that consists in distributions on vector spaces,
which are embedded by the approximate resolution of optimal transport
problems.


-----------------------------------
Nearest neighbour network discovery
-----------------------------------

**pynndescent**

Builds the k-nearest neighbour graph of a set of high-dimension vectors
expressed as either dense or sparse arrays, under a large set of distances
and pseudo-metrics. Doubles as an in-memory index for querying neighbours
to arbitrary vectors.


-------------------
Dimension reduction
-------------------

**umap** (package name is umap-learn)

Uniform Manifold Approximation and Projection is a manifold learning
dimension reduction algorithm that preserves the local similarity
structure of a set of vectors. It works on both dense and sparse
vector arrays.


----------
Clustering
----------

**hdbscan**

Hierarchical Density-Based Spatial Clustering of Applications with Noise.
This clustering algorithm partitions a set of vectors into groups based on
mutual reachability distance, discarding outliers as noise.

**fast_hdbscan**

A new implementation of HDBSCAN optimized for runtime efficiency by
restricting computations to low-dimension vectors in Euclidean geometry.

**evoc**

Embedding Vector-Oriented Clustering is a new clustering algorithm that
streamlines and approximates the UMAP-HDBSCAN combo approach to clustering,
so as to compute high-quality clusterings of high-dimension vector sets
at a fraction of the computational cost.


-------------------------
Interactive visualization
-------------------------

**datamapplot**

Creates static plots and interactive views of 2D vectors and metadata,
with an emphasis on presentation aesthetics and interactive exploration
for insight discovery.

**toponymy**

Generates a multiresolution hierarchy of annotation labels for text
embeddings by querying a large language model with representative,
distinctive and contrastive characterizations of data clusters. These
labels are then useful for annotating data maps produced with
datamapplot.

