|
ppforest2 v0.1.0
Projection Pursuit Decision Trees and Random Forests
|
ppforest2 is a fast, memory-efficient implementation of Projection Pursuit Random Forests, built on Projection Pursuit (oblique) Decision Trees. By learning linear projections at each split, the model captures complex structure that axis-aligned trees often miss, without sacrificing interpretability or scalability.
The C++ core is the single source of truth for the implementation. Language bindings (R, Python planned) wrap the core via thin interface layers.
The library is organised around a small set of core abstractions:
| Abstraction | Header | Purpose |
|---|---|---|
| Tree | models/Tree.hpp | A single projection pursuit decision tree |
| Forest | models/Forest.hpp | An ensemble of bootstrap trees with OOB estimation |
| TrainingSpec | models/TrainingSpec.hpp | Composes a PP, DR, and SR strategy into a training configuration |
| PPStrategy | models/PPStrategy.hpp | Projection pursuit index optimisation (e.g. PDA) |
| DRStrategy | models/DRStrategy.hpp | Dimensionality reduction / variable selection |
| SRStrategy | models/SRStrategy.hpp | Split threshold rule |
| GroupPartition | stats/GroupPartition.hpp | Contiguous-block representation of grouped observations |
| VariableImportance | models/VariableImportance.hpp | Three variable importance measures (permuted, projections, weighted) |
Tree training is parameterised by three pluggable strategies. Concrete implementations are composed at runtime via TrainingSpec, so new optimisation criteria or variable selection methods can be added without changing the tree-building logic.
lambda). lambda = 0 gives standard LDA; lambda in (0, 1] penalises the within-group covariance matrix.n_vars uniformly at random, for forests).TrainingSpec is a single concrete class that composes these strategies together with forest-level parameters (size, seed, threads, max retries):
Strategies are held via shared_ptr and are immutable after construction, so TrainingSpec can be freely copied and shared across trees without deep cloning. Each strategy implements to_json() for serialisation.
For a step-by-step guide to implementing new strategies, see Extending: Custom Strategies.
Two visitor interfaces avoid dynamic_cast and keep traversal logic decoupled from the model types:
For a step-by-step guide to implementing new visitors, see Extending: Custom Visitors.
Three measures quantify each variable's contribution to predictions:
The core uses single-precision (float) arithmetic by default. Compile with -DPPFOREST2_DOUBLE_PRECISION=ON to switch to double throughout. See types::Feature.
| Namespace | Purpose |
|---|---|
| ppforest2 | Core model types: Tree, Forest, TrainingSpec, VariableImportance |
| ppforest2::types | Numeric type aliases (Feature, Response, FeatureMatrix, ...) |
| ppforest2::stats | Statistical infrastructure: RNG, GroupPartition, Uniform, ConfusionMatrix |
| ppforest2::pp | Projection pursuit strategies |
| ppforest2::dr | Dimensionality reduction strategies |
| ppforest2::sr | Split rule strategies |
| ppforest2::serialization | JSON serialization and deserialization |
| ppforest2::viz | Visualisation visitors for tree structure and decision boundaries |
| ppforest2::io | File I/O (CSV, JSON) and presentation utilities |
| ppforest2::io::style | ANSI-aware colored terminal output |
| ppforest2::io::layout | Column-driven table formatting |
| ppforest2::math | Numeric comparison utilities |
| ppforest2::cli | Command-line interface parsing and subcommands |
| ppforest2::sys | System utilities (memory measurement) |
Results are identical across platforms (Ubuntu/GCC, macOS/Clang, Windows/MinGW) for the same seed. This is enforced by: