|
ppforest2 v0.1.0
Projection Pursuit Decision Trees and Random Forests
|
ppforest2 is a fast, memory-efficient implementation of Projection Pursuit Random Forests, built on Projection Pursuit (oblique) Decision Trees. By learning linear projections at each split, the model captures complex structure that axis-aligned trees often miss, without sacrificing interpretability or scalability.
The C++ core is the single source of truth for the implementation. Language bindings (R, Python planned) wrap the core via thin interface layers.
The library is organised around a small set of core abstractions:
| Abstraction | Header | Purpose |
|---|---|---|
| Tree | models/Tree.hpp | A single projection pursuit decision tree |
| Forest | models/Forest.hpp | An ensemble of bootstrap trees with OOB estimation |
| TrainingSpec | models/TrainingSpec.hpp | Composes seven strategies into a training configuration |
| ProjectionPursuit | strategies/pp/ProjectionPursuit.hpp | Projection pursuit index optimisation (e.g. PDA) |
| VariableSelection | strategies/vars/VariableSelection.hpp | Variable subset selection |
| Cutpoint | strategies/cutpoint/Cutpoint.hpp | Split cutpoint computation |
| StopRule | strategies/stop/StopRule.hpp | Node stopping condition |
| Binarization | strategies/binarize/Binarization.hpp | Multiclass → binary regrouping |
| Grouping | strategies/grouping/Grouping.hpp | Group partition management |
| LeafStrategy | strategies/leaf/LeafStrategy.hpp | Leaf node creation |
| GroupPartition | stats/GroupPartition.hpp | Grouped observations with arbitrary row indices |
| VariableImportance | models/Evaluation.hpp | Three variable importance measures (permuted, projections, weighted) |
Tree training is parameterised by seven pluggable strategies. Concrete implementations are composed at runtime via TrainingSpec, so new optimisation criteria, variable selection methods, or splitting rules can be added without changing the tree-building logic.
lambda). lambda = 0 gives standard LDA; lambda in (0, 1] penalises the within-group covariance matrix.count variables uniformly at random, for forests).TrainingSpec is a single concrete class that composes these strategies together with forest-level parameters (size, seed, threads, max retries):
Strategies are held via shared_ptr and are immutable after construction, so TrainingSpec can be freely copied and shared across trees without deep cloning. Each strategy implements to_json() for serialisation.
For a step-by-step guide to implementing new strategies, see Extending: Custom Strategies.
Two visitor interfaces avoid dynamic_cast and keep traversal logic decoupled from the model types:
For a step-by-step guide to implementing new visitors, see Extending: Custom Visitors.
Three measures quantify each variable's contribution to predictions:
Var(y_oob)). Averaged over all trees. See Forest::vi_permuted().1 - error_rate; regression: max(0, 1 - NMSE)), and each split contributes I_s x |a_j|. See Forest::vi_weighted_projections().The core uses single-precision (float) arithmetic for all feature data. This is sufficient for classification and reduces memory usage. If a strategy needs higher precision internally (e.g. for regression loss computation), it can cast to double within its own scope. See types::Feature.
| Namespace | Purpose |
|---|---|
| ppforest2 | Core model types: Tree, Forest, TrainingSpec, VariableImportance |
| ppforest2::types | Numeric type aliases (Feature, Outcome, FeatureMatrix, ...) |
| ppforest2::stats | Statistical infrastructure: RNG, GroupPartition, Uniform, ConfusionMatrix |
| ppforest2::pp | Projection pursuit strategies |
| ppforest2::vars | Variable selection strategies |
| ppforest2::cutpoint | Split cutpoint strategies |
| ppforest2::stop | Stopping rule strategies |
| ppforest2::binarize | Multiclass binarization strategies |
| ppforest2::grouping | Group partition strategies |
| ppforest2::leaf | Leaf node creation strategies |
| ppforest2::serialization | JSON serialization and deserialization |
| ppforest2::viz | Visualisation visitors for tree structure and decision boundaries |
| ppforest2::io | File I/O (CSV, JSON) and presentation utilities |
| ppforest2::io::style | ANSI-aware colored terminal output |
| ppforest2::io::layout | Column-driven table formatting |
| ppforest2::math | Numeric comparison utilities |
| ppforest2::cli | Command-line interface parsing and subcommands |
| ppforest2::sys | System utilities (memory measurement) |
Results are identical across platforms (Ubuntu/GCC, macOS/Clang, Windows/MinGW) for the same seed. This is enforced by: