|
ppforest2 v0.1.0
Projection Pursuit Decision Trees and Random Forests
|
Statistical infrastructure for training and evaluation. More...
Namespaces | |
| namespace | simulation |
Classes | |
| struct | ClassificationMetrics |
| Classification evaluation metrics. More... | |
| struct | ConfusionMatrix |
| A confusion matrix comparing predicted vs actual group labels. More... | |
| struct | DataPacket |
| Bundled dataset: features, response, and group labels. More... | |
| class | GroupPartition |
| Contiguous-block representation of grouped observations. More... | |
| class | Normal |
| Normal (Gaussian) random number generator. More... | |
| struct | RegressionMetrics |
| Regression evaluation metrics. More... | |
| struct | Split |
| Indices for a train/test split. More... | |
| class | Uniform |
| Discrete uniform random integer generator over [min, max]. More... | |
Typedefs | |
| using | Metrics = std::variant<ClassificationMetrics, RegressionMetrics> |
| Mode-polymorphic metrics block. | |
| using | RNG = pcg32 |
Functions | |
| float | accuracy (types::OutcomeVector const &predictions, types::GroupIdVector const &actual) |
| Fraction of predictions matching the ground-truth class label. | |
| double | error_rate (types::OutcomeVector const &predictions, types::GroupIdVector const &actual) |
Misclassification rate — 1 - accuracy. | |
| double | error_rate (types::OutcomeVector const &predictions, types::OutcomeVector const &actual) |
Convenience overload: float-typed labels (cast to GroupId locally). | |
| std::map< int, int > | get_labels_map (types::GroupIdVector const &y_pred, types::GroupIdVector const &y) |
| Build a sorted mapping from unique group labels to contiguous indices. | |
| std::map< types::GroupId, int > | group_indices (std::set< types::GroupId > const &groups) |
Map each label in groups to its index in iteration order. | |
| double | mae (types::OutcomeVector const &predictions, types::OutcomeVector const &actual) |
| Mean absolute error. | |
| types::Outcome | majority_vote (std::vector< types::Outcome > const &preds) |
| Majority vote over a sequence of integer-coded class labels. | |
| types::Outcome | mean (std::vector< types::Outcome > const &preds) |
| Arithmetic mean of a sequence of outcome values. | |
| Metrics | metrics_from_outcomes (types::OutcomeVector const &y_pred, types::OutcomeVector const &y, types::Mode mode) |
Build a mode-appropriate Metrics variant from in-memory tensors. | |
| double | mse (types::OutcomeVector const &predictions, types::OutcomeVector const &actual) |
| Mean squared error. | |
| double | r_squared (types::OutcomeVector const &predictions, types::OutcomeVector const &actual) |
| Coefficient of determination (R²). Returns 0 when total variance is 0. | |
| template<typename Derived> | |
| double | sd (Eigen::MatrixBase< Derived > const &data) |
Sample standard deviation of a vector — sqrt(var(data)). | |
| types::FeatureVector | sd (types::FeatureMatrix const &data) |
Column-wise sample standard deviation — element-wise sqrt of var. | |
| DataPacket | simulate (int n, int p, int G, RNG &rng, simulation::params::Classification const ¶ms={}) |
| Generate a simulated classification dataset. | |
| DataPacket | simulate (int n, int p, RNG &rng, simulation::params::Regression const ¶ms={}) |
| Generate a simulated regression dataset. | |
| template<typename Y> | |
| void | sort (types::FeatureMatrix &x, Y &y) |
| Sort a feature matrix and a response vector by the response values. | |
| Split | split (DataPacket const &data, float train_ratio, RNG &rng) |
| Perform a stratified random train/test split on a DataPacket. | |
| std::set< types::GroupId > | unique (types::GroupIdVector const &column) |
| Unique group labels in a response vector. | |
| template<typename Derived> | |
| double | var (Eigen::MatrixBase< Derived > const &data) |
Sample variance of a vector (unbiased, n-1 denominator). | |
| types::FeatureVector | var (types::FeatureMatrix const &data) |
| Column-wise sample variance of a matrix. | |
Statistical infrastructure for training and evaluation.
Provides the random number generator (pcg32), discrete uniform sampling (Lemire's method), grouped-observation bookkeeping (GroupPartition), confusion matrices, data simulation, and basic descriptive statistics used throughout the training pipeline.
| using ppforest2::stats::Metrics = std::variant<ClassificationMetrics, RegressionMetrics> |
Mode-polymorphic metrics block.
Carries ClassificationMetrics for classification models and RegressionMetrics for regression models. Consumers that want the scalar / matrix can std::visit the variant instead of branching on which JSON key (or strategy mode) happens to be present.
| using ppforest2::stats::RNG = pcg32 |
| float ppforest2::stats::accuracy | ( | types::OutcomeVector const & | predictions, |
| types::GroupIdVector const & | actual ) |
Fraction of predictions matching the ground-truth class label.
| double ppforest2::stats::error_rate | ( | types::OutcomeVector const & | predictions, |
| types::GroupIdVector const & | actual ) |
Misclassification rate — 1 - accuracy.
|
inline |
Convenience overload: float-typed labels (cast to GroupId locally).
Used by the unified training pipeline where y is carried as OutcomeVector for both classification and regression.
| std::map< int, int > ppforest2::stats::get_labels_map | ( | types::GroupIdVector const & | y_pred, |
| types::GroupIdVector const & | y ) |
Build a sorted mapping from unique group labels to contiguous indices.
Includes labels from both predictions and actual so the confusion matrix can hold cells for any predicted class — even classes the model confused into that didn't appear in the ground truth (and vice versa).
| std::map< types::GroupId, int > ppforest2::stats::group_indices | ( | std::set< types::GroupId > const & | groups | ) |
Map each label in groups to its index in iteration order.
Used by predict(FeatureMatrix, Proportions) on ClassificationTree and ClassificationForest to assign each group a column in the proportions matrix. std::set iterates in ascending key order, so the resulting indices are sorted by group label.
| double ppforest2::stats::mae | ( | types::OutcomeVector const & | predictions, |
| types::OutcomeVector const & | actual ) |
Mean absolute error.
| types::Outcome ppforest2::stats::majority_vote | ( | std::vector< types::Outcome > const & | preds | ) |
Majority vote over a sequence of integer-coded class labels.
Returns the label with the largest count. Ties are broken by the smallest GroupId — std::map iterates in ascending key order and the strict > comparison keeps the earliest winner.
Used for classification ensemble aggregation: per-row OOB voting (models/Evaluation.cpp) and per-row in-bag voting (ClassificationForest::predict). Throws on empty input.
| types::Outcome ppforest2::stats::mean | ( | std::vector< types::Outcome > const & | preds | ) |
Arithmetic mean of a sequence of outcome values.
Used for regression ensemble aggregation: per-row OOB averaging (models/Evaluation.cpp) and per-row in-bag averaging (RegressionForest::predict). Sum is computed in double to avoid loss-of-precision when many float outcomes are added. Throws on empty input.
| Metrics ppforest2::stats::metrics_from_outcomes | ( | types::OutcomeVector const & | y_pred, |
| types::OutcomeVector const & | y, | ||
| types::Mode | mode ) |
Build a mode-appropriate Metrics variant from in-memory tensors.
Sibling to serialization::metrics_from_json — same variant, different source. Lets callers compute metrics once and pass the variant to polymorphic consumers (e.g. print_metrics_block).
| double ppforest2::stats::mse | ( | types::OutcomeVector const & | predictions, |
| types::OutcomeVector const & | actual ) |
Mean squared error.
| double ppforest2::stats::r_squared | ( | types::OutcomeVector const & | predictions, |
| types::OutcomeVector const & | actual ) |
Coefficient of determination (R²). Returns 0 when total variance is 0.
| double ppforest2::stats::sd | ( | Eigen::MatrixBase< Derived > const & | data | ) |
Sample standard deviation of a vector — sqrt(var(data)).
| types::FeatureVector ppforest2::stats::sd | ( | types::FeatureMatrix const & | data | ) |
Column-wise sample standard deviation — element-wise sqrt of var.
| DataPacket ppforest2::stats::simulate | ( | int | n, |
| int | p, | ||
| int | G, | ||
| RNG & | rng, | ||
| simulation::params::Classification const & | params = {} ) |
Generate a simulated classification dataset.
Rows are sorted by group label.
| DataPacket ppforest2::stats::simulate | ( | int | n, |
| int | p, | ||
| RNG & | rng, | ||
| simulation::params::Regression const & | params = {} ) |
Generate a simulated regression dataset.
Rows are sorted by the continuous response. Distinguished from the classification overload by the absence of a group-count parameter.
| void ppforest2::stats::sort | ( | types::FeatureMatrix & | x, |
| Y & | y ) |
Sort a feature matrix and a response vector by the response values.
Templated on the response vector type so callers can sort by integer group labels (GroupIdVector) or continuous responses (OutcomeVector).
| Split ppforest2::stats::split | ( | DataPacket const & | data, |
| float | train_ratio, | ||
| RNG & | rng ) |
Perform a stratified random train/test split on a DataPacket.
Samples indices within each group proportional to train_ratio so that group balance is preserved in both train and test sets.
| data | The full dataset. |
| train_ratio | Proportion of data to use for training (0, 1). |
| rng | Random number generator. |
| std::set< types::GroupId > ppforest2::stats::unique | ( | types::GroupIdVector const & | column | ) |
Unique group labels in a response vector.
| column | Group ID vector. |
| double ppforest2::stats::var | ( | Eigen::MatrixBase< Derived > const & | data | ) |
Sample variance of a vector (unbiased, n-1 denominator).
Returns 0 for single-element inputs (variance undefined; 0 is the natural degenerate value for "no spread"). Throws on empty input.
| types::FeatureVector ppforest2::stats::var | ( | types::FeatureMatrix const & | data | ) |
Column-wise sample variance of a matrix.
| data | Feature matrix with at least 2 rows. |