learning_machines_drift package
Submodules
learning_machines_drift.backends module
Backend module.
- class Backend(*args, **kwargs)
Bases:
ProtocolA protocol class for a Backend.
- clear_logged_dataset(tag: str) bool
Delete directory containing logged files.
- Parameters:
tag (str) – Path to logged directory.
- Returns:
if tag/logged path exists. False: if tag/logged path does not exist.
- Return type:
True
- clear_reference_dataset(tag: str) bool
Delete directory containing reference files.
- Parameters:
tag – Path to reference directory.
- load_logged_dataset(tag: str) Dataset
Return a Dataset from the union of logged data.
- Parameters:
tag (str) – Tag identifying dataset.
- load_reference_dataset(tag: str) Dataset
Load reference dataset from reference path.
- Parameters:
tag (str) – Tag identifying dataset.
- save_logged_features(tag: str, identifier: UUID, dataframe: DataFrame) None
Save logged features using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the logged dataset.
dataframe (pd.DataFrame) – The dataframe that needs saving.
- save_logged_labels(tag: str, identifier: UUID, labels: Series) None
Save logged labels using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the labels of the dataset.
labels (pd.Series) – The dataframe that needs saving.
- save_logged_latents(tag: str, identifier: UUID, dataframe: DataFrame) None
Save optionally passed latents dataframe using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the labels of the dataset.
dataframe (pd.DataFrame) – The dataframe of latents to be saved.
- class FileBackend(root_dir: Union[str, Path])
Bases:
objectImplements the Backend protocol for writing files to the filesystem.
- clear_logged_dataset(tag: str) bool
Delete directory containing logged files.
- Parameters:
tag (str) – Path to logged directory.
- Returns:
if tag/logged path exists. False: if tag/logged path does not exist.
- Return type:
True
- clear_reference_dataset(tag: str) bool
Delete directory containing reference files.
- Parameters:
tag – Path to reference directory.
- load_logged_dataset(tag: str) Dataset
Return a Dataset from the union of logged data.
- Parameters:
tag (str) – Tag identifying dataset.
- load_reference_dataset(tag: str) Dataset
Load reference dataset from reference path.
- Parameters:
tag (str) – Tag identifying dataset.
- save_logged_features(tag: str, identifier: UUID, dataframe: DataFrame) None
Save logged features using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the logged dataset.
dataframe (pd.DataFrame) – The dataframe that needs saving.
- save_logged_labels(tag: str, identifier: UUID, labels: Series) None
Save logged labels using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the labels of the dataset.
labels (pd.Series) – The dataframe that needs saving.
- save_logged_latents(tag: str, identifier: UUID, dataframe: Optional[DataFrame]) None
Save optionally passed latents dataframe using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the labels of the dataset.
dataframe (pd.DataFrame) – The dataframe of latents to be saved.
- get_identifier(path_object: Union[str, Path]) Optional[UUID]
Extract the UUID from the filename. The filename should have the format UUID + some other text and a file extension. The UUID should match the regex in the pattern variable UUIDHex4.
- Parameters:
path_obejct (Union[str, Path]) –
- Returns:
- Optional universally unique identifier (UUID) from
path_object.
- Return type:
Optional[UUID]
learning_machines_drift.datasets module
Datasets module with functions for generating example data.
- example_dataset(n_rows: int, seed: Optional[int] = None) Tuple[DataFrame, Series, DataFrame]
Generates data and returns features, labels and latents.
- Parameters:
n_rows (int) – Number of rows/samples.
seed (Optional[int]) – Random seed for reproducibly generating data.
- Returns:
- A dataset tuple of
generated features, labels and latents.
- Return type:
Tuple[pd.DataFrame, pd.Series, pd.DataFrame]
- logistic_model(x_mu: ndarray[Any, dtype[float64]] = array([0., 0., 0.]), x_scale: ndarray[Any, dtype[float64]] = array([1., 1., 1.]), x_corr: ndarray[Any, dtype[float64]] = array([[1., 0.4, 0.], [0.4, 1., 0.], [0., 0., 1.]]), alpha: float = 0.5, beta: ndarray[Any, dtype[float64]] = array([1., 0.5, 0.]), size: int = 50, seed: Optional[int] = None, return_latents: bool = False) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], Optional[ndarray[Any, dtype[float64]]]]
Generate synthetic features, labels and latents.
Features are generated from a multivariate normal distribution, where the mean vector, scale vector and correlation matrix can be specified, allowing users to simulate covariate drift.
Labels are generated with a logistic regression model. The regression parameters are controlled with the beta parameter, allowing simulation of concept drift.
Latents are a single feature as characterizing the Bernoulli probability generated by the model.
- Parameters:
x_mu (NDArray[np.float64]) – Mean vector of features. Defaults to np.array([0.0, 0.0, 0.0]).
x_scale (NDArray[np.float64]) – Scale of features. Defaults to np.array([1.0, 1.0, 1.0]).
x_corr (NDArray[np.float64]) – Correlation matrix giving the correlation between features. Defaults to np.array([[1.0, 0.4, 0.0], [0.4, 1.0, 0.0], [0.0, 0.0, 1.0]]).
alpha (float) – Regression alpha parameter. Defaults to 0.5.
beta (NDArray[np.float64]) – Regression beta parameters . Defaults to np.array([1.0, 0.5, 0.0]).
size (int) – Number of samples to draw from model. Defaults to 50.
return_latents (bool) – Return underlying prediction value before thresholding as ‘latent’ data. Defaults to False.
- Returns:
Tuple of features, labels and (optional) latents generated.
- Return type:
Tuple[NDArray[np.float64], NDArray[np.float64], Optional[NDArray[np.float64]]]
learning_machines_drift.display module
Class for scoring drift between reference and registered datasets.
- class Display
Bases:
objectA class for converting a dictionary of drift scores to displayed output.
- classmethod plot(result: StructuredResult, score_name: Optional[str] = None, score_type: str = 'pvalue', alpha: float = 0.05) Tuple[Figure, Any]
Plot method for displaying a set of scores on a subplot grid.
- Parameters:
result (StructuredResult) – Structured result from a drift score measurement.
score_type (str) – Either “statistic” or “pvalue”.
score_name (str) – Name of score to be plotted and used as plot title.
alpha (float) – Value of alpha to be used in p-value plots.
- Returns:
tuple of fig and subplot array.
- Return type:
Tuple[plt.Figure, Any]
- classmethod table(result: StructuredResult, verbose: bool = True) DataFrame
Gets a pandas dataframe and optionally prints a table of results from drift scoring.
- Parameters:
structured_result (StructuredResult) – Structured result from a drift score measurement.
- Returns:
Dataframe of scores.
- Return type:
pd.DataFrame
learning_machines_drift.registry module
Module for registry handling storage and logging of datasets.
- class Registry(tag: str, expect_features: bool = True, expect_labels: bool = True, expect_latent: bool = False, backend: Optional[Backend] = None, clear_logged: bool = False, clear_reference: bool = False)
Bases:
objectClass for registry for logging datasets.
- tag
Tag identifying dataset.
- Type:
str
- registered_features
Optional registered features.
- Type:
Optional[pd.DataFrame]
- registered_labels
Optional registered labels.
- Type:
Optional[pd.Series]
- registered_latents
Optional registered latents.
- Type:
Optional[pd.Series]
- expect_features
Whether features are expected in registry.
- Type:
bool
- expect_labels
Whether a labels series is expected in registry.
- Type:
bool
- expect_latent
Whether latents are expected in registry.
- Type:
bool
- all_registered() bool
Checks whether all expected datastes are registered.
- Returns:
True if all expected registered, False otherwise.
- Return type:
bool
- property identifier: UUID
Gets the identifier of the registry.
- Returns:
The identifier.
- Return type:
UUID
- log_dataset(dataset: Dataset) None
Logs dataset features in registered data.
- Parameters:
dataset (Dataset) – New dataset to be logged.
- log_features(features: DataFrame) None
Logs dataset features in registered data.
- Parameters:
features (pd.DataFrame) – Features dataframe to be registered.
- log_labels(labels: Series) None
Logs dataset labels in registered data.
- Parameters:
labels (pd.Series) – Labels series to be registered.
- log_latents(latent: DataFrame) None
Logs dataset latents in registered data.
- Parameters:
latents (pd.DataFrame) – Latents dataframe to be registered.
- ref_summary() BaselineSummary
- Return a JSON describing shape of dataset feature, labels and
latents.
- Returns:
Summary of the dataset shapes.
- Return type:
- register_ref_dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None) None
Registers passed reference data.
- Parameters:
features (pd.DataFrame) – Reference features to be stored.
labels (pd.Series) – Reference labels to be stored.
latents (Optional[pd.DataFrame]) – Reference latents to be stored.
learning_machines_drift.filter module
Module with class to filter a dataset.
- class Comparison(value)
Bases:
EnumComparison enum for ‘LESS’, ‘GREATER’ and ‘EQUAL’ cases.
- EQUAL = 3
- GREATER = 2
- LESS = 1
- class Condition(comparison_str: str, value: Any)
Bases:
objectCondition class comprising of a ‘comparison’ and a ‘value’.
- comparison: Comparison
- value: Any
- class Filter(conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]])
Bases:
objectFilter class.
Filters a given dataset through an AND operation applied across all passed conditions.
- conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]]
Dict with key (variable) and value as a list of (condition, value) to be used for filtering.
- Type:
dict[str, List[Condition]]
learning_machines_drift.monitor module
Monitor class for interacting with data and scoring drift.
- class Monitor(tag: str, backend: Optional[Backend] = None)
Bases:
objectA class for monitoring data with data loading from backend and scoring drift scoring with metrics class.
- tag
The tag where data for monitoring is located within backend.
- Type:
str
- registered_dataset
The logged, registered dataset for drift comparison to reference dataset.
- Type:
Optional[Dataset]
- property metrics: Metrics
Drift metrics.
- Raises:
ReferenceDatasetMissing – The reference dataset is None.
ValueError – There is no additional registered data.
learning_machines_drift.exceptions module
Exceptions module.
- exception ReferenceDatasetMissing
Bases:
ExceptionRaised when no reference dataset logged.
learning_machines_drift.metrics module
Class for scoring drift between reference and registered datasets.
- class Metrics(reference_dataset: Dataset, registered_dataset: Dataset, random_state: Optional[int] = None)
Bases:
objectA class with metrics for scoring data drift between registered and reference datasets.
- random_state
Optional seeding for reproducibility.
- Type:
Optional[int]
- get_boundary_adherence() StructuredResult
For each feature the proportion of registered data that lies within the minimum and maximum of the reference dataset.
See SDMetrics for further details.
- Returns:
- The boundary adherence of the registered dataset
compared to the reference dataset.
- Return type:
- get_range_coverage() StructuredResult
For each feature the proportion of the range of the registered data that is covered by the reference dataset.
See SDMetrics for further details.
- Returns:
- The range of the registered dataset compared
to the reference dataset.
- Return type:
- logistic_detection(normalize: bool = False, score_type: Optional[str] = None, seed: Optional[int] = None, verbose: bool = True) StructuredResult
Calculates a measure of similarity using fitted logistic regression to predict reference or registered label. SD metrics package source # pylint: disable=line-too-long is adapted to permit optional score_type and seed to be given allowing alternative and reproducible metrics.
- score_type can be:
None: defaults to scoring of logistic_detection method.
“f1”: Cross-validated F1 score with 0.5 threshold.
“roc_auc”: Cross-validated receiver operating characteristic (area under the curve).
- Parameters:
score_type (Optional[str]) – None for default or string; “f1” and “roc_auc” currently implemented.
seed (Optional[int]) – Optional integer for reproducibility of scoring as cross-validation performed.
verbose (bool) – Boolean for verbose output to stdout.
- Returns:
- Score providing an overall similarity measure of
reference and registered datasets.
- Return type:
results (float)
- scipy_kolmogorov_smirnov(verbose: bool = True) StructuredResult
Calculates feature-wise two-sample Kolmogorov-Smirnov test for goodness of fit. Assumes continuous underlying distributions but scores are still interpretable if data is approximately continuous.
- Parameters:
verbose (bool) – Boolean for verbose output to stdout.
- Returns:
Dictionary of statistics and p-values by feature.
- Return type:
results (dict)
- scipy_mannwhitneyu(verbose: bool = True) StructuredResult
Calculates feature-wise Mann-Whitney U test, a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. Provides a test for the difference in location of two distributions. Assumes continuous underlying distributions but scores are still interpretable if data is approximately continuous.
- Parameters:
verbose (bool) – Boolean for verbose output to stdout.
- Returns:
Dictionary of statistics and p-values by feature.
- Return type:
results (dict)
- scipy_permutation(agg_func: ~typing.Callable[[...], float] = <function mean>, verbose: bool = True) StructuredResult
Performs feature-wise permutation test with default statistic to measure differences under permutations of labels as the mean.
- Parameters:
func (Callable[..., float]) – Function for comparing two samples.
verbose (bool) – Print outputs
- Returns:
Dictionary with keys as features and values as scipy.stats.permutation_test object with test results.
- Return type:
results (dict)
learning_machines_drift.types module
Module of drift types.
- class BaselineSummary(*, shapes: ShapeSummary)
Bases:
BaseModelClass for storing a shape summary with JSON string representation.
- shapes: ShapeSummary
A shape summary instance of a dataset.
- Type:
- class Dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None)
Bases:
objectClass for representing a drift dataset.
- property feature_names: List[str]
Returns a list of features dataframe columns.
- Returns:
A list of feature column names as strings.
- Return type:
List[str]
- features: DataFrame
A combined dataframe of input features and ground truth labels.
- Type:
pd.DataFrame
- labels: Series
A series of predicted labels from a model.
- Type:
pd.Series
- latents: Optional[DataFrame] = None
An optional dataframe of latent variables per sample.
- Type:
Optional[pd.DataFrame]
- unify() DataFrame
Returns a column-wise concatenated dataframe of features, labels and latents.
- Returns:
- Column-wise concatenated dataframe of features,
labels and latents.
- Return type:
pd.DataFrame
- class FeatureSummary(*, n_rows: int, n_features: int)
Bases:
BaseModelProvides a summary of a features dataframe.
- n_features: int
Number of features (columns).
- Type:
int
- n_rows: int
Number of samples (rows).
- Type:
int
- class LabelSummary(*, n_rows: int, n_labels: int)
Bases:
BaseModelProvides a summary of a labels series.
- n_labels: int
Number of distinct labels. For example, for binary data, this would be equal to 2.
- Type:
int
- n_rows: int
Number of samples (rows).
- Type:
int
- class LatentSummary(*, n_rows: int, n_latents: int)
Bases:
BaseModelProvides a summary of a latents dataframe.
- n_latents: int
Number of latent features (columns).
- Type:
int
- n_rows: int
Number of samples (rows).
- Type:
int
- class ShapeSummary(*, features: FeatureSummary, labels: LabelSummary, latents: Optional[LatentSummary] = None)
Bases:
BaseModelProvides a summary of the object shapes in a dataset of features, labels and latents.
- features: FeatureSummary
Features shape summary.
- Type:
- labels: LabelSummary
Labels shape summary.
- Type:
- latents: Optional[LatentSummary]
Optional latents shape summary.
- Type:
Optional[LatentSummary]
- class StructuredResult(method_name: str, results: Dict[str, Dict[str, float]])
Bases:
objectA type for representing a result from the hypothesis tests module.
- method_name: str
Name of the scoring method used.
- Type:
str
- results: Dict[str, Dict[str, float]]
Dictionary of results with keys as feature_name or, if for a unified dataset, “single_value”. Values are a dictionary containing the result statistic and p-value (if available) for a given method_name.
- Type:
Dict[str, Dict[str, float]]
Module contents
Tools for measuring data drift.
- class Dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None)
Bases:
objectClass for representing a drift dataset.
- property feature_names: List[str]
Returns a list of features dataframe columns.
- Returns:
A list of feature column names as strings.
- Return type:
List[str]
- features: DataFrame
A combined dataframe of input features and ground truth labels.
- Type:
pd.DataFrame
- labels: Series
A series of predicted labels from a model.
- Type:
pd.Series
- latents: Optional[DataFrame] = None
An optional dataframe of latent variables per sample.
- Type:
Optional[pd.DataFrame]
- unify() DataFrame
Returns a column-wise concatenated dataframe of features, labels and latents.
- Returns:
- Column-wise concatenated dataframe of features,
labels and latents.
- Return type:
pd.DataFrame
- class Display
Bases:
objectA class for converting a dictionary of drift scores to displayed output.
- classmethod plot(result: StructuredResult, score_name: Optional[str] = None, score_type: str = 'pvalue', alpha: float = 0.05) Tuple[Figure, Any]
Plot method for displaying a set of scores on a subplot grid.
- Parameters:
result (StructuredResult) – Structured result from a drift score measurement.
score_type (str) – Either “statistic” or “pvalue”.
score_name (str) – Name of score to be plotted and used as plot title.
alpha (float) – Value of alpha to be used in p-value plots.
- Returns:
tuple of fig and subplot array.
- Return type:
Tuple[plt.Figure, Any]
- classmethod table(result: StructuredResult, verbose: bool = True) DataFrame
Gets a pandas dataframe and optionally prints a table of results from drift scoring.
- Parameters:
structured_result (StructuredResult) – Structured result from a drift score measurement.
- Returns:
Dataframe of scores.
- Return type:
pd.DataFrame
- class FileBackend(root_dir: Union[str, Path])
Bases:
objectImplements the Backend protocol for writing files to the filesystem.
- clear_logged_dataset(tag: str) bool
Delete directory containing logged files.
- Parameters:
tag (str) – Path to logged directory.
- Returns:
if tag/logged path exists. False: if tag/logged path does not exist.
- Return type:
True
- clear_reference_dataset(tag: str) bool
Delete directory containing reference files.
- Parameters:
tag – Path to reference directory.
- load_logged_dataset(tag: str) Dataset
Return a Dataset from the union of logged data.
- Parameters:
tag (str) – Tag identifying dataset.
- load_reference_dataset(tag: str) Dataset
Load reference dataset from reference path.
- Parameters:
tag (str) – Tag identifying dataset.
- save_logged_features(tag: str, identifier: UUID, dataframe: DataFrame) None
Save logged features using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the logged dataset.
dataframe (pd.DataFrame) – The dataframe that needs saving.
- save_logged_labels(tag: str, identifier: UUID, labels: Series) None
Save logged labels using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the labels of the dataset.
labels (pd.Series) – The dataframe that needs saving.
- save_logged_latents(tag: str, identifier: UUID, dataframe: Optional[DataFrame]) None
Save optionally passed latents dataframe using tag as the path with UUID prepended to filename.
- Parameters:
tag (str) – Tag identifying dataset.
identifier (UUID) – A unique identifier for the labels of the dataset.
dataframe (pd.DataFrame) – The dataframe of latents to be saved.
- class Filter(conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]])
Bases:
objectFilter class.
Filters a given dataset through an AND operation applied across all passed conditions.
- conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]]
Dict with key (variable) and value as a list of (condition, value) to be used for filtering.
- Type:
dict[str, List[Condition]]
- class Monitor(tag: str, backend: Optional[Backend] = None)
Bases:
objectA class for monitoring data with data loading from backend and scoring drift scoring with metrics class.
- tag
The tag where data for monitoring is located within backend.
- Type:
str
- registered_dataset
The logged, registered dataset for drift comparison to reference dataset.
- Type:
Optional[Dataset]
- property metrics: Metrics
Drift metrics.
- Raises:
ReferenceDatasetMissing – The reference dataset is None.
ValueError – There is no additional registered data.
- class Registry(tag: str, expect_features: bool = True, expect_labels: bool = True, expect_latent: bool = False, backend: Optional[Backend] = None, clear_logged: bool = False, clear_reference: bool = False)
Bases:
objectClass for registry for logging datasets.
- tag
Tag identifying dataset.
- Type:
str
- registered_features
Optional registered features.
- Type:
Optional[pd.DataFrame]
- registered_labels
Optional registered labels.
- Type:
Optional[pd.Series]
- registered_latents
Optional registered latents.
- Type:
Optional[pd.Series]
- expect_features
Whether features are expected in registry.
- Type:
bool
- expect_labels
Whether a labels series is expected in registry.
- Type:
bool
- expect_latent
Whether latents are expected in registry.
- Type:
bool
- all_registered() bool
Checks whether all expected datastes are registered.
- Returns:
True if all expected registered, False otherwise.
- Return type:
bool
- property identifier: UUID
Gets the identifier of the registry.
- Returns:
The identifier.
- Return type:
UUID
- log_dataset(dataset: Dataset) None
Logs dataset features in registered data.
- Parameters:
dataset (Dataset) – New dataset to be logged.
- log_features(features: DataFrame) None
Logs dataset features in registered data.
- Parameters:
features (pd.DataFrame) – Features dataframe to be registered.
- log_labels(labels: Series) None
Logs dataset labels in registered data.
- Parameters:
labels (pd.Series) – Labels series to be registered.
- log_latents(latent: DataFrame) None
Logs dataset latents in registered data.
- Parameters:
latents (pd.DataFrame) – Latents dataframe to be registered.
- ref_summary() BaselineSummary
- Return a JSON describing shape of dataset feature, labels and
latents.
- Returns:
Summary of the dataset shapes.
- Return type:
- register_ref_dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None) None
Registers passed reference data.
- Parameters:
features (pd.DataFrame) – Reference features to be stored.
labels (pd.Series) – Reference labels to be stored.
latents (Optional[pd.DataFrame]) – Reference latents to be stored.
- class StructuredResult(method_name: str, results: Dict[str, Dict[str, float]])
Bases:
objectA type for representing a result from the hypothesis tests module.
- method_name: str
Name of the scoring method used.
- Type:
str
- results: Dict[str, Dict[str, float]]
Dictionary of results with keys as feature_name or, if for a unified dataset, “single_value”. Values are a dictionary containing the result statistic and p-value (if available) for a given method_name.
- Type:
Dict[str, Dict[str, float]]