learning_machines_drift package

Submodules

learning_machines_drift.backends module

Backend module.

class Backend(*args, **kwargs)

Bases: Protocol

A protocol class for a Backend.

clear_logged_dataset(tag: str) bool

Delete directory containing logged files.

Parameters:

tag (str) – Path to logged directory.

Returns:

if tag/logged path exists. False: if tag/logged path does not exist.

Return type:

True

clear_reference_dataset(tag: str) bool

Delete directory containing reference files.

Parameters:

tag – Path to reference directory.

load_logged_dataset(tag: str) Dataset

Return a Dataset from the union of logged data.

Parameters:

tag (str) – Tag identifying dataset.

load_reference_dataset(tag: str) Dataset

Load reference dataset from reference path.

Parameters:

tag (str) – Tag identifying dataset.

save_logged_features(tag: str, identifier: UUID, dataframe: DataFrame) None

Save logged features using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the logged dataset.

  • dataframe (pd.DataFrame) – The dataframe that needs saving.

save_logged_labels(tag: str, identifier: UUID, labels: Series) None

Save logged labels using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the labels of the dataset.

  • labels (pd.Series) – The dataframe that needs saving.

save_logged_latents(tag: str, identifier: UUID, dataframe: DataFrame) None

Save optionally passed latents dataframe using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the labels of the dataset.

  • dataframe (pd.DataFrame) – The dataframe of latents to be saved.

save_reference_dataset(tag: str, dataset: Dataset) None

Saves passed dataset to backend under tag.

Parameters:
  • tag (str) – A tag for locating the dataset within the backend.

  • dataset (Dataset) – Reference dataset to be saved.

class FileBackend(root_dir: Union[str, Path])

Bases: object

Implements the Backend protocol for writing files to the filesystem.

clear_logged_dataset(tag: str) bool

Delete directory containing logged files.

Parameters:

tag (str) – Path to logged directory.

Returns:

if tag/logged path exists. False: if tag/logged path does not exist.

Return type:

True

clear_reference_dataset(tag: str) bool

Delete directory containing reference files.

Parameters:

tag – Path to reference directory.

load_logged_dataset(tag: str) Dataset

Return a Dataset from the union of logged data.

Parameters:

tag (str) – Tag identifying dataset.

load_reference_dataset(tag: str) Dataset

Load reference dataset from reference path.

Parameters:

tag (str) – Tag identifying dataset.

save_logged_features(tag: str, identifier: UUID, dataframe: DataFrame) None

Save logged features using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the logged dataset.

  • dataframe (pd.DataFrame) – The dataframe that needs saving.

save_logged_labels(tag: str, identifier: UUID, labels: Series) None

Save logged labels using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the labels of the dataset.

  • labels (pd.Series) – The dataframe that needs saving.

save_logged_latents(tag: str, identifier: UUID, dataframe: Optional[DataFrame]) None

Save optionally passed latents dataframe using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the labels of the dataset.

  • dataframe (pd.DataFrame) – The dataframe of latents to be saved.

save_reference_dataset(tag: str, dataset: Dataset) None

Saves passed dataset to backend under tag.

Parameters:
  • tag (str) – A tag for locating the dataset within the backend.

  • dataset (Dataset) – Reference dataset to be saved.

get_identifier(path_object: Union[str, Path]) Optional[UUID]

Extract the UUID from the filename. The filename should have the format UUID + some other text and a file extension. The UUID should match the regex in the pattern variable UUIDHex4.

Parameters:

path_obejct (Union[str, Path]) –

Returns:

Optional universally unique identifier (UUID) from

path_object.

Return type:

Optional[UUID]

learning_machines_drift.datasets module

Datasets module with functions for generating example data.

example_dataset(n_rows: int, seed: Optional[int] = None) Tuple[DataFrame, Series, DataFrame]

Generates data and returns features, labels and latents.

Parameters:
  • n_rows (int) – Number of rows/samples.

  • seed (Optional[int]) – Random seed for reproducibly generating data.

Returns:

A dataset tuple of

generated features, labels and latents.

Return type:

Tuple[pd.DataFrame, pd.Series, pd.DataFrame]

logistic_model(x_mu: ndarray[Any, dtype[float64]] = array([0., 0., 0.]), x_scale: ndarray[Any, dtype[float64]] = array([1., 1., 1.]), x_corr: ndarray[Any, dtype[float64]] = array([[1., 0.4, 0.], [0.4, 1., 0.], [0., 0., 1.]]), alpha: float = 0.5, beta: ndarray[Any, dtype[float64]] = array([1., 0.5, 0.]), size: int = 50, seed: Optional[int] = None, return_latents: bool = False) Tuple[ndarray[Any, dtype[float64]], ndarray[Any, dtype[float64]], Optional[ndarray[Any, dtype[float64]]]]

Generate synthetic features, labels and latents.

Features are generated from a multivariate normal distribution, where the mean vector, scale vector and correlation matrix can be specified, allowing users to simulate covariate drift.

Labels are generated with a logistic regression model. The regression parameters are controlled with the beta parameter, allowing simulation of concept drift.

Latents are a single feature as characterizing the Bernoulli probability generated by the model.

Parameters:
  • x_mu (NDArray[np.float64]) – Mean vector of features. Defaults to np.array([0.0, 0.0, 0.0]).

  • x_scale (NDArray[np.float64]) – Scale of features. Defaults to np.array([1.0, 1.0, 1.0]).

  • x_corr (NDArray[np.float64]) – Correlation matrix giving the correlation between features. Defaults to np.array([[1.0, 0.4, 0.0], [0.4, 1.0, 0.0], [0.0, 0.0, 1.0]]).

  • alpha (float) – Regression alpha parameter. Defaults to 0.5.

  • beta (NDArray[np.float64]) – Regression beta parameters . Defaults to np.array([1.0, 0.5, 0.0]).

  • size (int) – Number of samples to draw from model. Defaults to 50.

  • return_latents (bool) – Return underlying prediction value before thresholding as ‘latent’ data. Defaults to False.

Returns:

Tuple of features, labels and (optional) latents generated.

Return type:

Tuple[NDArray[np.float64], NDArray[np.float64], Optional[NDArray[np.float64]]]

learning_machines_drift.display module

Class for scoring drift between reference and registered datasets.

class Display

Bases: object

A class for converting a dictionary of drift scores to displayed output.

classmethod plot(result: StructuredResult, score_name: Optional[str] = None, score_type: str = 'pvalue', alpha: float = 0.05) Tuple[Figure, Any]

Plot method for displaying a set of scores on a subplot grid.

Parameters:
  • result (StructuredResult) – Structured result from a drift score measurement.

  • score_type (str) – Either “statistic” or “pvalue”.

  • score_name (str) – Name of score to be plotted and used as plot title.

  • alpha (float) – Value of alpha to be used in p-value plots.

Returns:

tuple of fig and subplot array.

Return type:

Tuple[plt.Figure, Any]

classmethod table(result: StructuredResult, verbose: bool = True) DataFrame

Gets a pandas dataframe and optionally prints a table of results from drift scoring.

Parameters:

structured_result (StructuredResult) – Structured result from a drift score measurement.

Returns:

Dataframe of scores.

Return type:

pd.DataFrame

learning_machines_drift.registry module

Module for registry handling storage and logging of datasets.

class Registry(tag: str, expect_features: bool = True, expect_labels: bool = True, expect_latent: bool = False, backend: Optional[Backend] = None, clear_logged: bool = False, clear_reference: bool = False)

Bases: object

Class for registry for logging datasets.

backend

Optional backend for data.

Type:

Optional[Backend]

tag

Tag identifying dataset.

Type:

str

ref_dataset

Optional reference dataset.

Type:

Optional[Dataset]

registered_features

Optional registered features.

Type:

Optional[pd.DataFrame]

registered_labels

Optional registered labels.

Type:

Optional[pd.Series]

registered_latents

Optional registered latents.

Type:

Optional[pd.Series]

expect_features

Whether features are expected in registry.

Type:

bool

expect_labels

Whether a labels series is expected in registry.

Type:

bool

expect_latent

Whether latents are expected in registry.

Type:

bool

all_registered() bool

Checks whether all expected datastes are registered.

Returns:

True if all expected registered, False otherwise.

Return type:

bool

property identifier: UUID

Gets the identifier of the registry.

Returns:

The identifier.

Return type:

UUID

log_dataset(dataset: Dataset) None

Logs dataset features in registered data.

Parameters:

dataset (Dataset) – New dataset to be logged.

log_features(features: DataFrame) None

Logs dataset features in registered data.

Parameters:

features (pd.DataFrame) – Features dataframe to be registered.

log_labels(labels: Series) None

Logs dataset labels in registered data.

Parameters:

labels (pd.Series) – Labels series to be registered.

log_latents(latent: DataFrame) None

Logs dataset latents in registered data.

Parameters:

latents (pd.DataFrame) – Latents dataframe to be registered.

ref_summary() BaselineSummary
Return a JSON describing shape of dataset feature, labels and

latents.

Returns:

Summary of the dataset shapes.

Return type:

BaselineSummary

register_ref_dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None) None

Registers passed reference data.

Parameters:
  • features (pd.DataFrame) – Reference features to be stored.

  • labels (pd.Series) – Reference labels to be stored.

  • latents (Optional[pd.DataFrame]) – Reference latents to be stored.

property registered_dataset: Dataset

Gets the registered dataset.

Returns:

The registered dataset.

Return type:

Dataset

save_reference_dataset(dataset: Dataset) None

Registers passed reference data.

Parameters:

dataset (Dataset) – Reference dataset to be stored.

learning_machines_drift.filter module

Module with class to filter a dataset.

class Comparison(value)

Bases: Enum

Comparison enum for ‘LESS’, ‘GREATER’ and ‘EQUAL’ cases.

EQUAL = 3
GREATER = 2
LESS = 1
class Condition(comparison_str: str, value: Any)

Bases: object

Condition class comprising of a ‘comparison’ and a ‘value’.

comparison: Comparison
value: Any
class Filter(conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]])

Bases: object

Filter class.

Filters a given dataset through an AND operation applied across all passed conditions.

conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]]

Dict with key (variable) and value as a list of (condition, value) to be used for filtering.

Type:

dict[str, List[Condition]]

transform(dataset: Dataset) Dataset

Transform the passed dataset given filter.

Parameters:

dataset (Dataset) – the dataset to be filtered.

Returns:

transformed dataset given filters.

Return type:

Dataset

learning_machines_drift.monitor module

Monitor class for interacting with data and scoring drift.

class Monitor(tag: str, backend: Optional[Backend] = None)

Bases: object

A class for monitoring data with data loading from backend and scoring drift scoring with metrics class.

tag

The tag where data for monitoring is located within backend.

Type:

str

ref_dataset

The reference dataset.

Type:

Optional[Dataset]

registered_dataset

The logged, registered dataset for drift comparison to reference dataset.

Type:

Optional[Dataset]

load_data(drift_filter: Optional[Filter] = None) Monitor

Load data from backend into monitor.

Parameters:

drift_filter (Filter, optional) – An optional filter with conditions applied to both reference and registered loaded data.

Returns:

The calling Monitor instance with (optionally) filtered

datasets loaded.

Return type:

Monitor

property metrics: Metrics

Drift metrics.

Raises:
  • ReferenceDatasetMissing – The reference dataset is None.

  • ValueError – There is no additional registered data.

learning_machines_drift.exceptions module

Exceptions module.

exception ReferenceDatasetMissing

Bases: Exception

Raised when no reference dataset logged.

learning_machines_drift.metrics module

Class for scoring drift between reference and registered datasets.

class Metrics(reference_dataset: Dataset, registered_dataset: Dataset, random_state: Optional[int] = None)

Bases: object

A class with metrics for scoring data drift between registered and reference datasets.

reference_dataset

Reference datastet for drift measures.

Type:

Dataset

registered_dataset

Registered/logged datastet for drift measures.

Type:

Dataset

random_state

Optional seeding for reproducibility.

Type:

Optional[int]

get_boundary_adherence() StructuredResult

For each feature the proportion of registered data that lies within the minimum and maximum of the reference dataset.

See SDMetrics for further details.

Returns:

The boundary adherence of the registered dataset

compared to the reference dataset.

Return type:

StructuredResult

get_range_coverage() StructuredResult

For each feature the proportion of the range of the registered data that is covered by the reference dataset.

See SDMetrics for further details.

Returns:

The range of the registered dataset compared

to the reference dataset.

Return type:

StructuredResult

logistic_detection(normalize: bool = False, score_type: Optional[str] = None, seed: Optional[int] = None, verbose: bool = True) StructuredResult

Calculates a measure of similarity using fitted logistic regression to predict reference or registered label. SD metrics package source # pylint: disable=line-too-long is adapted to permit optional score_type and seed to be given allowing alternative and reproducible metrics.

score_type can be:
  • None: defaults to scoring of logistic_detection method.

  • “f1”: Cross-validated F1 score with 0.5 threshold.

  • “roc_auc”: Cross-validated receiver operating characteristic (area under the curve).

Parameters:
  • score_type (Optional[str]) – None for default or string; “f1” and “roc_auc” currently implemented.

  • seed (Optional[int]) – Optional integer for reproducibility of scoring as cross-validation performed.

  • verbose (bool) – Boolean for verbose output to stdout.

Returns:

Score providing an overall similarity measure of

reference and registered datasets.

Return type:

results (float)

scipy_kolmogorov_smirnov(verbose: bool = True) StructuredResult

Calculates feature-wise two-sample Kolmogorov-Smirnov test for goodness of fit. Assumes continuous underlying distributions but scores are still interpretable if data is approximately continuous.

Parameters:

verbose (bool) – Boolean for verbose output to stdout.

Returns:

Dictionary of statistics and p-values by feature.

Return type:

results (dict)

scipy_mannwhitneyu(verbose: bool = True) StructuredResult

Calculates feature-wise Mann-Whitney U test, a nonparametric test of the null hypothesis that the distribution underlying sample x is the same as the distribution underlying sample y. Provides a test for the difference in location of two distributions. Assumes continuous underlying distributions but scores are still interpretable if data is approximately continuous.

Parameters:

verbose (bool) – Boolean for verbose output to stdout.

Returns:

Dictionary of statistics and p-values by feature.

Return type:

results (dict)

scipy_permutation(agg_func: ~typing.Callable[[...], float] = <function mean>, verbose: bool = True) StructuredResult

Performs feature-wise permutation test with default statistic to measure differences under permutations of labels as the mean.

Parameters:
  • func (Callable[..., float]) – Function for comparing two samples.

  • verbose (bool) – Print outputs

Returns:

Dictionary with keys as features and values as scipy.stats.permutation_test object with test results.

Return type:

results (dict)

class Wrapper(value)

Bases: Enum

Enum for specifying the calculation type.

TYPE_OTHER = 2
TYPE_SDMETRIC = 3
TYPE_TUPLE = 1

learning_machines_drift.types module

Module of drift types.

class BaselineSummary(*, shapes: ShapeSummary)

Bases: BaseModel

Class for storing a shape summary with JSON string representation.

shapes: ShapeSummary

A shape summary instance of a dataset.

Type:

ShapeSummary

class Dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None)

Bases: object

Class for representing a drift dataset.

property feature_names: List[str]

Returns a list of features dataframe columns.

Returns:

A list of feature column names as strings.

Return type:

List[str]

features: DataFrame

A combined dataframe of input features and ground truth labels.

Type:

pd.DataFrame

labels: Series

A series of predicted labels from a model.

Type:

pd.Series

latents: Optional[DataFrame] = None

An optional dataframe of latent variables per sample.

Type:

Optional[pd.DataFrame]

unify() DataFrame

Returns a column-wise concatenated dataframe of features, labels and latents.

Returns:

Column-wise concatenated dataframe of features,

labels and latents.

Return type:

pd.DataFrame

class FeatureSummary(*, n_rows: int, n_features: int)

Bases: BaseModel

Provides a summary of a features dataframe.

n_features: int

Number of features (columns).

Type:

int

n_rows: int

Number of samples (rows).

Type:

int

class LabelSummary(*, n_rows: int, n_labels: int)

Bases: BaseModel

Provides a summary of a labels series.

n_labels: int

Number of distinct labels. For example, for binary data, this would be equal to 2.

Type:

int

n_rows: int

Number of samples (rows).

Type:

int

class LatentSummary(*, n_rows: int, n_latents: int)

Bases: BaseModel

Provides a summary of a latents dataframe.

n_latents: int

Number of latent features (columns).

Type:

int

n_rows: int

Number of samples (rows).

Type:

int

class ShapeSummary(*, features: FeatureSummary, labels: LabelSummary, latents: Optional[LatentSummary] = None)

Bases: BaseModel

Provides a summary of the object shapes in a dataset of features, labels and latents.

features: FeatureSummary

Features shape summary.

Type:

FeatureSummary

labels: LabelSummary

Labels shape summary.

Type:

LabelSummary

latents: Optional[LatentSummary]

Optional latents shape summary.

Type:

Optional[LatentSummary]

class StructuredResult(method_name: str, results: Dict[str, Dict[str, float]])

Bases: object

A type for representing a result from the hypothesis tests module.

method_name: str

Name of the scoring method used.

Type:

str

results: Dict[str, Dict[str, float]]

Dictionary of results with keys as feature_name or, if for a unified dataset, “single_value”. Values are a dictionary containing the result statistic and p-value (if available) for a given method_name.

Type:

Dict[str, Dict[str, float]]

Module contents

Tools for measuring data drift.

class Dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None)

Bases: object

Class for representing a drift dataset.

property feature_names: List[str]

Returns a list of features dataframe columns.

Returns:

A list of feature column names as strings.

Return type:

List[str]

features: DataFrame

A combined dataframe of input features and ground truth labels.

Type:

pd.DataFrame

labels: Series

A series of predicted labels from a model.

Type:

pd.Series

latents: Optional[DataFrame] = None

An optional dataframe of latent variables per sample.

Type:

Optional[pd.DataFrame]

unify() DataFrame

Returns a column-wise concatenated dataframe of features, labels and latents.

Returns:

Column-wise concatenated dataframe of features,

labels and latents.

Return type:

pd.DataFrame

class Display

Bases: object

A class for converting a dictionary of drift scores to displayed output.

classmethod plot(result: StructuredResult, score_name: Optional[str] = None, score_type: str = 'pvalue', alpha: float = 0.05) Tuple[Figure, Any]

Plot method for displaying a set of scores on a subplot grid.

Parameters:
  • result (StructuredResult) – Structured result from a drift score measurement.

  • score_type (str) – Either “statistic” or “pvalue”.

  • score_name (str) – Name of score to be plotted and used as plot title.

  • alpha (float) – Value of alpha to be used in p-value plots.

Returns:

tuple of fig and subplot array.

Return type:

Tuple[plt.Figure, Any]

classmethod table(result: StructuredResult, verbose: bool = True) DataFrame

Gets a pandas dataframe and optionally prints a table of results from drift scoring.

Parameters:

structured_result (StructuredResult) – Structured result from a drift score measurement.

Returns:

Dataframe of scores.

Return type:

pd.DataFrame

class FileBackend(root_dir: Union[str, Path])

Bases: object

Implements the Backend protocol for writing files to the filesystem.

clear_logged_dataset(tag: str) bool

Delete directory containing logged files.

Parameters:

tag (str) – Path to logged directory.

Returns:

if tag/logged path exists. False: if tag/logged path does not exist.

Return type:

True

clear_reference_dataset(tag: str) bool

Delete directory containing reference files.

Parameters:

tag – Path to reference directory.

load_logged_dataset(tag: str) Dataset

Return a Dataset from the union of logged data.

Parameters:

tag (str) – Tag identifying dataset.

load_reference_dataset(tag: str) Dataset

Load reference dataset from reference path.

Parameters:

tag (str) – Tag identifying dataset.

save_logged_features(tag: str, identifier: UUID, dataframe: DataFrame) None

Save logged features using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the logged dataset.

  • dataframe (pd.DataFrame) – The dataframe that needs saving.

save_logged_labels(tag: str, identifier: UUID, labels: Series) None

Save logged labels using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the labels of the dataset.

  • labels (pd.Series) – The dataframe that needs saving.

save_logged_latents(tag: str, identifier: UUID, dataframe: Optional[DataFrame]) None

Save optionally passed latents dataframe using tag as the path with UUID prepended to filename.

Parameters:
  • tag (str) – Tag identifying dataset.

  • identifier (UUID) – A unique identifier for the labels of the dataset.

  • dataframe (pd.DataFrame) – The dataframe of latents to be saved.

save_reference_dataset(tag: str, dataset: Dataset) None

Saves passed dataset to backend under tag.

Parameters:
  • tag (str) – A tag for locating the dataset within the backend.

  • dataset (Dataset) – Reference dataset to be saved.

class Filter(conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]])

Bases: object

Filter class.

Filters a given dataset through an AND operation applied across all passed conditions.

conditions: Optional[dict[str, List[learning_machines_drift.filter.Condition]]]

Dict with key (variable) and value as a list of (condition, value) to be used for filtering.

Type:

dict[str, List[Condition]]

transform(dataset: Dataset) Dataset

Transform the passed dataset given filter.

Parameters:

dataset (Dataset) – the dataset to be filtered.

Returns:

transformed dataset given filters.

Return type:

Dataset

class Monitor(tag: str, backend: Optional[Backend] = None)

Bases: object

A class for monitoring data with data loading from backend and scoring drift scoring with metrics class.

tag

The tag where data for monitoring is located within backend.

Type:

str

ref_dataset

The reference dataset.

Type:

Optional[Dataset]

registered_dataset

The logged, registered dataset for drift comparison to reference dataset.

Type:

Optional[Dataset]

load_data(drift_filter: Optional[Filter] = None) Monitor

Load data from backend into monitor.

Parameters:

drift_filter (Filter, optional) – An optional filter with conditions applied to both reference and registered loaded data.

Returns:

The calling Monitor instance with (optionally) filtered

datasets loaded.

Return type:

Monitor

property metrics: Metrics

Drift metrics.

Raises:
  • ReferenceDatasetMissing – The reference dataset is None.

  • ValueError – There is no additional registered data.

class Registry(tag: str, expect_features: bool = True, expect_labels: bool = True, expect_latent: bool = False, backend: Optional[Backend] = None, clear_logged: bool = False, clear_reference: bool = False)

Bases: object

Class for registry for logging datasets.

backend

Optional backend for data.

Type:

Optional[Backend]

tag

Tag identifying dataset.

Type:

str

ref_dataset

Optional reference dataset.

Type:

Optional[Dataset]

registered_features

Optional registered features.

Type:

Optional[pd.DataFrame]

registered_labels

Optional registered labels.

Type:

Optional[pd.Series]

registered_latents

Optional registered latents.

Type:

Optional[pd.Series]

expect_features

Whether features are expected in registry.

Type:

bool

expect_labels

Whether a labels series is expected in registry.

Type:

bool

expect_latent

Whether latents are expected in registry.

Type:

bool

all_registered() bool

Checks whether all expected datastes are registered.

Returns:

True if all expected registered, False otherwise.

Return type:

bool

property identifier: UUID

Gets the identifier of the registry.

Returns:

The identifier.

Return type:

UUID

log_dataset(dataset: Dataset) None

Logs dataset features in registered data.

Parameters:

dataset (Dataset) – New dataset to be logged.

log_features(features: DataFrame) None

Logs dataset features in registered data.

Parameters:

features (pd.DataFrame) – Features dataframe to be registered.

log_labels(labels: Series) None

Logs dataset labels in registered data.

Parameters:

labels (pd.Series) – Labels series to be registered.

log_latents(latent: DataFrame) None

Logs dataset latents in registered data.

Parameters:

latents (pd.DataFrame) – Latents dataframe to be registered.

ref_summary() BaselineSummary
Return a JSON describing shape of dataset feature, labels and

latents.

Returns:

Summary of the dataset shapes.

Return type:

BaselineSummary

register_ref_dataset(features: DataFrame, labels: Series, latents: Optional[DataFrame] = None) None

Registers passed reference data.

Parameters:
  • features (pd.DataFrame) – Reference features to be stored.

  • labels (pd.Series) – Reference labels to be stored.

  • latents (Optional[pd.DataFrame]) – Reference latents to be stored.

property registered_dataset: Dataset

Gets the registered dataset.

Returns:

The registered dataset.

Return type:

Dataset

save_reference_dataset(dataset: Dataset) None

Registers passed reference data.

Parameters:

dataset (Dataset) – Reference dataset to be stored.

class StructuredResult(method_name: str, results: Dict[str, Dict[str, float]])

Bases: object

A type for representing a result from the hypothesis tests module.

method_name: str

Name of the scoring method used.

Type:

str

results: Dict[str, Dict[str, float]]

Dictionary of results with keys as feature_name or, if for a unified dataset, “single_value”. Values are a dictionary containing the result statistic and p-value (if available) for a given method_name.

Type:

Dict[str, Dict[str, float]]