Skip to content

dataquality_utils

qoa4ml.utils.dataquality_utils

Classes

Functions

eva_duplicate(data)

Evaluate and return the number or percentage of duplicate entries in the data.

Parameters:

data : numpy.ndarray or pandas.DataFrame Input data to be evaluated.

Returns:

dict or None A dictionary containing the following keys if successful: - DataQualityEnum.duplicate_ratio: Percentage of duplicate data. - DataQualityEnum.total_duplicate: Total number of duplicate entries. Returns None if the input data type is unsupported or if an exception occurs.

eva_erronous(data, errors=None)

Evaluate and return the number or percentage of erroneous data entries.

Parameters:

data : numpy.ndarray or pandas.DataFrame Input data to be evaluated. errors : list, optional List of items considered as errors. If not provided, NaNs will be considered as errors.

Returns:

dict or None A dictionary containing the following keys if successful: - DataQualityEnum.total_errors: Total number of errors. - DataQualityEnum.error_ratios: Percentage of errors. Returns None if the input data type is unsupported or if an exception occurs.

eva_input_file_type(input_file, allowed_data_type)

Check if the input file matches any of the allowed data types

Parameters:

input_file : UploadFile The uploaded file object to be checked for data type. allowed_data_type : List[str] A list of allowed data types to compare against the content type of the input file.

Returns:

bool True if the content type of the input file is in the list of allowed data types, otherwise False.

eva_missing(data, null_count=True, correlations=False, predict=False)

Evaluate and return statistics about missing data in the dataset.

Parameters:

data : numpy.ndarray or pandas.DataFrame Input data to be evaluated. null_count : bool, default=True If True, return the count of missing values in each column. correlations : bool, default=False If True, return the correlation matrix of missing values. predict : bool, default=False If True, enable missing data prediction (not implemented).

Returns:

dict or None A dictionary containing: - DataQualityEnum.null_count: Count of missing values (if null_count is True). - DataQualityEnum.null_correlations: Correlation matrix of missing values (if correlations is True). Returns None if the input data type is unsupported or if an exception occurs.

eva_none(data)

Evaluate and return statistics about valid and None (NaN) values in the dataset.

Parameters:

data : numpy.ndarray or pandas.DataFrame Input data to be evaluated.

Returns:

dict or None A dictionary containing the following keys if successful: - DataQualityEnum.total_valid: Total count of valid (non-NaN) entries. - DataQualityEnum.total_none: Total count of None (NaN) entries. - DataQualityEnum.none_ratio: Percentage of valid entries. Returns None if the input data type is unsupported or if an exception occurs.

image_quality(input_image)

Assess various quality metrics of an input image.

Parameters:

input_image : bytes or np.ndarray The input image in either byte format or as a numpy array.

Returns:

dict A dictionary containing the following keys: - ImageQualityNameEnum.image_size: The size of the image (width, height). - ImageQualityNameEnum.color_mode: The color mode of the image (e.g., 'RGB'). - ImageQualityNameEnum.color_channel: The number of color channels in the image.