dataquality_utils
qoa4ml.utils.dataquality_utils
¶
Classes¶
Functions¶
eva_duplicate(data)
¶
Evaluate and return the number or percentage of duplicate entries in the data.
Parameters:
data : numpy.ndarray or pandas.DataFrame Input data to be evaluated.
Returns:
dict or None A dictionary containing the following keys if successful: - DataQualityEnum.DUPLICATE_RATIO: Percentage of duplicate data. - DataQualityEnum.TOTAL_DUPLICATE: Total number of duplicate entries. Returns None if the input data type is unsupported or if an exception occurs.
eva_erronous(data, errors=None)
¶
Evaluate and return the number or percentage of erroneous data entries.
Parameters:
data : numpy.ndarray or pandas.DataFrame Input data to be evaluated. errors : list, optional List of items considered as errors. If not provided, NaNs will be considered as errors.
Returns:
dict or None A dictionary containing the following keys if successful: - DataQualityEnum.TOTAL_ERRORS: Total number of errors. - DataQualityEnum.ERROR_RATIOS: Percentage of errors. Returns None if the input data type is unsupported or if an exception occurs.
eva_input_file_type(input_file, allowed_data_type)
¶
Check if the input file matches any of the allowed data types
Parameters:
input_file : UploadFile The uploaded file object to be checked for data type. allowed_data_type : List[str] A list of allowed data types to compare against the content type of the input file.
Returns:
bool True if the content type of the input file is in the list of allowed data types, otherwise False.
eva_missing(data, null_count=True, correlations=False, predict=False)
¶
Evaluate and return statistics about missing data in the dataset.
Parameters:
data : numpy.ndarray or pandas.DataFrame Input data to be evaluated. null_count : bool, default=True If True, return the count of missing values in each column. correlations : bool, default=False If True, return the correlation matrix of missing values. predict : bool, default=False If True, enable missing data prediction (not implemented).
Returns:
dict or None A dictionary containing: - DataQualityEnum.NULL_COUNT: Count of missing values (if null_count is True). - DataQualityEnum.NULL_CORRELATIONS: Correlation matrix of missing values (if correlations is True). Returns None if the input data type is unsupported or if an exception occurs.
eva_none(data)
¶
Evaluate and return statistics about valid and None (NaN) values in the dataset.
Parameters:
data : numpy.ndarray or pandas.DataFrame Input data to be evaluated.
Returns:
dict or None
A dictionary containing the following keys if successful:
- DataQualityEnum.TOTAL_VALID: Total count of valid (non-NaN) entries.
- DataQualityEnum.TOTAL_NONE: Total count of None (NaN) entries.
- DataQualityEnum.NONE_RATIO: Percentage of none/NaN entries
(100 * none_count / total; 0.0 when the dataset is
empty). Field name is authoritative — previous versions
accidentally computed the valid ratio.
Returns None if the input data type is unsupported or if an exception occurs.
image_quality(input_image)
¶
Assess various quality metrics of an input image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
bytes or ndarray
|
The input image in either byte format or as a numpy array. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary keyed by |
Raises:
| Type | Description |
|---|---|
TypeError
|
If |