Datagnosis Tutorial 02 - image data#
If you prefer, this tutorial is also available ongoogle colab.
In this tutorial we will see how to use “hardness characterization method” plugins to calculate the hardness scores for images. We will also plot these values and extract some data points based on these scores. For this tutorial we will be using the cifar and mnist datasets from pytorch.
OK, Lets start!
First we import our logger from datagnosis and set the logging level at “INFO”. If something goes wrong and you want to see more detailed logs, you can change the logging level to “DEBUG” or, conversely, if you don’t want to see any logs you can remove them with log.remove().
[1]:
import sys
import datagnosis.logger as log
log.add(sink=sys.stderr, level="INFO")
Load the Image COVID19 dataset. Set the value of the dataset variable to select which one you’d like to use.
[2]:
import torch
from datagnosis.utils.datasets.images.mnist import load_mnist
from datagnosis.utils.datasets.images.cifar import load_cifar
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataset = "cifar"
model_name = "LeNet"
if dataset == "mnist":
X_train, y_train, X_test, y_test = load_mnist()
elif dataset == "cifar":
X_train, y_train, X_test, y_test = load_cifar()
else:
raise ValueError("Invalid dataset!")
Files already downloaded and verified
Files already downloaded and verified
The next key step is to then pass the data to the DataHandler object provided by Datagnosis. This is done by passing the features and the labels separately. The features can be a pandas.DataFrame, numpy.ndarray or torch.Tensor. The labels can be pandas.series, numpy.ndarray or torch.Tensor.
Here we’ve limited the number of examples to reduce training time. If the [:100]s are removed it will load the whole dataset. This will affect fitting times. Why not play around with the size of the dataset you pass to the DataHandler and see how that affects the results!
[3]:
from datagnosis.plugins.core.datahandler import DataHandler
# Lets just use the first 100 examples from the dataset
datahandler = DataHandler(X_train[:100], y_train[:100], batch_size=64)
Now we define some values which we will pass to the plugin, such as the model that we want to use to classify the data.
[4]:
import datagnosis.plugins.core.models.image_nets as im_nets
import torch
import torch.nn as nn
# Instantiate the neural network
if dataset == 'cifar':
if model_name == 'LeNet':
model = im_nets.LeNet(num_classes=10).to(DEVICE)
if model_name == 'ResNet':
model = im_nets.ResNet18().to(DEVICE)
elif dataset == 'mnist':
if model_name == 'LeNet':
model = im_nets.LeNetMNIST(num_classes=10).to(DEVICE)
if model_name == 'ResNet':
model = im_nets.ResNet18MNIST().to(DEVICE)
# creating our optimizer and loss function object
learning_rate = 0.01
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)
Import the Plugins object from Datagnosis. Then by calling list() on the we can see all the available plugins that we can use.
[5]:
# datagnosis absolute
from datagnosis.plugins import Plugins
Plugins().list()
[5]:
['conf_agree',
'prototypicality',
'aum',
'grand',
'data_iq',
'el2n',
'allsh',
'confident_learning',
'data_maps',
'large_loss',
'vog',
'forgetting']
Now we can call get() to load up a specific plugin from the list and then we call fit() to fit the plugin.
[6]:
hcm = Plugins().get(
"data_iq",
model=model,
criterion=criterion,
optimizer=optimizer,
lr=learning_rate,
epochs=2,
num_classes=10,
logging_interval=1,
)
hcm.fit(
datahandler=datahandler,
)
[2023-07-24T14:45:03.106610+0100][19092][INFO] Fitting data_iq
[2023-07-24T14:45:04.360566+0100][19092][INFO] Epoch 1/2: Loss=50.3924
[2023-07-24T14:45:04.368485+0100][19092][INFO] Epoch 2/2: Loss=10.1667
Now the plugin has been fit we can access scores. First, lets get a description of the scores then print them.
[7]:
print("description: ", hcm.score_description())
print("hard direction: ", hcm.hard_direction())
print("\nscores:")
print(hcm.scores)
description: Compute scores returns two scores for this data_iq plugin. The first is the Aleatoric
Uncertainty and the second is the Confidence. Aleatoric uncertainty permits a principled characterization
and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard).
Confidence is a measure of the model's confidence in its prediction. High Confidence predictions
define the category `Easy`. Low Confidence scores define `Hard`. High Aleatoric Uncertainty scores define ambiguous.
hard direction: low
scores:
(array([0.09147589, 0.11849582, 0.0835323 , 0.04855546, 0.02332794,
0.08970094, 0.07786956, 0.08399799, 0.11791816, 0.06903086,
0.12531565, 0.10292949, 0.10487543, 0.141856 , 0.06954783,
0.03266784, 0.13237974, 0.10527245, 0.13733722, 0.10353062,
0.10164936, 0.11001916, 0.10047639, 0.0257053 , 0.04987705,
0.14117853, 0.11219541, 0.11800013, 0.10056169, 0.12216834,
0.13285862, 0.0147678 , 0.04241518, 0.0143281 , 0.09451754,
0.14933638, 0.10936733, 0.04254796, 0.01308143, 0.09072555,
0.07707541, 0.13792846, 0.07920272, 0.10007098, 0.06903636,
0.02880334, 0.11424593, 0.09427848, 0.11145239, 0.11049786,
0.07402061, 0.09792005, 0.0070019 , 0.12334572, 0.04296402,
0.00655646, 0.09524225, 0.04669097, 0.10536374, 0.11247582,
0.08786047, 0.12601283, 0.10181613, 0.05603585, 0.09381205,
0.10759981, 0.11060429, 0.09895134, 0.12267244, 0.09632455,
0.10449035, 0.0951715 , 0.09650907, 0.10457175, 0.13437259,
0.08920917, 0.06973697, 0.05712068, 0.10334858, 0.12492844,
0.12519793, 0.08870013, 0.07843236, 0.08399148, 0.10043813,
0.08761903, 0.12944601, 0.1540111 , 0.1048195 , 0.13080277,
0.11425949, 0.11124209, 0.04196164, 0.10048518, 0.06891061,
0.04235439, 0.05105817, 0.11078806, 0.10150963, 0.11170949]), array([0.08310805, 0.10445456, 0.07655465, 0.04619783, 0.02278375,
0.08165468, 0.07180589, 0.07694233, 0.10401346, 0.0642656 ,
0.10961164, 0.09233501, 0.09387657, 0.12173288, 0.06471093,
0.03160065, 0.11485534, 0.09419016, 0.11847571, 0.09281203,
0.09131677, 0.09791495, 0.09038089, 0.02504454, 0.04738933,
0.12124716, 0.0996076 , 0.1040761 , 0.09044904, 0.10724324,
0.11520721, 0.01454971, 0.04061613, 0.0141228 , 0.08558397,
0.12703503, 0.09740611, 0.04073763, 0.0129103 , 0.08249442,
0.07113479, 0.1189042 , 0.07292965, 0.09005678, 0.06427034,
0.02797371, 0.1011938 , 0.08539005, 0.09903075, 0.09828808,
0.06854156, 0.08833172, 0.00695287, 0.10813155, 0.04111812,
0.00651348, 0.08617117, 0.04451092, 0.09426222, 0.09982501,
0.08014101, 0.1101336 , 0.09144961, 0.05289584, 0.08501135,
0.09602209, 0.09837098, 0.08915997, 0.10762391, 0.08704613,
0.09357211, 0.08611388, 0.08719507, 0.0936365 , 0.1163166 ,
0.08125089, 0.06487373, 0.05385791, 0.09266765, 0.10932132,
0.10952341, 0.08083242, 0.07228072, 0.07693691, 0.09035031,
0.07994193, 0.11268974, 0.13029168, 0.09383237, 0.1136934 ,
0.10120426, 0.09886729, 0.04020086, 0.09038791, 0.06416193,
0.04056049, 0.04845123, 0.09851407, 0.09120543, 0.09923048]))
Printing the scores leaves them difficult to digest, so now we will plot them instead.
[8]:
print(hcm.score_names)
hcm.plot_scores(axis=1)
[2023-07-24T14:45:04.391648+0100][19092][INFO] Plotting data_iq scores
('Confidence', 'Aleatoric Uncertainty')
Finally the extract_datapoints method can be used to select data based on the hcm score. Available methods for extract include "top_n", "threshold" and "index". Give them all a go!
The following cell takes the hardest 10 data points summarises them in a pandas.DataFrame.
[9]:
import pandas as pd
import matplotlib.pyplot as plt
hardest_image = hcm.extract_datapoints(method="top_n", n=1)
display(pd.DataFrame(
data={
"indices":hardest_image[0][2],
"labels": hardest_image[0][1],
"scores": hardest_image[1]
}
))
plt.imshow((hardest_image[0][0][0]/256).permute(1, 2, 0))
[2023-07-24T14:45:04.532444+0100][19092][INFO] Sorting extracted datapoints
[2023-07-24T14:45:04.533156+0100][19092][INFO] [55]
[2023-07-24T14:45:04.533594+0100][19092][INFO] [55]
| indices | labels | scores | |
|---|---|---|---|
| 0 | 55 | 2 | 0.006556 |
[9]:
<matplotlib.image.AxesImage at 0x7f7516b31210>