How to get val loss in 3.x? #9904
Sorry, the val loss calculation in version 3.x is not yet supported. We will support it in the next few releases. |
Any updates on this @RangiLyu ? |
Why doesn't mmdetection 3.x include validation loss? Has it been removed for a specific reason? This is a critical feature because, without validation loss, we cannot assess whether a model is overfitting or generalizing. Does mmdetection suggest any alternative methods for addressing this? I'm feeling concerned and confused because I couldn't find anything related to this issue in the documentation. |
No, the Validation Loss is not supported yet. The only way to check the overfitting is by looking at mAP scores over trained and validation data. There is a way to get validation loss, but it's more of a hack by creating hooks in the pipeline. |
Anything more on the hacking with the hooks, you can point me to? |
no update yet? |
@abdksyed could you share how to log mAP on the training dataset while training? |
class FindIoU(Hook):
def __init__(self, name):
os.makedirs("bestepochs", exist_ok=True)
# Some Necessary Variables for me
self.bestIoU = 0
self.bestepoch = None
self.name = name
self.metric = BinaryJaccardIndex()
# RGB format
self.CLS2COLOR = {
1: (228,0,120), # Red
2: (42, 82, 190), # Blue
3: (3, 192, 60) # Green
# define our custom x axis metric
# define which metrics will be plotted against it
"coco/pGen1IoU", step_metric="coco/epoch", step_sync=False)
"coco/pGen2IoU", step_metric="coco/epoch", step_sync=False)
"coco/meanIoU", step_metric="coco/epoch", step_sync=False)
self.artifact = wandb.Artifact(self.name, type='model')
def after_val(self, runner, **kwargs):
IoUs = []
checkpoint_file = runner.work_dir + f"/epoch_{runner.epoch}.pth"
model = init_detector(runner.cfg, checkpoint_file, device='cuda:0')
meanIoU = []
val_file = runner.cfg.val_dataloader.dataset.ann_file
test_file = runner.cfg.test_dataloader.dataset.ann_file
for f_type, json_path in zip(['pGen1', 'pGen2'], [val_file, test_file]):
# json_path = f"{data_type}.json"
coco = COCO(json_path)
img_dir = f"combined_data"
cat_ids = coco.getCatIds()
frames = {}
for idx, img_data in coco.imgs.items():
anns_ids = coco.getAnnIds(imgIds=img_data['id'], catIds=cat_ids, iscrowd=None)
anns = coco.loadAnns(anns_ids)
truth_mask = coco.annToMask(anns[0])
for i in range(1,len(anns)):
truth_mask = np.maximum(truth_mask,coco.annToMask(anns[i])*1)
img = f'combined_data/{img_data["file_name"]}' # or img = mmcv.imread(img), which will only load it once
result = inference_detector(model, img)
# outputs = predictor(im)
pred_mask = np.zeros_like(truth_mask)
for i in result.pred_instances.masks.type(torch.int8):
pred_mask = np.maximum(pred_mask, i.to('cpu').numpy().astype(np.uint8))
# frame = label2rgb(pred_mask, cv2.imread(img), alpha=0.3, bg_label=0)*255
target = torch.tensor(truth_mask)
preds = torch.tensor(pred_mask)
intersection_mask = np.logical_and(pred_mask == 1, truth_mask == 1)
pred_mask[truth_mask == 1] = 2
pred_mask[intersection_mask] = 3
# Repeating Channels to make it three channels
pred_mask = np.tile(pred_mask[..., np.newaxis], (1,1,3))
# red -> Wrong Predicted, blue -> Ground Truth, green -> Correct Predicted
frame = io.imread(img)
for color_id in range(1,4):
mask = np.where(pred_mask == (color_id,)*3, self.CLS2COLOR[color_id], 0).astype('uint8')
frame = cv2.addWeighted(frame, 1.0, mask, 0.5, 0)
frames[img_data["file_name"]] = frame
IoUs.append(self.metric(preds, target).item())
size1,size2,_ = frame.shape
out = cv2.VideoWriter('output.mp4', cv2.VideoWriter_fourcc(*'mp4v'), 1, (size2, size1), True)
# Sorting the frames according to frame number eg: p3_frame_000530..PNG
for _,i in sorted(frames.items(), key=lambda x: x[0]):
out_img = cv2.cvtColor(i, cv2.COLOR_BGR2RGB)
# Convert MPV4 codec to libx264 codec
input_file = 'output.mp4'
output_file = f_type+'.mp4'
clip = VideoFileClip(input_file)
clip.write_videofile(output_file, codec='libx264')
# Collect all meanIoUs for all Generalization Patients
print(f"IoU: {sum(IoUs)/len(IoUs)}")
# axes are (time, channel, height, width)
wandb.log({f"{self.name}_{f_type}_epoch_{runner.epoch}": wandb.Video(output_file)})
for IoU, log in zip(meanIoU, ['pGen1', 'pGen2']):
wandb.log({f'coco/{log}':IoU, 'coco/epoch':runner.epoch})
meanIoU = sum(meanIoU)/len(meanIoU)
if meanIoU > self.bestIoU:
self.bestIoU = meanIoU
self.bestepoch = checkpoint_file
print(f"meanIoU: {meanIoU}")
wandb.log({'coco/iou':meanIoU, 'coco/epoch':runner.epoch})
print(f"Saving checkpoint of epoch {runner.epoch} to wandb")
self.artifact.add_file(checkpoint_file, name=f'epoch_{runner.epoch}.pth')
# wandb.log_artifact(self.artifact)
def after_run(self,runner, **kwargs):
shutil.copy(self.bestepoch, f"bestepochs/{self.name}.pth")
print(f"Saving best checkpoint to wandb")
self.artifact.add_file(self.bestepoch, name=f"best.pth")
wandb.log_artifact(self.artifact) This was a hook which I implemented for finding IoU values after each epoch. here I was doing inference and getting the mask of the prediction to find the IoU with the ground truth mask, and also create videos of the frames and save them in weights and biases. You can change the logic of the code, but function names and all will be same for you There is inefficiency, like I am performing inference again on the validation/test data to get IoU whereas, while training it, by default inference is done on validation data to get mAP values and so. I couldn't find how to get results of validation which was already performed, so I had to do inference again. |
@abdksyed thanks for sharing this code. This appears to be a way to compute mask IoU loss over the validation set. You mentioned that there's a way to get the mAP on the train set as well:
It appears that the same approach of a custom hook could be used in the docs link you provided above. Perhaps something with |
Yes, for train loss, you can use |
Does anyone have an example script that gets the validation loss using the hook approach? |
Same question, can anyone share how to get the validation loss using the hook approach? |
This would be a very useful feature and would appreciate an update on this @RangiLyu . |
Any update on the feature? |
this may be useful |
The loss computation seems to be more on the mmengine side. |
@Ileal16 this is now included in v0.10.5 of MMengine. However, as explained here, MMDetection will need to update the You can workaround this by overriding creating a custom model (likely by overriding whichever MMDet model you're using) and appending the loss to the end of its Upgrading MMengine to latest and addign this workaround should enable validation loss without a second forward pass. Ideally, MMDetection would add this to all of its default models, so this feature is available out of the box. |
Thank you all for your attention to this issue. Recently, I found that mmcls 0.25.0 involves validation loss. In short, it's about putting model.train() in model.eval(), and then calculating the loss as in forward_train. |
hey, guys. I did it. Perhaps my implementation is quite simple and immature, but I hope it can serve as a reference for everyone. If you encounter any bugs while using my method, I would appreciate your feedback (I don’t know how to insert images in the text, and the images I have contain Chinese, so I’ll just describe the modifications in text form). The versions of mmdetection and other related libraries I’m using are as follows: The following instructions assume you are constructing the project from source code. Core Idea:
→1. Adding Loss Calculation in the run_iter Part of ValLoop: |
================================================================================ |
Hey everyone, I just came across this thread and want to share what I do :) TLDR: I create an extra dataloader vor val loss calculations and call a modified model.train_step (without gradient updates) from a hook Create an extra dataloader in Runnerfrom itertools import cycle
from mmengine.runner import Runner
from mmengine.registry import RUNNERS
class CustomRunner(Runner):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.val_loss_dl = CustomRunner.build_val_loss_dl(self._train_dataloader, self._val_dataloader)
def build_val_loss_dl(train_dataloader, val_dataloader):
# have to be unitialized
assert isinstance(train_dataloader, dict)
assert isinstance(val_dataloader, dict)
# ensure val dataloader for loss calculation uses same sampler/pipeline as train dl (just switch anns and imgs)
dl = copy.deepcopy(train_dataloader)
dl['dataset']['ann_file'] = copy.deepcopy(val_dataloader['dataset']['ann_file'])
dl['dataset']['data_prefix'] = copy.deepcopy(val_dataloader['dataset']['data_prefix'])
dl = CustomRunner.build_dataloader(dl)
return cycle(dl) In a custom logger hook run train_step without gradient updatesAdd to config with Note that the from contextlib import nullcontext
import torch
from mmengine.dist import all_reduce_dict
from mmengine.runner.amp import autocast
from mmengine.hooks import Hook
from mmengine.registry import HOOKS
class CustomLoggerHook(Hook):
priority = 'BELOW_NORMAL'
def __init__(self, interval: int = 100):
self.interval = interval
def before_train_iter(self,
batch_idx: int,
data_batch: DATA_BATCH = None) -> None:
# see base class Hook for different ways to check intervals and change to your needs
if self.every_n_train_iters(runner=runner, n=self.interval):
outputs = self._get_loss_on_val_batch(runner) # losses on val data
all_reduce_dict(outputs, op='mean')
val_loss = {}
for k, v in outputs.items():
val_loss[k] = v.item() # cuda to cpu
# do whatever you want with the loss from here, e.g. log it :)
# or integrated this in a different hook...
def _get_loss_on_val_batch(self, runner):
# we basically run model.train_step but with:
# - model.eval() so we don't update batch_norm stats with val data!!!
# - torch.no_grad() so we don't produce gradients
# - no optim_wrapper.update_params but we have no gradients anyway
# - manual amp context instead of optim_wrapper.optim_context() so we don't change anything inside
# the optim_wrapper but still get correct amp loss. Not sure if this is actually necessary but
# let's not fiddle with optim_wrapper. Maybe we could simply use the context just without calling
# optim_wrapper.update_params() but I'm not sure so we just avoid it.
# see: https://github.com/open-mmlab/mmengine/blob/main/mmengine/model/base_model/base_model.py#L84
model = runner.model
if is_model_wrapper(model): # unwrap DDP
model = model.module
data = next(runner.val_loss_dl)
amp = hasattr(runner.optim_wrapper, 'cast_dtype')
cast_dtype = getattr(runner.optim_wrapper, 'cast_dtype', None)
with torch.no_grad():
with autocast(dtype=cast_dtype) if amp else nullcontext():
data = model.data_preprocessor(data, True)
losses = model._run_forward(data, mode='loss') # type: ignore
parsed_losses, log_vars = model.parse_losses(losses) # type: ignore
return log_vars |
@JohannesTheo : I would like to use the custom runner in the rtmdet ins config, but where to set the runner_type='CustomRunner'? |
Hey @danielsagmeister-cw, sry I forgot to mention but I'm using mmdet 3.3.0. In that case, from .custom_runner import CustomRunner
runner_type = CustomRunner or if you are using the text based configs: # I'm not 100% sure about the correct path etc. but,
# you need the custom_imports to trigger the registry mechanism
custom_imports = dict(imports=['.custom_runner'], allow_failed_imports=False)
runner_type ='CustomRunner' If you are using mmdet 2.x however, the mechanism will be different. The runner is instantiated here and defined as EDIT: I just checked and in case of mmdet 2.x, you might be able to do almost the same thing but have to extend/inherit from https://github.com/open-mmlab/mmcv/blob/1.x/mmcv/runner/epoch_based_runner.py . I didn't check what's different in terms of Hooks but since the 'EpochBasedRunner' implements the train loop directly, it seems that this part can be customized even easier and more directly, even without a hook. |
Perhaps it could be easier?
and in my config,I can
Although useful, it is inefficient |
I have seen Validation Loss During Training #7971.
but there is no workflow in base/default_runtime.py.
my mmdetection version is 3.x
The text was updated successfully, but these errors were encountered: