Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631

BBrianZhang · 2024-06-21T09:18:20Z

Is there an existing issue for this?

I have searched the existing issues

Bug description

Hello,
while using Deeplabcut with a Pytorch engine, I encountered an issue with the model dlcrnet_stride16_ms5. During the training process, a ValueError occurred stating "matrix contains invalid numeric entries." No matter how much I reduce the learning rate or adjust the training batch size, I cannot resolve this problem.

Operating System

Ubuntu 18.04

DeepLabCut version

DLC 3.0.0rc1

DeepLabCut mode

multi animal

Device type

gpu

Steps To Reproduce

1.creating a training dataset
2.pytorch_config.yaml as follows

data:
  colormode: RGB
  inference:
    normalize_images: true
  train:
    affine:
      p: 0.5
      rotation: 30
      scaling:
      - 1.0
      - 1.0
      translation: 0
    collate:
      type: ResizeFromDataSizeCollate
      min_scale: 0.4
      max_scale: 1.0
      min_short_side: 128
      max_short_side: 1152
      multiple_of: 32
      to_square: false
    covering: false
    gaussian_noise: 12.75
    hist_eq: false
    motion_blur: false
    normalize_images: true
device: auto
metadata:
  project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
  pose_config_path: 
    /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
  bodyparts:
  - Front
  - Right
  - Middle
  - Left
  - FL1
  - BL1
  - FR1
  - BR1
  - BL2
  - BR2
  - FL2
  - FR2
  - Body1
  - Body2
  - Body3
  unique_bodyparts: []
  individuals:
  - MARMOSET_1
  - MARMOSET_2
  with_identity: true
method: bu
model:
  backbone:
    type: DLCRNet
    model_name: resnet50
    pretrained: true
    output_stride: 16
  backbone_output_channels: 2304
  pose_model:
    stride: 8
  heads:
    bodypart:
      type: DLCRNetHead
      predictor:
        type: PartAffinityFieldPredictor
        num_animals: 2
        num_multibodyparts: 15
        num_uniquebodyparts: 0
        nms_radius: 5
        sigma: 1.0
        locref_stdev: 7.2801
        min_affinity: 0.05
        graph: &id001
        - - 0
          - 1
        - - 0
          - 3
        - - 2
          - 3
        - - 1
          - 2
        - - 0
          - 2
        - - 2
          - 12
        - - 12
          - 13
        - - 13
          - 14
        - - 7
          - 14
        - - 7
          - 9
        - - 5
          - 8
        - - 5
          - 14
        - - 6
          - 12
        - - 6
          - 11
        - - 4
          - 12
        - - 4
          - 10
        edges_to_keep:
        - 0
        - 1
        - 2
        - 3
        - 4
        - 5
        - 6
        - 7
        - 8
        - 9
        - 10
        - 11
        - 12
        - 13
        - 14
        - 15
      target_generator:
        type: SequentialGenerator
        generators:
        - type: HeatmapPlateauGenerator
          num_heatmaps: 15
          pos_dist_thresh: 17
          heatmap_mode: KEYPOINT
          generate_locref: true
          locref_std: 7.2801
        - type: PartAffinityFieldGenerator
          graph: *id001
          width: 20
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
        locref:
          type: WeightedHuberCriterion
          weight: 0.05
        paf:
          type: WeightedHuberCriterion
          weight: 0.1
      heatmap_config:
        channels:
        - 2304
        - 15
        kernel_size:
        - 3
        strides:
        - 2
      locref_config:
        channels:
        - 2304
        - 30
        kernel_size:
        - 3
        strides:
        - 2
      paf_config:
        channels:
        - 2304
        - 32
        kernel_size:
        - 3
        strides:
        - 2
      num_stages: 5
    identity:
      type: HeatmapHead
      predictor:
        type: HeatmapPredictor
        location_refinement: false
      target_generator:
        type: HeatmapPlateauGenerator
        num_heatmaps: 2
        pos_dist_thresh: 17
        heatmap_mode: INDIVIDUAL
        generate_locref: false
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
      heatmap_config:
        channels:
        - 2304
        - 2
        kernel_size:
        - 3
        strides:
        - 2
net_type: dlcrnet_stride16_ms5
runner:
  type: PoseTrainingRunner
  gpus:
  key_metric: test.mAP
  key_metric_asc: true
  eval_interval: 25
  optimizer:
    type: AdamW
    params:
      lr: 0.0001
  scheduler:
    type: LRListScheduler
    params:
      lr_list:
      - - 1e-05
      - - 1e-06
      milestones:
      - 90
      - 120
  snapshots:
    max_snapshots: 5
    save_epochs: 50
    save_optimizer_state: false
train_settings:
  batch_size: 16
  dataloader_workers: 0
  dataloader_pin_memory: true
  display_iters: 1000
  epochs: 200
  seed: 42

3.train_network

Relevant log output

Training with configuration:
data:
  colormode: RGB
  inference:
    normalize_images: True
  train:
    affine:
      p: 0.5
      rotation: 30
      scaling: [1.0, 1.0]
      translation: 0
    collate:
      type: ResizeFromDataSizeCollate
      min_scale: 0.4
      max_scale: 1.0
      min_short_side: 128
      max_short_side: 1152
      multiple_of: 32
      to_square: False
    covering: False
    gaussian_noise: 12.75
    hist_eq: False
    motion_blur: False
    normalize_images: True
device: auto
metadata:
  project_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19
  pose_config_path: /mnt/Data16Tb/Data/boyang/pose/MMVISV3-BRIAN-2024-06-19/dlc-models-pytorch/iteration-3/MMVISV3Jun19-trainset70shuffle0/train/pose_cfg.yaml
  bodyparts: ['Front', 'Right', 'Middle', 'Left', 'FL1', 'BL1', 'FR1', 'BR1', 'BL2', 'BR2', 'FL2', 'FR2', 'Body1', 'Body2', 'Body3']
  unique_bodyparts: []
  individuals: ['MARMOSET_1', 'MARMOSET_2']
  with_identity: True
method: bu
model:
  backbone:
    type: DLCRNet
    model_name: resnet50
    pretrained: True
    output_stride: 16
  backbone_output_channels: 2304
  pose_model:
    stride: 8
  heads:
    bodypart:
      type: DLCRNetHead
      predictor:
        type: PartAffinityFieldPredictor
        num_animals: 2
        num_multibodyparts: 15
        num_uniquebodyparts: 0
        nms_radius: 5
        sigma: 1.0
        locref_stdev: 7.2801
        min_affinity: 0.05
        graph: [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]]
        edges_to_keep: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
      target_generator:
        type: SequentialGenerator
        generators: [{'type': 'HeatmapPlateauGenerator', 'num_heatmaps': 15, 'pos_dist_thresh': 17, 'heatmap_mode': 'KEYPOINT', 'generate_locref': True, 'locref_std': 7.2801}, {'type': 'PartAffinityFieldGenerator', 'graph': [[0, 1], [0, 3], [2, 3], [1, 2], [0, 2], [2, 12], [12, 13], [13, 14], [7, 14], [7, 9], [5, 8], [5, 14], [6, 12], [6, 11], [4, 12], [4, 10]], 'width': 20}]
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
        locref:
          type: WeightedHuberCriterion
          weight: 0.05
        paf:
          type: WeightedHuberCriterion
          weight: 0.1
      heatmap_config:
        channels: [2304, 15]
        kernel_size: [3]
        strides: [2]
      locref_config:
        channels: [2304, 30]
        kernel_size: [3]
        strides: [2]
      paf_config:
        channels: [2304, 32]
        kernel_size: [3]
        strides: [2]
      num_stages: 5
    identity:
      type: HeatmapHead
      predictor:
        type: HeatmapPredictor
        location_refinement: False
      target_generator:
        type: HeatmapPlateauGenerator
        num_heatmaps: 2
        pos_dist_thresh: 17
        heatmap_mode: INDIVIDUAL
        generate_locref: False
      criterion:
        heatmap:
          type: WeightedBCECriterion
          weight: 1.0
      heatmap_config:
        channels: [2304, 2]
        kernel_size: [3]
        strides: [2]
net_type: dlcrnet_stride16_ms5
runner:
  type: PoseTrainingRunner
  gpus: None
  key_metric: test.mAP
  key_metric_asc: True
  eval_interval: 25
  optimizer:
    type: AdamW
    params:
      lr: 0.0001
  scheduler:
    type: LRListScheduler
    params:
      lr_list: [[1e-05], [1e-06]]
      milestones: [90, 120]
  snapshots:
    max_snapshots: 5
    save_epochs: 50
    save_optimizer_state: False
train_settings:
  batch_size: 16
  dataloader_workers: 0
  dataloader_pin_memory: True
  display_iters: 1000
  epochs: 200
  seed: 42
Loading pretrained weights from Hugging Face hub (timm/resnet50.a1_in1k)
[timm/resnet50.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
Data Transforms:
  Training:   Compose([
  Affine(always_apply=False, p=0.5, interpolation=1, mask_interpolation=0, cval=0, mode=0, scale={'x': (1.0, 1.0), 'y': (1.0, 1.0)}, translate_percent=None, translate_px={'x': (0, 0), 'y': (0, 0)}, rotate=(-30, 30), fit_output=False, shear={'x': (0.0, 0.0), 'y': (0.0, 0.0)}, cval_mask=0, keep_ratio=True, rotate_method='largest_box'),
  GaussNoise(always_apply=False, p=0.5, var_limit=(0, 162.5625), per_channel=True, mean=0),
  Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
  Validation: Compose([
  Normalize(always_apply=False, p=1.0, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], max_pixel_value=255.0),
], p=1.0, bbox_params={'format': 'coco', 'label_fields': ['bbox_labels'], 'min_area': 0.0, 'min_visibility': 0.0, 'min_width': 0.0, 'min_height': 0.0, 'check_each_transform': True}, keypoint_params={'format': 'xy', 'label_fields': ['class_labels'], 'remove_invisible': False, 'angle_in_degrees': True, 'check_each_transform': True}, additional_targets={}, is_check_shapes=True)
Using custom collate function: {'type': 'ResizeFromDataSizeCollate', 'min_scale': 0.4, 'max_scale': 1.0, 'min_short_side': 128, 'max_short_side': 1152, 'multiple_of': 32, 'to_square': False}
Using 102 images and 44 for testing

Starting pose model training...
--------------------------------------------------
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return F.conv2d(input, weight, bias, self.stride,
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Epoch 1/200 (lr=0.0001), train loss 0.35200
Epoch 2/200 (lr=0.0001), train loss 0.16082
Epoch 3/200 (lr=0.0001), train loss 0.13178
Epoch 4/200 (lr=0.0001), train loss 0.08032
Epoch 5/200 (lr=0.0001), train loss 0.04940
Epoch 6/200 (lr=0.0001), train loss 0.04160
Epoch 7/200 (lr=0.0001), train loss 0.04610
Epoch 8/200 (lr=0.0001), train loss 0.05107
Epoch 9/200 (lr=0.0001), train loss 0.03826
Epoch 10/200 (lr=0.0001), train loss 0.03448
Epoch 11/200 (lr=0.0001), train loss 0.02770
Epoch 12/200 (lr=0.0001), train loss 0.02214
Epoch 13/200 (lr=0.0001), train loss 0.02729
Epoch 14/200 (lr=0.0001), train loss 0.03104
Epoch 15/200 (lr=0.0001), train loss 0.02087
Epoch 16/200 (lr=0.0001), train loss 0.03396
Epoch 17/200 (lr=0.0001), train loss 0.02404
Epoch 18/200 (lr=0.0001), train loss 0.02167
Epoch 19/200 (lr=0.0001), train loss 0.02466
Epoch 20/200 (lr=0.0001), train loss 0.02104
Epoch 21/200 (lr=0.0001), train loss 0.02200
Epoch 22/200 (lr=0.0001), train loss 0.02281
Epoch 23/200 (lr=0.0001), train loss 0.02097
Epoch 24/200 (lr=0.0001), train loss 0.01914
Training for epoch 25 done, starting evaluation
/usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:74: RuntimeWarning: Mean of empty slice
  distance_matrix[i, j] = np.nanmean(d)
{
	"name": "ValueError",
	"message": "matrix contains invalid numeric entries",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 deeplabcut.train_network(config_path, shuffle=0)

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/compat.py:245, in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights, modelprefix, superanimal_name, superanimal_transfer_learning, engine, **torch_kwargs)
    242     if \"display_iters\" not in torch_kwargs:
    243         torch_kwargs[\"display_iters\"] = displayiters
--> 245     return train_network(
    246         config,
    247         shuffle=shuffle,
    248         trainingsetindex=trainingsetindex,
    249         modelprefix=modelprefix,
    250         max_snapshots_to_keep=max_snapshots_to_keep,
    251         **torch_kwargs,
    252     )
    254 raise NotImplementedError(f\"This function is not implemented for {engine}\")

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:336, in train_network(config, shuffle, trainingsetindex, modelprefix, device, snapshot_path, detector_path, batch_size, epochs, save_epochs, detector_batch_size, detector_epochs, detector_save_epochs, display_iters, max_snapshots_to_keep, pose_threshold, **kwargs)
    323     detector_run_config[\"train_settings\"][\"weight_init\"] = loader.model_cfg[
    324         \"train_settings\"
    325     ].get(\"weight_init\")
    326     train(
    327         loader=loader,
    328         run_config=detector_run_config,
   (...)
    333         max_snapshots_to_keep=max_snapshots_to_keep,
    334     )
--> 336 train(
    337     loader=loader,
    338     run_config=loader.model_cfg,
    339     task=pose_task,
    340     device=device,
    341     logger_config=loader.model_cfg.get(\"logger\"),
    342     snapshot_path=snapshot_path,
    343     max_snapshots_to_keep=max_snapshots_to_keep,
    344 )
    346 destroy_file_logging()

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/apis/train.py:189, in train(loader, run_config, task, device, gpus, logger_config, snapshot_path, transform, inference_transform, max_snapshots_to_keep)
    186 else:
    187     logging.info(\"\
Starting pose model training...\
\" + (50 * \"-\"))
--> 189 runner.fit(
    190     train_dataloader,
    191     valid_dataloader,
    192     epochs=run_config[\"train_settings\"][\"epochs\"],
    193     display_iters=run_config[\"train_settings\"][\"display_iters\"],
    194 )

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:181, in TrainingRunner.fit(self, train_loader, valid_loader, epochs, display_iters)
    179 with torch.no_grad():
    180     logging.info(f\"Training for epoch {e} done, starting evaluation\")
--> 181     valid_loss = self._epoch(
    182         valid_loader, mode=\"eval\", display_iters=display_iters
    183     )
    184     if self._print_valid_loss:
    185         msg += f\", valid loss {float(valid_loss):.5f}\"

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:236, in TrainingRunner._epoch(self, loader, mode, display_iters)
    234 perf_metrics = None
    235 if mode == \"eval\":
--> 236     perf_metrics = self._compute_epoch_metrics()
    237     self._metadata[\"metrics\"] = perf_metrics
    238     self._epoch_predictions = {}

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/runners/train.py:365, in PoseTrainingRunner._compute_epoch_metrics(self)
    358 \"\"\"Computes the metrics using the data accumulated during an epoch
    359 Returns:
    360     A dictionary containing the different losses for the step
    361 \"\"\"
    362 num_animals = max(
    363     [len(kpts) for kpts in self._epoch_ground_truth[\"bodyparts\"].values()]
    364 )
--> 365 poses = pair_predicted_individuals_with_gt(
    366     self._epoch_predictions[\"bodyparts\"], self._epoch_ground_truth[\"bodyparts\"]
    367 )
    369 # pad predictions if there are any missing (needed for top-down models)
    370 gt, pred = {}, {}

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/metrics/scoring.py:391, in pair_predicted_individuals_with_gt(predictions, ground_truth)
    389 matched_poses = {}
    390 for image, pose in predictions.items():
--> 391     match_individuals = rmse_match_prediction_to_gt(pose, ground_truth[image])
    392     matched_poses[image] = pose[match_individuals]
    394 return matched_poses

File /usr/local/anaconda3/envs/deeplabcut3/lib/python3.10/site-packages/deeplabcut/pose_estimation_pytorch/post_processing/match_predictions_to_gt.py:76, in rmse_match_prediction_to_gt(pred_kpts, gt_kpts)
     73         d = (gt_idv[mask, :2] - pred_idv[mask, :2]) ** 2
     74         distance_matrix[i, j] = np.nanmean(d)
---> 76 _, col_ind = linear_sum_assignment(distance_matrix)  # len == len(valid_gt_indices)
     78 gt_idx_to_pred_idx = {
     79     valid_gt_indices[valid_gt_index]: valid_pred_indices[valid_pred_index]
     80     for valid_gt_index, valid_pred_index in enumerate(col_ind)
     81 }
     82 matched_pred = {valid_pred_indices[i] for i in col_ind}

ValueError: matrix contains invalid numeric entries"
}

Anything else?

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

MMathisLab · 2024-06-21T09:33:53Z

Looks like an evaluation error, not training per se; your config also seems to have a mix of Hrnet and dlcrnet; can you let us know if you modified this file?

BBrianZhang · 2024-06-21T09:42:50Z

Sorry, I pasted the wrong content of the pytorch_config.yaml file. I have edited the issue again.

n-poulsen · 2024-06-21T12:03:08Z

@BBrianZhang thanks for reporting this! I'll look into it

- Moved all metric computation code to a deeplabcut/core/metrics folder (as metrics are computed with numpy) - Cleaned metric computation code so the prediction/ground truth matching always happens - Refactored in a way such that no OOM errors should occur, even on very large datasets (>60k images) - Multi-animal RMSE: only compute RMSE using (ground-truth, detection) matches with non-zero RMSE - Add compute_detection_rmse to compute "detection" RMSE, matching the DeepLabCut 2.X implementation - Fixed the bug for PAF models documented in #2631

n-poulsen · 2024-07-19T15:11:38Z

Should have been fixed in #2679

BBrianZhang assigned n-poulsen Jun 21, 2024

BBrianZhang mentioned this issue Jun 21, 2024

⚠️ NOTICE: a new major release of DeepLabCut, v3.0.0 is pending -- we need your feedback on v3.0.0rc1 #2616

Open

MMathisLab added the DLC3.0🔥 label Jul 2, 2024

rdarie mentioned this issue Jul 2, 2024

IndexError during evaluation step of training with the new pytorch engine #2648

Open

2 tasks

n-poulsen added the bug Something isn't working label Jul 2, 2024

n-poulsen changed the title ~~ValueError: matrix contains invalid numeric entries~~ Jul 4, 2024

n-poulsen mentioned this issue Jul 19, 2024

evaluation refactor #2679

Merged

n-poulsen closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631

Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631

BBrianZhang commented Jun 21, 2024 •

edited

Loading

MMathisLab commented Jun 21, 2024

BBrianZhang commented Jun 21, 2024 •

edited

Loading

n-poulsen commented Jun 21, 2024

n-poulsen commented Jul 19, 2024

Evaluation error with PAF heads: ValueError: matrix contains invalid numeric entries #2631

Evaluation error with PAF heads: ValueError: matrix contains invalid numeric entries #2631

Comments

BBrianZhang commented Jun 21, 2024 • edited Loading

Is there an existing issue for this?

Bug description

Operating System

DeepLabCut version

DeepLabCut mode

Device type

Steps To Reproduce

Relevant log output

Anything else?

Code of Conduct

MMathisLab commented Jun 21, 2024

BBrianZhang commented Jun 21, 2024 • edited Loading

n-poulsen commented Jun 21, 2024

n-poulsen commented Jul 19, 2024

Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631

Evaluation error with PAF heads: `ValueError: matrix contains invalid numeric entries` #2631

BBrianZhang commented Jun 21, 2024 •

edited

Loading

BBrianZhang commented Jun 21, 2024 •

edited

Loading