Computer vision metrics evaluate models across detection, segmentation, generation, 3D, and pose estimation tasks. Most reduce to a precision/recall-style overlap measure, tailored to the prediction type (box, mask, keypoint, point cloud, image).
When to use which metric
| Metric | When to use |
|---|---|
| IoU | Box or mask overlap with ground truth. |
| AP | Area under precision-recall for a class at a given IoU. |
| mAP | Mean AP across classes (and often across IoU thresholds). |
| Pixel Accuracy | Fraction of correctly classified pixels — dominated by large classes. |
| mIoU | Mean IoU across classes (standard semantic-segmentation metric). |
| FWIoU | mIoU weighted by class frequency. |
| Dice Coefficient | F1 equivalent for segmentation masks. |
| Mask AP | AP computed over masks instead of boxes. |
| Panoptic Quality (PQ) | Panoptic segmentation — segmentation quality × recognition quality. |
| FID | Generative image quality/diversity vs real distribution. |
| Inception Score (IS) | Generative image sharpness and diversity. |
| SSIM | Perceptual similarity via luminance/contrast/structure. |
| PSNR | Reconstruction quality (denoising, super-resolution). |
| LPIPS | Deep-feature perceptual similarity. |
| Chamfer / EMD | Point-cloud distance. |
| PCK / MPJPE / OKS | Pose-keypoint accuracy. |
Object Detection Metrics
Object detection models predict bounding boxes around objects and classify them.
Intersection over Union (IoU)
Measures the overlap between predicted and ground truth bounding boxes.
Average Precision (AP)
Area under the Precision-Recall curve for a specific class, calculated at a specific IoU threshold.
Where is the precision at recall level .
Mean Average Precision (mAP)
Mean of AP values across all object classes, often calculated at multiple IoU thresholds.
Semantic Segmentation Metrics
In semantic segmentation, each pixel belongs to one class.
Pixel Accuracy
Proportion of correctly classified pixels among all pixels. Can be dominated by large classes (e.g., background).
Mean Intersection over Union (mIoU)
Average IoU across all classes.
Where:
- is the number of true positive pixels for class .
- is the number of false positive pixels for class .
- is the number of false negative pixels for class .
- is the number of classes.
Frequency Weighted IoU (FWIoU)
Weighted version of mIoU that accounts for class imbalance.
Where is the total number of pixels that truly belong to class .
Dice Coefficient
F1 equivalent for segmentation masks.
Instance Segmentation Metrics
Instance segmentation involves both semantic segmentation and instance differentiation (separating individual objects).
Mask AP
Average Precision calculated based on IoU between predicted and ground truth masks instead of bounding boxes.
Panoptic Quality (PQ)
Combines recognition and segmentation quality for panoptic segmentation tasks.
Where:
- is a predicted segment.
- is a ground truth segment.
- , , are true positives, false positives, and false negatives.
Image Generation and Synthesis Metrics
These metrics evaluate the quality, diversity, and realism of generated images.
Fréchet Inception Distance (FID)
Measures the distance between the distribution of features from generated images and real images, extracted using a pre-trained Inception network. Compares the mean and covariance of these feature distributions. Lower values indicate more realistic generated images.
Where:
- and are the mean feature representations of real and generated images.
- and are the covariance matrices of the feature representations.
Inception Score (IS)
Measures the quality (sharpness, recognizability by a pre-trained Inception network) and diversity of generated images.
Where:
- is the conditional class distribution for image .
- is the marginal class distribution.
Structural Similarity Index (SSIM)
Measures perceptual difference between two images based on luminance, contrast, and structure. Ranges from −1 to 1 (or 0 to 1); 1 = perfect similarity. More consistent with human perception than PSNR/MSE.
Where:
- and are the average pixel values.
- and are the variances.
- is the covariance.
- and are constants to avoid division by zero.
Peak Signal-to-Noise Ratio (PSNR)
Measures the quality of reconstructed images in tasks like denoising or super-resolution. Ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. Based on MSE.
Where:
- is the maximum possible pixel value.
- is the mean squared error between images.
Learned Perceptual Image Patch Similarity (LPIPS)
Measures perceptual similarity using deep features from pre-trained networks (VGG, AlexNet). Aligns better with human perception than pixel-wise metrics like MSE.
3D Vision Metrics
Metrics for evaluating 3D reconstruction, depth estimation, and point cloud processing.
Depth Estimation
- Mean Absolute Error (MAE) — average absolute difference between predicted and ground truth depths.
- Root Mean Squared Error (RMSE) — square root of the average squared differences.
- Threshold Accuracy — percentage of pixels with ratio of predicted to ground truth depth within threshold (commonly ).
Point Cloud
- Chamfer Distance — average distance from each point in one cloud to its nearest neighbor in another.
- Earth Mover’s Distance (EMD) — minimum “cost” to transform one point cloud into another.
- F-Score — harmonic mean of precision and recall at a specific distance threshold.
3D Reconstruction
- Volumetric IoU — intersection over union of 3D volumes.
- Surface-to-Surface Distance — average distance between reconstructed and ground truth surfaces.
Human Pose Estimation Metrics
Percentage of Correct Keypoints (PCK)
Percentage of predicted keypoints that fall within a distance threshold of the ground truth keypoints.
Mean Per Joint Position Error (MPJPE)
Average Euclidean distance between predicted and ground truth joint positions.
Object Keypoint Similarity (OKS)
Similar to IoU but for keypoints — accounts for keypoint visibility and scale.
Where:
- is the Euclidean distance between predicted and ground truth keypoint .
- is the object scale.
- is the per-keypoint constant.
- is the visibility flag for keypoint .