Computer vision metrics

Computer vision metrics evaluate models across detection, segmentation, generation, 3D, and pose estimation tasks. Most reduce to a precision/recall-style overlap measure, tailored to the prediction type (box, mask, keypoint, point cloud, image).

When to use which metric

Metric	When to use
IoU	Box or mask overlap with ground truth.
AP	Area under precision-recall for a class at a given IoU.
mAP	Mean AP across classes (and often across IoU thresholds).
Pixel Accuracy	Fraction of correctly classified pixels — dominated by large classes.
mIoU	Mean IoU across classes (standard semantic-segmentation metric).
FWIoU	mIoU weighted by class frequency.
Dice Coefficient	F1 equivalent for segmentation masks.
Mask AP	AP computed over masks instead of boxes.
Panoptic Quality (PQ)	Panoptic segmentation — segmentation quality × recognition quality.
FID	Generative image quality/diversity vs real distribution.
Inception Score (IS)	Generative image sharpness and diversity.
SSIM	Perceptual similarity via luminance/contrast/structure.
PSNR	Reconstruction quality (denoising, super-resolution).
LPIPS	Deep-feature perceptual similarity.
Chamfer / EMD	Point-cloud distance.
PCK / MPJPE / OKS	Pose-keypoint accuracy.

Object Detection Metrics

Object detection models predict bounding boxes around objects and classify them.

Intersection over Union (IoU)

Measures the overlap between predicted and ground truth bounding boxes.

IoU = \frac{Area of Overlap}{Area of Union}

Average Precision (AP)

Area under the Precision-Recall curve for a specific class, calculated at a specific IoU threshold.

AP = \int_{0}^{1} p (r) d r

Where $p (r)$ is the precision at recall level $r$ .

Mean Average Precision (mAP)

Mean of AP values across all object classes, often calculated at multiple IoU thresholds.

mAP = \frac{1}{n} i = 1 \sum n AP_{i}

Semantic Segmentation Metrics

In semantic segmentation, each pixel belongs to one class.

Pixel Accuracy

Proportion of correctly classified pixels among all pixels. Can be dominated by large classes (e.g., background).

Pixel Accuracy = \frac{Number of correctly classified pixels}{Total number of pixels}

Mean Intersection over Union (mIoU)

Average IoU across all classes.

IoU_{c} = \frac{TP _{c}}{TP _{c} + FP _{c} + FN _{c}}

mIoU = \frac{1}{n _{c}} c = 1 \sum n_{c} IoU_{c}

Where:

$TP_{c}$ is the number of true positive pixels for class $c$ .
$FP_{c}$ is the number of false positive pixels for class $c$ .
$FN_{c}$ is the number of false negative pixels for class $c$ .
$n_{c}$ is the number of classes.

Frequency Weighted IoU (FWIoU)

Weighted version of mIoU that accounts for class imbalance.

FWIoU = \frac{1}{\sum _{k = 1}^{n_{c}} t _{k}} c = 1 \sum n_{c} t_{c} \cdot IoU_{c}

Where $t_{c}$ is the total number of pixels that truly belong to class $c$ .

Dice Coefficient

F1 equivalent for segmentation masks.

D i c e_{c l a ss} = \frac{2 \cdot T P _{c l a ss}}{( 2 \cdot T P _{c l a ss} + F P _{c l a ss} + F N _{c l a ss} )} = \frac{2 \cdot I o U}{( I o U + 1 )}

Instance Segmentation Metrics

Instance segmentation involves both semantic segmentation and instance differentiation (separating individual objects).

Mask AP

Average Precision calculated based on IoU between predicted and ground truth masks instead of bounding boxes.

Panoptic Quality (PQ)

Combines recognition and segmentation quality for panoptic segmentation tasks.

PQ = segmentation quality (SQ) \frac{\sum _{(p, g) \in TP} IoU ( p , g )}{∣ TP ∣} \times recognition quality (RQ) \frac{∣ TP ∣}{∣ TP ∣ + \frac{1}{2} ∣ FP ∣ + \frac{1}{2} ∣ FN ∣}

Where:

$p$ is a predicted segment.
$g$ is a ground truth segment.
$TP$ , $FP$ , $FN$ are true positives, false positives, and false negatives.

Image Generation and Synthesis Metrics

These metrics evaluate the quality, diversity, and realism of generated images.

Fréchet Inception Distance (FID)

Measures the distance between the distribution of features from generated images and real images, extracted using a pre-trained Inception network. Compares the mean and covariance of these feature distributions. Lower values indicate more realistic generated images.

FID = ∣∣ μ_{r} - μ_{g} ∣ ∣^{2} + Tr (Σ_{r} + Σ_{g} - 2 Σ_{r} Σ_{g})

Where:

$μ_{r}$ and $μ_{g}$ are the mean feature representations of real and generated images.
$Σ_{r}$ and $Σ_{g}$ are the covariance matrices of the feature representations.

Inception Score (IS)

Measures the quality (sharpness, recognizability by a pre-trained Inception network) and diversity of generated images.

IS = exp (E_{x} [KL (p (y ∣ x) ∣∣ p (y))])

Where:

$p (y ∣ x)$ is the conditional class distribution for image $x$ .
$p (y)$ is the marginal class distribution.

Structural Similarity Index (SSIM)

Measures perceptual difference between two images based on luminance, contrast, and structure. Ranges from −1 to 1 (or 0 to 1); 1 = perfect similarity. More consistent with human perception than PSNR/MSE.

SSIM (x, y) = \frac{( 2 μ _{x} μ _{y} + c _{1} ) ( 2 σ _{x y} + c _{2} )}{( μ _{x}^{2} + μ _{y}^{2} + c _{1} ) ( σ _{x}^{2} + σ _{y}^{2} + c _{2} )}

Where:

$μ_{x}$ and $μ_{y}$ are the average pixel values.
$σ_{x}^{2}$ and $σ_{y}^{2}$ are the variances.
$σ_{x y}$ is the covariance.
$c_{1}$ and $c_{2}$ are constants to avoid division by zero.

Peak Signal-to-Noise Ratio (PSNR)

Measures the quality of reconstructed images in tasks like denoising or super-resolution. Ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. Based on MSE.

PSNR = 10 \cdot lo g_{10} (\frac{MAX _{I}^{2}}{MSE})

Where:

$MAX_{I}$ is the maximum possible pixel value.
$MSE$ is the mean squared error between images.

Learned Perceptual Image Patch Similarity (LPIPS)

Measures perceptual similarity using deep features from pre-trained networks (VGG, AlexNet). Aligns better with human perception than pixel-wise metrics like MSE.

3D Vision Metrics

Metrics for evaluating 3D reconstruction, depth estimation, and point cloud processing.

Depth Estimation

Mean Absolute Error (MAE) — average absolute difference between predicted and ground truth depths.
Root Mean Squared Error (RMSE) — square root of the average squared differences.
Threshold Accuracy — percentage of pixels with ratio of predicted to ground truth depth within threshold $t$ (commonly $t \in 1.25, 1.2 5^{2}, 1.2 5^{3}$ ).

Point Cloud

Chamfer Distance — average distance from each point in one cloud to its nearest neighbor in another.

CD (S_{1}, S_{2}) = \frac{1}{∣ S _{1} ∣} x \in S_{1} \sum y \in S_{2} min ∣∣ x - y ∣ ∣_{2}^{2} + \frac{1}{∣ S _{2} ∣} y \in S_{2} \sum x \in S_{1} min ∣∣ y - x ∣ ∣_{2}^{2}

Earth Mover’s Distance (EMD) — minimum “cost” to transform one point cloud into another.
F-Score — harmonic mean of precision and recall at a specific distance threshold.

3D Reconstruction

Volumetric IoU — intersection over union of 3D volumes.
Surface-to-Surface Distance — average distance between reconstructed and ground truth surfaces.

Human Pose Estimation Metrics

Percentage of Correct Keypoints (PCK)

Percentage of predicted keypoints that fall within a distance threshold of the ground truth keypoints.

Mean Per Joint Position Error (MPJPE)

Average Euclidean distance between predicted and ground truth joint positions.

Object Keypoint Similarity (OKS)

Similar to IoU but for keypoints — accounts for keypoint visibility and scale.

OKS = \frac{\sum _{i} exp ( - d _{i}^{2} / ( 2 s ^{2} k _{i}^{2} )) δ ( v _{i} > 0 )}{\sum _{i} δ ( v _{i} > 0 )}

Where:

$d_{i}$ is the Euclidean distance between predicted and ground truth keypoint $i$ .
$s$ is the object scale.
$k_{i}$ is the per-keypoint constant.
$v_{i}$ is the visibility flag for keypoint $i$ .

DSWoK — Data Science Well of Knowledge

Explorer

Computer vision metrics

When to use which metric

Object Detection Metrics

Intersection over Union (IoU)

Average Precision (AP)

Mean Average Precision (mAP)

Semantic Segmentation Metrics

Pixel Accuracy

Mean Intersection over Union (mIoU)

Frequency Weighted IoU (FWIoU)

Dice Coefficient

Instance Segmentation Metrics

Mask AP

Panoptic Quality (PQ)

Image Generation and Synthesis Metrics

Fréchet Inception Distance (FID)

Inception Score (IS)

Structural Similarity Index (SSIM)

Peak Signal-to-Noise Ratio (PSNR)

Learned Perceptual Image Patch Similarity (LPIPS)

3D Vision Metrics

Depth Estimation

Point Cloud

3D Reconstruction

Human Pose Estimation Metrics

Percentage of Correct Keypoints (PCK)

Mean Per Joint Position Error (MPJPE)

Object Keypoint Similarity (OKS)

Links

Graph View

Table of Contents

Backlinks