### Object Detection Metrics
Object detection models predict bounding boxes around objects and classify them.
1. **Intersection over Union (IoU)**: Measures the overlap between predicted and ground truth bounding boxes.
$\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}$
2. **Average Precision (AP)**: The area under the Precision-Recall curve for a specific class, calculated at a specific IoU threshold.
$\text{AP} = \int_{0}^{1} p(r) dr$
Where $p(r)$ is the precision at recall level $r$.
3. **Mean Average Precision (mAP)**: The mean of AP values across all object classes, often calculated at multiple IoU thresholds.
$\text{mAP} = \frac{1}{n} \sum_{i=1}^{n} \text{AP}_i$
### Semantic Segmentation Metrics
In semantic segmentation, each pixel belongs to one class.
1. **Pixel Accuracy**: The proportion of correctly classified pixels among all pixels. Can be dominated by large classes (background).
$\text{Pixel Accuracy} = \frac{\text{Number of correctly classified pixels}}{\text{Total number of pixels}}$
2. **Mean Intersection over Union (mIoU)**: The average IoU across all classes.
$\text{IoU}_c = \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c + \text{FN}_c}$
$\text{mIoU} = \frac{1}{n_c} \sum_{c=1}^{n_c} \text{IoU}_c$
Where:
- $\text{TP}_c$ is the number of true positive pixels for class $c$
- $\text{FP}_c$ is the number of false positive pixels for class $c$
- $\text{FN}_c$ is the number of false negative pixels for class $c$
- $n_c$ is the number of classes
3. **Frequency Weighted IoU (FWIoU)**: A weighted version of mIoU that accounts for class imbalance.
$\text{FWIoU} = \frac{1}{\sum_{k=1}^{n_c} t_k} \sum_{c=1}^{n_c} t_c \cdot \text{IoU}_c$
Where $t_c$ is the total number of pixels that truly belong to class $c$.
4. **Dice Coefficient**: (F1 Score equivalent for segmentation).
$Dice_{class} = \frac{2 * TP_{class}}{(2 * TP_{class} + FP_{class} + FN_{class})} = \frac{2 * IoU} {(IoU + 1)}$
## Instance Segmentation Metrics
Instance segmentation involves both semantic segmentation and instance differentiation (separating individual objects).
1. **Mask AP**: Average Precision calculated based on IoU between predicted and ground truth masks instead of bounding boxes.
2. **Panoptic Quality (PQ)**: Combines recognition and segmentation quality for panoptic segmentation tasks.
$\text{PQ} = \underbrace{\frac{\sum_{(p,g) \in TP} \text{IoU}(p,g)}{|TP|}}_\text{segmentation quality (SQ)} \times \underbrace{\frac{|TP|}{|TP| + \frac{1}{2}|FP| + \frac{1}{2}|FN|}}_\text{recognition quality (RQ)}$
Where:
- $p$ is a predicted segment
- $g$ is a ground truth segment
- $TP$, $FP$, $FN$ are true positives, false positives, and false negatives
## Image Generation and Synthesis Metrics
These metrics evaluate the quality, diversity, and realism of generated images.
1. **Fréchet Inception Distance (FID)**: Measures the distance between the distribution of features from generated images and real images, extracted using a pre-trained Inception network. Compares mean and covariance of these feature distributions. Lower values indicate that generated images are more similar to real images in terms of deep features.
$\text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2\sqrt{\Sigma_r \Sigma_g})$
Where:
- $\mu_r$ and $\mu_g$ are the mean feature representations of real and generated images
- $\Sigma_r$ and $\Sigma_g$ are the covariance matrices of the feature representations
2. **Inception Score (IS)**: Measures the quality (sharpness, recognizability by a pre-trained Inception network) and diversity of generated images.
$\text{IS} = \exp\left( \mathbb{E}_x [ \text{KL}(p(y|x) || p(y)) ] \right)$
Where:
- $p(y|x)$ is the conditional class distribution for image $x$
- $p(y)$ is the marginal class distribution
3. **Structural Similarity Index (SSIM)**: Measures the perceptual difference between two images based on luminance, contrast, and structure. Ranges from -1 to 1 (or 0 to 1). `1` indicates perfect similarity. Aims to be more consistent with human perception than PSNR/MSE.
$\text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}$
Where:
- $\mu_x$ and $\mu_y$ are the average pixel values
- $\sigma_x^2$ and $\sigma_y^2$ are the variances
- $\sigma_{xy}$ is the covariance
- $c_1$ and $c_2$ are constants to avoid division by zero
4. **Peak Signal-to-Noise Ratio (PSNR)**: Measures the quality of reconstructed images in tasks like denoising or super-resolution. Ratio between the maximum possible power of a signal and the power of corrupting noise that affects its fidelity. Based on MSE.
$\text{PSNR} = 10 \cdot \log_{10}\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right)$
Where:
- $\text{MAX}_I$ is the maximum possible pixel value
- $\text{MSE}$ is the mean squared error between images
5. **Learned Perceptual Image Patch Similarity (LPIPS)**: Measures perceptual similarity using deep features from pre-trained networks (VGG, AlexNet). Aims to align better with human perception of similarity than pixel-wise metrics like MSE.
### 3D Vision Metrics
Metrics for evaluating 3D reconstruction, depth estimation, and point cloud processing.
1. **Depth Estimation Metrics**:
- **Mean Absolute Error (MAE)**: Average absolute difference between predicted and ground truth depths.
- **Root Mean Squared Error (RMSE)**: Square root of the average squared differences.
- **Threshold Accuracy**: Percentage of pixels with ratio of predicted to ground truth depth within threshold $t$ (commonly $t \in {1.25, 1.25^2, 1.25^3}$).
2. **Point Cloud Metrics**:
- **Chamfer Distance**: Measures the average distance from each point in one point cloud to its nearest neighbor in another point cloud.
$\text{CD}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} \min_{y \in S_2} ||x-y||_2^2 + \frac{1}{|S_2|} \sum_{y \in S_2} \min_{x \in S_1} ||y-x||_2^2$
- **Earth Mover's Distance (EMD)**: The minimum "cost" to transform one point cloud into another.
- **F-Score**: The harmonic mean of precision and recall at a specific distance threshold.
3. **3D Reconstruction Metrics**:
- **Volumetric IoU**: The intersection over union of 3D volumes.
- **Surface-to-Surface Distance**: The average distance between reconstructed and ground truth surfaces.
## Human Pose Estimation Metrics
1. **Percentage of Correct Keypoints (PCK)**: The percentage of predicted keypoints that fall within a distance threshold of the ground truth keypoints.
2. **Mean Per Joint Position Error (MPJPE)**: The average Euclidean distance between predicted and ground truth joint positions.
3. **Object Keypoint Similarity (OKS)**: Similar to IoU but for keypoints, accounting for keypoint visibility and scale.
$\text{OKS} = \frac{\sum_i \exp(-d_i^2 / (2s^2k_i^2)) \delta(v_i > 0)}{\sum_i \delta(v_i > 0)}$
Where:
- $d_i$ is the Euclidean distance between predicted and ground truth keypoint $i$
- $s$ is the object scale
- $k_i$ is the per-keypoint constant
- $v_i$ is the visibility flag for keypoint $i$
## Links
- [COCO Evaluation Metrics](https://cocodataset.org/#detection-eval)
- [Scikit-image Comparison Metrics](https://scikit-image.org/docs/stable/api/skimage.metrics.html)
- [PyTorch Vision Metrics](https://pytorch.org/vision/stable/reference.html)