Object Localization
Image Classification
- Purpose: Identifies a single object in an image and assigns it a specific label (e.g., “car”).
Object Localization
- Purpose: Recognizes and determines the precise location and size of an object within an image by drawing a bounding box around it.
- Relation to Object Detection: It is a fundamental component of object detection, which deals with detecting multiple objects.
Final Output
object probability(P_c): The probability that the object exists. bounding box: The position of the bounding box. classes: The label of the corresponding object class, consisting of 0s and 1s.
Classification with Localization
Outputs: In addition to the object’s class, the neural network outputs four parameters (bx, by, bh, bw) that define the bounding box of the detected object. These parameters are part of the output vectory.(Outputs a bounding box that indicates the location as well as the class of the object.)
- Bounding Box Coordinates: The image is conceptually mapped such that the upper-left corner is (0,0) and the lower-right corner is (1,1).
- Output Vector Components:
- pc: Indicates the presence (1) or absence (0) of a classifiable object (not background).
- bx, by, bh, bw: Specify the bounding box dimensions and position.
- c1, c2, c3: Class indicators, which determine whether the object is a pedestrian, car, or motorcycle. When an object is present (pc = 1), one of these indicators will also be 1. If no object is detected, these are irrelevant.
- Loss Function: Training uses a sum of squared errors to minimize the differences between predicted outputs and actual labels for all components of y. Focus is on all components when an object is detected, and solely on pc when no object is present. - In a three-class classification problem, if you use the squared error as the loss
Object Detection
- Purpose: Detects and localizes multiple objects within a single image, each potentially belonging to different categories.
- Complexity: Handles more complex scenes with multiple dynamic and static objects, crucial for applications like autonomous driving.
Landmark Detection
Landmark Detection
- Capability: Now able to output X and Y coordinates of critical points or landmarks within images.
- Utility: Extremely useful in applications like face recognition, where it’s crucial to precisely identify specific details such as the corners of eyes.
Multiple Landmark Outputs
- Functionality: Networks can be tailored to output coordinates for multiple landmarks, accommodating complex requirements like identifying all corners of the eyes or other detailed facial features, with capability extending up to 64 points on a face.
Network Configuration
- Setup: Configured to have 129 output units; one unit to detect the presence of a face and 128 units to manage the coordinates of 64 landmarks.
Applications in Emotion Recognition and Augmented Reality
- Foundational Technologies: These capabilities underpin advanced technologies such as emotion recognition and augmented reality filters, which dynamically apply effects based on detected facial landmarks.
- Training Requirements: Effective training demands a well-labeled dataset with consistent annotations across all images to ensure accurate landmark detection.
Extension to Pose Detection
- Application Extension: Extends to pose detection, where the network identifies key body positions, useful in various fields from health monitoring to interactive gaming.
Consistency Requirement
- Importance: It’s essential that the identity of each landmark remains consistent across different images to ensure the network’s reliability and accuracy.
Sliding Windows Detection Algorithm
- Purpose: This algorithm is used to detect objects within an image by systematically analyzing different sections of the image.
- Training: Initially, a Convolutional Neural Network (ConvNet) is trained with a labeled dataset consisting of closely cropped images of cars. These images are specifically chosen to ensure that the car occupies most of the image, thereby simplifying the recognition task for the ConvNet.
- Implementation: In application, the algorithm divides a test image into multiple small rectangular regions. Each region is processed by the ConvNet, which predicts whether a car is present (output 1) or not (output 0).
- Adaptation: The process involves varying the size of the windows and the strides between them to cover different scales and positions within the image.
- Computational Cost: A significant drawback of this method is its high computational demand. Each small region must be individually processed, which becomes resource-intensive.
- Evolution: Previously, simpler classifiers based on hand-engineered features were used due to their lower computational cost. However, with advancements in neural networks, these have largely been replaced by more capable but computationally expensive ConvNets.
- Efficiency Improvements: Despite its challenges, the Sliding Windows Detection approach can be optimized. By implementing the detection in a convolutional manner, the process becomes more efficient, reducing the computational load while maintaining accuracy.
Convolutional Implementation of Sliding Windows Object Detection
- Converting a fully connected (FC) layer to a convolutional layer within a convolutional neural network (CNN) creates an architecture that is better suited for tasks such as image classification.
- Sermanet published a paper in 2014 that replaced the backward fully connected layer of a traditional convolutional neural network with a convolutional layer.
- Comparison of the traditional fully connected layer (above) and the convolutional fully connected layer (below):
- Compared to the traditional layer (above), the convolutional layer (below) performs convolution operations instead of flattening all dimensions to one dimension after max pooling.
- Mathematically, this is equivalent to the operations of a fully connected layer. The result produces 400 values, each of which is a random linear function of a filter of size 5 x 5 x 16, and these values pass through an activation function.
In sliding window detection, it is utilized as follows.
Convolutional Sliding Window Detection Method
- Let’s take an example where detection occurs by sliding across four windows. If a convolutional layer, rather than a fully connected layer, is used at the end, regions corresponding to these four windows are created as shown below. This allows classification of all four windows at once without passing through the convolutional layer four times, significantly reducing computational costs and thus enabling more efficient calculations.
Starting Point: Originally, the sliding windows method used a convnet to detect objects, but it was too slow.
- Better Approach: To speed things up, the system now converts fully connected layers to convolutional layers. This changes the output from classifying four types of probabilities with a softmax unit to using convolutions.
- Sliding window detection chops up a picture into smaller sections (windows) and processes each one separately through the model to get results.
- It doesn’t just use one size for these windows; it tries out several sizes to cover more ground.
Faster Processing: By running the convnet on the whole image at once, this new method saves a lot of computing time. It does this by doing many calculations together that used to be done separately.
- Results and Drawbacks: This setup makes the whole detection process quicker because it does all the work in one go. However, it’s not perfect at figuring out exactly where to place bounding boxes. There might need to be some tweaks or a new strategy to get this part right.
- One downside is that this method can be resource-heavy since it scans the entire image in one shot.
- Also, if the window sizes are too small or the steps between them too big, the accuracy drops, and the system might miss spotting some objects correctly.
reference:
Coursera - Convolutional Neural Networks
Boostcourse - Convolutional Neural Networks