Implementation of HOG for Human Detection

Introduction

The Problem

As described in the paper, this project tries to solve the problem of human detection, even in cluttered backgrounds under difficult illumination.
The ability to detect and tell whether and where a human exists in an image, could be helpful for many kind of entities (Image search engines, security companies and more...).

This project is an implementation of the algorithm described in the paper, including another application which evaluates the solution in the wild.

Related Paper

Our work related to HOG feature extraction is fully based on the paper:
Histograms of Oriented Gradients for Human Detection, Navneet Dalal and Bill Triggs, CVPR 2005

Our Work

Our aim in this work, was:

To implement our own HOG feature extractor (according to the algorithm description in the paper)
To train a classifier on the dataset which was prepared and used by the authors of the paper, and to get high accuracy on the corresponding testing set
To create a program which evaluates the classifier in the wild, in a way that it scans images provided by the user and uses our trained classifier to detect humans within those provided images

Used Datasets and Technologies

The Used Dataset

We used the INRIA Person Dataset, which was prepared by the authors of the original paper we base on

About the dataset:

A 970MB tar archive file, named INRIAPerson.tar
Contains the following relevant directories:

96x160H96/Train/pos:	2416 human images centered and cropped, used as positive training set (each image is sized 96x160 pixels)
70x134H96/Test/pos:	1132 human images centered and cropped, used as positive testing set (each image is sized 70x134 pixels)
Train/neg:	1218 images with no humans within. Image sizes vary. We used them to randomly crop and save 24360 patches (20 patches from each), to be used as a negative training set
Test/neg:	453 images with no humans within. Image sizes vary. We used them to randomly crop and save 9060 patches (20 patches from each), to be used as a negative testing set

Some samples from the dataset:

96x160H96/Train/pos:
70x134H96/Test/pos:
Train/neg:
Test/neg:

The Used HOG Feature Vector Extractor

The HOG feature vector extractor is implemented in Matlab, in the function computeHOG126x63(). Its implementation is found in the file computeHOG126x63.m

The function computeHOG126x63() expects an image sized at least 63x126 pixels

It assumes that a human is centered in the provided image (if it is a positive sample) and it computes the HOG feature vector only on the sub-image formed by the central 63x126 pixels

The usage of 63x126 pixels for a human image, is because according to the paper, a cell size should be 6x6 pixels and a block size should be 3x3 cells.
It means that a block size is 18x18 pixels. As according to the paper, blocks should have 50% overlapping, we need width and length that are divisible by 9.
Hence, we chose 63x126 pixels, in which both width and height are divisible by 9 and which are the closest such dimension to 64x128 pixels, suggested in the paper

The Used Classifier and Related Libraries

The training and classification code is written in C

We used a FeedForward neural network as a classifier, from the library: C/C++ Neural Network
As the feature vectors are computed and exported by Matlab into CSV files, we also had to use a CSV parser library: C/C++ CSV Parser

Our FeedForward neural network is set with the following structure and behavior:

6318 inputs (The length of the feature vector)
50 neurons as a 1st hidden level
30 neurons as a 2nd hidden level
1 output neuron
All the neurons are activated with TANH function

Here is a scheme of the structure of our neural network:

The Workflow

The following diagram explains our workflow in a very high level:

Note: CSV stands for "Comma Separated Values"
For example, the below sample content of a CSV file defines the 3 vectors [1 2 3] , [1.23 0.54 -6], [-1 -1 -1]

1,2,3
1.23,0.54,-6
-1,-1,-1

Thus, we can compute the HOG vectors of many images and then export them all at once into a CSV file.
This CSV file can be parsed an be used by other applications, such as our C application for training and activating neural networks

Dataset Preparation

As explained in the Used Dataset part, not all of the dataset images are ready to use:

The most simple case, is with the positive samples (both for training and for testing):

A human is centered in each of the positive sample images (Explained in the Used Dataset part)
As explained in the Used HOG Feature Vector Extractor part, our function for HOG feature vector extraction always takes only the sub-image formed by the central 63x126 pixels
It actually means that in our case, the positive samples do not require any preprocessing, as our function will anyway process only the relevant part of each of these samples

The more challenging case, is the preparation of the negative samples:

We want the negative training set to be ~10 times larger than the positive one
As the positive training set contains 2416 samples, while there are 1218 large negative training images with no human within, we randomly crop 20 patches of 63x126 pixels each, from each of the 1218 large negative images
This cropping gives us a negative training set, which contains 24360 samples sized 63x126 each

The same is done with the negative testing set: there are 1132 positive testing samples and 453 large negative testing images.
We again randomly crop 20 patches of 63x126 pixels from each large negative image, to form a negative testing set of 9060 samples

Here are some samples of the negative patches we randomly cropped from the large images:

Training negative samples:
Testing negative samples:

Finally, the dataset we are going to use is formed by:

2416 positive training samples:	Taken without preprocessing from 90x160H96/Train/pos
24360 negative training set	Prepared by cropping patches from 1218 images from Train/neg
1132 positive testing samples	Taken without preprocessing from 70x134H96/Test/pos
9060 negative testing samples	Prepared by cropping patches from 453 images from Test/neg

We copy these photos we prepared into 4 directories, so we can easily access them later:

train/pos - 2416 positive training images
train/neg - 24360 negative training images
test/pos - 1132 positive testing images
test/neg - 9060 negative testing images

Note: The cropping of the negative images into samples is performed by our implemented Matlab function cutRandomImages(). To automatically cut and save 20 patches per image from source images directory SRC into destination images directory DST, where each patch is sized 63x126 pixels, one should call it as cutRandomImages(SRC, DST, 20, 63, 126)
This function is found in the file cutRandomImages.m

Extracting HOG Feature Vectors from the Dataset

After the Data Preparation step, we have 4 directories of images, which we want to turn into HOG vectors.
We use our Matlab function createDataset() in order to turn the images into HOG vectors and to save them in corresponding 4 CSV files.

Our function createDataset() gets 2 parameters:

Input images directory
Output CSV file

We run it 4 times, each time for a different directory.
For example, to create a CSV file of the positive training samples, we'll call the function with the path to the train/pos directory and with a path to an output file, which we'll name something like train_pos.csv.

In general, the flow of the function createDataset() is:

Get all the images that are in the provided input image directory
prepare a matrix sized Nx6318 (where N is the number of found input images)
For each image, call the function computeHOG126x63() so it computes a 6318 sized HOG vector of that image, and save this vector as a line at the matrix
Finally, when the matrix rows contain all the HOG vectors, output the matrix into the provided output file, in CSV format

Below is a detailed explanation of the HOG extraction flow.

Given a 63x126 pixels image, our HOG feature extractor works according to the following flow:

If the provided image is not in Grayscale, convert it to Grayscale
Compute the image X-gradient matrix using convolution with the mask: [-1 0 1]
Compute the image Y-gradient matrix using convolution with the mask: [-1 0 1]' (the ' operator stands for transpose)
For each pixel in the image, compute its gradient magnitudes matrix, using the computed X and Y gradient matrices. The gradient magnitude in pixel (i,j) is defined by the formula: , where X' and Y' stand for our computed X-gradient matrix and Y-gradient matrix respectively
For each pixel (i,j) in the image, compute its gradient directions matrix, using the formula: , where again, X' and Y' stand for our computed X-gradient matrix and Y-gradient matrix respectively
As the arctan() function returns values between -90° and 90°, while we want the directions to be between 0° and 180°, we add (+180°) to all the negative values (we actually "flip" the negative angles to the other side). More information about it is in the notes below

Iterate over all blocks sized 18x18 pixels, with 50% overlapping between each block, and do:

Divide the block into a grid of 3x3 equal cells (such that each cell is formed by 6x6 pixels)
Iterate over the cells, and do:

Allocate a new array with 9 indices
Compute the sum of all the gradient magnitudes of all the pixels with gradient direction between 0° and 20°. Assign this value to the 1st index in the allocated array
Compute the sum of all the gradient magnitudes of all the pixels with gradient direction between 20° and 40°. Assign this value to the 2nd index in the allocated array
. . .
Compute the sum of all the gradient magnitudes of all the pixels with gradient direction between 160° and 180°. Assign this value to the 9th index in the allocated array
## Note: This cell processing is actually a computation of a 9-binned Histogram of Oriented Gradients for this specific cell. We divided the range of 0° to 180° into 9 equal bins of 20° each, and summed the magnitudes for each bin from the pixels of the 6x6 pixels sized cell

Concatenate all the cells' arrays into one array with 81 indices (one cell's array has 9 indices and there are 3x3 = 9 cells ==> there should be 81 indices in the concatenation of the arrays)
## Note: the new array we just formed by a concatenation of the arrays of all the cells in the block, is actually the Histogram of Oriented Gradients of this block
Normalize the 81-sized array by dividing it by: , where x is the array. This normalization is done in order to remove the effect of local lights differences

Concatenate all the blocks' 81-sized vectors into 1 long vector with 6318 indices (There are totally 78 blocks of 18x18 pixels each and with 50% overlapping, while each block's vector has 81 indices, hence, the final vector has 78 * 81 = 6318 indices)
Output the vector sized 6318 as the HOG feature vector of the provided image

The following graphical slideshow might help to understand it better:

Notes:

In the paper, it is reported that the best results were received when the gradient directions where unsigned (i.e. between 0° and 180°)
When calling the function arctan(x/y), it first computes the value of x/y and then it calls arctan() with this computed value. It means that arctan() loses information, as e.g. if its argument is positive, it can't know whether it was (+/+) ==> [0°,90°] or (-/-) ==> [180°,270°]. The same happens when the argument is negative - it can't know whether it was (-/+) ==> [90°,180°] or (+/-) ==> [270°,360°]. Hence, in Matlab, it returns values only between -90° and 90°. It means that we anyway get the results as we want them to be, except for the fact that for negative values, we get the results between -90° and 0°, instead of between 90° and 180°. Hence, we "flip" all these negative angles by adding (+180°) to each of them. It finally lets all our results be in the range of 0° to 180°

Training and Evaluating the Classifier

The training and classification processes are done in a C application.

The training process:

After the Extracting HOG Feature Vectors from the Dataset step, we have 4 CSV files:

train_pos.csv - contains 2416 CSV lines
train_neg.csv - contains 24360 CSV lines
test_pos.csv - contains 1132 CSV lines
test_neg.csv - contains 9060 CSV lines

Note that each line in each CSV file represents a HOG vector sized 6318

We are going to train the network, in a way that we'll want each activation of a positive sample on the network to output 1.0, and each activation of a negative sample on the network to output (-1.0).
As the negative training set is more than 10 times larger than the positive training set, we will each time train the network on a positive sample, and then on 10 negative samples, until we finish with all the positive samples.
It means that we actually give up the last 200 negative training samples, as we are going to use only 24160 of them (which is 10 time the size of the positive training set).
It also means that finally each training epoch will consist of 2416 + 24160 = 26576 training iterations.

In order to accelerate the training, we use the FeedForwardNetwork API function FeedForwardNetwork_train_fast() instead of the standard function FeedForwardNetwork_train().
The function FeedForwardNetwork_train_fast() accelerates the training in a way that it lets the user define a callback function, which tells whether the output is satisfying. If the answer is positive, it skips the backpropagation and weight updating over the network, and thus it saves a lot of computation time.

We define this callback function to return positive answer if the sample is positive and the network output is greater than 0.5, or if the sample is negative ant the network output is less than 0.5. Other cases yield negative answer, which will cause the API to update the weights of the network using Backpropagation algorithm.

The flow of the training process is the following:

Read each of the CSV files into an array of arrays of doubles (See external link to C/C++ CSV Parser)
Create an instance of a FeedForward neural network, as described in The Used Classifier and Related Libraries, with learning rate of 0.01
"forever" do:

Iterate over the positive training samples (there are 2416 such), and do:

Call the training function on the current positive sample
Call the training function on the 10 corresponding negative samples

Check the network performances over the testing set, and if it is the best performances until now, save the network into a file (Using the API function FeedForwardNetwork_saveToFile())

Kill the process when the network is fully trained on the training set, such that the results stop changing

In the end of this training process, we have a trained network file, which we can later use to create an instance of the network, that will be from now on used only as a classifier

The evaluating process:

This step is much simpler than the training process, as now we already have a trained network.

We get from the user a path to a CSV input file and a path to a trained network file.

The flow of the evaluating process is the following:

Read all the input HOG vectors from the provided CSV file into an array of arrays of doubles (again using C/C++ CSV Parser)
Create an instance of a FeedForward neural network using the trained network file (Using the API function FeedForwardNetwork_new_from_file())
Iterate over the HOG vectors, and do:

Activate the vector on the network using the API function FeedForwardNetwork_activate().
If the output is greater than 0.5, print "1", else, print "0"

Finally, the output of the evaluating application will contain N lines, where N is the number of vectors in the input CSV file.
Each output line contains one of the numbers {0, 1}, such that 1 is a positive answer and 0 is a negative answer, where of course, the result in the i'th output line is corresponding to the sample represented by the HOG vector at the i'th line in the provided CSV file

Using the Classifier in the Wild

After training and testing the classifier using the dataset, we also wanted to check how effective it is when its desired to detect humans within some random image.

Note that the negative training and testing set we used, contain patches from a lot of human-free various images.
We assume that the amount of non-human-objects, which their HOG looks similar to a human's one, is pretty small in these random images.

Though, some objects with strong vertical components like trees, poles and human limbs could produce HOG descriptors that are very similar to humans' ones and that are too hard to discriminate from real humans.
The following work we've found supports this claim: Pedestrian Detection project (Computer Vision course, Computer Science Department in Stanford University)

Basing on this assumption, we do not expect to see the same accuracy on each random image that is going to be checked. Some images could give better results, while others could give worse, depending on the contents of the image.
Of course, we expect the accuracy rates to tend to the results we reported on the testing set, when measuring and summarizing them on a large enough amount of scanned random images.

Our detector works in the following way:

It computes the HOG vector all the 63x126 windows with 80% overlapping in the input image
Then, checking all windows with size multiplied by sqrt(2). We downscale them to our 63x126 size and compute also their HOG vectors
We continue the same way with window sizes always multiplied by sqrt(2), until we reach the limit of the height or the width of the image

We then output all the window's HOG vectors into a CSV file and then we call the classifier
The classifier activates each input vector on the network, and returns a positive result if the output of the network was greater than 0.5, and a negative result otherwise.
The corresponding windows of all vectors which got a positive result from the classifier are highlighted with green border

Here are some results we got:

A perfect performance. 100% accuracy

An excellent performance. Only 6 false-alarms out of thousands of checked windows. Less than 1% of mistake

In a crowded street, the results are more noisy and harder to read.
However, all the humans (sized at least 63x126) were detected and the rate of false-alarms is less than 2% of the thousands of checked windows

In another less crowded street, most pedestrians were detected, and again we can spot a few false-alarms

Results

As mentioned in the dataset preparation section, we had 1132 positive samples and 9060 negative samples in the testing set.

We start with showing a ROC curve that describes the results:

A high-scope ROC curve graph

A zoom on the ROC curve interesting area (The area in the blue circle above)

Below is also a confusion matrix, calculated with threshold 0.66, which is the one that yields the highest accuracy with the testing set.
The performances we got are given in the following confusion matrix:

		Predicted Class
		Human	Non-Human
Actual Class	Human	TP 1054 (93.11%)	FN 78 (6.89%)
Actual Class	Non-Human	FP 65 (0.72%)	TN 8995 (99.28%)

Using these values, we can compute precision, recall and other scores:

Precision:	99.23%	[TP / (TP + FP)] = [0.9311 / (0.9311 + 0.0072)]
Recall:	93.11%	[TP / TP + FN] = [0.9311 / (0.9311 + 0.0689)]
Accuracy:	98.6%	[(TP + TN) / (P + N)] = [(1054 + 8995) / (1132 + 9060)]
F1 Score:	96.07%	[2TP / (2TP + FP + FN)] = [(2 * 0.9311) / (2 * 0.9311 + 0.0072 + 0.0689)]
Magnitude:	96.22%	[sqrt(precision^2 + recall^2)] / sqrt(2)] = [sqrt(0.9923^2 + 0.9311^2) / sqrt(2)] Note: the magnitude is divided by sqrt(2) to get a result between 0 and 1

Note that the C code was tested only with GCC compiler in Unix. Though, with small changes, it should work also on MSVC.

Relevant files:

createDataset.m
Used for creating a CSV file of HOG vectors out of a directory if images

cutRandomImages.m
Used for randomly cutting and saving patches from a directory of large images into a target directory

computeHOG126x63.m
Used for computing a HOG vector of the subimage formed by the central 63x126 pixels of the provided image

detectHumans.m
Used for scanning windows in a large input image and evaluating the classifier on them, in order to highlight detected humans

train.c
Used for creating, training and saving a FeedForward neural network that is trained on the dataset

evaluate.c
Used for evaluating vectors on a trained network and returning True/False

External Links and Resources

The paper: Histograms of Oriented Gradients for Human Detection, Navneet Dalal and Bill Triggs, CVPR 2005
The dataset: INRIA Person Dataset
A C library for parsing CSV files: C/C++ CSV Parser
A C library for working with neural networks: C/C++ Neural Networks

Credits

This project was prepared by Tal Hakim and David Cohn, students for M.Sc. degree in Computer Science in University of Haifa, Israel, as an assignment in the course Recognition and Classification in Images and Videos, taught by Dr. Margarita Osadchy, 2015.
The project is based on the paper: Histograms of Oriented Gradients for Human Detection, Navneet Dalal and Bill Triggs, CVPR 2005