Essay Writing Service

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Biologically Motivated Model for Outdoor Scene Classification

A literature review

1. Introduction

Human beings understand the universe with their senses. Everything we have achieved, invented or discovered is done with the help of our senses in coordination with our brain. Vision is one of the most blessed systems. From analyzing expression of other humans to analyzing the environment, the power of vision is second to none. Scene classification or Scene recognition is one such use of our vision. Scene classification or recognition is one of the branches of image classification that is used to extract important as well as trivial information from an input image. Humans get entire “gist” of a scene by just having a glance. The processing speed is within the range of tens of milliseconds. There are millions if not billions of processes takes place to conquer this task. With ever growing technology, applications of scene classification is also increasing. There is numerous application of scene classification already in use like identifying faces and expression in an image or adjusting camera to get the best shot. It can also be used to build map and for localization and navigation using mobile robot. This paper focuses on scene classification in an outdoor environment using a mobile robot. It also compares various existing non-biological as well as biologically inspired method of scene classification. It also devised new biologically inspired model of scene classification. This model tries to approximate the working of various processes that takes place while we analyze a scene. It requires a visual gist module to characterize an image and Incremental hierarchical Discriminant regression (IHDR) technique to simulate prefrontal cortex of human brain which is used for memory generation and recalling. The approach tries to approximate the working of memory pattern in our brain to get accurate, high speed processing of an input sample to get a ‘gist’ of an image. In order to simulate such speed and accuracy in a real time environment is a challenging process. However, the research is ideal, but had shown the accuracy of 100 percent during its testing and also reduces training cost for the system. The above research was performed at university of southern California data set.

In the previous decades Scientists and Engineers used Laser and sonar to determine location of an object. But these devices have their application limited to environments that have long range regularities and similar orientation like a door, walls, tables etc. With the advancement in computer vision and Artificial Intelligence, the knowledge about human visual system has encouraged researchers to mimic the mechanisms taking place in our visual cortex. The emulation of such features is becoming a possibility. For example Sonar closely resembles the working of bats to find prey in dark. Or use of Gabor filter for object recognition, but despite these biologically plausible devices overall scene classification is a very complex process. All the research being conducted on this can be clubbed under three categories namely

  1. Low level visual feature based approach
  2. Middle level visual feature based approach
  3. Biologically inspired feature based approach.

Low level feature based approach

Low level approach is used to grasp basic information about an image or a scene like a human eye would do. It involved three basic characteristics of an image like Color, Shape and Texture. In this level complex details of an image is overlooked and purpose of this approach is just to extract each feature from each image patch independently. High level information about an input image or a scene is not gathered. In order to simplify further this approach, it is divided into two sections namely

  1. Global feature based approach- that lays focus on entire input.
  2. Local feature based approach- that focuses on just a part of an input.

These approaches are fast but they are not capable enough to perform the task alone as they can’t compete with semantic content in humans. These are known as middle level visual feature approach for which an overview is given in later sections of this report.

In the following paper some of the non-biological and biological models are discussed and various mechanisms through which they inspire are also stated and explained. The idea is to compare these models on various fronts to review their effectiveness.

2. Appearance-Based Place Recognition System (April 2000)

From very beginning in the field of machine intelligence, localization is one of the fundamental requirements. Localization means where the system is, with respect to origin or destination or some point of reference. Machine has to be aware of its location in order to perform any function and also plan its path (navigate) accordingly. There are three basic localization method used till now namely:

  1. Geometrical localization- in which exact location is governed by a two dimensional Coordinate system. It is the one which has been researched the most.
  2.  Topological localization – uses adjacency graph.
  3.  Hybrid localization- is the combination of previous two methods.

Geometrical localization is reliable only in a specific environment. They are rarely applied to complex ones. Recent studies on this method uses Kalman filter. The incremental nature of this filter and uncertainty involved during it implementation makes it unsuitable to dynamic environment. For example- due to small error robot can fall off cliff while collecting samples of soil in mountainous region can lead to problem in outdoor situations.

The discussed research adopted the topological method (deviating from popular geometrical localization) which works well in different environments both indoors and outdoors. This is a novel model which uses adjacency graph to depict the environment. The graph contains nodes which specify different locations within an environment and arcs used to connect them.

Image result for adjacency map for an apartment, appearance based recognition for topological localization

Fig 1. Adjacency map for an apartment (a)

The above figure shows a topological map for an apartment where nodes are specific location connected by double headed arrows. These maps can be created by a map editor. This method unlike the geometrical one does not need coordinates system which is complex and time consuming. No dimensional calculation is required. But their purpose is only for visualization. Since this method uses only color to distinguish between different locations, the sensor should be of high precision that can work efficiently without the use of any other characteristics (text or shape) or localization method. However the system needs to be trained with samples before implementing it in actual environment.

Basic idea of using adjacency map

This can be explained with a help of an example. If we take 100 images of an apartment for which the adjacency map is shown above for reference. We take 10 images for office, 10 for Lab and so on.  And we input an image for classification (let say from office). This image required to be matched with the 100 reference images. But by using adjacency map we can limit the number of comparison from 100 to 10 by determining the location of the input image. This makes the process faster and efficient.

General working

This topological/vision based localization system captures variety of sample images which are labeled with their corresponding locations. This is called the training period. During operation the input image is compared with the samples which are labeled and the sample which matches the most with the input has its location as output. But this method puts a constraint on our training samples i.e a set of images should represent each location.

Training procedure

The training procedure is simple as well as fast. It involves two steps.

  1. In the first step, robot is driven through the testing environment and images are captured in such a way that each location can be significantly described by set of images. The images are captured at a frequency of 1 hertz.
  2. In the next step, images are labeled with respect to their location. This process is fundamental and takes only few minutes in practice.

The research uses Omnicam for image capturing. The main advantage of using an Omnicam over regular lens camera is to capture panoramic images of environment. This provides rotational invariance in its view i.e the same area is captured irrespective of the orientation of the camera. Because of this large field of view, the sample images during training are reduced significantly for a particular location. This reduces memory requirement of the system. Also system becomes less sensitive to small changes in environment which is usually what happens in practical applications.

Fig 2.

Figure shows a panoramic view of an image it can be seen that due to its large field the number of images for particular location is reduced.


After the training period, there should be a module to match the test image with the samples. This process is called Image retrieval. Due to high potential and large number of applications based on image retrieval, research and advancement in this field is also high which can be used in this experiment. Most image retrievals system used histograms for their working. Histograms like normal histograms shows frequency of pixel intensity values. It basically represents graphical representation of tonal distribution of image. Histograms are used for matching because of their several advantages namely

  1. It requires less memory as compared to a raw image. Raw input image takes 231kb for its storage while a histogram takes only 1kb. That is the reason histogram requires less processing time and comparison becomes faster than done with two raw inputs.
  2.  Due to rotational invariance histograms, a single image acquired at certain angle is the representation of all images of the same location at different orientation.
  3. They are very robust to small translation which usually happens in the outdoor environment. Below is the color histogram shown (explained in next section) for the adjacency map (a) and of color image shown by Omnicam.


Fig 3. Image adopted from


Matching process

In order to match the input histogram (converted online) with the sample histogram (offline conversion), nearest neighbor learning is carried out. Since our place recognition uses color for discriminating between different images, we first transform them into HLS and normalized RGB subspaces. HLS represents hue, luminance and intensity respectively. In simple terms hue is the actual color, luminance (brightness) is the amount of white or black in a color, and saturation is the greyness in hue. RGB refers to red, green and blue which is similar to color picker we normally see in Microsoft office and Microsoft paint. RGB is used because all colors can be described by different concentration of these 3 colors. Depending on the type of environment each image is represented by 3 color bands for HLS respectively and RGB or normalized RGB subspaces. A total of 6 color band are used to represent each image. Histograms are then built for these bands. With the help of adjacency map the number of match needed for classification can be reduced to a particular location (office or lab) and its neighbor. Now for matching of histograms for image retrieval we need some kind algorithm or metric that can find the nearest similar image.

The research tests various metrics like L1 distance, Jeffrey divergence, x^2 statistics, Quadratic-form distance and EMD (Earth mover distance). The most suitable to our application was found to be Jeffrey’s Divergence metric.

In the below figure eight images are shown where the leftmost image is used as a input image and all other the results that came out using               L1 distance, Jeffrey divergence, x2 statistics, Quadratic Form Distance,  Earth Mover Distance respectively.

Fig 4.

Mathematical equation for computing Jeffrey divergence is as follows

Where h and k are histograms for which the distance is computed. Parameters hi and ki are histograms entries.

After that confidence measure is calculated for each color band. Confidence measure (cb) is basically a measure for the vote that each color band casts for a particular location. Vote is cast on basis of minimum distance calculated using Jeffrey divergence. Confidence measure is calculated using:

dm = minimum matching distance

di = minimum matching distance for all other location

The value ranges from 0 to 1. The more the value closer to 1, means the input image matches the candidate location better than all other candidate locations. At last each of the 6 votes (each for HLS and RGB) are combined for classification.

Testing and results:

The table shown below depicts the result for this localization system in four different environments. Each of the environments is different and tests the robustness of the system. Three of the tests are taken indoors and one is taken outdoors. Each of the tests is performed twice and samples of the two are taken at different times so that the color constancy of the system can be checked. One is the test sample and other is used to train the system. The adjacency maps for the locations are shown. The tests were considered highly successful and shown accuracy rates of 97.7% even during transition from one test location to other. Whereas the lowest rate of 87.5 is achieved during test number 4.

The table for 3 indoor and 1 outdoor locations is shown:

Table 1: Results of cross-validation tests

Image to be included from ref 3 adjacency maps

Following shown are the adjacency maps for the tests 2, 3 and 4 respectively. Their results are shown with the help of the graph as shown. Since each location is tested twice, the graph shows the accuracy of classification for both the tests.

Fig 5

Fig 6. Image adopted from

Fig 7.

Fig 8.

Fig 9.

Fig 10.


The test performed proved to be very successful and detect images efficiently. System has a strong potential and can achieve the frequency of 2 hertz easily. The system is fast but its applicability to high dimensional images is still needs to be explored. It shows high potential for future development.

Analysis of metric method for image retrieval (1998)

In the following section we are comparing different metric methods which can be used in processes like image retrieval efficiently. The methods compare are histograms based and a signature based Earth Distance Mover method is also shown. The experiment shown contains ten good images which are extracted manually from the database of 75 images. Images are extracted in such a way that red car dominates major part of image with gray/green background as shown. The graph shown below compares the average number of relevant images as a function of number of images for different metric methods.

Fig 11.

Fig 12.

Here it can be observed that for histograms Earth Mover’s Distance performs better than all other histogram method including Jeffrey divergence. We have chosen the background of each image different from each other. Here it can also be seen that EMD signature performs better than all the metric methods compared. EMD is a better choice for metric matching because of the following advantages:

  • It performs perceptual similarity better than other methods like Jeffrey divergence, x2 Statistics, L1 distance, L2 Distance and various others.
  • Due to variable-length representative nature of this method, it performs better than histogram methods which have a fixed size structure.
  • It is more robust and efficient due to availability of many linear optimization algorithms and partial matching.

Signatures which represents image as a feature cluster performs better than histogram due to their adaptive nature. Histograms are generally fixed size and it is difficult to maintain a tradeoff between efficiency and expressiveness.

Context Based Place Recognition System (Oct 2003)

This method is a part of global image representation as stated earlier. The idea behind the development of such a system is that it should be able to recognized set of objects and justify their location, for example a scene containing a lot of books can be a library and scene involving desk and chair can be a classroom or a workplace and so on. One of the main advantages of such system is that it limits the number of processes for further classification once the context of the location is known. For example if the system knows its location is library it is not hard to figure out a specific book. Other advantage is that it can be adapted for different environment which the system was never been introduces before. This is the major drawback of low level system with the appearance based system i.e they cannot be generalized to new environment easily.

Why this system is useful?

The need for such a system arises due the disadvantages of extracting image patches independently as explained earlier. To extract low level features (like color) the resolution of the image should be high and hence the dimensionality. The basic idea behind is explained with the help of three images as shown. In the first image we see a very high resolution picture of the coffee machine (object). For a mobile robot to identify this object, it needs ample amount of information for its classification like color, shape and texture. And for second image, where object is an part of an image. By using contextual information like the coffee machine is a part of kitchen, the number of object that need to be referenced for recognition limits down to the object in the kitchen. To conclude the above discussion it can be said contextual knowledge cam help scene recognition as well as object recognition despite poor resolution of object.

Fig 13.

Computation of textual features (ref to be included)

Wavelet image decomposition is used to represent the textual information in an image. The purpose is to represent image at multiple resolution. In this process, the image to be classified is divided (in its spectrum) into two parts by using a second order band-pass filter. Then a low pass filter is used to subsampling. This creates multiple bands of spectrums in such a way that that difference of information between two successive resolutions (bands) can be extracted. This helps to analyze each sections of image independently. This is the key to edge detection (determination of boundaries of objects in an image). In this model we are defining the image at different orientation and scales. We further reduce the dimensionality by using Principle Component Analysis (PCA).

Principal Component Analysis (PCA) ref from video

PCA can be thought of a way to represent a high dimensional image or set of images (a multivariate dataset) into a low dimensional image in a way that most informative data is still present in the low dimensional image. Basically it performs orthogonal transformation set of correlated values (image or set of images) into uncorrelated variables known as Principal Components (PC). The number of principal components (PC) is always less than correlated values. The first principal component of an image contains the most amount of variance in a dataset whereas the second principal component contains less and so on. As the number of PC’s increases the noise also increased in an image. Even before computing a PCA for a set of images, we reduce the dimensionality to ease the calculation required.

Fig 14. Image adopted from

Here an image of a woman is and its reconstruction using 50 principal components is shown

Coming back to the computation of textual features, PCA is applied to database containing thousands of images. Here two images and their respective textual features are shown. The reduced dimension contains 80 global information along with noise.

Fig 15.


In this section we test the system both indoor and outdoor. The performance can be effectively explained with the help of a plot which is shown below. The plot can be divided into three section. The bottom section shows the estimated probability of system being indoor or outdoor. The middle section shows the category of each location like office, plaza, street etc where the test is conducted. There are total of 63 location where the system is trained. Testing and training is done in random manner. The top section shows the posterior probability of each location(shown by red dots). The actual location of the object is solid line.

Fig 16. Performance of Context based place recognition system

X axis shows the time frame of observations

Y axis shows location of system

Although the system is robust and able to classify over 60 locations, It commits many error in doing so which can be observed from the performance graph shown above.

  1. During the frame interval 2100-2200, the system is confused about the classification between “Draper Street” and “Draper Plaza”. This error arises due to dividing space into small regions which system is unable to pick up.
  2. A drop in the graph (solid line) can be seen just before frame t=1500 from elevator 200/6 to elevator 400/1 due to computational error in the algorithm.
  3. Since the system is based on estimation. There is an error due to this which occurs at t=1500 at “corridor6a”.

Generalization to new environments

Appearance based place recognition system discussed earlier had a major drawback i.e it cannot be applied to location that does not lies in the testing paradigm. But due to context based nature of this model, it is applied to unfamiliar places where its performance is shown below. Here the performance of system for new environment can be compared with the familiar one by comparing their graphs. It can be seen that the system is uncertain about location in new environment (low on confidence), however it is able to classify category of location effectively. But at frame t=1500 system works robust and shows same performance as shown in the above graph.






Fig 17.

Biologically inspired feature based approach

Many approaches for scene classification have been mentioned above. These approaches have found success and are already in use for various applications. But none of them is comparable to human vision. The working and adaptability of human vision is order of magnitude greater than existing systems. This fact encouraged the researchers to simulate the physiological mechanism involved with human vision. Many gist models were proposed after that. This gave rise to biologically inspired approaches. In the following section we will discuss about different models and their ability to approximate the perceptibility of human.

Rapid Biologically Inspired Scene Classification (with visual attention 2007)

One of the approaches was proposed by Christian Siagian and Laurent Itti. They proposed a method that has the same low level grasp as well as processing of human vision. Human beings have an ability to interpret entire image and also can focus on specific part in entire field of view. For example if we view an image just for fraction of second we are able to determine whether it was an indoor scene or outdoor scene. This means it has ability for extracting ‘gist’ as well as “saliency” for an input image. This feature has been included in this approach. The word ‘gist’ here means the low level representation of a scene.


Finding saliency means finding those locations which stands out in the entire visual field. Parallel with this mechanism there should be a mechanism for overall analysis of an image.  Christian Siagin and Laurent Itti proposed that since both the saliency and gist feature are complementary opposite in their working but have their working implemented through a single visual cortex V1-V4 and Middle Temporal Visual area or V5 in brain (of primate). There should be way that both (gist module and saliency module) can be implemented simultaneously and fed by the same low level features rather than computed alone for both. The saliency model depicts ‘where’ of an image while the gist model answers the ‘what’ of an image. The divergence to these features comes later during further processing as shown in figure below. It can be seen that after the input image is filtered, there are two paths shown which are called dorsal and ventral pathway. The dorsal pathway implements the working of saliency model and ventral pathway is implemented by gist model. Both of these are computed separately and integrated later in our model. In the following sections we are going to explain saliency and gist model separately.

Fig 18.  Model for gist and saliency

Saliency Based Visual Attention for Rapid Scene Analysis (Nov 1998)

This model explains the human search strategies in a quite simpler way.  It is therefore been used as a basis for various biologically inspired model explained later. Figure shows a dorsal pathway working.

Image result for saliency based model itti

Fig 19. General architecture of dorsal pathway (b)


How visual features are extracted (feature map creation)

When the image is input, it is decomposed into a set of topographic maps. These maps in primates believed to be lie in posterior parietal cortex and thalamus. The model focus on fast processing of interesting parts that stands out in an image. The general input image is of the 640 X 480 resolution. After being extracted for color, intensity and orientation, a “center-surround” operation is performed, which are biologically inspired feature and occurs in visual receptive field. Each of the neurons cover an area in our field of vision, this field of vision in which if stimulus lie can lead to specific neuron firing is called receptive field of neuron. It is believed that cone cell in center of retina are more sensitive than rod cell in surroundings. Research has shown that visual neurons are activated when stimuli presented itself at center and excitatory response is produced. While when the stimuli presented in the concentric circle (surround) the neuron response would be inhibitory. This is reason we extract those feature that looks attractive at first glance.  This phenomenon is simulated by coarse to fine scales in this model. A difference of two scales is taken. The purpose of simulating ‘center surround’ operation is to extract that information from feature channels that stand out from its neighbor. This operation in primates involves retina, geniculate nucleus and primary visual cortex.

Image result for center surround organization

Fig 20.

Low level feature extraction

After the linear filtering of R, G and B channels are created using r, g and b channels of input image. Where R= r – (g + b)/2 for red, G = g – (r +b )/2 for green, B= b – (r + g)/2 for blue , and Y=(r + g)/2 – 2 –mod/2-b for yellow where r = red, g = green, y = yellow channel for input image and intensity I is obtained using colors and is given by (r + g +b)/3 . After that Gaussian pyramid is generated for intensity, color (red, green, blue and yellow) channel. The third and the last low level feature maps, the orientation map is obtained from Intensity using Gabor pyramids. Gabor filters are biologically inspired Band-pass filter used in pattern analysis. They are used to extract textual features in an image or a scene. They are used for texture analysis and edge detection which helps in image segmentation. They have different tunable frequency, orientation and bandwidth. A Gabor filter generally responds to edges and changes in textual orientation of an input. For designing Gabor filter for scene classification (cognitive modeling), we want robust system in which magnitude of filter does not change when the image is rotated or shifted. An example of Gabor filter is shown below. It can be observed that response change is not observable when we change the magnitude Center surround operation is carried on intensity, color and orientation feature maps to be fed into the saliency map.

Feature map extraction using Gabor filter

As it is stated earlier, Gabor filter are used to represent texture information and edge detection for an image. To extract information effectively, we have to represent image at multiple orientation and scales. Here the filter bank is used for this purpose. The Gabor filters takes its motivation from simple cell in human visual system. Simple cells are responsible for edge detection for early processing in primates. They do so because of their predefined distinct inhibitory and excitatory regions in visual receptive field. If the response is inhibitory in one region, in other region it is excitatory such that is there is a balance between them. Gabor filter just like simple cells are tuned to multiple frequencies and multiple orientations. The two dimensional equation for Gabor filter is given as


here θ is the orientation, λ is the frequency, γ is the aspect ratioand σ is the extent.

Gabor filter emulate simple cell working due to its long elliptical interior structure which is in opposite sign from its surroundings. Due to this opposition the response is high for detecting edges (due to considerable change in intensity). Also famous researcher Jones and Palmer in 1987[] claimed that these filters are best designs that can effectively represents simple cells in the receptive field. These filters had been well established and show experimental results that proved that they are best suitable for Global feature representation. Gabor filters based Completed local binary pattern (CLBP) is one such design which extracts multiple global feature and textual information. Here Gabor filter is used to extract Global feature and detects edges while CLBP is used for local feature determination[]. Results shows that they are effective than many local feature extraction methods (like bag of feature model) for scene classification.

Fig 21.

Intensity feature maps:

The intensity feature maps in primates are created by neurons which are sensitive to dark center to bright surround and vice-a-versa. It can be computed by using center scale and surround scale. The purpose of using these scales is explained later in the “center-surround” section. Center and surround scale are divided on pixel basis.

I(c,s) = | I(c) Θ I(s) |

Where          c = center fine scale and c Є {2, 3, 4}

S = surround coarse scale given by s = c +  ,    Є {3, 4}

And s = c + Ω

Color feature maps:

The color maps in our cortex are represented by “double opponent system” where neurons are excited by one color in the center they should be inhibited by other color. In the surrounding, the neurons the phenomenon is reversed. These opponent behavior exist in pairs like red/green, blue/yellow, yellow/blue and green/red in visual cortex.


RG (c,s) = |(R(c) – G(c)) Θ (G(s) – R(s))|

BY (c,s) = |(B(c) – Y(c)) Θ (Y(s) – B(s))|

Orientation Feature maps:

Orientation feature maps are biologically inspired from neurons in visual system, which are selective to orientation information.

Orientation maps are given by

Mi(c)= Gabor(c,Θ)

Where Θ= 00,45o,900,135at four spatial scale c=(0,1,2,3,)

In the proposed method 6, 12 and 24 maps are created for intensity , color and orientation respectively.

Image result for working of gabor filters

Fig 22. Showing response of gabor filter for shifted image

Saliency Map

Since all the feature maps are extracted by different mechanism and have large dynamic range. It is very challenging to normalize them.due to their uncomparable nature they cannot be fed to the saliency map. So a normalization operator N(). is proposed. The working of which is explained with the following diagram.

Fig 23.

Here it can be seen that normalization operator is used to promote small peaks in the orientation map. And also in the intensity map it is used to suppress large number of comparable peaks. The working of the normalization operator biologically motivates from coartical lateral inhibition.

It is defined as the ability of a neurons to suppress the activity of its neighbor neurons. The suppression takes place in the lateral direction and inhibits the spreading of action potential. After normalization, the feature maps are ready to be combined. The combined map should depicts saliency featiure or “conpicious” location in an image. The summation is done using:


where the bracket term describe the normalized values for intensity, color and orientation respectively.

To implement the working of saliency map, a single capacitor and conductors used. It should work in way that leakage conductance and voltage threshold should deliver charge (by synaptic input) which is integrated by capacitance employed in the the output of saliency map is fed to “winner takes all” mechanism, which is explained below.

Winner takes all mechanism

This mechanism can be explained with the help of the images shown below. Neurons from saliency model excites their corresponding winner takes all (WTA) neurons. There is a preset threshold. As the first excitatory neurons cross this threshold, it is fired. These leads to three processes that takes place simultaneously:

  1. The first neuron which crosses the threshold is called winner takes all (WTA) neuron becomes the Focus of Attention (FOA). This is shown in as red telephone booth at simulated tome 92 ms.
  2. Since all the neuron are evolving towards threshold, all of them except the winner will gets inhibits.
  3. Inorder to jump to the next salient locations, local inhibition is activated for transient frame. The second, third and fourth FOA is shown at time 145, 206, 260 ms.

Fig 24. Normalization and summation of input image

Fig 25. Salient features

This process also prevents the FOA from returning to its initial position. This phenomenon also takes place in human physcophysics and called “inhibition of return”.

Drawback of saliency approach

Since all the feature maps created are by different mechanism it is very difficult to cluster them to generate significant output, which is useful in analysis of saliency features. Because of this problem a normalization is carried out which ignores the homogenous area and promotes ‘activation spots’. It compares these spots with overall average to generate most active locations. The difference tells us about the uniqueness of maps. This creates a contrast which is exactly what is required from this model.also when  combining different feature maps (42 in this case) there is a absolute possibility of noise being masked on the map. The purpose of using three different channel  for intensity, color and orientation and  normalizing them separately is that different  attributes contribute independently and similar feature compete in for saliency in the saliency model. The absence in consideration of magnocellular cell working is one of the major disadvantages. Magnocellular cells are concerned with “where” part of our visual system. Although the approach proposed above is simple but the efficiency of the system is highly dependent upon the feature maps. Our model is highly dependent on simple low level features, but there is uncertainity in detecting salient location for maps whom existence is not yet completely defined like T junction or line terminator.

Gist feature model

After the visual features extraction is done as explained in the previous section. Feature maps are generated which are processed further to determine the conspicuous section in each map according to the saliency architecture shown in the figure. Using winner takes all mechanism, we establish saliency maps. Same visual extraction mechanism for color, intensity and orientation is used. In Gist model we do not perform pattern analysis on orientation channel as shown because they are already in the form as needed.

Fig 26. Visual feature used in Gist model

After center-surround operation on color and intensity channels, we extract gist feature. This process is explained by taking an example of an orientation sub-channel aligned vertically. When the image shown is passed through a Gabor filter (vertically aligned). A feature map representing the textual feature is obtained. This map is divided into 4×4 sub-regions to produce 16 values as shown to comprise gist feature vector. The computation for these 16 raw gist feature can be explained by the following equation:

K and l are the indices in the horizontal and vertical direction

Where W and H are width and height respectively

Fig 27. Gist feature extraction for orientation sub-channel

This process is repeated for each map that is for color, intensity and orientation making a total of 544 gist feature maps.

Dimensionality reduction

There are 6, 12 and 16 feature maps for Intensity, Color and Orientation respectively. For each feature map, 16 region are created using Gabor filter leading to a total of 544 dimensions. To process such a large dimension, storage required is more and also processing time increases making system slower. Therefore it is advised to use various dimensionality reduction techniques that reduces the dimension and also does not lead to any loss of significant information. In this model Principal Component Analysis (PCA) and Independent Component Analysis (ICA) is used to reduce the dimensionality from 544 to 80. This system has variance efficiency of 97%. For classifying mechanism, A 3 layer neural network is used because of short training time and scalability of samples. Large number of samples is required to build a robust system.

Testing and results:

The test was conducted on university of South California. We are just analyzing the result for one experiment which was conducted on Ahmanson Center for Biological Research (ACB). The samples for test were taken using 8mm camcorder. Scene of the dataset include flat walls, different parts of the building and contains very less textual information. The video-clips are divided into equal segments. Image extracted for segment are shown below. Also a map for segments is shown. It can be easily observed that segments are taken in a random manner. The table showing results of test is attached below. The term “false+” shows the number of incorrect segment guesses x given the answer is another segment which is divided by total frames in that segment. Whereas “false-” shows the number of incorrect guesses x given the answer is the same segment. The test results are highly efficient with accuracy rates of 87.96 percent as shown.

Fig 28. Images in each segments (Ahmanson center for biological research)

Fig 29. Testing results

Fig 30.  Map of segments taken

Errors and Limitations:

  • In the table shown below, confusion matrix is given. There are number of error that came across while testing:
  1. As can be seen from the segment map shown above segment 1 and 2 are in continuation of each other, system is unable to detect that transition. This is shown by the spikes in the matrix.
  2. There are also errors for which the causes cannot be easily determined. For example the segment 2 shows 163 “false+” when the answer for that is segment 7. However this error may be caused due to similarity of the segments as can be observed from the two images shown above.


  • Due to change in sun position and atmospheric conditions throughout the day, there is difference in illumination in different image samples taken. There is difference in lighting even in different section of a single image. Because of these errors, large number of samples needs to be taken at different lighting conditions (from brightest to dark).
  • The proposed “gist” model does not overcome partial occlusion. It means that the system is unable to analyze background, if it is hidden partially by an object in the foreground.
  • The samples for the experiment as can be seen are taken when the environment is static, but in practical usage number of dynamic objects (people, car etc) increases. The ability of the system is not defined for such instances.
  • The proposed model still needs to be justifies for identifying “gist” in large spaces.

Fig 31. Confusion matrix

Drawback of model proposed by siagian and itti (by song and tao)

In the model, Orientation maps which are gathered from the input images do not fully corresponds to complex cells in the human visual cortex. Since Principle Component Analysis (PCA) and Independent Component Analysis (ICA) are based on standard Euclidean property of vector spaces, but they are not well defined for non-Euclidean geometry like spherical and hyperbolic. Since most of the dataset that are used are non-Euclidean, classical PCA cannot be applied effectively. So due to these shortcomings, a new gist models was proposed and tested on the same dataset (Ahmanson center for biological research (ACB)) and has improved efficiency.

Independent component analysis (ICA)

ICA is a simpler representation of random vectors. It is very useful in feature extraction along with PCA. It performs linear transformation on multivariate variable in such a way that the resultant components are as independent as possible. This method can be used to solve well known cocktail party problem. Suppose there are two recording devices placed at different location in a room. They are used to record two speakers. Then the linear equation can be represented as

X1(t) = a11s1 + a12s2

X2(t) = a21s1 + a22s2

Where, x1 and x2 are amplitudes. S1 and s2 are speakers 1 and 2 respectively. a11, a12, a21,a22 are the parameters of the above equations. Assuming the conditions while recording to be ideal (zero delay and other factors). Here the approximate original signal and mixed signal are shown. If we want to extract the original signal from mixed signal, we have to calculate the parameters given in the above two equations. If we assume that signal emitted by the both speaker (S1 and S2) at time index t is independent (statistically). Both the signals can be extracted from the mixed. Here the third figure shown is extracted from applying ICA. It can be observed that first and third are same (approx.). These methods were later applied beyond the cocktail party problem. It is now used frequently for feature extraction, where it is used to linearize the multivariate data (audio, image and other signals). It also finds its application in biomedical devices like electroencephalogram (EEG). Where, also the task is to find the components responsible for brain activities.

Fig 32. Original signal

Fig 33. Mixture of two recordings

Fig 34. Original signal (from ICA method)

Improved Gist model (Dongjin Song and Dacheng tao 2008)

The improved model proposed following replacement:

  • Due to the problem associated using Gabor filter as mentioned in previous section, orientation information is extracted using C1 units. C1 unit emulates the complex cells while S1 units corresponds the simple cells in our visual cortex. The working is established in way that C1 units apply maximizing operation on S1 units to keep the maximum response over the area for a particular pair of orientation an scale. The color and intensity channel are left untouched by C1 units. The color and intensity feature maps are formed by the same “color double opponent” system and “dark center on bright surroundings or vice-a-versa” respectively. The phenomenon can be well visualized with the help of the figure shown below.

Fig 35. Extraction mechanism for low level features

The three feature maps are combined later using the same steps for gist model.

  • Since Principal Component Analysis (PCA) and Independent Component Analysis (ICA) does not maps effectively the Non-Euclidean geometry of sample labels and biological features (Magnocellular motion). Locality Preserving Projection (LPP) is used. It not only preserves the geometry of biological feature but also retained the sample label data. But supervision is needed for these functions to work. There is also an advantage of this in dimensionality reduction. PCA used in previous gist model reduce the dimensionality of 544 feature maps to 80, while LPP reduces the dimensionality to 4. Thus the processing speed of the classifier increases significantly.
  • Instead of using the three layer neural network for scene recognition, Nearest Neighbor Rule (NRR) is used. One of the major advantages of using this is testing can be performed without taking samples. We do not need to use any non-linear transformation module.

Testing and comparison

Since the model is the improvement proposed over the conventional “gist” model. Same segments and samples of ACB are taken as input. Step by step evaluation is performed by replacing the three modules (Gabor filter, PCA/ICA, Neural network) with their respective improvement. Here the reference is the conventional model with 12.04% error.

Fig 37. Step by step justification

It can be seen that C1 unit, when used with Nearest Neighbor classifier increases the efficiency to 73.84%. LPP alone when used with this recognition module leads to 6% (approx.) improvement than the compared model. The overall model has considerably high accuracy rate of 95.37% also the processing speed is almost 50 times which proves its effectiveness.

Drawback of “gist” model

Although improved gist model by Song D and Tao D considerably increases the accuracy and processing time of the gist model but there are some problems with the existing models

  •  As can be seen there are 544 raw features for color, intensity and orientation combined and their dimensionality is reduced to 80 and 4 using PCAICA and LPP respectively. But using these methods leads to large waste of processing time and also are very expensive to implement.
  •  Since dimensionality is a function of classification. So such as system required where the dimensionality is adaptive and have broad dynamic range. There is still a need for such a system to be built which can change the dimensionality according to the requirement of classification.
  • As can be seen from testing, there are errors in the system due to insufficient samples. There is always shortcoming in samples for various lighting condition. So there is a need for the system which can work efficiently for limited number of samples.
  • Since there is proper separation between training (learning) and testing phase, but to emulate the human visual system (which is always learning), a system should be built which can process learning incrementally.

These above limitations were kept in mind while developing the proposed model for outdoor scene classification.


Understanding and extracting useful information from an image or scene is a complex task. Working in this field has been going on for decades. Many classification approaches has been proposed and are in practice in various application like robot navigation, image retrieval, object recognition etc. However, we cannot simply govern which is the best model or the best approach. We can only compare one model with the other based on its robustness, adaptability to different situation and variety of other parameters. Despite this, there are particular trends that are followed in the scene classification world. Earlier, scene were classified using low level information like color and texture []. It was believed (or limited by technology) that color alone can be used to extract sufficient information. But these approaches cannot be generalized to new environments. So contextual based model was proposed which has been discussed in this text. At the same time many approaches were proposed which uses the combination of these two features like image retrieval using color and shape. These systems were fast but they have limited performance in dynamic environment. To further improve the performance there is a need to combine local as well as global features (Scene classification using local and global features with collaborative representation fusion). But their usage is limited as they are unable to reveal content information about image. Content information is like if an image contains water or water. Categorizing images in this manner helps to classify new scene and increasing the robustness. Semantic modeling of natural scene for content based image retrieval is one such model. Many researchers had argued that this approach needs more exploration as this method (of classification)is more close to human working. Also training and testing complexity is very less as compared to previously discussed system. It uses Support Vector Machine (SVM) for classification. Results had also been in the favor of semantic modeling (by almost 10 percent) when compared with low level feature modeling. These models uses Bag of feature or Bag of words methods [22,23]. Here image is divided and segments represented as a collection of features. But this approach fails to specific orientation and spatial area covered by features. Enhancement to this, Spatial Pyramid Matching [24] was designed. This method overcomes the drawback by segmenting the image into very fine region and drawing histograms of such regions (represented by local features). This method was very effective and many version and improvement of these methods were later proposed. Lowes scale invariant feature transform (SIFT)[16,17] is one of them and reference to the model is given (Scene Classification using hybrid generative/discriminative approach).

Although all the methods and approaches mentioned above proves to be very significant in scene classification. But still they are no match for our ability to classify an image. With groundbreaking technology working of human visual system can be analyzed and its working can be implemented using compute vision. But the trend here is narrow. Most of the biologically inspired models are improvement to one another. For example improved “gist model” explained above. These methods take similar test and training data (USC dataset) and proposed new approaches on them. Although these improvements show significant change classification accuracy but their use beyond the scope of their domain is still to be tested. Thereby I conclude on this note that biologically motivated Scene Classification has still got a long way ahead as new studies and technologies will emerge. We might see some of these models into implementation.


  1. Jingjing Zhao, Chun Du, Hao Sun, Xingtong Liu and Jixiang Sun, “Biologically Motivated Model for Outdoor Scene Classification”,
  2. Ulrich I, Nourbakhsh I. Appearance-based place recognition for topological localization. In: Proceedings of IEEE international conference on robotics and automation, Apr 2000, pp. 1023–1029.
  3. Torralba A, Murphy KP, Freeman WT, Rubin MA. Context based vision system for place and object recognition. In: Proceedings of IEEE international conference on computer vision (ICCV), Oct 2003, pp. 1023–1029
  4. Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell. 1998;20(11):1254–1259.
  5. Siagian C, Itti L. Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Trans Pattern Anal Mach Intell. 2007; 29(2) :300–12
  6. Song D, Tao D. C1 units for scene classification. In: Proceedings of IEEE international conference on pattern and recognition, 2008, pp. 1–4.
  7. Rubner, Y., Tomasi, C., and Guibas, L.J., “The Earth Mover’s Distance as a Metric for Image Retrieval”, STAN-CS-TN-98-86, Stanford University, 1998.
  8. Jain AK, Vailaya A. Image retrieval using color and shape. Pattern Recognit. 1996;29(8):1233–44.
  9. Vogel J, Schiele B. Semantic modeling of natural scenes for content-based image retrieval. Int J Comput Vis. 2007;72(2):133–57
  10. Bosch A, Zisserman A, Munoz X. Scene classification using a hybrid generative/discriminative approach. IEEE Trans Pattern Anal Mach Intell. 2008;30(4):712–27
  11. “The Application of Wavelet Transform in Digital Image Processing,” 2008 International Conference on MultiMedia and Information Technology, Three Gorges, 2008, pp. 326-329.
  12. Baoyu Dong and Guang Ren, “A New Scene Classification Method Based on Local Gabor Features,” Mathematical Problems in Engineering, vol. 2015, Article ID 109718, 14 pages, 2015. doi:10.1155/2015/109718
  13. Hyvärinen and E. Oja. 2000. Independent component analysis: algorithms and applications. Neural Netw. 13, 4-5 (May 2000),
  14. Jinyi Zou, Wei Li, Chen Chen, Qian Du, Scene classification using local and global features with collaborative representation fusion, Information Sciences, Volume 348, 20 June 2016, Pages 209-226,
  15. N. Serrano, A. Savakis, J. Luo, Improved scene classification using efficient low-level features and semantic cues, Pattern Recognit. 37 (2004) 1773–1784
  16. D. Lowe, Distinctiveimage features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2004) 91–110.
  17. D. Lowe, Object recognition from local scale-invariant features, Int. J. Comput. Vis. 2 (1999) 1150–1157
  19. C. Chen, L. Zhou, J. Guo, W. Li, H. Su and F. Guo, “Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification,” 2015 IEEE International Conference on Multimedia Big Data, Beijing, 2015, pp. 324-329
  20. Jones, J. P. and Palmer, L. (1987). An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology,58:1233–1258.
  21. Javier R. Movellan ,“Tutorial on Gabor Filters”, Summer 2008.
  22. L. Zhou, Z. Zhou, D. Hu, Scene classification using a multi-resolution bag-of-features model, Pattern Recognit. 46 (1) (2013) 424–433.
  23.  L. Zhou, Z. Zhou, D. Hu, Scene classification using a multi-resolution low-level feature combination, Neurocomput. 122 (25) (2013) 284–297.
  24.  S. Chen, Y. Tian, Pyramid of spatial relations for scene-level land use classification, IEEE Trans. Geosci. Remote Sens. 53 (4) (2015) 1947–1957.

EssayHub’s Community of Professional Tutors & Editors
Tutoring Service, EssayHub
Professional Essay Writers for Hire
Essay Writing Service, EssayPro
Professional Custom
Professional Custom Essay Writing Services
In need of qualified essay help online or professional assistance with your research paper?
Browsing the web for a reliable custom writing service to give you a hand with college assignment?
Out of time and require quick and moreover effective support with your term paper or dissertation?