Plant phenotyping plays an important role in improving plant variety such as higher biomass, nutrition and resistance to biotic and abiotic stressors. By analysing the relationship between genetic data and the resulting appearance, it is possible to find the useful traits that can tackle the grain shortage or climate changes. This kind of genetic research requires a large number of phenotype measurements. For example, the common measurement parameters are the leaf characteristics , yield-related traits , and biotic and abiotic stress response . These works were done manually before and it is not only time-consuming but also intractable. The growing demand of high-throughput feature annotation has motivated scientists to develop an automated process for plant phenotyping.
Over the last decades, plenty of image-based phenotyping methods and corresponding image analysis algorithms have been developed for extracting relevant plant traits. In A framework for the extraction of quantitative traits from 2D images of mature Arabidopsis thaliana , a 2D geometrical approach is introduced. They have developed a strategy of constructing a visual model of the root derived from the position of intersections on the stem and positioning the tips of the Siliques. However, the drawback is obvious, the features are determined manually. The frameworks firstly transform the image into grey scale. After that, it analyse the centreline of the object and trace it iteratively. Geometrical and topological properties of a plant are determined during extracting the centreline by iterative tracing. This framework provides reliable performance on the task. But what if we can build a model that starts from raw image to arrive directly at a phenotyping determination or trait? With the development of machine learning field in recent years, this has become feasible.
Machine learning is a powerful tool for data analysis. Its ability to learn from samples and make predictions in high accuracy rate inspired researchers to apply this technique on many areas. Particularly, deep learning,part of a broader family of machine learning methods, has shown a great capability on image processing tasks. A recent study on applying deep convolutional neural networks to perform root and shoot feature identification and localization provides state-of-the-art performance in image-based plant phenotyping (over 97% accuracy). The use of deep learning approaches is a revolution in plant image analysis.
This work focuses on the development of deep learning models for extracting traits of interest (e.g. counts and length of the seed) in plant images. This procedure involves two parts of process, i.e. classification and localisation. With the classification of objects, we are able to acquire the counts. And the localisation can provide the information to calculate the length of the seeds.
Because image datasets are usually large and have shared features, it is more efficient to use transfer learning to adapt pre-trained model. Our aim is to find out the state-of-the-art image processing model, study its structure and utilise it to perform our plant phenotyping task.
Furthermore, we wish to evaluate the effect of the size of the dataset (i.e. the number of annotations in each image) and the effect of different background colours.
- Research on the image-processing deep learning models, compare their performance and choose one as the method.
- Understand the structure and implement the model.
- Optimise the model by several approaches, e.g., hyper-parameter tuning, feature extraction, image augmentation, different pre-trained weight.
- Compare the results and evaluate the influence of different factors (datasets, noises, background colours)
Since the plant images are acquired from natural environment in real-time, there are many factors that would affect the appearance of the plants including backgrounds, biotics (insects, virus), brightness, overlaps, etc. Therefore, a robust model is required in this task. What’s more, the number of pictures taken by autonomous machines is growing rapidly, resulting in a large amount of images to be processed. An efficient high-throughput approach is highly desired.
Moreover, the way to measure the length of the siliques is worth considering. The information that is available to calculate length is limited. We need to figure out a way to convert the predicted results to the length of siliques. My proposed method is to use the coordinates to calculate the diagonal as the length. It works in some cases. But it can only be an estimate because the siliques are curved. The diagonal is a straight line which is a bit shorter than the true length. Thus, a more appropriate approach needs to be carried out for measuring the length of the siliques.
Deep learning techniques have been shown to be successful in many types of computer vision tasks, including image classification, multi-instance detection, and segmentation . In recent years, researchers were trying to apply such technique to plant science. Their work has shown the advantage of deep learning approach in complex image-based plant phenotyping tasks over traditional manually-determined computer vision pipelines for the same task .
For example, in a stress identification task for soybean leaves , they applied deep convolution neural network as their classification framework and achieved 90.03% accuracy on the test dataset, which is a significant improvement compared to the previous work done by human experts. Moreover, it only takes a few seconds to generate prediction for an image, which is much faster than manual work.
Another application  is also a classification and localisation problem, similar to our project. They designed a deep convolution neural network that contains 26 layers. After trained on a dataset containing 2500 annotated images of whole plant root system, it significantly improved the accuracy of plant phenotyping to 97%. This approach provides remarkable results comparable to humans, or even surpassing them.
Together, these advances have encouraged us to apply deep learning models on plant science to promote development of discipline.
With the upsurge of machine learning and deep learning in the past few years, various efforts have been made to apply them in plant science. Moreover, their applications in plant phenotyping not only improved the detection performance, but also saved human resources. This chapter provides an overview of previous methods for data-driven plant phenotyping tasks.
Image pre-processing plays an important role in the analysis of images. The pre-processing steps can usually reduce noises such as missing values or wrong data (outliers) so that the model is able to discover more features of interest . This procedure is essential because without pre-processing, the noise might reduce the quality of the training data and interfere with the model generation, which eventually leads to the reduction of model performance.
There are a lot of image pre-processing algorithms. The general methods can be greyscale conversion, filtering, cropping and edge detection as described in . Advanced methods such as dimensionality reduction, clustering and segmentation will include more steps. Those steps are not only manipulating the data itself, but also changing the structure of the data, making it easier to be learned. An example application can be seen in . They proposed greyscale conversion, binary segmentation, smoothing and filtering as their pre-processing methods for identifying plant species under varying illumination and viewpoint conditions. Another application in leaf diseases analysis  found that histogram equalization can be very useful for enhancing the contrast of the images and provide clear vision for human eyes. However, greyscale conversion is believed to downgrade the performance in leaf pathogens identification task . Therefore, we cannot conclude that one pre-processing method is better than another. The appropriate approach should be chosen in terms of the project aim and data quality.
Whereas, these pre-processing algorithms are mostly altered by convolution layers and pooling layers in Convolution Neural Network. The convolution layers and pooling layers have the similar capability of cropping, filtering, flattening images, which is the key point for a fully automated process. The underlying principle of the layers will be discussed later. In deep learning algorithms, you can still apply the methods such as greyscale conversion or dimensionality reduction before feed your data into the network for improvement. But the neural network is able to work well with the raw images as input.
In image processing, the images are usually segmented to a number of subsets for the use of learning and classification. Segmentation provides the isolation and identification of regions of interest in training samples, allowing the model to focus on the areas that have useful features. Normally, the regions of interest are determined by the relationship of adjacent pixels in terms of pattern, colour and statistic, etc .
A common approach is threshold segmentation. An android-based plant leaf identification system  has adopted this algorithm for identifying plant leaves by creating different groups of pixels based on the level of intensity, discriminating the plants from background. Another segmentation example is proposed in . They segmented the images by searching the threshold that minimises the weighted variance for the same class label. This algorithm has been applied on background subtraction in an automatic image-based plant location and recognition system . However, for some cases, it can underestimate the signal which will cause under segmentation. It is also slower than the other threshold segmentation methods.
The Watershed algorithm is widely used in plant image segmentation. The algorithm is based on an immersion process analogy, in which the flooding of the water in the picture is efficiently simulated using of queue of pixel . This method generates the gradients of magnitude for the images, where the maximum appears at the edges of objects, causing the segmentation divide at that area . Its applications include growth rate identification , detection of overlapping leaves  and leaf segmentation .
When it comes to deep learning, we all know that deep learning models are able to learn and predict results. But in fact, these capable models can even learn to predict segmentations. In our approach, we adopted the method from  which used Region Proposed Network (RPN) to predict regions of interest. The detail of the network will be discussed later. This network is made of fully convolutional layers, which is the reason why the segmentation procedure can be integrated into this fully automated model.
Features are the unique signatures of the given data that describe a pattern. Feature extraction algorithms extract the relevant features hidden in a pattern such that the detection models can make detection in an easy manner. It is a special form of dimensionality reduction. The goal of this process is to extract the most relevant features into feature vectors in a lower dimensionality space. The common features that you can find in images are edges, textures, intensity of image pixels and different colour spaces combinations.
The widely used feature extraction methods are Template matching, Deformable templates, Unitary Image transforms, Graph description, Projection Histograms, Contour profiles, Zoning, Geometric moment invariants, Zernike Moments, Spline curve approximation, Fourier descriptors, Gradient feature and Gabor features . They are all built for the purpose of extracting a set of features (feature vectors) which maximises the recognition rate with the least number of substances.
A paper  proposed an imaging and analysis platform for automatic phenotyping and trait ranking of plant root system. In the experiment, 16 automatically acquired phenotypic traits including statistical, geometric, and shape features obtained from 2297 images from 118 individuals are vectorized to perform the ranking task.
There are also some extraction methods that acquire invariant features from different scales and rotations. This kind of algorithms can ensure that phenotyping model identifies features regardless of different scales, angels and illuminations. For example, Wei et al. introduced a dynamic model for detecting the flowering paddy rice . They applied Scale-Invariant Feature Transform descriptor (SIFT) on the extraction of the flowering part of panicles, combining with dense sampling. Another plant phenotyping application has also adopted SIFT to their model for detecting stereo and ToF images . In the paper, the algorithm is used to create dense depth maps to identify pepper leaf in glasshouse. Speeded-Up Robust Features (SURF)  is partly inspired by SIFT descriptor but faster and more robust than its predecessor. A multi-view stereo image based 3D plant model has used SURF to obtain local invariant features .
In Convolution neural network, the feature extraction work is done by the convolutional layers. This does not mean that the feature extraction algorithms cannot work with CNN. Whereas, CNN has a few characteristics which make it be capable of extracting features spontaneously. The general approaches are combining CNN and other feature extraction frameworks to extract features.
The development of machine learning models has inspired people to apply it on high-throughput plant phenotyping tasks. As the amount of plant data keep growing, machine learning can provide methods that are able to learn from data without relying on manually specified rules. This is important when the size of the dataset is very large and you cannot find the hidden patterns or the traits. The advantage of Machine Learning algorithms over traditional methods is the capability of capturing potential patterns in large dataset using combinations of multiple elements instead of performing individual analysis.
There are so many machine learning algorithms, including Supervised and Unsupervised Learning, Clustering, Dimensionality Reduction, and all kinds of Neural Networks. Among those algorithms, three classic methods, k Nearest Neighbour (kNN), Support Vector Machine (SVM), and Naïve Bayes (NB), has been applied on the segmentation of Antirrhinum majus leaves . Each Machine learning algorithm was implemented with different kernel functions and they were trained on both raw data and two types of normalisation processed data. The result show that the kNN classifier has a better performance on RGB data without data normalisation and SVM performs better on NIR images with data normalisation method 1 (mean and standard deviation). Therefore, as a matter of fact, there is no ‘best’ algorithm but their performances depend on the actual environment, e.g. the composition of the dataset or the project aim.
Deep learning is a new branch of machine learning. Deep means the models have more layers as well as better ability to learn. Deep learning models have been tested on a wide range of tasks such as Natural Language Processing (NLP), Pattern Recognition, Computer Vision, and Speech Recognition etc. Their performances are shown to be comparable to human experts and even outperforming under certain circumstance. Convolution Neural network (CNN) is a popular model among the Deep Learning approaches. CNN consists of multiple layers, including convolution layers, pooling layers and fully-connected layers. The most important layer, convolution layer uses filters to learn from the samples, endowing the model with independence from prior experience and manual determination. A study in plant phenotyping has applied CNN on plant shoot and root classification and localization, significantly improved the state-of-the-art performance . Another application used CNN to detect plant disease . It is able to perform detection in less than 1 second which is very efficient and capable of applying in a high-throughput pipeline. Overall, Deep Learning is a cutting-edge technology beyond machine learning and has a promising future on analysing the phenomic data.
In , they proposed two networks, root CNN and shoot CNN with 11 and 15 layers respectively. Nowadays, there are much deeper CNNs developed for detecting objects and providing fast and accurate predictions. In this research, we adopted the most recent state-of-the-art model, Mask-RCNN, to perform our detection task.
In this chapter, we will introduce the Machine Learning background that is related to our project aim in detail.
Today, the computers are processing tasks for our human beings. They would do what we exactly ordered in a fast and efficient approach. However, this is not enough. We would like the computers to ‘think’ and to ‘learn’. Can we build a computer system that improves itself by learning from the previous experience like the brain? This is the problem that the field of machine learning trying to solve. Basically, machine learning works by using algorithm to parse data, learn from it, and then make a determination or prediction about something in the world.
What made machine learning superb is not just automation. You will miss the valuable insights if you only treat it as a counting or simple statistic tool. In fact, it is powerful because, with the strong computational power, the algorithm can generate models or find patterns in a large amount of data that would have otherwise been ignored by human.
There are many types of machine learning algorithms, which can be typically grouped by three main categories (i.e. supervised learning, unsupervised learning and semi-supervised learning). Supervised learning is the most commonly used technology among those three.
Most of the real-world problems can be formulated as a relationship ƒ : X → Y, where X is the input set of data and Y is the output set of data. Supervised learning is the procedure to generate the mapping function. A supervised learning algorithm analyses the input training data and produces an inference function, which can be used for mapping new examples . For example, after trained on a dataset of cats and dogs, the algorithm should be able to identify new images whether they contain cat or dog. The optimal scenario is that the trained algorithm can correctly predict the corresponding output Y, given the unseen input instance X.
Supervised learning has two major approaches: classification and regression. These two solutions are widely used in object detection tasks in recent years. In classification, a new instance will be identified as one of the categories which the training data belongs to. In our task, for instance, we can use a neural network classifier to identify the seeds in the plant. Regression problem finds the best line that divides the dataset. The input and output variable are numbers. In this research, the regression can learn to predict the coordinate of the bounding boxes.
Artificial Neural network (ANN) is a supervised machine learning algorithm that can learn from the labelled training data and generate mapping function as we mentioned above. The model is inspired by human brain where we learn and memory things. As an imitation of brain, it also has neurons as its computational units. The intelligence of this structure is endowed by the biological neural system where billions of neurons connect to each other and transmit signals.
In order to construct a computer programme version of brain, several concepts are raised by the human experts. For example, a group of neurons that connected to each other and receives the same input is called layer. Activation is a value resulted from the activation function which changes the output according to a formula. The common activation functions for the non-linearity are tanh, sigmoid and the rectified linear unit (ReLU). The trainable parameters in the neuron are called the weight which changes along with the input data. The neuron is excitory if its weight is positive. In the other hand, the neuron is inhibitory if the weight is negative. There is no dependence between the input and the output if the weight is zero. The ANNs usually use sigmoid as their output activation function which converts the weighted sum to a value ranged between 0 and 1. Hidden layer is the layers between the input and output layer. Hidden layers are commonly activated by the ReLU function which is shown to be effective to avoid overfitting.
source from: http://nl.wikipedia.org/wiki/Perceptron
During the training procedure of neural networks, each layer receive an input, process it and feed them into the next layer until it reaches the output layer. This is what we called feed-forward network. With this work flow, the network can receive inputs and output a prediction. But this prediction is likely to be wrong because the parameters in the network are randomly initialised with some small numbers. The output is unlikely to match the label due to the neurons have no clue how to deal with the input. Since we want the network to be able to learn and make accurate predictions, some kind of ‘feedback’ is required to ‘teach’ the network by updating the parameters in the neurons. ‘Loss’ is that correction of the parameters. It is the measurement of the deviation between the output and the desired value. After the signal has gone through the network and reached the output layer, the loss will then be calculated to change the weight of neurons recursively the other way around according to the chain rule. This process is thus also called backward propagation of errors, because the errors is computed at the last layer and distributed back through the network layers.
The Backpropagation is repeated in each iteration, causing the model updated to fit the data. A factor called learning rate will affect the speed and quality of this process. It is the ratio of weight’s gradient which will be subtracted from the weight, in other words updating the parameters. After each update, the loss will decrease as the network fits the data better and eventually converges to the local minima, which is so-called Stochastic Gradient Descent (SGD). It is a popular algorithm for the optimization of a wide range of models in machine learning . The following equation is how SGD compute the updated weight:
wi+1 = wi − ∇g(w)
Where is the learning rate and g(w) is the gradient of weight.
Deep learning is a branch of machine learning. Among the deep learning algorithms, Convolution Neural Network is the model that performs the best on the data with some spatial topology such as images, audios, bioinformatic sequences. We will discuss it in this subchapter.
Convolutional Neural Network is a deep, feed-forward neural network that is commonly applied on pattern recognition tasks. The concept was first introduced by Hubel and Wiesel in 1960s when they were analysing the visual cortex neurons of cat and monkey. Their research showed that the visual cortexes consist of neurons that work individually to receive external signals . They described a receptive field as the region of visual space where visual stimuli affects the firing of a single neuron.
The individual neurons respond to different receptive fields which overlap such that they constitute the entire vision to the target space. The Hubel and Wiesel’s paper mentioned two types of visual cell in the brain. One is called simple cell which is sensitive to the straight edges and has a certain range of receptive field. Another one is called complex cell whose receptive field is larger, meanwhile it is insensitive to the actual location of the edges in the field.
The visual cortex is a very powerful image processing system that, many shallow models in the visual recognition field are inspired and constructed from the visual neurons such as Neocognitron, LeNet-5, Shift-invariant neural network and etc. As the computational power increases and the machine learning technology develops, deeper network is then proposed based on the animal’s visual system, i.e. Convolutional Neural Network.
Sparse Connectivity and shared weights contribute to the establishment of Convolutional Neural Network. Both of these features significantly decrease the number of parameters required in the traditional neural networks, allowing CNN to include more hidden layers.
In a traditional neural network, the neurons in each layer are fully connected to all the neurons in the last layer, producing a lot of weights to be trained. What makes CNN different is that the neurons in convolution layers are only connected a small region of the last layer, which we called receptive field in a visual system. This sparse connectivity enables CNN to explore the spatially local traits.
From figure 3(a), we can observe that the inputs of layer one and two are the local subsets of the neurons from the previous layer. The size of the receptive field of CNN in figure 3(a) is three, and therefore the neurons will only connect to three of the neurons from the last layer. Following this architecture, CNN is able to preserve the spatial structure in the input data, producing stronger response to the locally spatial pattern compared to a fully connected network which treats the input pixels that are far apart in the same way as adjacent pixels. Moreover, stacking many such layers can achieve dimensionality reduction on the spatial data, converting it to low-dimensional feature rich vector space.
In the other hand, CNN layers can significantly reduce the trainable neurons in the network. For example, in figure 3(b) an input of length 3 will result in 3*3 = 9 weights after trained for 3 layers. If the input length is 5, then the number of weights will increase to 5*3 = 15. But if we use CNN for training this data, the required number of weights is just 5+3+1 = 9. Apparently, the complexity decreases when using CNN as training model as well as preserving the structural properties.
Another distinguishing feature of CNN is Shared Weights. The neurons in Convolutional layers, or sometimes we call it ‘filters’, will slide through the entire visual space during training procedure. Those weights in the same filter share identical weight vector and bias. This trait allows the neurons in a give convolutional layer detect the similar feature within their specific receptive field. The feature detection is insensitive to the position of the features when the neurons are replicated in this way.
Convolution layers are the key component in CNN, which make CNN different from the traditional fully connected neural network. It is consist of a number of trainable units, or sometimes called filters. The filters will convolve across the input volume during the forward pass, transforming the input into a tensor called activation map. Each activation maps that learned by the filters eventually forms the full output tensor of the convolution layer which can be utilised by the convolutional fully connected layers (Dense Layer).
Figure 4 shows an example of convolution process. Assume that we have a 32*32*3 image and use filter of size 5*5*3 as shown on the left. After convolution operation, this will produce a 32*32*1 feature map, given by the formula (W−F+2P)/S+1, where W is input volume size, F is the filter size, P as amount of zero padding used and S is the stride. In this case, it used P = 2 and S = 1 in order to keep the shape of the input. Supposed that we apply 10 different filters on the image, then we will obtain a matrix of shape 32*32*10 in three dimensions, which we called feature map. Because of the shared weight, the 32*32*1 feature map which produced by the same filter represents the identical feature in the original image. If we acquired 10 feature maps, that means 10 features are extracted from the image.
Pooling layer is frequently used in CNN for non-linear down-sampling. There are several types of pooling layers while the max-pooling is the most popular algorithm. The max-pooling layer reduces the number of parameters and computation in the network by dividing the input into a set of non-overlapping sub-regions, outputting the maximum value in that region whose size is usually 2*2. The intuition is that the exact location of a feature is less important than its rough location relative to other features . The use of pooling layers benefit to the reduction of the spatial size for CNN as well as the occurrence of overfitting.
ReLU is a popular activation function in CNN’s layers. It is the abbreviation of Rectified Linear Unit. The function is given by the formula:
The function filters out the value that is lower than 0, which increases the nonlinear properties of the output without affecting the spatial structure of the receptive field. There are also other functions for increasing nonlinearity. The actual reason why ReLU outperforms the others is that it trains the network several times faster without making a noticeable difference to the generalisation accuracy .
Softmax is another activation function that commonly used in the output layers for classification tasks. It is given by the equation:
Where k is the dimension of the input vector and j = 1, …, k. Basically, the softmax function is squashing the k-dimensional vector into k real values which range between 0 and 1. Also, the sum of output from softmax layer will add up to 1. Each output value represents the predicted possibility of the corresponding class label.
In this section, we will discuss the popular architectures for convolutional neural network. These structures exploit innovative ways to construct the CNN, allowing for more efficient learning. They can be served as not only the task solutions, but also the rich feature extractors.
LeNet-5  is a 7-level convolutional network designed by LeCun et al in 1998 for handwritten and machine-printed character recognition problems. The network consists of two convolutional layers with size 5*5 and two pooling layers with size 2*2 followed by 2 fully connected layers. This architecture transforms the 32*32 input image into feature maps, which can be learnt by the fully connected layers and eventually converted into a tensor of size 10, each indicating the probability of the corresponding class labels. LeNet was successfully applied to recognise hand-written numbers on checks (cheques) in bank. However, it fell short on processing higher resolution images which require larger and more convolutional layers, hence limited by the computational resources at that time.
AlexNet  is the first large-scale CNN architecture which outperformed all the other competitors in ImageNet competition in 2012. The success of this model has promoted the development of CNN and deep learning. The design of this architecture is based on LeNet but with larger and deeper layers. The most intuitive way to see the difference is to compare the number of parameter in the network. AlexNet has around 60 million parameters which is 1,000 times bigger than LeNet which has only 60,000 parameters.
GoogLeNet and VGG-16
The GoogLeNet  and VGG-16  are both implemented for the ImageNet competition in 2014. The GoogLeNet is designed by Google who won the first place and VGGNet from Oxford who won the second place. VGGNet increases the number of layers to 16 while GoogLeNet makes it deeper to 22. Whereas, GoogLeNet introduced the concept ‘inception cell’, in which we perform a series of convolution at different scales and depths following by a merge operation. The design of the ‘inception’ modules also saved the computational power. Although GoogLeNet has deeper layers than AlexNet, its amount of parameters is twelve times less than the one in AlexNet.
The researchers of GoogLeNet published a paper afterwards, which introduced a more efficient alternative approach to the original inception module. Though the convolution filters with large scale (e.g. 5*5, 7*7) have better potential for extracting features, it is disproportionally computational expensive. They indicated that a 5*5 convolution kernel could be more efficiently replaced by two stacked 3*3 kernels.
The VGGNet consists of a large number of 3*3 convolution filters only, trained on 4 GPUs for 2-3 weeks. Though it is merely the second place in the competition compared to GoogLeNet, VGG-16 is currently a popular choice for extracting features from images. The weights and configurations are publicly available which are then being applied to many other image processing tasks as the baseline feature extractor. Whereas, the number of parameters in this architecture is over 138 million, which can be a considerable shortage.
The introduction of the concept ‘Deep Residual Network’ was a breakthrough in the research of deeper CNN structures. There is a prevalent opinion that the deeper networks are able to learn more complex features and patterns from the data. But in fact, simply adding more layers will have a negative effect on the model performance. This phenomenon is referred by the author of ResNet as the Degradation Problem . In Degradation, although deeper layers force the network to converge, they usually converge in a higher error rate compared to its shallower counterpart. ResNet provided a remedy for this problem, called ‘Residual Blocks’. The residual modules consist of two convolution layers as shown in figure 9.
The uniqueness is, the input X is directly added to the output of the second convolution layer, forming the output of the residual block together. The use of modules in ResNet is similar to GoogLeNet. However, with this ‘skip connection’ structure, ResNet is able to stack up to 152 layers while its complexity is still lower than VGGNet.
The ResNet is the winner of ILSVRC 2015, with an error rate of 3.57% which surprisingly outperformed human experts on the dataset for this competition.
Mask-RCNN  proposed by Facebook Research group is an efficient model for detecting objects in images while simultaneously generating accurate segmentation mask for each instance. The model is built on the basis of Faster-RCNN , extending it by adding a new branch for predicting the segmentation masks. The new branch is a Fully Convolutional Network which was added on the top of the feature map, predicting the mask at pixel-level precision. The structure used for extracting the feature map is ResNet101 or ResNeXt-101.
There is a small modification on the Faster-RCNN architecture to adapt it for predicting mask. Because image segmentation requires much more accurate alignment than bounding boxes, Mask-RCNN proposed RoIAlign layer to replace the original RoIPooling layer. The use of RoIAlign corrects the location misalignment caused by the quantization in RoIPooling. It is the floating number that was ignored in RoIPooling during the rescaling process. Therefore, bilinear interpolation is applied for computing the missing part of floating-point location values in the image.
Transfer learning is a family of machine learning algorithms that improves the model learning in a specific domain of interest by transferring the knowledge from a trained network, whose data is related but falls in a different distribution . When applying Convolutional Neural Network on the tasks, it is more practical to pretrain the model on a very large dataset such as ImageNet. This is due to the task-specific datasets usually have insufficient data compared to ImageNet which contains 1.2 million general images with 1000 categories. The pretrained CNNs are served as an initialization or feature extractor to adapt to the task of interest.
The knowledge learned by CNNs is able to be transferred because the first few layers in CNN learn the abstract features that can be generally adapted to other recognition tasks. The last fully connected layers can be considered as independent classifiers that learn the representations from the feature map output by the Convolutional layers. When the last FC layer is removed, the other parts of the architecture can be kept as a fixed feature extractor. Following this principle, we can replace the classifier layer in a specific task with another fully connected layer to perform different detections.
Fine-tuning the transferred model
In the case that we used pretrained network as initialization, not only the classifier on the top would be replaced or retrained but also the rest of the network will be fine-tuned by the target dataset. It is your choice to select all the layers of the CNN to retrain, or freeze the bottom layers which contain more generic features that is useful to other tasks and fine-tune some high-level layers which focus on learning the details of the original dataset. Note that the early layers are usually left unchanged due to potential overfitting issue. Compared to the fixed feature extractor, fine-tuning the transferred model costs much more time. But retraining the feature extractor layers endows the model with more powerful ability to detect local features.
This chapter gives the overview of the dataset used in this recognition problem. The plant dataset include two major components. One is filtered to white background containing nothing but the target object. Another one is a raw dataset which contains sundries and the background colour is close to black.
The datasets are obtained from a diversity panel of the plant Arabidopsis thaliana. Seeds of each line were 98 sown in individual pots, and grown in controlled conditions by the use of the PlantScreen Phenotyping System (PSI) at the National Plant Phenomics Centre facilities in Aberystwyth, UK. In a mature state, the stem of each line were imaged in a flatbed scanner (Plustek, OpticPro A320) at 300 dpi and store as PNG. A sample image is shown in Figure 11. A total of 7133 images were collected. The main intention of acquiring the images was to have a record of the results dataset and a subsequent phenotyping, for traits as silique number and length. Manual counting of siliques from 5430 images was done, since no image analysis methods were available at that moment.
The dataset is composed of two types of plant images. The white background plant image has 189 annotated images while the black background has 98 annotated images. The resolution of the white background images is 3600*5100 while the other one is of size 1701*2800. Each image has 3 colour channels (Red, Green, and Blue). We used the white background images as our training sample and it was split into 132 instances for training and 57 instances for validation (70% for training and 30% for validation). The siliques are labelled previously by the Phenomics center fellows in Aberystwyth University, eventually results in 10474 labelled samples. The datasets has three classes which are siliques, stems and flowers. However, our aim in this project is to count and measure the siliques in each image, so we only focus on the siliques detection. The prediction layer thus merely contains two classes: background and siliques, represented by 0 and 1 respectively.
The number of siliques counted in each white background image is shown in figure 11. The counts distribute around 25 and 75 where the number of siliques has the highest frequency. The mean value of the counts is 55, which means each image has approximately 55 siliques on average.
Figure 12 shows the histogram of siliques lengths. The distribution of lengths is nearly symmetric and its peak value is located at around 33.
Another dataset which has black background images is not chosen for training because they have a complicated background with many noises and too many overlaps which are difficult to handle and learn by the model. Therefore, we only use it for testing purpose.
Whereas, the black background images are good testing samples which can evaluate how robust our model is after trained on the white background dataset. What’s more, we can apply transfer learning technique on this dataset. After obtained an appropriate model from the white background images, can we transfer the knowledge to a similar problem, detecting the siliques in a different background? Is the features extracted from the training dataset still work in the black background? These are the interesting points that we can experiment on.
The dataset has 98 images, but 19103 labels which means that it has much more labels in each image compared to the white background one. The average number of siliques in black background images is 194 and the mean value of lengths is 39.88
Because the dataset only has 189 instances, it is not enough for training the model. Thus, we need to augment the images to produce more training samples to avoid overfitting and get better performance. The augmentation tool we used is the ‘imgaug’ library in python. It provides sufficient algorithms to perform data augmentation. Figure 15 shows the example of different augmented results from the identically original image. The algorithms used to augment data include flipping, affine rotation, pixel enhancement, Gaussian blur etc.
- Original (b) Vertical reversal (c) Horizontal reversal
(d) 90 degree rotation (e) Pixel value multiplication (f) Gaussian blur
In this chapter, we will firstly introduce how siliques detection is performed on the basis of Mask-RCNN. The second part will be the experiment choice we make and justification. Finally, we will evaluate the result we achieved.
According to the author of Mask-RCNN, the model surpasses all previous state-of-the-art single-model results on all three task of the COCO 2016 challenge, including instance segmentation, bounding-box object detection, and person keypoint detection . The model is conceptually intuitive and offer flexibility and robustness that it has been adapted and implemented for many object detection and segmentation task, presenting top results. Therefore, it is a proper choice of our siliques detection task.
The mask-RCNN is composed of two major steps. The first step convolves through the image and producing feature maps (or region proposals generated by the Region Proposal Network). The feature maps will be input into different classifiers and regressors for predicting bounding boxes and masks.
The backbone in a CNN architecture is usually the bottom convolutional layers that serves as a feature extractor. In Mask-RCNN, the backbone of choice in Mask-RCNN could be ResNet50 or ResNet101 which we have introduced in chapter 3. The other common backbone networks are AlexNet, VGG, ResNet, Inception, Inception-ResNet, ResNeXt.
In backbone architecture, the input image would be scanned and concentrated into feature maps for the following layer to use. The front layers in the backbone extract low level features while the back layers successively recognise high level features. Typically, an image of size height*width*(colour channel) will be converted to a 32*32*2048 feature map, after passed through the Mask-RCNN’s backbone network. Note that the ‘same’ padding is applied on each layer in the backbone in order to preserve the integer scales.
Feature Pyramid Network  is introduced to perform scale invariant detection. It is an extension of the original feature extraction pyramid which produced by the CNN backbone. As shown in figure 11, the left pyramid is the output feature maps from the backbone feature extractor. The features in the native feature maps pose a inherent trade-off that the low-level features are weak more general while the high-level features are strong and more abstract.
The FPN proposed an extra pyramid that takes the high-level features from the original pyramid and propagates back to the bottom feature maps. This process augments the quality of features in the lower layers, making all the feature maps useful for training. It is important to note that the feature map will be dynamically selected depending on the size of the instance.
Selective search is a popular region proposal method, yet it consumes much more runtime than the detection network, slowing down the overall detection process. As we mentioned above, Region Proposal Network proposes areas of interest by scanning the feature map using a sliding window. The RPN is specifically designed to improve the efficiency of generating region proposals with a wide range of scales and aspect ratios. Anchors are the prototypical object regions that slid through by the RPN window. The anchor boxes serve as references of the feature map at different scales. Moreover, the anchors would be classified into two classes: foreground or background. The ‘foreground’ anchor boxes are likely to contain an object.
Because RPN scans the feature map instead of the raw image, this allows the network to reuse the extracted feature and avoid extra computation. Therefore, the RPN is able to perform fast region proposal in 10ms as stated in Faster RCNN .
Bounding Box Refinement
There might be several foreground anchor boxes (or positive anchor boxes) generated for the same instance in an image. In this case, the algorithm will keep the best-fitted anchor and discard the rest based on the score (referred as Non-maximum Suppression). Eventually, the RPN will choose the anchor boxes with top scores which mean they are more likely to contain object, refining the size and location using NMS.
This part of the model performs the classification and regression on the regions of interest which proposed by RPN. The outputs include the predicted class and 4 coordinates of the bounding box. The last fully connected layer which activated by softmax function will output the class label of the object in the RoI. The other fully connected layers in the bottom are the bounding box regressors that output the coordinates of the object in the RoI.
Because the classifiers normally have fixed size of input features, we need a tool to transform the RoI boxes with different scales and aspect ratios proposed by the RPN. The RoI Pooling layer is typically designed for normalising the features by taking the average value from the 2*2 grid in RoI boxes. Mask RCNN push it further by replacing the RoI pooling layer with RoIAlign layer. The RoIAlign layer removed the coordinate quantization which reduced the information loss and make the localization of segmentation mask become more precise.
The previous sections are all included in Faster-RCNN’s framework. Mask-RCNN adapts them to perform mask prediction by embedding additional mask network branch. The mask branch is a convolutional neural network that uses refined detection boxes (Post-NMS boxes) for mask prediction which increases the accuracy.
The masks generated from the refined boxes are very small, keeping the mask branch in a light weight. Moreover, the 28*28 size generated masks are soft masks, which means they are represented by float numbers and more precise than binary masks. The mini masks would be scaled up during inference to match the size of the RoI bounding boxes.
The overall architecture includes three parts: the convolutional backbone, the bounding box detection network and the mask prediction network. They are all adapted from Mask-RCNN architecture provided by matterport  with some modifications and hyper-parameter changes to fit our plant image dataset. The model is implemented on the basis of a deep learning framework called Keras, with Tensorflow running as the backend. The structure of Mask-RCNN is summarised in table 1.
As shown in table 1, the model contains 407 layers and over 63 million trainable parameters. It is not feasible to train the model end-to-end with the existing small dataset and computing resources. Therefore, we should apply transfer learning technique, transferring the knowledge from the other task to reduce our training time and improve the performance. A potential solution is to initialise the network with weights that pretrained on COCO dataset . After initialisation, we can train the head layers on our dataset to fit the model for siliques detection. Furthermore, we can try to train the fully connected layers or even the whole network with a smaller learning rate to pursue a better performance.
After obtained results, we need methods to measure the accuracy of our result. The following sections describe the common evaluation methods.
Precision is the measurement of how accurate is your result. More precisely, it means the percentage of how many positive predictions are correct. The precision is given by the formula:
Equation 1: Definition of precision
where TP is the number of true positives and FP is the number of false positives. But this is not enough for measuring the actual model performance because some of the object could be mistakenly classified into negative (i.e. FN) which is not taken into account in precision. Hence, recall is introduce to measure the accuracy as well. It is given by the following formula:
Equation 2: Definition of recall
Recall measures the percentage of predicted true positives out of all the positives.
Intersection over Union (IoU) measures the similarity between two regions. It is defined as the overlapping area of the two set divided by the overall area of the two regions. This provides the measurement of how good is our prediction compared to the ground truth image. As shown in figure 19, the intersection is the area of AB. The union is the area of AB.
The mean Average Precision (mAP) is commonly used in the field of object detection and information retrieval. It is the mean value from a set of Average Precisions. So in order to obtain the mAP, we need to calculate the Average Precision first. Note that AP is calculated over a dataset, different datasets have different AP and mAP.
In our task, we aim to evaluate the precision of the predicted bounding boxes, thus the AP is defined as the average precision over 50% IoU levels (i.e. the minimum IoU value to consider a positive match).
Equation 3: Definition of Average Precision
In general, it is to find the maximum precision value at each recall level (0,0.1,…,1).
Pearson correlation coefficient (also called Pearson’s r) measures the linear correlation between two variable X and Y in statistics. Combining with the scatter plot, the coefficient can clearly illustrate how well the model fits the data.
The best case is p =+1, which means our model follows the exact pattern of the data. But it is an idealized result. The common Pearson’s r value falls between 0 and 1, where 0 indicates that there is no correlation between two variables.
For the plant images in our dataset, due to the siliques are usually straight and with no curve, we can use the diagonal as the estimation of the length of the siliques. Assuming the diagonal is the target length, we can use the known length of the bounding boxes to obtain the results. The length is thus given by: Length =
(x1-x2)2+(y1-y2)2, where x1, x2 is the abscissa and y1, y2 is the ordinate.
In this chapter, we will describe the experiments and their results, as well as the justification of choices.
The experiment is conducted on a workstation provided by my supervisor. The workstation is equipped with Intel Xeon Processor E5-2630 v4 @ 2.20GHz which has 10 cores (2 logical cores per physical) and a single NVIDIA TITAN Xp with 12GB of memory.
As I mentioned in chapter 4, data pre-processing is important for the training of the small datasets. Particularly, the white background dataset had only around 100 valid annotated images at the early stage of this project. The number of available images increased to 189 later in the middle stage. But with the data augmentation, the number of valid images can easily increase to several times more than the raw data. In our project, we randomly applied some of the algorithms (flipping, affine rotation, pixel enhancement, Gaussian blur) to the raw images resulted in many variations for the use of training.
With more training data available, we split the dataset into two parts, where 70% for the training purpose and 30% for testing. This will enable us to detect overfitting and predict results on unseen data which improves the reliability of the model.
In order to obtain the best result, we always need to tune the hyperparameters to fit the target data and generate an appropriate model. There are so many tuneable parameters due to the large-scale model architecture. Therefore, we will start the training with the default setting proposed by the author  and tune them based on my understanding of the network and the experiences provided by the discussant on Github.
I selected ResNet-101 as my backbone feature extractor. As we all know, the deeper the network, the stronger features it can detect from the data. ResNet-101 solved the gradient descent problem by adding units called ‘residual blocks’, which allows it to have deeper layers but less parameter than VGG network. In addition, my computing resources are sufficient for training the ResNet-101.
The default optimizer in the original paper is SGD. Since it is a fixed learning rate optimizer, the model can converge faster using the adaptive optimizers such as ADAM. Also, the adaptive learning rate of ADAM allows the model to find the local minimum accurately, improving the training performance. Moreover, the ADAM optimizer updates the learning rate automatically, manual tuning become less important while the initialised learning rate determines the model performance in SGD.
Because the dimension of the image is too large (5100*3600), it is necessary to reduce the size of the image for the training purpose. I firstly tried to resize the image to the size of 1024*1024, which is a common scale for training. The performance of the model is fine in terms of Average Precision and Pearson Correlation Coefficient. After that, I tried to reduce the size to 512*512. But the evaluation performance is worse than the previous dimension that I can clearly spot many undetected siliques. Follow by this negative feedback, the next experiment was performed on the images with 1280*1280 dimensions. The model trained with larger scale seems to have better capability of detecting overlaps and horizontal or vertical siliques (we will discuss why they are hard to detect later in chapter 7). The model performance was slightly improved as well. However, the dimension cannot be increased unlimitedly due to the memory issues. In that case, the dimensionality and data augmentation would affect each other, which leads to the exponential increase of complexity. Therefore, 1280*1280 is the final training scale for my model.
The parameters for anchor boxes in RPN need to be selected carefully because, as mentioned in chapter 5, they affect the way the model learn features. If your anchor boxes did not fit the objects in your data, the model could not generate accurate predictions. Compared to the training data in the original paper, our plant data has smaller targets but contains more instances per image. Therefore, I decreased the scale and increased the number of the anchor boxes. The current ratio of the anchors is (0.5, 1, 2) which means the anchor boxes are in shapes of square, vertical rectangle and horizontal rectangle. However, our siliques could be longer and thinner than this ratio that other ratio such as (0.3, 1, 3) might be more efficient. This can be the future work for this project.