In: Computer Science
Write a full project report: Fruit Detection System Using Neural
Networking. (following this procedure -Abstract
Introduction, Methodology, Dataset, CNN-TensorFlow, Experimental
Result, Conclusion, Future Work, Reference)
N.B. If you don't finish the answer here fully, after I can submit another part here like Dataset, CNN. Make sure you should answer properly part by part.
FRUIT DETECTION SYSTEM USING NEURAL NETWORKING:
1. Abstract:
The aim of this project is to utilize the knowledge of DNN (deep neural network) or Neural Network to generate a model for Fruit Detection. The system is accurate, fast and reliable fruit detection system, which is a vital element of an autonomous agricultural robotic platform; it is a key element for fruit yield estimation and automated harvesting. Recent work in deep neural networks has led to the development of a state-of-the-art object detector termed Faster Region-based CNN (Faster R-CNN). We adapt this model, through transfer learning, for the task of fruit detection using imagery obtained from two modalities: colour (RGB) and Near-Infrared (NIR). Early and late fusion methods are explored for combining the multi-modal (RGB and NIR) information. This leads to a novel multi-modal Faster R-CNN model, which achieves state-of-the-art results compared to prior work with the F1 score, which takes into account both precision and recall performances improving from 0.807 to 0.838 for the detection of sweet pepper. In addition to improved accuracy, this approach is also much quicker to deploy for new fruits, as it requires bounding box annotation rather than pixel-level annotation (annotating bounding boxes is approximately an order of magnitude quicker to perform). The model is retrained to perform the detection of seven fruits, with the entire process taking four hours to annotate and train the new model per fruit.
2. Introduction:
According to sourcing skilled farm labour in the agriculture industry (especially horticulture) is one of the most cost-demanding factors in that industry. This is due to the rising values of supplies, such as power, water irrigation, agrochemicals, and so on. This is driving farm enterprises and horticultural industry to be under pressure with small profit margins. Under these challenges, food production still needs to meet the growing demands of an ever-growing world population, and this casts a critical problem to come.
Robotic harvesting can provide a potential solution to this problem by reducing the costs of labour (longer endurance and high repeatability) and increasing fruit quality. For these reasons, there has been growing interest in the use of agricultural robots for harvesting fruit and vegetables over the past three decades. The development of such platforms includes numerous challenging tasks, such as manipulation and picking. However, the development of an accurate fruit detection system is a crucial step toward fully-automated harvesting robots, as this is the front-end perception system before subsequent manipulation and grasping systems; if fruit is not detected or seen, it cannot be picked. This step is challenging due to various factors, among which are illumination variation, occlusions, as well as the cases when the fruit exhibits a similar visual appearance to the background, as shown in Figure 1. To overcome these, a well-generalised model that is invariant and robust to brightness and viewpoint changes and highly discriminative feature representations are required.
Figure 1. Example images of the detection for two fruits. (a) and (b) show a colour (RGB) and a Near-Infrared (NIR) image of sweet pepper detection denoted as red bounding boxes respectively. (c) and (d) are the detection of rock melon.
In this work, we present a rapid training (about 2 h on a K40 GPU) and real-time fruit detection system based on Deep Convolutional Neural Networks (DCNN) that can generalise well to various tasks with pre-trained parameters. It can be also easily adapted to different types of fruits with a minimum number of training images. In addition, we introduce approaches that combine multiple modalities of information (colour and near-infrared images) with early and late fusion. For the evaluation, we demonstrate both quantitative and qualitative results compared to previous work. The contributions of this paper are therefore:
Developing a high-performance fruit detection system that can be rapidly trained with a small number of images using a DCNN that has been pre-trained on a large dataset, such as ImageNet .
Proposing multi-modal fusion approaches that combine information from colour (RGB) and Near-Infrared (NIR) images, leading to state-of-the-art detection performance.
Returning our findings to the community through open datasets and tutorial documentation .
To the best of our knowledge, this is the first attempt to fuse RGB and NIR multi-modal images within a DCNN framework for fruit detection. We use standard evaluation metrics, precision-recall curves and the F1 score (i.e., the harmonic mean of precision and recall), to perform extensive evaluations using data collected from three commercial sites acquired during day and night. This dataset, along with the annotated ground truth imagery and labelling tool will be distributed upon the publication of this work to encourage further research use in the relevant area.
2. Methodology:
Fruit segmentation is an essential step in order to distinguish the fruits from the background (leaves and stems). This task is challenging due to variation in fruit colour and illumination, as well as high levels of occlusion.
In this section, we present the state-of-the-art fruit detection system [4], which performs pixel-wise segmentation, against which we compare. We then describe the DCNN approach, Faster R-CNN, which forms the basis of our proposed method. The details behind adapting this model for fruit detection are then given, followed by a description of the fusion methods we propose for this DCNN architecture.
3.1. Fruit Detection Using a Conditional Random Field
In prior work [4], we demonstrated that using a CRF [27] to model colour and visual texture features from multi-spectral images led to the impressive performance for sweet pepper segmentation. The multi-spectral images contain both colour (RGB) and Near-Infrared (NIR) information. The CRF uses both colour and texture features. The colour features are constructed by directly converting the RGB values to the HSV colour space. Visual texture features are extracted from the NIR channel. NIR images are used to calculate texture features, as they were found to be more consistent than the colour imagery. Three sets of visual texture features are used: (i) Sparse Autoencoder (SAE) features [13]; (ii) Local Binary Pattern (LBP) [28]; and (iii) a Histogram of Gradients (HoG) [29]. Each feature captures a different property, such as the distribution of the local gradient, edges and texture, respectively. It appears that the LBP feature can capture information, such as the smooth surface of sweet peppers, and provides an efficient method for encoding visual texture.
Although the CRF-based approach yields impressive results, there are two key challenges: ground truthing (i.e., pixel-wise image annotation) for training, and model evaluation requires labour-intensive work, as shown in Figure 2. Pixel-wise annotations took an order of magnitude more time to produce than bounding box annotations (in our experiments per-pixel segmentation of 5 images took approximately 470 s, whereas bounding box annotations took approximately 43 s). The slow processing time (∼1.5 s/frame only using LBP feature) of the current MATLAB implementation is also a bottleneck for robotic application, which usually requires a closed-loop control. We present quantitative comparisons to our new proposed method in Section 4.2.
Figure 2. Pixel-wise (a) and bounding box (b) image annotation.
3.2. Fruit Detection Using Faster R-CNN
Despite the recent progress being made using deep convolutional neural networks on large-scale image classification and detection [5], accurate object detection still remains a challenging problem in the computer vision and machine learning fields. This task requires not only detecting which objects are in a scene, but also where they are located. Accurate region proposal algorithms thus play significant roles in the object detection task.
There are recent works, such selective search [19], which merges super pixels based on low-level features, and EdgeBoxes [20], making use of edge information to generate region proposals. However, these methods require as much running time as the detection to hypothesise object locations. Faster R-CNN [21] was proposed to overcome this challenge by introducing the Region Proposal Network (RPN), which shares convolutional features with the classification network, and two networks are concatenated as one network that can be trained and tested through an end-to-end process. By doing that, the running time for region proposal generation takes around 10 ms, and this framework can maintain a 5 fps detection rate and outperform the state-of-the-art object detection accuracy using very deep models [30].
The Faster R-CNN work of [21] uses colour (RGB) images to perform general object detection. It consists of two parts: (i) region proposal; and (ii) a region classifier. The region proposal step produces a set of NP proposed regions (bounding boxes) where the object(s) of interest could reside within the image. The region classifier step then determines if the region belongs to an object class of interest; the classifier could be a 2-class or N-class classifier. To train the Faster R-CNN for our task, we perform fine-tuning [31]. This requires labelled (annotated) bounding box information for each of the classes to be trained. An example of the bounding boxes required is given in Figure 2.
Figure 3 illustrates the test-time detection pipe line. First, regions of interest are generated from the input image, and these are fed into subsequent convolutional layers. In this paper, we use the VGG-16 model that has 13 convolutional layers. The RPN produces region proposals using the previously-generated feature map. These proposals highlight the regions that are highly probable to contain an object. Fully-connected layers and the softmax classifier yield n bounding boxes, Bn and their corresponding probability scores of each class, P(xn|Bn).
Figure 3. Illustration of test time the Faster Region-based Convolutional Neural Network (R-CNN). There are 13 convolutional and 2 fully-connected (Fc6 and Fc7) and one softmax classifier layers. N denotes the number of proposals and is set as 300. O1:N is the output that contains N bounding boxes and their scores. Non-Maximum Suppression (NMS) with a threshold of 0.3 removes duplicate predictions. BK is a bounding box of the K-th detection that is a 4 × 1 vector containing the coordinates of top-left and bottom right points. xK is a scalar representing an object being detected.
Although it is usually challenging and out of the scope of this paper to prove why the use of deep convolutional neural network works well for the object detection task, we present some level of visual proofs that the neural networks can capture significant features discriminatively. Figure 4a is the visualisation of the first convolutional layer of the colour VGG-16 network. This model is designed to use of 3 × 3 convolutional kernels (mask) and a 2 × 2 pooling mask from the beginning to the end of 13 convolutional nets. It can be observed that filters have reddish and greenish colours that correspond to red and green sweet peppers. Other filters represent edge filters in varying orientations. Figure 4b shows the input data layer and one of feature maps from conv5 layer Figure 4c. It can be seen that the regions for sweet red peppers (cyan boxes) are strongly activated, and this information is highly useful for RPN and further classification process.
Figure 4. (a) The 3 × 3 (pixels) Conv164 filters of the RGB network from VGG, (b) The input data and (c) One of the feature activations from the conv5 layer. The cyan boxes in (b) are manually labelled in the data input layer to highlight the corresponding fruits of the feature map.
We also perform further investigation for visual proofs by visualising a high-dimensional feature space. It was shown that output from the fully-connected layer can be used as feature representations for classification tasks [17], and we show how this feature is discriminative.
Four thousands ninety six dimensions of feature vectors are extracted from the fully-connected 7 (fc7) layer and are fed into t-Distributed Stochastic Neighbour Embedding algorithm (t-SNE) [32] with the corresponding labels. t-SNE is one of the popular dimensionality reduction methods that measures pairwise neighbouring similarities using the L2 norm distance in both high and low dimensions. The pairwise similarities are calculated around the sample points, and the Kullback–Leibler divergence is used to gauge the distance between two probability distributions (i.e., the similarities of high and low dimensions). Stochastic Gradient Decent (SGD) minimises the distance to keep the local structure in a low dimension space. Figure 5 shows low dimension (2D) feature visualisation using t-SNE. Each point represents a feature, and its colour is the corresponding label. It is obvious that sweet peppers (green) and rock melons (blue) are highly distinguishable from each other and the background (in red). This figure also shows that good detection results are expected given a reasonable classifier.
Figure 5. t-SNE feature visualisation of 3 classes. The 4k dimensions of features are extracted from the Fc7 layer and visualised in 2D. For the visualisation, 86 images are randomly selected from the dataset and processed for the network shown in Figure 3.
Note that the key contributions of this model (VGG-16) are in demonstrating that the depth of the network plays significant roles for proper detection performance, and despite its slightly inferior classification power, its features generated from the network architectures outperform other state-of-the art networks, such as AlexNet [17], ZF [33] and GoogLeNet [34]. It is, therefore, the most popular choice at the time of writing this article in the computer vision and machine learning communities for the front-end feature extraction module. Faster R-CNN also makes use of these feature maps as the guidance for where to look. We will present how to train VGG-16 net and deploy it for fruits detection in the following section.
3.3. DeepFruits Training and its Deployment
The data that we have are multi-modal, colour (RGB) and NIR in nature, and so, we fine-tune (adapt) the Faster R-CNN for each modality independently. Fine-tuning consists of updating, or adapting, the model parameters using the new data. In practice, this involves initialising a new classification layer and updating all of the layers, for both the region proposal and classification network. The classification network uses the same architecture as VGG [30], as this provided the best performance.
The VGG network configuration used (Configuration D) consists of 13 convolutional layers followed by two fully-connected layers, referred to as VGG-D. The original implementation of Faster R-CNN was fine-tuned using the PASCAL VOC dataset (20 objects, 11 k images and 27 k annotated objects), and the network was initialised by the pre-trained ImageNet dataset, which consists of 1000 object categories, 1.2 million images and their bounding box annotations [5]. This implies that we are required to fine-tune again the network using our custom data; otherwise, Faster R-CNN can only detect the 20 ordinary objects on which the network was trained, such as aeroplane, bicycle, bird, cat, dog, and so on. By doing this, we can make use of features learned from a large-scale dataset which are well generalised to various visual recognition tasks.
Given the VGG-16 network, we define three classes named ‘background’, ‘sweet pepper’ and ‘rock melon’ and fine tune the network. Regarding this fine-tuning topic [35], abundant resources are available from online [36], and we also have made publicly available our implementation and tutorial document [6].
Table 1 shows the number of training images used by CRF and Faster R-CNN only for the performance evaluation. We can only use a relatively small number of images due to the limited pixel-wise image annotation datasets from [4]. For a fair comparison, the same training and testing images are utilised, and the experimental results are presented in Section 4.2. We also conduct further experiments by increasing the number of classes and training images to detect another fruit and to demonstrate its generalisation.
Table 1. Number of images used for training and testing for CRF and Faster R-CNN.
After the training, we deploy the trained fruit detector on a laptop that has Intel i7, 64-bit 2.90GHz quad-core CPUs, a GeForce GTX 980M 8GB GPU (1536 CUDA cores) and 16GB of memory space running on an Ubuntu 14.04 Linux system. Input images are obtained from a multi-spectral camera, the JAI AD-130GE, and a Microsoft Kinect 2. Each camera has a resolution of 1296 × 964 and 1920 × 1080, respectively. Processing for the detection takes an average of 341ms with a 4ms standard deviation for JAI and 393ms with 3ms for the Kinect 2 image. The processing time gap is caused by an external library for reading different resolution images.
3.4. Multi-Modal Fusion
In the previous section, we introduced the proposed fruit detection approach using the Faster R-CNN framework; here, we present the two methods, late and early fusion, that we use to combine the multi-modal (RGB and NIR) imagery that we have. Late fusion combines the classification decisions from the two modalities. Early fusion alters the structure of the input layer of the VGG network so that 4 channels, rather than 3, are provided.
3.4.1. Late Fusion
Late fusion combines the classification information from the two modalities, colour and NIR imagery. Using the independently-trained models for each modality (see Section 3.2), we combine the classification information in the following manner.
Each modality m produces Nm,P region proposals. To combine the two modalities, these region proposals are combined to form a single set of NP*=m×Nm,P region proposals. A score sm,p is then proposed for the p-th proposed region of the m-th modality. A single score for the p-th region is produced by averaging the response across the modalities,
sp=∑m=1Msm,p
(1)
The score is a C-dimensional variable, where C is the number of classes to be classified.
3.4.2. Early Fusion
Early fusion alters the structure of the input layer of the VGG network so that the input data layer has Nc=4 channels (3 channels from RGB and 1 channel from NIR), rather than Nc=3. The VGG network is modified and adapted to receive RGB and NIR information simultaneously. An overview of this is provided in Figure 6. To achieve this, we duplicate the R response from the VGG-D network and initialise the extra, NIR channel using this; the R channel (620–750nm) is chosen, as it is closest to the NIR channel’s wavelength (750–1400nm). This early fusion network is then fine-tuned as previously described.
Figure 6. A diagram of the early and late fusion networks. (a) The early fusion that concatenates a 1-channel NIR image with a 3-channel RGB image; (b) The late fusion that stacks outputs, ORGB+NIR1:2N, from two Faster R-CNN networks. ORGB1:N and ONIR1:N represent the output containing N=300 bounding boxes and their scores from the RGB and NIR networks, respectively. K is the number of objects being detected. Note that the Faster R-CNNs of the early fusion are identical to that of Figure 3.
Note: Kindly upload the next part, as I'm not able to complete the entire report (due to exceed of word limit)
Note: If you have any related doubts, queries, feel free to ask by commenting down below.
And if my answer suffice your requirements, then kindly upvote.