This undergraduate Bachelor Thesis examines the use of a raspberry pi towards a real-time computer vision system.
Advances in technology in recent years have steadily increased computational performance, ushering in the availability of affordable powerful, single board micro systems. This project attempts to showcase an application of low cost hardware for performing modern computer vision algorithms, paired with imaging sensors to emulate an embedded system. In order to achieve this goal, the project must demonstrate background learning, object detection, and establish methods for monitoring the real time movement of pedestrians and vehicles on a road. The implementation will make use of a Raspberry Pi type Model B as the main piece of computational hardware to be employed to an IP camera.

Excerpt

1. Introduction
1.1 Motivations behind this project
1.2 Background
1.4 Content Summary

2 Literature Review
2.1 Background Modelling
2.1.1 Static Reference Images
2.1.2 Background Learning
2.2 Blob Detection
2.3 Object Tracking
2.4 Optimization

3 Methodology
3.1 Tools
3.1.1 Hardware
3.1.2 Software
3.2 Methods
3.2.3 Blob Detection
3.2.4 Tracking

4 Implementation
4.1 Development OS
4.2 Software Methodology
4.2.2 Methodology Evaluation
4.3 Background Modelling
4.3.1 Frame Differencing algorithm
4.3.2 Median Filter
4.3.3 Grimson Mixture of Gaussian Modelling
4.4 Blob Extraction
4.5 Tracking & Counting

5 Results
5.1 Plans for testing
5.2 Results
5.2.1. A Process latency
5.2.1. B Higher Resolution latency testing
5.2.2 Tracking accuracy
5.2.3 24 Hour Stress Test

6 Conclusion and reflection
6.1 Major Development Problems
6.2 Future Research
6.3 Summation of the Artefact

References

Appendix

Appendix - Results from regional tracking prototype

1. Introduction

In this chapter, the overall context of the project is presented, as well as a brief background into the realisation and development of computer visions systems. Followed by a list of project Aim and Objectives.

1.1 Motivations behind this project.

Due to the computationally expensive nature of computer vision systems, many autonomous surveillance and monitoring systems rely upon a distribution of cameras sending data to a central supercomputer. This supercomputer simultaneously performs calculations on these streams of data, locally storing or outputting theses results. Whilst impressive demonstrations of such systems have been exhibited, they are fundamentally restricted by two factors: the operational reliance of the supercomputer is an absolute, and their expanse is finite. The analogy proposed by Grimson & Stauffer (1998) of a “forest of sensors” can only work if processing is performed locally on the camera. Such a system would mean that the central machine could be a low powered laptop receiving post-processed data, enabling almost infinite scalability. Furthermore, if a node were to go offline, even the central node, the rest of the system would remain operational.

1.2 Background

Gilbert (1980) introduces the idea of a (RTV) 'Real Time Video tracking system' explaining the computation restrictions with regards object recognition frame rates, analogue to digital conversion as well as resolution. Explaining that “A great deal of work must be done before a truly versatile video analysis system can be built”. A couple of years later Gennery (1982) on a 2x2x1 meter computer managed a rudimentary object tracking system, on a finite field of interest and a using a simple known three-dimensional object, taking 3.2 seconds to process each image.

By as early 1989 academic discussions entertained the concept of cameras paired to computer vision systems monitoring vehicles on roads. Evaluating safety concerns of a reliance of computer vision systems (Hitchcock 1989). Such systems were first realised by Koller et al (1991) using the central band of colours in a grayscale video feed, displacement vectors were calculated by calculating distance between key pixels across adjacent frames, taking just over 46 seconds per image. Further feature based tracking as best described by Sonka et al (1993). Huang et al (1993) and Koller et al (1994) demonstrate the application of the Kalman filter in real time vehicle tracking.

Using an inexpensive entry-level UNIX workstation Grimson et al (1998) create a system capable of learning its background based on the colour patterns of each pixel in previous frame. The system demonstrated a comprehensive ability to adapt to changing lighting conditions found in an external environment. Enabling for a rugged foreground separation and blob detection, performing at a rate of 7 frames a second classifying it as real time.

Cucchiara (2003) looks at a novel approach to background subtraction. Addressing issues with shadows and ghost removal. Zivkovic (2004) furthers noise reduction and methods of false positive suppression. Goyat (2006) looked at tracking the speed and trajectories of vehicles. Using the full RGB colour space at a resolution of 640x480 achieving a frame rate of 30fps.

In recent years the incorporation of computer vision systems has steadily increased. Leading to a broad spectrum of applications. “Examples of computer vision applications include augmented reality, vehicle lane detection systems, and vision-based video game controllers.” explains Clemons et al (2011), justifying this by stating that “the number of computer vision applications has been growing steadily as the underlying algorithms have become more efficient and robust”, Kisacanin et al (2005) accredits this to “Moores Law improvements in hardware, advancements in camera technology, and the availability of useful software tools.” leading to “small, flexible and affordable vision systems.”

1.3 Aims and Objectives

Kandhalu et al. (2009) explains that due to improvements in camera technology twinned with a great increase in affordability “video cameras are now being installed at an unprecedented pace”, in both the public and private environment. However, he goes on to comment that the continual watching of multiple video streams by human operators can be “neither reliable nor scalable”. With this in mind, and based on the computationally intensive nature of computer vision systems, the goal of this project is to provide an answer for the question.

“Can a low-cost, low-power ARM machine be employed to track the real] time movement of vehicles?”

The principle aim for this project is to provide an answer to the above question by demonstrating that a Raspberry Pi has the computational dynamacy to emulate an embedded system and run a modern background subtraction algorithms at a fraction of the cost. Developed by a team lead by Dr. Eben Upton at The University of Cambridge, the Raspberry Pi was primarily to to encourage children into computer programming at an early age, in an interview on CNN news Upton (2012) explains that having seen a decline of computer science additions the Raspberry Pi's original purpose was an “attempt to reboot some of that 1980's feel that was responsible for giving us that stream of very talented students”. The driving force behind this ambition was cost, the original Raspberry Pi “Model A” retailed for just $25. When Upton (2012) spoke at TED talk he explained that the raspberry pi was an attempt to give every child a computer, and thus the price must be appropriate enough that a child can buy themselves it, explaining that it was “the usual price for a textbook”. Due to its high specification to cost ratio it caught the attention of a much larger group, Upton & Jones (2012) explain that optimistic sales expectations of 10,000 units were exceeded tenfold as first day sales reached over 100,000 units sold, the larger group responsible for this were hackers programmers and computer hobbyists, testing the possibilities provided by such a small affordability device.in accordance to the main project goals, an attempt shall be made to:

- Establish and install optimally performing operating system.
- install languages and library’s.
- Work from sample videos.
- Establish a stable camera feed for the Raspberry Pi.
- Perform background subtraction, and foreground separation as well as noise reduction and morphology.
- Apply tracking algorithms capable of tracking over many frames.
- Keep track of multiple and occluded vehicles
- Count the vehicles.
- Maintain stability over long durations.

in brief, the main system is required to monitor a real time camera stream of a road, distinguishing any moving entities form the static background, be that pedestrians of vehicles. These direction of these entities must be recognised and tracked. Counting the variations in both directions. During the conception creation and testing of this software, special attention will be made to properly observe Rapid application development (RAD) process, (further discussion of its application can be found in section 4.2).

1.4 Brief Risk assessment and Contingency planning.

Throughout every stage of this project, (as with all projects) an unavoidable and inevitable number of risks could occur. The manifestation of which could be detrimental to the performance and progression of the project. General risks such as poor time management, the loss of data, or illness have the capacity to hamper the advancement of the project, as such their impact must be understood and relevant contingency plans decided. An indepth contingency plan can be view in Appendix 1.

1.4 Content Summary

This project will be broken up into six chapters.

Chapter I Introduction

Introduces the rationale of the project, presenting a background to computer vision systems, as well as listing Aims and goals.

Chapter II Literature Review

Examines relevant academic literature, in the discipline computer vision and background subtraction.

Chapter III Methodology

Shall cover the project methodology evaluating tools and methods chosen for the development.

Chapter IV Implementation

Describes the steps taken during the implementation of the project examining their requirements in attaining aims and goals.

Chapter V Testing

Shall provide an evaluation of the final system in terms of it meeting the Aims and objectives originally set out.

Chapter VI Conclusion & Reflection

Will present an overall conclusion relating to the experience drawn from the work conducted during this project, as well as improvements and possible other work.

2 Literature Review

2.1 Background Modelling

A fundamental process for undertaken by many stationary camera computer vision systems is foreground background separation. Many methods for this have been proposed, however effective algorithms for modelling the background have been difficult to demonstrate. The following section looks a these.

2.1.1 Static Reference Images

Davies et al (1995) describe a system to monitor crowd behaviour in “semi-confined spaces”. Introducing the idea of using static 'background only' reference images (Figure 1, A) as an efficient template to deducing none stationary objects. Their system for monitoring crowd behaviour in “semi-confined spaces”. Using a fixed 256 level grey camera, targeting a large open area of Liverpool station. Such a solution was enabled both by the stationary nature of their camera, and low lighting fluctuations of an internal system, as such the background will be fairly regular. Each pixel of an input frame (Figure 1, B) is then compared respective to the background, and where the variance exceeds that of a predefined threshold a mask is created (Figure 1, C). Finally an edge detection algorithm was used in an attempt to separate individuals (Figure 1,D).

A new frame enters the system at intervals of 10 seconds. After a mask of the crowd is established, a comparison between the masks size and a manual count of the actual number of people in the scene is made. This process is repeated 150 times to improve accuracy.

illustration not visible in this excerpt

Fig. 1A I Reference Image B | Real Image C | Mask subtraction D | Edge detection

(Source: Crowd monitoring using image processing, 1995)

Whilst boasting impressive performance, characteristically dynamic scenes rarely rely upon static reference image background modelling techniques. Since over time they can easily become out of sync with the real background, after the simple act of introducing background objects.

2.1.2 Background Learning

2.1.2.1 Temporal Median

Lo et al (2001) made use of static fixed cameras on the London underground, designing and developing a system to measure congestion autonomously and alert a controller if a threshold level was reached. Whilst their system was indoors (underground) the brightness variance became and early issue for background modelling due to reflective trains arriving into the station. To combat this they proposed using the morphological operator of a variance filter to blur the intensity of the image and reduce noise, explaining that “brightness variation has negligible affect on the variance filter”.

Such an algorithm establishes the mean value for the 3x3 grid, assigning the centre pixel summation of the overall difference between the perimeter pixels and the mean all squared. This filter was applied to their “87 training samples”, which in turn stored in memory. Comparisons with these these photos were then made to 6,400. The resultant systems broke crowding levels into 4 categories, see figure 2, the average accuracy for the entire system was just 79.875%.

illustration not visible in this excerpt

Fig. 2 Frame difference Equation

The reliability of this system is therefore fundamentally floored. For an indoor fixed lighting system to throw a false positive one out of five times improvements can be made. It is also very memory intense requiring (frame size)*(n training frames) worth of free memory just for the background modelling alone.

Cucchiara et al (2001) also demonstrate the use of variance filters, examining the uses for background suppression, they designing and developing a general purpose motion tracking system, called SaKbOT (Statistic and Knowledge-based Object Tracker). As with Lo et al (2001) Cucchiara is also quick to addresses difficulties with creating reliable background images, due to lighting fluctuations, exclaiming the “typical trade off” between “high responsiveness and and a reliable background. Observations of two other computer vision complexities they noted are; Shadows & when an object belonging to the background “start moving”, known as “Ghosts”. To overcome these difficulties their proposed solution is twofold:

To detect ghosts they first attempt to detect any object, to do this their algorithm applies a variance filter to previous 'n' frames, disclosing that “typical parameters used in SaKbOT are n = 7 for the length of the sequence of previous frames” as their system is running at lOfps that would enable a background model of just under l second. The system then compares the colour space values of the current pixels value to that of the median* of the past 'n' frames, and by thresholding they are able to distinguish both MVO's (moving visual objects) and the 'ghosts', in addition to add confidence to the background model the pixel values of MVO's are excluded from the background model.

Shadow detection works in a similar way, achieved through establishing where a detected objects “appearance” closely resembles the colours and luminance of the modelled background, e.g. if a shadow is cast upon a red wall it will throw a darker red, output and so forth. To distinguish between MVO shadows and ghost shadows the algorithm see if a shadow and an MVO are close to or connected to one another. Figure 3 shows a flow chart of this process.

illustration not visible in this excerpt

Fig. 3 Algorithm Flowchart

Experimenting with the video feeds shown in Figure 4 Cucchiara was able to establish the centroid pixels of MVO's to an accuracy of 4.6 pixels greater with shadow detection than without. To review SaKbOT is a very robust system for monitoring the movement of unknown objects with a static camera, it handles light fluctuations due to the variance functions and is cable of avoiding false positives, accuracy is improved through shadow detection leading to good tracking. A visible short coming of this system however is that due to MVO detection not being applied to the background model it can never update to an object becoming static, and relies upon the principle that MVO's by definition stay moving, this is a will return problems however if for example a car parks or a door opens or a bright light turns on.

2.1.2.2 Clustering

Benezeth et al. (2008) point out that the “basic inter-frame difference with a global threshold is often a too simplistic method” to attain a reliable background model. Fukunaga et al (1975) describes a noval solution known as the “Mean Shift” Algorithm, a process for categorising segments of images rather than how much they differ form the background but mapping homogeneous regions of pixels that follow similar colours patterns or textures. Explaining that as Humans, we have an ability to distinguish and recognise shapes, even objects foreign to them. The purpose of such an algorithm is to support computer vision systems in doing a similar feat. The clustering segmentation occurs in iterations, a pre-defined kernel sweeps over the entire image and begins to group regions together, each iteration smooths noise levels and lessons the number of clusters in the image, this smoothing can be seen in Figure 5. The loop will either break after a certain number of iterations has occurred or if the number of clusters reaches a predefined value. Figure 6 Shows the output from a mean-shift where the desired number of clusters was three. A clear transformation can be seen from scattered image data A to the uniform nature of B.

Figure 5 Clustering distribution

illustration not visible in this excerpt

Due to the nature of this algorithm it is described as a “Low level” computer vision task, one of which further computation would rely on, a problem with the mean-shift algorithm proposed by Fukunaga, is its requirement upon user input to configure the algorithm based upon a set image. Comaniciu et al (1999) explain that “Today, it is an accepted fact in the vision community that the execution of low level tasks should be task driven, i.e., supported by independent high level information.” they propose an improved upon the algorithm which uses

2.1.2.3 Gaussian Distribution

Wren et al (1997) propose a system for monitoring each pixel individually, proposing a probabilistic subtraction approach. Recognising floors with prior mentioned pixel clustering algorithms, stating that extracting blob contours based upon pixel visual coherence has been “proven unreliable and difficult to find and use”. Observing that small lighting changes and colour oscillations are introduced naturally in the background image. Their system “Pfinder” (people finder), monitored individual pixels modelling each as a Gaussian distribution. By doing so a rugged assessment of which colours most commonly existed within individual pixels. Allowing for more accurate thresholding in addition handling some light flickering. Whilst this greatly improved background modelling of the time, it fell short for colour patterns that belong to the background but were missed by the singular nature of the solo Gaussian. As such if the pixel value of the background changed colour even in a regular formation, such as that produced by a blinking light, it would be deemed part of the foreground.

Stauffer and Grimson (1999) discuss the inherent difficulty of these noisy and false positives with regards to background modelling. Entities such as “branches rustling in the wind”, which for the human eye will be perceived as static have the ability to disrupt a clear model of the background. Building upon attempts made by Friedman & Russell (1997) of improving the single Gaussian ideology to modelling each pixel as a “Mixture of Gaussian”. By doing so a single background pixel could represent one or more colours. Creating a truly rugged background modelling system, recognised across the computer vision field as “Grimson's algorithm”.

Their paper describes a robust system similar to that of this project with the capacity to be placed in any location; both at sites indoor and out. Their ambition is to encase “on-board computational power, local memory, communication capability and possibly locational instrumentation (e.g. GPS)”. Local processing would enable processed data to be sent back to the aforementioned relatively low specification central machine.

KaewTraKulPong and Bowden (2001) highlight a shortcoming of the Grimson algorithm in that the returned foreground entities are larger than the objects themselves due to the inclusion of shadows. Advancing the algorithm they enable for online shadow suppression. In addition KaewTraKulPong observed slow update times for large changes in lighting conditions such caused by the movement of clouds of a newly introduced light source. An attempt which was succeed by Lee (2005) managing to create a state system which was able to rapidly adapt to changes in light levels. Further improvements come from Zivkovic and Heijden (2005) comparing two previously proposed algorithms, enabling a probabilistically based background update loop.

2.2 Blob Detection

Jung et al. (1991) Explain a system for detecting the connected component in a binary mask. Producing an array of entities they describe as blobs. Once the posision of the blob has been established further properties such as number of connected pixels, width, hight, mean colour, location and circularity can be established.

2.3 Object Tracking

Kalman (I960) proposes a novel probability based approach for estimating the location of an entity based upon its previous movement and locations, making it an idea solution for object tracking. The Kalman filter was developed to reduce randomization and noise in signals, replacing it with a non-random representation of data. An recursive clock logs the location of a signal and the location of the same signal on the next clock cycle creating a vector, evaluating this vector creates probable regions to search for the signal in the next frame. For the first few this is likely to be unreliable, however after several iterations the “growing memory” creates a high degree of search accuracy. Taking the actual location of a signal and the assumed signal location based upon the Kalman filter a more precise noise reduced location can be determined.

Koller et al (1994) demonstrates the application for the Kalmon filter within computer vision when applied to the real time monitoring of traffic. They explain their rationale for the project by saying “unlike conventional loop detectors, which are buried underneath highways to count vehicles, video monitoring systems are less disruptive and less costly to install.” They go on to explain that “they also have greater range and allow for more detailed descriptions of traffic situations”. Figure 7 shows the Kalman filter being applied to a busy high way.

illustration not visible in this excerpt

Figure 7 A I Real Image В | Blob detection C | Kalmon Filter

A problem they discuss is overlapping vehicles or “occlusions”, such that if a large vehicle were to obscure a smaller or two were close enough to be perceived as one the ability to track would seize. They make use of an “occlusion reasoning” algorithm that makes use of hard-coded scene geometrical data, such a system makes it possible to determine depth ordering, which along side the Kalman filters probabilistic estimations enables smooth vehicle tracking.

2.4 Optimization

Whilst creating a tracking real time car tracking system Koller et al (1993) acknowledge a fundamental shortcoming of machine vision explaining that it does “not provide reliable data fast enough, since they demand a high computational effort.” Suggesting two sub sampling methods to lesson the constraints upon the CPU enabling 'real time' tracking.

1. Sub sample the image, using a lesser resolution image.
2. Sub sample the frame rate.

They explain that such sub sampling methods come at a quality cost “In both cases you gain speed-up but lose accuracy (since you discard some information from the image or skip some frames) and reliability (the probability of spurious results increases due to the loss of information using sub sampling”.

[...]

Excerpt out of 64 pages - scroll top

Details

Title: Employment of low-cost low-power ARM machines as tracking device for real time vehicle movement
College: University of Lincoln (School of Computer Science)
Course: Computer Science
Grade: 69
Author: Mark Collins (Author)
Year: 2013
Pages: 64
Catalog Number: V275269
ISBN (eBook): 9783656680734
ISBN (Book): 9783656680741
File size: 3477 KB
Language: English
Tags: Computer vision raspberry pi ARM low cost low power gcc linux arch autonomous CCTV

Employment of low-cost low-power ARM machines as tracking device for real time vehicle movement

Excerpt

Table of Contents

1. Introduction

1.1 Motivations behind this project.

1.2 Background

1.3 Aims and Objectives

1.4 Brief Risk assessment and Contingency planning.

1.4 Content Summary

2 Literature Review

2.1 Background Modelling

2.1.1 Static Reference Images

2.1.2 Background Learning

2.1.2.1 Temporal Median

2.1.2.2 Clustering

2.1.2.3 Gaussian Distribution

2.2 Blob Detection

2.3 Object Tracking

2.4 Optimization

Details