Multicore systems are dominating the processor market; they enable the increase in computing power of a single chip in proportion to the Moore's law-driven increase in number of transistors. A similar evolution is observed in the system-on-chip (SoC) market through the emergence of multi-processor SoC (MPSoC) designs. Nevertheless, MPSoCs introduce some challenges to the system architects concerning the efficient design of memory hierarchies and system interconnects while maintaining the low power and cost constraints. In this master thesis, I try to address some of these challenges: namely, non-cache coherent DMA transfers in MPSoCs, low instruction cache utilization by OS codes, and factors governing the system throughput in MPSoC designs. These issues are investigated using the empirical and simulation approaches. Empirical studies are conducted on the Danube platform. Danube is a commercial MPSoC platform that is based on two 32-bit MIPS cores and developed by Infineon Technologies AG for deployment in access network processing equipments such as integrated access devices, customer premises equipments, and home gateways.
Simulation-based studies are conducted on a system based on the ARM MPCore architecture. Achievements include the successful implementation and testing of novel hardware and software solutions for improving the performance of non-cache coherent DMA transfers in MPSoCs. Several techniques for reducing the instruction cache miss rate are investigated and applied. Finally, a qualitative analysis of the impact of instruction reuse, number of cores, and memory bandwidth on the system throughput in MPSoC systems is presented.

Excerpt

Acknowledgment

Abstract

Kurzfassung

Abbreviations

1 Introduction
1.1 Background
1.2 Objectives
1.3 Organization
1.4 Structure of the Thesis

2 Background and Related Work
2.1 Chip Multiprocessing
2.2 Hardware Multithreading
2.3 Multithreaded Chip Multiprocessors
2.4 Symmetric Shared-Memory Multiprocessors
2.5 Multi-Processor System-on-Chip
2.6 Memory Hierarchy
2.7 Direct Memory Access
2.8 Memory Consistency and Coherence
2.9 System Interconnect
2.9.1 Shared-Medium Interconnect
2.9.2 Switched-Medium Interconnect
2.9.3 Hybrid Interconnect
2.10 Processor Performance Counters
2.10.1 Previous Work
2.10.2 Proposed Solution: MIPSProf

3 Evaluation Platforms
3.1 Danube Platform
3.1.1 Features
3.1.2 Architecture
3.1.3 Processor Subsystem
3.1.4 Danube Evaluation Framework
3.2 ARM11 MPCore Architecture
3.3 VaST CoMET Platform

4 DMA and Cache Coherence
4.1 Background
4.2 DMA Operation
4.3 Linux DMA Driver
4.4 Original DMA Solution
4.5 Proposed Solution # 1: Software-Based Solution
4.6 Proposed Solution # 2: Hybrid Solution
4.7 Testing and Validation
4.8 Results and Analysis
4.9 Conclusions

5 OS Instruction Cache Utilization
5.1 Background
5.2 Previous Work
5.2.1 Hardware Solutions
5.2.2 Software Solutions
5.2.3 Hybrid Solutions
5.3 Optimizations for Linux
5.3.1 Cache Locking
5.4 CPU Subsystem Behavior
5.4.1 Effect of Cache Size Reduction
5.4.2 Effect of the Absence of Critical Word First Filling
5.4.3 Effect of Cache Locking and Instruction Pre-fetching
5.5 Conclusions

6 System Throughput in MPSoCs
6.1 Background
6.2 Evaluation Methodology
6.3 System Architecture
6.4 Results and Analysis
6.4.1 Effect of Memory Bandwidth
6.4.2 Effect of Number of Cores
6.4.3 Effect of Instruction Reuse
6.5 Conclusion
7 Summary and Outlook

Bibliography

List of Tables

2.1 Comparison of different memory types based on access time and price

3.1 Danube features
3.2 Cache conﬁguration for the MIPS cores in Danube
3.3 Danube evaluation framework - Hardware components
3.4 Danube evaluation framework - Software components
3.5 Cache conﬁguration for the MP11 cores used in the simulation

4.1 Un-cached loads and stores in the original DMA implementation
4.2 Events measured during the proﬁling
4.3 Events measured concurrently during the proﬁling
4.4 Number of fragments per packet that are sent from the ping host
4.5 Overhead due to the new DMA solution

5.1 Original DMA solution cache accesses and misses

6.1 Parameters conﬁguration for the memory bandwidth simulations

List of Figures

1.1 Processor performance improvement between 1978-

2.1 Multicore architecture examples
2.2 Comparison of hardware threading types with CMP
2.3 General classiﬁcation of parallel systems
2.4 Memory hierarchy example
2.5 DMA read operation
2.6 DMA write operation
2.7 Memory consistency problem in the presence of DMA
2.8 Shared-medium vs. switched-medium interconnects
2.9 Proposed kernel module design

3.1 Danube architecture
3.2 Danube application example
3.3 Experimental environment
3.4 ARM11 MPCore architecture

4.1 DMA descriptors chain
4.2 DMA read usage through RX direction of Ethernet driver
4.3 DMA write usage through TX direction of Ethernet driver
4.4 DMA read and write logic
4.5 Descriptors access pattern within the DMA Driver
4.6 Improvement in the RX handler of ping due to the new solution
4.7 Improvement in the TX handler of ping due to the new solution

5.1 Proposed locking mechanism

5.2 Effect of cache size reduction on the total number of cycles
5.3 Effect of reducing the cache size on I-cache misses
5.4 Effect of absence of critical word ﬁrst ﬁlling
5.5 Effect of the instruction prefetching on sequential hazard-free code streams

6.1 MPCore-based System architecture
6.2 Effect of memory bandwidth on system IPC
6.3 Effect of number of cores on system IPC
6.4 Effect of instruction reuse on system IPC

Acknowledgment

First of all, all the thanks and praises are to God for all what he gave to me. Then I would like to pay all the gratitude to my family whom without their continuous encouragement and support, I would not be able to achieve what I have done so far. I would like to thank my supervisors, Dr. Jinan Lin at Inﬁneon Technologies and Prof. Dr. Hans Michael Gerndt at the university, for providing me with the opportunity to work on a very interesting topic for this master’s thesis. This thesis was enabled by their continuous support and brilliant guidance. I also would like to

thank Prof. Dr. Arndt Bode for accepting to be the second examiner.

All of my colleagues in the Advanced Systems and Circuits department gave me the feeling of being at home. Thank you Dr. Xiaoning Nie, Stefan Maier, Stefan Meier, Mario Steinert, Dr. Ulf Nordqvist, and Chao Wang for all the help and being wonderful colleagues.

I feel gratitude towards all the people in Inﬁneon Communications Solutions unit who provided me with the support and guidance throughout this thesis. Many thanks to Sandrine Avakian, Stefan Linz, Yang Xu, Dmitrijs Burovs, Beier Li, and Mars Lin.

I also would like to thank Ralf Ba¨chle from the Linux Kernel community for his valuable advice and the interesting discussions.

I am very grateful to Shakeel UrRehman from the ASIC Design and Security team at Inﬁneon and Abdelali Zahi from Risklab Germany for their continuous advice and proofreading this thesis.

Last, but absolutely not least, a big “thank you” to all of my friends for being there all the time. We really had a very nice time together.

To the soul of my father...

Abstract

Multicore systems are dominating the processor market; they enable the increase in computing power of a single chip in proportion to the Moore’s law-driven increase in number of transistors. A similar evolution is observed in the system-on-chip (SoC) market through the emergence of multi-processor SoC (MPSoC) designs. Never- theless, MPSoCs introduce some challenges to the system architects concerning the eﬃcient design of memory hierarchies and system interconnects while maintaining the low power and cost constraints. In this master thesis, I try to address some of these challenges: namely, non-cache coherent DMA transfers in MPSoCs, low instruc- tion cache utilization by OS codes, and factors governing the system throughput in MPSoC designs. These issues are investigated using the empirical and simulation approaches. Empirical studies are conducted on the Danube platform. Danube is a commercial MPSoC platform that is based on two 32-bit MIPS cores and developed by Inﬁneon Technologies AG for deployment in access network processing equipments such as integrated access devices, customer premises equipments, and home gateways. Simulation-based studies are conducted on a system based on the ARM MPCore ar- chitecture. Achievements include the successful implementation and testing of novel hardware and software solutions for improving the performance of non-cache coher- ent DMA transfers in MPSoCs. Several techniques for reducing the instruction cache miss rate are investigated and applied. Finally, a qualitative analysis of the impact of instruction reuse, number of cores, and memory bandwidth on the system throughput in MPSoC systems is presented.

Kurzfassung

Multicore Systeme dominieren inzwischen den Prozessormarkt. Sie ermo¨glichen den Anstieg der Rechenleistung eines einzelnen Chips mit der gleichen Geschwindigkeit, in der nach Moores Law die Anzahl der Transistoren wa¨chst. Eine ¨ahnliche Entwick- lung kann bei Systemen auf einem Chip (System-on-Chip, SoC) mit dem Auftreten von Multiprozessor-SoCs (MPSoC) beobachtet werden. MPSoCs stellen jedoch neue Herausforderungen an die Entwickler solcher Architekturen. Sie liegen insbesondere im eﬃzienten Design von Speicherhierarchien und der Verbindungseinrichtungen mit niedrigem Energieverbrauch und niedrigen Kosten. In dieser Masterarbeit versuche ich einige dieser Herausforderungen zu thematisieren: Nicht Cache-koha¨rente direkte Speicherzugriffe (direct memory access, DMA) in MPSoCs, schlechte Ausnutzung des Instruktions-Caches durch das Betriebsystem (OS) und Faktoren, die den Gesamt- durchsatz eines MPSoCs bestimmen. Diese Punkte werden mit empirischen und simulativen Ansa¨tzen untersucht. Die empirischen Versuche werden auf der Danube Plattform durchgefu¨hrt. Danube ist eine kommerzielle MPSoC Plattform basierend auf zwei 32-bit MIPS Prozessorkernen, die von der Inﬁneon Technologies AG fu¨r den Einsatz in Zugangsnetzen (Access Networks) entwickelt wurde. Simulativ wird ein System basierend auf der ARM MPCore Architektur untersucht. Ziel dieser Arbeit ist das erfolgreiche Implementieren und Testen neuer Hard- und Softwaremethoden, um die Leistung von nicht Cache-koha¨renten DMA-Transfers in MPSoCs zu steigern. Weiterhin werden einige Techniken zur Reduzierung der Cache-Fehlzugriffsrate unter- sucht und angewandt. Abschließend zeigt eine qualitative Analyse die Auswirkungen der Wiederverwendung von Instruktionen im Cache, der Anzahl der Prozessorkerne und der Speicherbandbreite auf den Durchsatz eines MPSoCs.

Abbreviations

illustration not visible in this excerpt

Chapter 1 Introduction

1.1 Background

Doubling CPU performance every 18 months has been the trend in the hardware market for the last 50 years. Moore’s law has fueled such a phenomenon for a long time but today there are many challenges which make this trend more difficult to follow. These challenges are imposed by the physical limitations. The ﬁrst one is represented in the so-called speed of light limit which imposes a minimum length for the wires within the chip in order to avoid hazards [20]. The second challenge is the fabrication technology. Typical fabrication CMOS gate lengths at the time are in the range of 130-45 nm where experiments showed that it is possible to achieve lengths around 6 nm [21], Nevertheless, chip design and manufacturing under such a scale becomes extremely difficult and costly as well. The third challenge is power dissipation. Power has been a major problem for a long time and the “straightforward” approach to solve this problem was to lower the voltage with each new generation of chips since the dynamic power dissipation depends proportionally on the voltage:

P = α .C.V² .f (1.1)

where:

P: The dynamic power dissipation

C: The effective capacitance V: The operational voltage f: The clock frequency

α: The switching activity factor

Equation 1.1 shows that the power also depends on the frequency. With the chip frequencies rising and approaching around 4 GHz at the time [78], the gain out of reducing the voltage is becoming quite small.

A fourth challenge is the leakage current which became a major concern with the advent of nano-scale technologies since it became so large that in some cases it is almost equal to the operational current [46].

Further challenges arise also from the architecture complexity and data inten- sive applications. According to [61], architects have reached the limit in extracting the computing power from single core processors. Increasing the performance of sin- gle core processors will require either enlarging them or putting more sophisticated logic for extracting more instruction-level parallelism (ILP) or thread-level parallelism (TLP) through superscalarity, very long instruction word (VLIW) architectures, and simultaneous multithreading (SMT) [32, 10].

illustration not visible in this excerpt

Figure 1.1: Processor performance improvement between 1978-2005. Source: [32]

Figure 1.1 shows processors speedup during the last 30 years. One can easily see that in the last ﬁve years the rate of speedup has decreased. This is a clear evidence of the effect of the mentioned challenges.

In summary: To keep Moore’s law valid, we need a new approach which can fuel the increase in CPU performance in the future. One possibility is to utilize the large number of transistors that can already be put into a single chip today and switch into parallel processing approach. Here the idea is to put more than one CPU within the same chip rather than trying to increase the amount of logic that can be put into a single CPU. The CPU in such systems is usually called a pr ocessor co r e or simply core and the resulting system is called a multicor e¹ system.

Multicore systems represent a revolution in many aspects in computing since they require new paradigms, techniques, and tools for effciently utilizing the computing power available on such chips. This revolution is seen by many researchers as a similar one to the RISC revolution in 1980s [30]. Currently, multicore systems are already deployed in the market and many chips are available such as Intel Monte cito, AMD Opteron, Sun Niagar a [57, 5, 48]. However, these chips are targeted toward servers/workstations market. The embedded market in which the constraints are primarily related to power consumption and low cost, has also realized the potential of multicore systems. Vendors have already started to release chips based on embedded multicore architectures. Such chips are usually called multipr ocessor system-on-chips (MPSoCs) since they combine both of multiprocessing and system-on-chip paradigms. MPSoC systems present some challenges to system architects. These challenges include eﬃcient memory hierarchy design, scalable system interconnect, new pro- gramming paradigms, etc... Some of these problems exist also in unicore systems. However, the lack of extensive experience with embedded multicore chips represents both a challenge and an opportunity at the same time for the architects since they need to come up with novel architectures that satisfy all the constraints. One impor- tant market for multicore chips is network processors. The tremendous bandwidth available today in network connections require some innovative solutions in the pro- tocol processing engines to fully utilize the available bandwidth. Manufacturers have already started shipping products based on multicore chips such as Intel IXP family, Cavium Octeon family, and Freescale C-Port family [43, 60, 36].

One particular product is the Danube chip from Inﬁneon Technologies AG [2]. Danube is a dual-c or e chip based on 32-bit MIPS processors. According to the tech- nical speciﬁcations [2]:

Danube is Inﬁneon’s next generation ADSL2+ IAD single-chip solution. It comprises the highest integration of VoIP and ADSL for high-end and cost- optimized ADSL2/2+ IADs. It enables the most effective and scalable realization of VoIP applications. It is offered with a complete development kit consisting of reference designs, (...), and an application software stack based upon the Linux operating system.

In a typical application, the two cores in the Danube are organized as:

1. CPU 0: A 32-bit MIPS 24KEc core used to control the overall operation of the system.
2. CPU 1: A 32-bit MIPS 24KEc core with Digital Signal Processing Application Speciﬁc Extensions (DSP ASE) and it is used only for VoIP processing.

1.2 Objectives

The purpose of this thesis is to:

1. Explore and analyze the existing architectural solutions for MPSoC systems especially the ones concerning memory hierarchy and system interconnect using empirical and simulation approaches.
2. Implement some novel concepts in hardware and software to improve the per- formance of these solutions.

Based on that, the thesis was split into three concrete tasks:

1. Direct memroy access (DMA) and cache coherence: Software-based solutions for cache coherence are attractive in terms of power consumption for MPSoC systems [54, 56]. The aim here is to analyze and enhance the existing software solutions for non-cache coherent DMA transfers in MPSoCs. The Danube platform is used for conducting the experiments and testing the new solutions.
2. Operating system instruction cache utilization: OS codes have poor in- struction cache utilization [4, 77]. The goal here is to investigate and improve the instruction cache utilization for the Linux Kernel 2.4 series running on Danube.
3. System throughput of MPSoCs: A qualitative analysis of the impact of instruction reuse, number of cores, and memory bandwidth on the overall system IPC is presented. The analysis is obtained through simulating a system based on ARM MPCore architecture using VaST CoMET platform [7, 72].

1.3 Organization

The thesis was conducted in Inﬁneon Technologies headquarters in Munich, Germany between May and December 2007 within the Advanced Systems and Circuits (ASC) department. The supervisor from Inﬁneon Technologies is Dr.-Ing. Jinan Lin from the Protocol Processing Architectures (PPA) team. The principal examiner from the university is Prof. Dr. Hans Michael Gerndt from the Lehrstuhl fu¨r Rechnertech- nik und Rechnerorganisation/Parallelrechnerarchitektur and the second examiner is Prof. Dr. Arndt Bode from the same department.

1.4 Structure of the Thesis

Chapter 2 presents a review on selected topics from computer architecture that are used throughout the thesis. Chapter 3 outlines the evaluation platforms used for conducting the experiments. Chapter 4 presents the DMA and cache coherence task and the implemented solutions along with the obtained results. In Chapter 5, the instruction cache utilization problem is presented along with the proposed solutions and results. Chapter 6 presents a qualitative analysis of the impact of instruction reuse, number of cores, and memory bandwidth on the overall system throughput. Chapter 7 draws the summary of conclusions and suggestions are outlined regarding possible further research work on the topic.

Chapter 2
Background and Related Work

This chapter aims to introduce the reader to the prerequisite topics in computer architecture and operating systems that are used throughout the thesis.

2.1 Chip Multiprocessing

A multicor e p r ocessor is one that combines two or more independent processors into a single chip. The cores might be identical (i.e. homogeneous) or hybrid (i.e. het- erogeneous). Cores in a multicore chip may share the cache(s) on different levels. They may also share the interconnect to the rest of the system. Each core inde- pendently implements optimizations such as pipelining, superscalar execution, and multithreading. A system with N cores is effective when it is presented with N or more threads concurrently. Multicore systems are multiprocessor systems that belong to MIMD (Multiple-Instruction, Multiple-Data) family according to Flynn’s taxonomy [25]. Another common name used to describe multicore processors is Chip Multiprocessors (CMP). This name emphasizes the fact that all the cores are present on a single physical chip.

illustration not visible in this excerpt

Figure 2.1: Multicore architecture examples. Source: [80]

2.2 Hardware Multithreading

Hardware Multithreading exploits thr ead-level pa r a l lelism (TLP) by allowing multiple threads to share the functional units of a single processor in an overlapping fashion. According to [32, 22], it can be implemented in three different ways:

1. Coarse-grained: CPU switches to a new thread when a thread occupying the processor blocks on a memory request or other long-latency request.
2. Fine-grained: CPU switches between threads on each cycle, causing the exe- cution of multiple threads to be interleaved. This interleaving is usually done in a round-robin fashion, skipping any threads that are stalled at that time.
3. Simultaneous Multithreading (SMT): Adds multi-context support to multiple- issue, out-of-order processors. Unlike conventional multiple-issue processors, SMT processors can issue instructions from different streams on each cycle for improved ILP. SMT helps to eliminate both vertic al and horizontal waste. Ver- tical waste is introduced when the processor issues no instructions in a cycle, where horizontal waste is introduced when not all issue slots can be ﬁlled in a cycle.

Figure 2.2 highlights the difference between the different types of hardware threading and CMP.

illustration not visible in this excerpt

Figure 2.2: Comparison of hardware threading types with CMP. Source: [32]

2.3 Multithreaded Chip Multiprocessors

According to [22], a new emerging type of processors is the so-called multithr eaded chip multiprocessors. In this type of processors, multiple processor cores are integrated into a single chip as in CMP and each core implements hardware multithreading (MT). Sun Microsystems calls this paradigm Chip Multithreading (CMT) [42]. Examples of CMT processors include IBM POWER5 and Sun Niagara [45, 48].

2.4 Symmetric Shared-Memory Multiprocessors

Symmetric¹ Multipr ocessors (SMP) are shared-memory multiprocessor systems with the following characteristics:

1. Global physical address space
2. Symmetric access to all the main memory from any processor

They belong to the uniform memory access (UMA) family according to the general classiﬁcation of parallel systems. SMPs were originally implemented with each pro- cessor being on a separate chip. With the advent of mega-transistors chips, it became possible to integrate all the processors into a single chip. SMP systems represent the most popular type of multicore systems today [32, 10].

illustration not visible in this excerpt

Figure 2.3: General classiﬁcation of parallel systems. Source: [26]

2.5 Multi-Processor System-on-Chip

According to [44], MPSoCs represent the new solution targeted towards embedded computing to overcome the performance limitations of single processor microcon- trollers. MPSoCs exhibit some general properties that can be summarized as follows:

-Heterogeneous processing elements: MPSoCs usually combine general pro- cessors, application processors, and accelerators
-Application-oriented: They are targeted towards a speciﬁc application ﬁeld (e.g. protocol processing, video processing, security, etc...)
-Low power: Since the main application ﬁeld for these chips is embedded com- puting, low power is one of the primary factors that drive the design of such chips.

2.6 Memory Hierarchy

Almost all the modern computer systems incorporate some kind of memory hierarchy. Such a hierarchy is needed to cope with the different speeds by which the different components in a computer system operate. For example, registers are very fast but are limited in capacity (∼ 1 KB) where the hard disk is very huge in capacity (∼ 100 GB) but very slow to access. Moreover, CPU speed increases by 55% annually where memory speed increases only by 10% annually leading into what is known as the memory wall [81]. Thus, a solution is needed to cope with such a growing disparity in speed. The solution is to provide some kind of interme diate memory between the CPU and the main memory which is faster than the main memory and a little bit slower than the CPU registers. It serves to bridge the gap between the CPU and the main memory. This intermediate memory is known as cache. Cache helps to hide the latencies incurred in accessing the main memory. To be able to achieve that, cache operation is based on a phenomenon observed in most of the programs. This phenomenon is known as loc ality of reference. Locality of reference refers to the fact that a memory location which is accessed now would be very probably accessed again in the near future or that the adjacent locations would be very likely accessed in the near future. Locality can be split into two types [63]:

1. T emporal localit y (locality in time): If an item is referenced, it will tend to be referenced again soon.
2. Spatial locality (locality in space): If an item is referenced, items whose addresses are close by will tend to be referenced soon.

Thus we can take advantage of the principle of locality of reference by imple- menting the memory of a computer as a hierar chy. A memory hierarchy consists of

illustration not visible in this excerpt

Table 2.1: Comparison of different memory types based on access time and price. Source: [63] multiple levels of memory with different sizes and speeds. As the distance from the CPU increases, both capacity and access time of memories increase.

Table 2.1 shows a comparison between the different types of memory. It can be observed that the main memory access time is about 100 times greater than the CPU register access time. The access time gets even larger for external hard-disk access. Accessing the external hard-disk is around 10,000,000 times slower than accessing the CPU internal registers. This table shows that a memory hierarchy is very necessary for modern computer systems in order to get a decent performance. A level closer to the processor is generally a subset of any level further away. Memory hierarchy might contain multiple levels, but data is copied between only two adjacent levels at a time.

Figure 2.4 shows a general abstraction of the memory hierarchy used in modern computers. A small but very fast cache memory is placed on the chip inside the CPU and it is called CPU cache². This cache is built usually with SRAM technology and it can be accessed usually at a speed equivalent to CPU register access time. It is important to notice that the actual ﬁrst level of memory is the CPU registers. Most of modern CPUs contain two caches: one instruction cache and one data cache. Moreover, some CPUs implement more than one level of caching between the CPU and the main memory. For instance, the Intel Cor e Duo processor [19] implements two levels of caches called L1 cache and L2 cache with L2 cache being shared between the two cores.

Data transfer between the different levels of memory occurs in terms of blocks or

illustration not visible in this excerpt

Figure 2.4: Memory hierarchy example

lines. The block size might be different between each two levels. Now, suppose that the CPU requested a given address; if this address is present in the cache, then we say that we have a hit. Otherwise, we have a miss. One important goal for the system architect and programmer is to maximize the hit ratio which is the fraction of successful memory accesses. Since performance is the major reason for having a memory hierarchy, the time to service hits and misses is important. Hit time is the time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or a miss. The miss penalty is the time to replace a block in the upper level with the needed block from the lower level plus the time needed to deliver this block to the processor.

One important issue which arises here is that since we have much less cache lines than what does the memory contain, a policy is needed to determine how we map the memory lines into cache lines. There exist three general strategies for cache mapping:

1. Direct Mapping: In this scheme, each memory location is mapped to exactly one location in the cache. This means that a block of main memory can only be brought into the same line of cache every time.
2. F ully-Associative Mapping: A block of main memory may be mapped into any line of the cache and it is no longer restricted to a single line of cache.
3. Set-Associative Mapping: In order to overcome the complexity of fully- associative cache, cache is divided into a number of sets. Each set contains a number of lines (e.g a 2-way set associative cache has 2 lines per set). Under this scheme, a block of memory is restricted to a speciﬁc set of lines. A block of main memory may map to any line in the given set. It represents a compromise between direct mapping and fully-associative mapping.

2.7 Direct Memory Access

Direct Memory Access (DMA) is a technique used to offload the costly I/O operations from the CPU into a “third party” processor known as DMA controller. Prior to DMA, whenever a hardware device wanted to send or receive data from the memory, the CPU had to initiate the transfer and preform the copy between the device and the memory. This means that the CPU was kept busy during the whole process and if we recall from Section 2.6 that I/O devices and memory are much slower than CPU, this means that CPU is kept idle most of the time just waiting for the I/O devices and the memory. With DMA, the CPU initiates the transfer and offloads it to a specialized hardware called DMA controller (DMAC). Then the DMAC starts performing the copy operation between the device and the memory. Meanwhile, the CPU can go back to normal useful processing while the I/O operations are handled by the DMAC. Upon the completion, DMAC signals the CPU indicating that the operation has completed. Obviously, this represents a huge performance saving since the CPU time is not wasted in waiting for the slow peripheral or memory devices.

In general, DMAC provides two core operations: read and write. Figures 2.5 and 2.6 show the simpliﬁed sequence of actions taken during these operations.

Modern DMA controllers (e.g. ARM PrimeCell [6]) provide a wide range of oper- ations. These operations include memory-to-memory, memory-to-peripheral, peripheral- to-memory, and peripheral-to-peripheral transactions. Typically, DMAC provides its functionality through DMA channels. These channels determine the interconnection

illustration not visible in this excerpt

Figure 2.5: DMA Read Operation: (1) The ﬁrst step is to transfer the data from the I/O device (e.g. network controller) via DMAC into the main memory. (2) After that, DMAC issues an interrupt to the CPU signaling the reception of data (e.g. packet).

(3) Finally, CPU accesses the main memory and processes the received data.

illustration not visible in this excerpt

Figure 2.6: DMA Write Operation: (1) The ﬁrst step here is taken by the CPU which writes the data into the main memory. (2) After that, the CPU signals the DMAC to start the transaction. (3) Finally, DMAC transfers the data from the main memory into the I/O device.

between the different modules within the system via the DMAC. The channels as- signment can be ﬁxed or conﬁgurable. Furthermore, DMAC is usually split internally into an RX controller and a TX controller with both of them operating in parallel.

However, DMA comes at a cost: the presence of another data producer/consumer (i.e. DMAC) in the system implies the need for maintaining memory consistency. To illustrate this problem, let us have a look on Figure 2.7. Suppose that the following operations are executed in order on a given address:

1: CPU writes the value Y into the cache without updating the main memory

2: DMAC transfers the old value X from the main memory into an external device

illustration not visible in this excerpt

Figure 2.7: Memory consistency problem in the presence of DMA: Here is a situation where lack of coherence in the memory system might yield the whole system useless

It is obvious that the DMAC will be transferring a stale value of the memory address. This leads us to the deﬁnition of two important properties of memory systems called coherence and c onsistency.

2.8 Memory Consistency and Coherence

Any memory system which incorporates more than one memory data producer/con- sumer (i.e. processor) should maintain two important principles in order to keep the calculations correct and avoid situations as the one illustrated in Figure 2.7. These two important principles are:

1. Coherence: Deﬁnes what values can be returned by a read.
2. Consistency: Determines when a written value will be returned by a read.

Another reﬁned version of the above deﬁnition can be stated as follows: Coherence deﬁnes the behavior of reads and writes to the same memory location, while con- sistency deﬁnes the behavior of reads and writes with respect to accesses to other memory locations.

However, these deﬁnitions are somehow vague. A more concrete deﬁnition is found in [32]. It states that a memory system is coherent if:

1. A read by a processor, P, to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and read by P, always returns the value written by P.
2. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are suﬃciently separated and no other writes to X occur between the two accesses.
3. Writes to the same location are serialized: that is, two writes to the same location by any two processors are seen in the same order by all processors. Namely, the system should exhibit write serialization. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.

Consistency is however more complex. It speciﬁes constraints on the order in which memory operations become visible to the other processors. It includes opera- tions to the same location and to different locations. Therefore, it subsumes coher- ence. The most straightforward model for memory consistency is called sequential consistency. Sequential consistency requires that the result of any execution be the same as if the memory accesses executed by each processor were kept in order and the accesses among different processors were arbitrarily interleaved.

One last remark is that these two principles apply on accesses to main memory and caches as well. Cache coherence problem exists also in multiprocessor systems where each CPU has its own cache and coherence has to be ensured also on the level of caches. Nevertheless, the general concepts are the same as above except for the implementation. For more detailed discussion about memory consistency and cache coherence, please refer to [32].

2.9 System Interconnect

System Interconnect refers to the medium which connects the different parts of a system. In this context, we restrict ourselves to the on-chip interconnect that connects the processors, memory controllers, DMACs, and other peripheral interfaces in a SoC [44, 14]. A new trend in the interconnect technology is to use internetworking ideas (e.g. OSI model) to implement the on-chip network. This led to the introduction of on-chip networks (OCNs) —also referred to as network-on-chip (NoC) [13, 32]—that are used for interconnecting microarchitecture functional units, register ﬁles, caches, and processor and IP cores within a chip or multichip modules.

According to [32, 59, 49, 13], on-chip interconnect can be classiﬁed into three main categories:

1. Shared-Medium Interconnect
2. Switched-Medium Interconnect
3. Hybrid Interconnect

2.9.1 Shared-Medium Interconnect

In shared-medium interconnect, all devices share the interconnect media. An example of such an architecture is the traditional system bus. In bus architectures, all the devices are connected via the bus. Only one message is allowed to be sent at a time and this message is broadcasted to all the devices in the bus. Access to the bus is controlled via an arbiter which decides which device gets the access to the bus medium upon request. One obvious bottleneck with the bus architecture is the limited scalability. Some prominent industrial standards that are deployed in many products at the time are ARM A dvanc e d Micr op r ocessor Bus Architecture (AMBA), STMicroelectronics STBus, and IBM Cor eConne ct [8, 70, 34].

[...]

¹ I have chosen to use multicor e instead of multi-cor e to reﬂect the fact that these systems are becoming the norm. See [47] for more information about the usage of hyphen in English.

¹ The word Symmetric here refers to the symmetry in memory access latency. It should not be confused with Symmetric Multiprocessing OS in which a single OS image runs on all the cores. For more info, please refer to [66, 32]

² Cache principle is not restricted only to the CPU. Caches are used in many subsystems within a computer system (e.g. Disk cache)

Excerpt out of 90 pages - scroll top

Details

Title: Embedded Multiprocessor System-on-Chip for Access Network Processing
College: Technical University of Munich (Institute for Informatics)
Grade: 1.0
Author: Mohamed Bamakhrama (Author)
Year: 2007
Pages: 90
Catalog Number: V111469
ISBN (eBook): 9783640095223
ISBN (Book): 9783640112609
File size: 915 KB
Language: English
Tags: Embedded Multiprocessor System-on-Chip Access Network Processing

Embedded Multiprocessor System-on-Chip for Access Network Processing

Excerpt

Table of Contents

Acknowledgment

Abstract

Kurzfassung

Abbreviations

Chapter 1 Introduction

1.1 Background

1.2 Objectives

1.3 Organization

1.4 Structure of the Thesis

Chapter 2Background and Related Work

2.1 Chip Multiprocessing

2.2 Hardware Multithreading

2.3 Multithreaded Chip Multiprocessors

2.4 Symmetric Shared-Memory Multiprocessors

2.5 Multi-Processor System-on-Chip

2.6 Memory Hierarchy

2.7 Direct Memory Access

2.8 Memory Consistency and Coherence

2.9 System Interconnect

2.9.1 Shared-Medium Interconnect

Details

Chapter 2
Background and Related Work