Loading...
Please wait, while we are loading the content...
Similar Documents
A Clustered SMT Architecture for Scalable Embedded Processors
| Content Provider | Semantic Scholar |
|---|---|
| Author | Dehnavi, Maryam Mehri Hassanein, Wessam |
| Abstract | Power consumption and wire delays are two limiting factors in designing embedded systems with a centralized architecture. Traditionally, the majority of embedded systems have used either VLIW or superscalar architectures. Recently, Simultaneous Multithreaded (SMT) architecture processors have appeared commercially such as the Intel Pentium 4 and the IBM Power 5. SMT architectures differ from superscalar processors in allowing multiple threads to execute concurrently and share resources. This allows better utilization of processor resources and potentially better performance. Recently, several studies have been exploring SMT architectures for embedded systems. Clustering is a technique by which resources are partitioned between threads to reduce wire delays, complexity, and power consumption, as well as enhance scalability. These advantages make clustering suitable for embedded systems, which favor cost and power consumption over performance. In this work we study the different clustering options of an SMT processor favoring the reduction of power consumption and their effects on performance. We concentrate on how new processor features such as clustered SMT can be tailored to benefit embedded processors. CLUSTERED SMT EMBEDDED PROCESSORS Embedded processor architectures differ from general purpose processor architectures in their concentration on low cost and low power consumption usually on the expense of performance. Trends in embedded systems are leading to increasingly complex applications and larger performance requirements. To meet these growing demands, future embedded processors will resemble current high performance processors. This work studies architecture changes in state-of-the-art processor architectures to meet the challenges of embedded processors of achieving high performance while reducing cost and power consumption. Recently, processor architectures have been progressing in achieving performance through Thread Level Parallelism (TLP) using multithreaded processors. This direction allows multiple threads to execute concurrently on the same processor. Two main directions exist, 1Chip Multiprocessing (CMP), where multiple cores exist per chip, and 2Simultaneous Multithreading (SMT), where a single superscalar processor is able to execute multiple threads concurrently by sharing its resources between the different threads. Moreover, combinations of these two techniques (CMP + SMT) currently exist in commercial processors such as the IBM Power5, where two cores exist per chip, each an SMT processor. In this work, we concentrate on the SMT architecture, examples of which include the Intel Pentium 4 hyperthreading processors, the Intel Xeon processor, the IBM Power5, etc. Clustering is a technique by which resources are partitioned between threads to reduce wire delays, complexity, and power consumption, as well as enhance scalability. These advantages make clustering suitable for embedded systems, which favor cost and power consumption over performance. Hardware clustering allows a wide superscalar processor to maintain high clock rates by grouping resources into small clusters. This allows local communication to travel shorter distances, allowing complex logic to operate on a subset of the resources. However, clustered architectures pay a higher inter-cluster communication cost. Hardware clustering has advanced its way in research in three essential directions (Collins and Tullsen 2004); 1Deeper clustering (more resources clustered), 2Heavier clustering (Increase the number of clusters), and 3Higher inter-cluster communication cost. Thus studying each of these features is essential to an ideal architecture for embedded systems. A considerable amount of work has been done on clustering resources of a centralized architecture. However, most of this work concentrates on single threaded applications. Only recently has clustered multithreaded architectures been studied. This work includes; Latorre et al. (2004) have investigated dynamic and static assignment of very simple back-end clusters. Raasch et al. have investigated the impact of static partitioning in an SMT architecture. In Krishnan and Torrellas (1998) the threads are statically assigned to a number of non-clustered execution cores in an SMT architecture. In Berekovic et al (2004) a network topology has been implemented for communication between threads in an SMT processor. In Moursy et al (2005) different partitioning schemes of clustered multithreaded processors are studied. Clustering can be implemented on different processor resources, both at the back-end (comprised of the functional units, the register renaming units, the register files, and the data cache), and at the front-end (comprised of the fetch unit, the decode unit, and the instruction cache). One of the main challenges of clustering resources in a processor is that the instructions are distributed among a set of clusters. Therefore, if dependent instructions are located in different clusters, a considerable amount of communication has to occur to transfer the values from the producer clusters to the consumer clusters. In current clustered architectures communication among the back-ends are implemented by a network that results in a Network-on-Chip (NOC) architecture. To reduce the communication rate, different network topologies (such as unidirectional, bidirectional, and Radix-k cross bar networks) have to be studied, and different techniques to reduce dependencies between threads have to be considered. In this work a clustered SMT architecture is implemented and different clustering implementations and network topologies to evaluate the communication rates are studied. The specific contributions of this work are three-fold: First, clustering the resources of embedded processor’s architecture in both the frontend and the back-end to meet power and energy constraints in these applications. Second, optimizing the underlying network topology designed to maintain communication between back-end clusters in order to reduce the communication overhead. Third, implement SMT enabled features to the back-end clusters in order to hide communication and increase throughput. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.enel.ucalgary.ca/~whassane/papers/Prwt_2006_hassanein.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |