Lecture 1

Lecture 1: Fundamentals of Quantitative Design and Analysis

CRA Community White Paper, “21^st century computer architecture,” available at http://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepaper.pdf, May 2012.
J.L. Manferdelli, N.K. Govindaraju, and C. Crall, "Challenges and opportunities in many-core computing," Proceedings of IEEE, vol. 96, no. 5, pp. 808-815, May 2008.
S. Borkar, "Thousand core chips - a technology perspective," Proc. IEEE/ACM Design Automation Conf. (DAC), 2007, pp. 746-749.
A. Roy, J. Xu, and M.H. Chowdhury, "Multi-core processors: a new way forward and challenge," Proc. Int'l Conf. Microelectronics, 2008, pp. 454-457.

Lecture 2: Memory Hierarchy Design

R. Heald, K. Shin, V. Reddy, I.-F. Kao, M. Khan, W. L. Lynch, G. Lauterbach, and J. Petolino, “64-KByte sum-addressed-memory cache with 1.6-ns cycle and 2.6-ns latency,” J. of Solid State Circuits, pp. 1682, Nov. 1998.
B. Jacob and T. Mudge, “Virtual memory in contemporary microprocessors,” IEEE Micro., vol. 18, no. 4, pp. 60-75, Jul./Aug. 1998.
J. Kim, A.J. Hong, S.M. Kim, K.-S. Shin, et. al, “A stacked memory device on logic 3D technology for ultra-high-density data storage,” Nanotechnology, vol. 22, no. 25, Nov. 2011.
G.H. Loh, “3D-stacked memory architectures for multi-core processors,” in Proc. IEEE/ACM Int’l Symp. Computer Architecture (ISCA), 2008, pp. 453-464.
J. Olukotun, T. N. Mudge, and R. B. Brown, “Multilevel optimization of pipelined caches,” IEEE Trans. Computers, vol. 46, no. 10, pp. 1093-1102, Oct. 1997.
E. Rotenberg, et. al, “Trace cache: a low latency approach to high bandwidth instruction fetching,” in Proc. 29th Symp. Microarchitecture, Dec. 1996.
S. P. Vanderwiel and D. L. Lija, “When caches aren't enough: data prefetching techniques,” IEEE Computer, vol. 30, no. 7, pp. 23-30, Jul., 1997.
D.H. Woo, N.H. Seong, D.L. Lewis, and H.H.S. Lee, “An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth,” in Proc. IEEE 16^th HPCA, 2010, pp. 1-12.

Lecture 3: Instruction-Level Parallelism and Its Exploitation

G. Doshi, “Understanding the IA-64 Architecture,” 1999.
J. Douglas, “Intel 8xx series and Paxville Xeon-MP Microprocessors,” Proc. Hot Chips, Stanford University, August. 2005.
G. Hinton, et. al, “The Microarchitecture of the Pentium 4 Procssor,” Intel Technology Journal, Q1, 2001.
S. A. Mahlke, “A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” Proc. 22nd Annual Symp. Computer Architecture, pp. 138-150, Jun. 1995.

· H.M. Mathis, A.E. Mercias, J. D. McCalpin, R.J. Eickemeyer, and S.R. Kunkel, “Characterization of the multithreading (SMT) efficiency in Power5,” IBM J. Res. & Dev., 49:4/5 (July/September), 555–564.

S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-effective superscalar processors,” Proc. 24th Annual Symp. Computer Architecture, Jun. 1997.
K. Robberts-Hoffman, “ARM Cortex-A8 vs. Intel Atom: architecture and benchmark comparisons,” EE6304 Course Project, UT Dallas, Fall 2009.

· B. Sinharoy, R. N. Koala, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner, “POWER5 system microarchitecture,” IBM J. Res. & Dev, 49:4-5, 505–521.

· J.M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy, “Power4 system microarchitecture,” IBM J. Res & Dev, 46:1, 5–26.

· N. Tuck, and D. Tullsen, “Initial observations of the simultaneous multithreading Pentium 4 processor,” Proc. 12th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), 2003, pp. 26–34.

Lecture 5: Thread-Level Parallelsim

B. Busck, M. Engbom, S. Lee, M. Dubois, and P. Stenstrom, “Loop-level speculative parallelism in embedded applications,” Proc. Int’l Conf. Parallel Processing, 2007.
M. M. Islam, A. Busck, M. Engbom, S. Lee, M. Dubois, and P. Stenstrom, “Limits on thread-level speculative parallelism in embedded applications,” Proc. IEEE Int’l Symp. High-Performance Computer Architecture, 2007.
A. Kejariwal, M. Girkar, X. Tian, H. Saito, et. al, “Exploitation of nested thread-level speculative parallelism on multi-core system,” Proc. 7^th ACM Int’l Conf. Computing Frontiers, 2010.
D. Koufaty, D. T. Marr, “Hyperthreading technology in the netburst microarchitecture,” IEEE Micro., vol. 2, no. 23, Mar.-Apr. 2003.
J.L. Lo, J.S. Emer, H.M. Levy, R.L. Stamm, D.M. Tullsen, and S.J. Eggers, “Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading,” ACM Trans. Computer Systems (TOCS), vol. 15, no. 3, pp. 322-354, Aug. 1997.
D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” Proc. Symp. Computer Architecture, 1996.

Lecture 6: Data-Level Parallelism

L.A. Barroso and U. Holzle, The Datacenter as a Computer: an Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Publishers, 2009.
M. Chu, R. Ravindran, S. Mahlke, “Data access partitioning for fine-grain parallelism on multicore architecture,” Proc. 40^th IEEE/ACM Int’l Symp. Microarchitecture (MICRO), 2007.
J. Sampson, R. Gonzalez, J. Collard, et. al, “Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers,” Proc. 39^th IEEE/ACM Int’l Symp. Microarchitecture, 2006, pp. 235-246.
Y. Yi, W. Han, A. Major, A. T. Erdogan, and T. Arslan, “Exploiting loop-level parallelism on multi-core architectures for the wimax physical layer,” Proc. IEEE Int’l SoC Conf., 2008.
H. Zhong, S. Lieberman, and S. A. Mahlke, “Extending multicore architectures to exploit hybrid parallelism in single-thread applications,” Proc. IEEE Symp. High Performance Computing Architecture (HPCA), 2009.

Appendix A: Instruction Set Architecture

K. Diefendorff, et. al, “Altivec Extension to PowerPC Accelerates Media Processing,” IEEE Micro., vol. 20, no. 2, pp. 85-95, Sept. 2001.
A. Eden and T. Mudge, “The YAGS branch prediction scheme,” Proc. of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, 69–80.
J. Huck, et. al, “Introducing the IA-64 Architecture,” IEEE Micro., vol. 20, no. 5, pp. 12-23, Sept. 2000.

· D.A. Jimenez and C. Lin, “Neural methods for dynamic branch prediction,” ACM Trans. Computer Sys 20:4, (November), 369–397.

· C. McNairy and D. Soltis. “Itanium 2 processor microarchitecture,” IEEE Micro, vol. 23, no. :2, pp. 44–55, Mar.-Apr. 2003.

J. E. Smith, “A Study of Branch Prediction Strategies,” Proc. 8th Annual Symp. Computer Architecture, pp. 135-148, May, 1981.