In-Depth Analysis Of Intel 12th Generation Core Alder Lake, Thread Director, and Other Tech

Intel Alder Lake

Intel 12th Generation Core Alder Lake in-depth Analysis

In the past two years, Intel has become more and more ahead of the curve, and many of the technical details of the previous generation of product releases will be made public a few months in advance. In today’s Intel Architecture Day 2021, Intel announced the architecture and technical details of the 12th Generation Core Alder Lake.

Intel unveils its biggest architectural shifts in a generation for CPUs, GPUs and IPUs to satisfy the crushing demand for more compute performance.

Intel

At this year’s Architecture Day event, as in previous years, Intel announces many details of their own future chip design, such as the upcoming CPU/GPU architecture, new core design, and some new technologies, etc. This year’s Intel Architecture Day event brought the next-generation processor architecture “Alder Lake” details.

Intel Alder Lake Introduction

Intel’s Alder Lake is a new architecture that Intel has spent years building, and it is the core of the 12th generation Core processors that will be released in the future.

Like previous Intel processor architectures, Alder Lake includes CPU, GPU, memory controller, IO, display output, and AI accelerator components. It is also Intel’s first high-performance processor with a large and small core design, with the following major changes.

Let’s look at the evolution of Alder Lake a little bit.

x86 small and large core heterogeneity

The biggest point of change on Alder Lake is its heterogeneous architecture with large and small cores. Intel had previously tested the large and small core heterogeneity on Lakefield and launched two official products (such as ThinkPad X1 Fold), but they were both low-power processors with low performance. So Alder Lake is arguably the first high-performance x86 processor with a large and small-core heterogeneous design, with deep improvements over the large and small-core heterogeneous design inherited from Lakefield. The first look at the Gracemont, called the Efficiency Core (E-Core) by Intel.

“Small Core” E-Core: Gracemont with overall performance close to Skylake but lower power consumption.

Intel’s small core design is a separate line from the large core, now generally known as the Atom core. The Gracemont’s predecessor was the Tremont, which was passed down from generation to generation.

From the above figure, the left side is Tremont, the right side is Gracemont, it is very obvious to see that Gracemont has many more execution ports, from the original 10 surging to 17, and followed by the number of execution units become more.

The integer part ALU has been increased from 3 to 4, and the AGU has been doubled from 2 to 4. Correspondingly, a set of MUL and DIV units have been added. The integer execution capability has been greatly enhanced; the floating-point operation part has also been improved, and there was only one. There are now two FADD and FMUL units, which can combine and process data with a width of 256-bit, which means that they can meet the needs of executing the AVX2 instruction set; the floating-point ALU and STD are both increased by one, and the computing power will be greatly improved.

To meet the greatly expanded back-end, the front-end has been enhanced accordingly, with the decoding part still being designed with two sets of three widths, which can be enabled to achieve six decodings at the same time. The L1 instruction cache (L1I) has been doubled to 64KB, and the branch predictor has been enhanced with a larger cache.

In the mid-core section, the ROB is increased to 256, a number larger than Skylake’s 224 and the same as Zen 3.

Lastly, there is the cache subsystem. As mentioned earlier, has been multiplied from 2 to 4 AGUs, with 2 Load and 2 Store, while the L1D size remains unchanged at 32KB and the L2 cache is up to 4MB. By the way, it needs to be mentioned that there are 4 small cores in a group, and the area of ​​a group of small cores is similar to that of a Golden Cove (i.e., a large core).

Altogether these improvements add up to a considerable performance boost for Gracemont. The official comparison between Gracemont and Skylake, in terms of single-threaded integer performance, Gracemont can improve the performance of the same power consumption by more than 40%, and save about 40% of power consumption with the same performance, the energy efficiency ratio is superb.

In terms of multi-threading, the same 4 threads, compared to 2 Skylake cores with Hyper-Threading on, 4 Gracemont cores can output the same integer performance with 80% less power consumption. And if the fire is on full power, then it can provide about 1.8 times the integer performance, while the power consumption is still lower.

Overall, Alder Lake uses Gracemont to boost the total performance of the processor in multi-threaded scenarios, while enabling longer endurance performance in energy-conscious scenarios with the excellent power efficiency ratio of the smaller cores.

The “big core” P-Core: Golden Cove with approximately 19% IPC improvement

The small core is very strong, while the large core – Intel called Performance Core (P-Core) – Golden Cove core can only be said to have changed more to improve more. In Intel’s official words, it has become wider, deeper, and smarter.

Wider means that the core decodes and executes instructions in greater parallelism; deeper means that the various instruction caches in the core have become larger, and smarter means that some components have more accurate judgment capabilities.

The front-end part of Golden Cove has been changed considerably, the most obvious one being that the 4-width (actually 4+1 width) decoder which has remained unchanged for years has been upgraded to a 6-width decoder (it should be 6+1). Unlike the cores of Arm and other RISC systems, x86 belonging to the CISC system has to increase the instruction decoder at a considerable cost. So both AMD and Intel maintain the front-end decoder at 4-width, and now Intel is the first to take a step forward. Meanwhile, the bandwidth of the L1I cache is also doubled to 32 Bytes to meet the need for a 6-width decoder.

Increasing the decoder width increases the processor pipeline length, which makes the penalty for branch prediction errors heavier, and Intel chooses to increase the branch prediction buffer (BTB) to cope with this problem, increasing the number of branch entries from 5K to 12K directly, nearly double the 6.5K of Zen 3. The branch predictor itself has also become “Smarter” and accuracy continues to improve.

Macro instruction (µOP) throughput has increased from 6 per cycle, which has not changed in years, to 8. While the µOP Cache, which is used to cache macro instructions, has continued to grow, from 2.25K to 4K, on par with Zen 2/Zen 3. The macroinstruction queue structure has been adjusted and is now more optimized for hyperthreading, with a single-threaded queue depth of 72 for dual-threaded simultaneous utilization, and a full 144 queue depth for single-threaded utilization of the core.

In the middle core, the synchronization becomes wider, the launch area is widened from the original 5 to 6 widths, and the ROB cache is increased from Sunny Cove’s 384 to 512, which is straightforward to Apple Firestorm core’s 600+, and the ROB increase will significantly increase the core power consumption. In addition, the execution port increased by two, now a total of 12 ports, but the integer and floating-point still share the launch port, did not change to the popular separate type.

Although the ports are shared, Intel still talks about the integer and floating-point improvements separately. The changes in the back-end execution part are relatively small. As you can see from the two figures above, an ALU has been added to the integer part; two FADD units have been added to the FPU part, which is more efficient and has shorter instruction cycles than the FMA unit; and the FMA unit has added support for FP16 data, which is helpful for low-precision calculations, but because it needs to call the AVX-512 instruction set, it is not available on Alder Lake, we cannot take advantage of it.

Another new port was added for the cache subsystem, with the addition of a Load AGU, which increases the Load bandwidth per cycle to 3, on par with Zen 3. L2 follows Willow Cove’s design, which is still non-inclusive and has 1.25MB per core but adds a new prefetch mechanism that reduces the number of DRAM reads.

The total improvements add up to an average of about 19% co-channel performance improvement for Golden Cove compared to cypress core, and even up to about 60% improvement. Strangely enough, however, there were a few items that regressed in performance. Overall, Golden Cove is a complete overhaul and probably the most modified core microarchitecture since Skylake.

Intel Thread Director: A key player in scheduling large and small cores

The performance gains of both large and small cores are significant, but how do you schedule them so they can fully exploit their strengths? In fact, Arm has already been in the muddy water for x86, the big.LITTLE architecture has been developed for more than a decade, and mainstream operating systems have added support for scheduling large and small cores, including Windows, which is aware of multiple cores with different performance on the processor. Intel has chosen a combination of hardware and software solutions called Thread Director.

At the operating system level, Intel and Microsoft have worked together to improve Windows’ task scheduling. Starting with Windows 11, the system’s task scheduler can obtain more information for determining what performance mode the currently running thread needs and which instruction sets it wants to call, while it also knows how to make hardware give way to high-priority tasks.

At the same time, Intel has integrated a very small MCU in the Alder Lake processor to monitor the operation of the current processor core and can monitor the characteristics of each thread, such as what instruction set it runs, how its performance needs, and so on. After collecting the information, it will feed it back to Windows 11, which will combine it with the information it has collected to determine if the thread should be moved to another core. This all happens in just 30 microseconds or less, whereas a traditional scheduler might take more than 100 milliseconds to conclude.

Of course, Alder Lake will still schedule threads on the P-Core by default, unless tasks are running on top of all the high-performance cores. Intel has divided Alder Lake into three performance tiers as follows.

That is, in general, the system scheduler will prioritize threads to the P-Core native threads. After the 8 native P-Core threads are released, it is the E-Core’s turn, and if it is not enough, it will only use the threads from the P-Core hyperthreading (because of the performance of the hyperthreaded threads is definitely not as good as the E-Core’s). For example, a 20-thread task will take advantage of the 8 threads native to P-Core + 8 threads native to E-Core + 4 threads from the 4 P-Core hyperthreads.

Of course, Windows 10 still can schedule large and small cores, but to put it simply, it’s not smart enough. In Windows 11 Alder Lake should have a better energy efficiency performance.

DDR5 and LPDDR5 memory support, still compatible with DDR4 and LPDDR4

After talking about the kernel part, we skipped the Xe GPU, which has no substantial changes and directly look at some other points of change, first of all, the memory controller.

As you can see Alder Lake has added support for DDR5 and LPDDR5 memory. By default DDR5 supports 4800MT/s, LPDDR5 supports 5200MT/s, the former will start shipping later this year, while the latter has been widely used in mobile devices, originally Tiger Lake is claimed to support LPDDR5, and then for various reasons did not finally achieve. And after the official launch of Alder Lake, there should be a lot of thin and light notebooks with LPDDR5 memory.

New IO with PCIe 5.0 support

Alder Lake’s PCIe support is very aggressive, with a direct step upgrade to the latest PCIe 5.0, doubling the bandwidth from PCIe 4 to 64GB/s at x16. Of course, this should be unique to desktop platforms because of power consumption. The new x4 lanes on Rocket Lake-S and Tiger Lake are still PCIe 4.0 and can be used to connect to SSDs, and although not explicitly stated, the bus interconnecting the PCH should be upgraded to DMI 4.0, or at least x4 width and the high-end PCH should be connected to the CPU via DMI 4.0 x8. The PCH is capable of exporting 12 PCIe 4.0 and 16 PCIe 3.0, so the scalability is a lot higher than before.

Unified Alder Lake

Compared to the split between desktop and mobile Cores in the 11th generation, Alder Lake has been reunified again, but of course, different platforms will still have different specifications.

On the desktop side, Alder Lake will have up to 8 large cores and 8 small cores, but no integrated Thunderbolt 4 controller, and the core graphics specification is still only 32EU. The highest mobile terminal will have up to 6 large cores and 8 small cores, plus 96EU and 4 Thunderbolt controllers, and of course, the ancestral IPU will be integrated. The ultra-light and thin end, which is more sensitive to power consumption, has only 2 large cores and 8 small cores at the highest, and the number of Thunderbolt controllers has also been reduced to two.

Alder Lake is one of the biggest changes in Intel’s architecture in recent years, regardless of the changes in the computing core itself or the design of the small and large cores, which can be considered very radical. We are very surprised that Intel can bring us such a creative new architecture, but of course, we still need to wait for the official release of Alder Lake, the 12th generation Core processor, and we are looking forward to its performance.

Of course at last night’s architecture day, Intel also released a lot of interesting content, such as the core architecture of their future Arc gaming graphics cards, which we will discuss later.

Source, Via

Exit mobile version