ARM Cortex-X1 Mega-core Evolution and Comparison with Cortex-A78

ARM Cortex-X1 Mega-core Evolution

ARM Cortex-X1 Mega-core Evolution

Last year, ARM officially released its Cortex-A78 and Cortex-X1 architecture, the former position as a large core, the latter is a mega-core. Cortex-X series is ARM’s new high-performance core architecture, the first product is Cortex-X1, whose performance is 30% higher than the A77, 22% higher than the A78, and 100% improve machine learning capabilities.

Cortex-X1 also allows customers to customize and build more different features, but this requires customer involvement in the early development stage. Today, Xiaomi brought some detailed highlights of mega-core Cortex-X1 used in Snapdragon 888 SoC on Xiaomi 11.

Xiaomi phones have always been the first to carry the Snapdragon 8 series flagship mobile platform, and this time, Xiaomi 11 is the world’s first Snapdragon 888 flagship processor, bringing the most cutting-edge mobile technology innovation to the majority of consumers.

The most significant enhancement of Snapdragon 888 is the introduction of the ARM Cortex-X1 mega-core architecture, which is the ultimate architecture in the pure pursuit of performance. X1 is about 2.3 times larger than the A78 mega-core, and its giant size brings giant energy, and its peak performance is increased by 30% compared to the previous generation A77, which is an unprecedented performance leap and truly opens up the mega-core in the Android camp. That is an unprecedented performance jump, truly opening up the era of mega-core in the Android camp.

ARM Cortex-X1

130% peak performance may be the biggest improvement in the past generation

Cell phone processor faces many constraints when designing. For a long time, it has been necessary to balance the chip area and energy efficiency. But more and more high-load scenarios are putting higher demands on cell phone processor performance. The design goal of Cortex-X1 is to pursue the ultimate peak performance to enable better performance of cell phones under large-scale complex computing.

In contrast, the Cortex-A78, which is also a new architecture, focuses more on sustained performance. If Cortex-A78 is a marathon runner who balances speed and endurance, then Cortex-X1 is a 100-meter sprinter with great explosive power. The combination of Cortex-X1 and Cortex-A78 makes Snapdragon 888 an all-rounder that can sprint and lead for a long time.

The Cortex-X1 delivers a 30% increase in peak performance compared to the Cortex-A77 and a 22% increase compared to the latest generation Cortex-A78. This is the first true super core for Android.

How is such strong performance achieved?

The working process of a cell phone processor can be roughly divided into 3 parts: front-end instruction prediction prefetching, middle-end instruction decoding and distribution, and back-end instruction execution. That is to say, the processor first makes a prediction and fetches the instructions and data that may be needed from the memory to the cache, and then goes through layers of processing and transmission to perform calculations in the execution unit and finally outputs the calculation results, and the processed results are then transmitted to other units such as DPU and GPU through the storage management unit.

In high-load scenarios such as games and video clips, the amount of instantaneous concurrent data is very large, requiring the processor to be able to process data in parallel. This, in turn, is the focus of the Cortex-X1 Supercharged Core boost.

A careful comparison of the detailed design of the two shows that: Cortex-X1 doubles most of the resources compared to Cortex-A78.

2x L2 cache Improves instruction and data prefetch hit rate

The working process of a cell phone processor looks a bit complicated, but it is much simpler to understand when we can compare it to a highway transportation system. A transportation task begins with the dispatching of vehicles. The vehicles that are needed are brought from various locations to standby and wait to be dispatched. The L2 cache is an important part of the processor used to store instructions and data and is equivalent to the parking area for the spare vehicles.

And corresponding to the processor, it is the front-end instruction prediction prefetching stage. The processor first makes a prediction of the needed instructions and data, reads the instructions and data that may be needed from the Level 3 cache or external memory into the Level 2 cache, and then reads them from the Level 2 cache into the Level 1 instruction cache and Level 1 data cache, respectively.

Compared to the Cortex-A78, the Cortex-X1 directly doubles the L2 cache capacity. The increase in L2 cache capacity means that more instructions and data can be prefetched for backup, resulting in a higher instruction and data prediction hit rate and a lower impact on execution efficiency from re-reading resources due to prediction errors.

At the same time, the bandwidth of the L2 cache has also been increased exponentially, doubling the bandwidth of the Cortex-X1’s Level 1 data cache and Level 2 cache, preventing bandwidth from becoming a bottleneck for data transfers. This instantly turns a two-way two-lane highway into a two-way four-lane highway, greatly increasing traffic capacity and allowing more vehicles to travel unimpeded.

1.25X Instruction Decoding Performance Improves Parallel Instruction Processing Capability

Instructions read into the first-level instruction cache still need to be decoded into microinstructions (μOP) before they can be executed by the processing unit. To enhance decoding efficiency, Cortex-X1 specifically increases the number of instruction decoders by 1.25 times the decoding capacity of Cortex-A78. It is equivalent to increasing the toll window on the highway to improve the capacity and reduce vehicle congestion.

After that, the decoded macro instructions (MOP) are sent to the reorder buffer to be split into smaller microinstructions (μOP), waiting for centralized scheduling and final execution by the execution unit. However, when the execution of instruction needs to depend on other instructions or data, it needs to wait in the reorder buffer. Instructions that need to be reused are then temporarily stored in the MOP buffer.

The higher decoding performance allows for more instructions in parallel, and the MOP macro instruction buffer and reorders buffer of Cortex-X1 are significantly increased to carry more microinstructions (µOP). 100% more MOP macro instruction buffer and 1.4 times larger reorder buffer than A78. The macroinstruction buffer and reorder buffer are equivalent to a service area on a highway, where a larger service area allows more vehicles to wait to be dispatched. As a result, the processor’s ability to process instructions is greatly increased.

2x NEON instruction execution unit 100% improvement in machine learning performance

When a convoy travels to its destination, it needs to unload and perform a series of processing on the goods immediately, and the execution unit is the final destination of instructions and data. The execution unit processes the data according to the instructions, which is the most core unit of Cortex-X1.

The Cortex-X1 adds NEON instruction execution units, doubling the number, giving it exponentially more machine learning capabilities. It is greatly beneficial to the increasing demand for AI computing and multimedia computing in cell phones. Effectively enhances a range of practical usage experiences such as voice/face recognition, audio and video codecs, and game screen rendering.

Total more than ten key module performance enhancements

The increase of parking area, toll window, number of lanes, and destination unloading area/processing area greatly enhances the capacity of this whole high-speed transportation system.

Besides, Cortex-X1 has 1.5 times the BTB (branch target cache) capacity, 1.33 times the dynamic access unit, and more than 10 key module performance enhancements to ensure that the execution unit can access the required instructions and data on demand. 2.3 times the large area of Cortex-A78 is a visual demonstration of this series of comprehensive improvements to the X1 core.

Android flagship performance watershed

The Snapdragon 888 processor achieves even greater peak performance thanks to the vastly increased ability of the Cortex-X1 to process data in parallel. Whether in application launch, application multi-window, game double-opening, or large handheld games that rely heavily on CPU performance, you can experience the jump in real-world experience brought by the super core. This is a watershed moment for Android flagship performance.

Previously, the functions carried by the phone were relatively single, but in the future, we can give the phone a wider range of application scenarios. The increasingly rich terminal form, highly integrated use scenarios, and increasingly sophisticated computing photography all put higher demands on platform computing power. The X1, too, is built for these cutting-edge applications.

Xiaomi 11 is the world’s first Snapdragon 888 mobile platform, relying on the extreme performance of the X1 mega-core to bring a more advanced and quality experience to the majority of users worldwide. It is also this leading configuration ahead of its time that has made Xiaomi 11 the top flagship. After Snapdragon 888, Samsung Galaxy S21 Series also Exynos 2100 5nm SoC with Cortex-X1 mega-core.

Source 1, Source 2

Exit mobile version