It’s time for another annual architecture update from ARM. The yesterday evening news, ARM launched a new architecture for mobile, respectively, mega-core ARM Cortex-X2, large core ARM Cortex-A710, small core ARM Cortex-A510, upgrade of the existing X1, A78, A55 core.
Finally, the use of years of A55 small core update. Meanwhile, all three CPU architectures are based on Armv9 compatible design, of which X2 only supports AArch64 64-bit instructions, with a new tier of performance.
Let’s look at Cortex-X2 first. Officially, X2 achieves a 16% speed increase compared to X1 cores when built on the same node and in the same frequency environment. The peak performance of X2 is also optimized, doubling machine learning (ML) performance.
On the front end, branch prediction is decoupled from the prefetch unit so that it can run ahead of the core, reducing prediction errors, while improving branch prediction accuracy and boosting performance for large instruction loads.
On the core side, the pipeline length has been reduced from 11 to 10 instruction cycles, with the assignment phase reduced from 2 to 1 cycle. The chaotic execution window has been increased by up to 30 percent, with 244 to a maximum of 288 instructions.
On the back-end, loading storage window and structure increased by 33%, which can improve memory-level parallelism, the first-level cached-TLB also increased by 20%, in addition to enhanced data prefetching capabilities. In summary, ARM claims that X2’s maximum single-threaded performance is 40% higher than Intel’s i5-1135G7.
ARM Cortex A710 and A510
Then comes the A710 and A510, also based on the ARMv9 64-bit instruction set, architecturally similar to the X2, so they can be integrated into the same SoC.
However, it should be noted that the X2 and A510 are 64-bit and no longer compatible with 32-bit, while the A710 is specially designed at the request of Chinese customers and continues to support OL0 AArch32.
A710 also improved branch prediction, higher accuracy, the first-level instruction cache TLB also increased from 32 to 48, but the macro-OP cache is still 1.5K (X2 3K).
The width of the macro-OP cache and branch unit has been reduced from 6 to 5, mainly for power consumption and energy efficiency optimization considerations, which is also an important distinction between the X and A series.
Therefore, Cortex-A710 is just 10% faster than A78 (on the same node and frequency), but the energy efficiency is improved by 30%, and the machine learning speed is doubled.
Finally, the A510 is also the most important upgrade, compared to three years ago, the A55 improved 35-62% ranging from 20% lower power consumption, machine learning capabilities increased by 3 times. According to ARM, the A510 is close to the previous A big core in terms of performance. In other words, future low-end and mid-range models using the A510’s SoC will have a considerable performance boost.
ARM Mali-G710, Mali-G610, Mali-G510 and Mali-G310 GPU Line-up
ARM shipped over 1 billion Mali GPUs in 2020 and to become the power of half of the smartphones and 80% of smart TVs. Along with upgraded cores, ARM is bringing out the widest range of GPU designs that will fit every category of smartphones.
The new Mali-G710 sits at the top. It is 20% faster, 20% more energy efficient than older designs, and gets a 35% boost in ML tasks. The G710 will be featured in future smartphone flagships but also Chromebooks. The ML speed boost will come in handy for new improved image enhancements and enabling new video modes.
Below that is the Mali-G610, which is based on the G710, though it targets a lower price point and can be used in high-end phones. The Mali-G510 is twice as fast and 22% more energy efficient than older designs (ML performance is doubled too). This will become a mainstay of mid-range phones, smart TVs, and set-top boxes.
The Mali-G310 is the second most exciting part of the announcement after the A510. These two will change the experience on the lower end. It promises a 4.5x uplift in Vulkan performance, the texture units are 6x faster, Android UI rendering performance is doubled.