Exposure of detail of SamSung CPU framework, is the challenge connected high and China bed weapon?

(The article observes by industry of semiconductor of date of small letter public (ID: Icbank) pick interpret from ” AnandTeCh] , serve as here reprint share)

As this year one of grand operas of HotChips conference, we are very glad to see eventuallySamSungWrap around the government announced his newest this year newCPUDesign Exynos M3.

This year January, media reported the relevant news of the new small framework of SamSung first, we are very clear since thenceforth, this is an attention that nots allow to ignore nods: Because SamSung gained tremendous promotion in performance side, this is in they are some closer the place before designing a product to go up did not see year of silicon that come.

SamSung

But be in next a few the middle of a month, less and less however to the exposure of new fund Exynos 9810 and its M3 kernel. We had had a lot of exploration among, but always cannot peep its heart. Any content that there is SamSung more in the center make reference.

Review SamSung the development of this series framework, to the industry, this is a very good innovation is mixed drive.

On the HotChips 2016, samSung rollout Exynos M1 of framework of the Dai Wei at the beginning of its. As we have learned, the CPU IP of SamSung is the “ SamSung in Texas Austin ” of center of Austin research and development (abbreviation SARC) of development, this center held water 2010, the target is the S.LSI branch that is SamSung and Exynos chip set build in-house IP. In this center, have come fromAMD, IntelWith other company and institution of higher learing, brilliant senior expert, follow-up memory controller appears with what interrelate from the definition, it is their working achievement. Of course, samSung head money is custom-built CPU, it is among them star more.

SamSung

SamSung rolled out M2 again later, be worth what carry is, because M2 has 20 % IPC to improve in whole work load, so although produce chipClockSpeed reduced 12 % , but its function or excel M1. SamSung realized a few a few functions that plan M3 at first in M2, this makes new M3 design becomes more radical.

Here, samSung pointed out one of respects with the most merciless industry clearly, that is releasing cycle namely inside, IP and chip must synchronous. We see SoC the product of many suppliers, it is to capture of new product commerce releases the window and tighten jerk to the market.

SamSung

The primitive slide of the overview of comparative Exynos M3 and M1, we saw a lot of similar point, but M3 increased more on the desktop. SARC group piles up small framework width unit from 4 ease sb’s anxiety (Wide Decode Unit) increase to 6, this is the integral core feature of new μarch. The 2nd the contains multiplier function integral ALU that we see be added newly, laden unit and a floating-point that expand considerably / SIMD, this promoted computational capacity of 3 times.

SamSung never is opposite truly small architecture makes public M2, and also be done not have to it relevant specific compile implement machine model, but in the exposure today, a change that we see is SamSung undertook from 96 to 100 entry (Entries) small canzonet is whole, new sort amortize. What we mention no less than in the exposure of first time μarch January in that way, m3 expands greatly most 228 entry, this makes μarch looks on one hand from this, the core design with Intel is more similar (although we cannot undertake the density of different ISA directly,compare, and as the complexity of the instruction change) .

WhenArmAnnounced the µarch detail of A76, especially 128-entry ROB (this looks smaller than M3) . Look in them, this is function and area / the balance between power comsumption. What be worth to be carried particularly is, ROB Capacity increased 7 % , but the function promotion that brings 1 % only.

SamSung explanation says, ROB Capacity is a choice, the others of it and small framework and all sorts of buffer and back end attemper closely related the design of program capacity – μarch width and μarch width complement each other in order to improve performance, and if be like M3 such, the μarch of wider width can quickly fill ROB, achieve stronger performance from inside larger size thereby. As a whole, considering raise function and economic cost, m3 used the design that differs with M1 / M2.

A bigger front

Know the more detail of front deep, we saw branch is forecasted implement (Branch Predicto) all sorts of improvement with Fetch unit. The branch of M1 is forecasted implement with other μarch lie differently to be able to use two branch in every cycle at it and two branch port is had in back end. M3 maintains this width it seems that, but increase μBTB 128 from 64. MAiNBTB still is withheld in 4K entry, but after using branch, defer respect already had apparent promotion.

Besides, branch is forecasted implement quality also had promotion on the whole, this makes missed branch to reduce 15 % on average. Interesting is, samSung released an actual MPKI actually (Misses Per Kilo InstrucTiOns) value, is this up to now Arm (or any suppliers? ) the thing that did not see. Here, samSung monitoring comes from all sorts of application program and with of the exemple, the 4000-6000 code that expands ceaselessly dogs set, the test and verify in so that be in,developing a process its function.

Branch is forecasted implement furnish respectively with Fetch unit Decoupled Address alignment and alignment of Decoupled Instruction instruction, such doing or can make these unit undertake in implementation clock door accuses.

The bandwidth of Fetch unit is double already, every cycle is most and readable now take 48 byte, be equivalent to every cycle 12 32b instruction, this lets get the rate with decipher capacity to turn into 2: 1, than 1.5: The rate of 1 had increase apparently (the 24B / C in M1, 4 decipher) . SamSung explanation says, in order to answer the bigger and bigger branch bubble question on more comprehensive small framework, need increases such design considerably. They admit, average and character, what use the distance between branch to be less than 12 instructions, but bigger width is broken out to dictating temporarily have very great help.

Although this kind of change has very high instantaneous power utilization rate, but when instruction alignment (now is deepness double) be gotten to compare decipher unit by fill when decipher is fast still, because it allows Fetch unit to be accused by clock door, this makes it right uses power has effect of integral clean front, . Here, effect of whole physical ability and branch are forecasted implement quality is more close together and relevant, when because be in,getting an instruction actually not important.

Instruction cache / L1I is 64KB. We do not decide whether this increases than M2, because it is measured very hard, but double it is M1μarch for certain.

The instruction changes mothball buffer (ITLB) already added 512 entry from 256 entry. Those who need an attention is, samSung is using 3 level hierarchy, is not the structure that we are pleasant to the eye in the processor of Arm. A75 and A76 have the first class respectively 32 with 48 μITLB, among them MainTLB shares 1280 entry, include 1024 entry (page most greatly 64KB) assist 256 entry to express with (the page is counted ” = 1MB) .

SamSung also has one class data and instruction TLB, but the size that did not disclose L1 ITLB.

Middle-Machine: More extensive decipher, name again and send

Come to Middle-Machine (decoder, name again, attemper) , we saw this fact with 1.5 times wide unit decipher. SamSung herein did not disclose any detail, but it improved a statement / μOP shirt-sleeve function. Name again and attemper handling capacity matchs decipher width; here, important is not to try overmuch ground unscrambles it and the CPU kernel its and Arm undertakes comparative, because we are discussing the different μOP kind between different supplier. Here, samSung μarch supports since M1 attemper more formal; decoder gives out a μOP, can attemper at the same time many attemper program, but in ROB, they still its plan attemper for and an entry.

In integral core, we see two attemper additionally implement, because this M3 can give out 9μOps now, is not before the 7μOps of acting product. Among them a new port is the add ALU unit that has multiplication function, make MUL handling capacity double, increase simple and integral arithmetical handling capacity 25 % .

Auxiliary and add port is the 2nd laden AGU, its can make the laden bandwidth of core double.

” of beast of floating-point unit “

In floating-point core, we saw a very as different as foregoing μarch “ beast ” . Here, samSung increased the 3rd conduit, increased to be sent in FPU (Dispatched) and release (Issued) μOP. simple floating-point ability, m3 carries unit of 3 128b FMAC / FADD, make multiplication and arithmetical handling capacity increased 3 times. This means handling capacity direct from from 3 FLOPS (1x FMAC (2) 1x FADD (1) ) grow 6 FLOPS (3x FMAC (2) .

Because carry out the urgent leap of handling capacity to add, must expand so attemper program and physicsRegisterFile, will attemper that is to say program from 32 increase to 62, FP PRF from 96 increase 192.

SamSung reduces executive delay in effort all the time, this also applies to floating-point automation line. Here, multiplication unit has cut cycle 3 cycle from 4 cycle, this also is helpful for FMAC dropping from 5 cycle 4 cycle. Simple floating-point is added will periodic from 3 fall 2, meanwhile, FDIV can rise RAdiX-64 unit, and reduced divisional delay significantly.

Carry a little here, I remember Arm been in A76 its new floating-point pipeline already had ballyhoo several years, method of data of the most advanced ”VX feels their “ for new core very proud. But SamSung is in it seems that beat Arm, because M3 has identical floating-point delay, have higher executive handling capacity and lower defer ASIMD function at the same time. When we can test silicon product side by side, we are met in the future in more detail compares these.

New load / location

In to load / in location, because increased the laden port of the 2nd 128b, we see again read take bandwidth double. The laden use defer here keeps changeless inside 4 cycle. Memory bandwidth has corresponding promotion likewise. This generation M3 has double bandwidth advantage, because of its two LD unit the job is in 128b / is periodic, and A75 is 64b / cycle.

Overall and character, LD / ST attempers the capacity of the program has increased, in memory buffer respect, although we do not have exact value, but can see also increased one times probably. To serve better at more extensive μarch, the Outstanding Misses on cache of L1 data high speed already from 8 increase to 12, this is meant during Misses of high speed cache, this unit can be offerred amount to request of 12 subsequent data, and core / the system gets data cache level or memory from higher administrative levels. The machine width that considers M3μarch (Machine WIdtH) , this appears it seems that very low. The product before Arm did not announce A75 and this publicly is in the norms of this respect, but they run paralell class of MLP / memory an emphasis that the gender shows as A76, l1D is offerred here amount to 20 Outstanding Misses. This is morer than what M3 can do.

Here, take of SamSung beforehand implement need has top quality, in order to avoid any memory bottleneck, achieve the goal that optimal and perfect Cache-hit operates, actually they say new “ mixes ” (Hybridized) take beforehand implement had increased somewhat. Here Hybridized is meant substantially can have more person that take beforehand, individual perhaps take beforehand implement the memory mode that can treat different type.

The new TLB hierarchy that before slide mentioned us again, describes in instruction respect (Hierarchy) . In data side, we see as identical as M1 32-entry Micro-DTLB, but there is a brand-new Mid-level DTLB now, it has 512 entry. Instruction TLB and data TLB are mixed by what increase now the bigger unified L2 TLB that has 4096 entry serves, and before there are 1024 entries only in generation.

Core conduit: Everything has cost

Enlarge small framework to need to pay cost price. Compare with Exynos M1 photograph, m3 added two cycle on its conduit deepness, added auxiliary attemper level (Secondary Dispatch Stage) , and use at register to read take (Second Stage) the 2nd phase. Normally depth gauge of CPU automation line is from forecast / branch answers written level to register, below this kind of circumstance, m3 is in 17 Stages, and M1 is 15 Stages, a75 and A76 are 13 Stages.

Branch Misprediction Penalty is 16 cycle, because have cycle of a drive (Drive Cycle) return front, the 14c Penalty that goes up than M1 again is much 2 cycle. If μarch is put in any other and quick way between each phase, can reduce the delay below crucial case, so SamSung also won’t be talked about. The Arm corresponding content that the defect of M3 and M1 is it is located in 3 pairs of 2 class to take point to and decipher is unit (2 class) , regard highly of check of 2 pairs of 1 class names unit (1) , and need attempers the 2nd times level (1) .

SamSung admits, although this is a negative factor, but finish by the plan to let bigger μarch, this is a necessary “” devil, although the machine forecasts a respect to be done very well by accident in branch, but a this is new μarch big cost.

As a whole, the choice does not have the small architecture of SamSung to reveal very large clock career dominant position in the body in real product actually. This just lets they are designed in physics and restrict crucial method respect to be done weller it seems that, so that taller frequency comes true below reasonable voltage.

A hierarchy of new 3 class cache

Be far from CPU core itself, we are studying hierarchy of new cache of L2 / L3. Like A75 and A76, m3 introduced new demesne L2 cache is mixed as core share the intermediate level between last class cache. New demesne L2 includes inferior data cache, the capacity of every core is 512KB. With share L2 photograph to compare, the visit defer in M1 reduces 12 cycle from 22 cycle. The expression that SamSung is here, with the A75 of Arm photograph comparing is in inferior position, because occupy latter data to show, defer of its L2 Hit has 8 cycle only. Notable is, in the silicon chip that comes true in actual physics, as a result ofRAMWith physical layout medium design chooses, this number may rise. Actually, snapdragon 845 is measured in the L2 defer when 2.8GHz for ~4.4ns, and the data of 2.7GHz Exynos 9810 is 4.6ns about.

The bandwidth of L2 cache also increased one times, realized 32B / cycle now, and M1 is 16B / cycle. As contrast, a75 is read from L2 take cycle of bandwidth 16B / , write bandwidth to be 32B / cycle.

The L3 cache that announces its Exynos 9810 at first when SamSung is when how working, it is a bit confused that we feel but. Final we got clarifying, that is Arm did not allow tripartite core to insert its actually DynamiQ group / L3 system, that is to say the corresponding content of the silicon implementation of new SoC and Arm does not have any relations.

Here, we see with NUCA ((Non-uniform Cache Architecture: Be not unified cache framework) the large 4MB cache that means realizes. Have 4 1MB in all piece (Slice) , every “ piece on ” is located in CPU core. Because distribution is uneven, the visit defer between core and section is not identical. Core visits photograph adjacent section to have 32 periodic delay, but the delay that there are 44 cycles in the visit between CPU and farthermost section. SamSung cited in typical pattern the data of 37 periodic and average delay.

Here, compare with the plan photograph of Arm, m3 appears weaker.

The L3 Hit of Arm A75 has 25 periodic delay. Be in actual in, we see Snapdragon 845 achieves ~11.4ns, and Exynos 9810 is measured so that 11ns arrives between 20ns. Although DSU is inferior the biggest clock may be a defect, but actually it also is an advantage; below opposite situation when the clock frequency when CPU core is reduced, they still can use the cache of DSU / L3 that moves quickly and its inferior delay. Contrary, the cache hierarchy of M3 and its CPU kernel are decelerated together.

The processing of bus line unit of M1 / M2 amounts to 28 Outstanding Misses, and M3 processing amounts to 80 half-baked Utstanding Misses- , if this applies L3 to if in a certain respect L2 piece is included,perhaps pursue medium in this, lack definition. Arm never talks about the function of A75, but define 46 Outstanding Misses that A76 can handle L2 cache to go up, there are 94 Outstanding Misses on the L3 of DSU.

The data partition between L3 section is decided by Address Hash, and all section at the same time electrify. Under photograph comparing, the DSU acquiesce in bigger SoC uses two an implementation, among them every piece can be the half cuts off the power – , be in respect of electric energy force gives out the granuality of the 1/4 of L3. I do not decide how SD845 comes true here.

Finally, samSung explanation says, design of this kind of section aims to move for high end the different design implementation besides equipment is better can configure a gender, of course the factor that this remains the firstest consideration. But S.LSI also is in hard in car domain.

Overall to cache hierarchy, samSung admits end item did not reach the level that they want truly. End item resembles such, the hierarchy of 3 class cache that because must undertake,ability can obtain balance come true for this generation. Here, I think we will pay close attention to next generation M4 more.

Physical layout: Understand silicon chip
 

SamSung announced the core plan of its chip this year, we very be happy to see this, it is a few brief specifications of these pair of term below:

PL2: Private L2 Cache, here We See The 512KB Cache Implemented In What Seems To Be Two Banks/slices.

FPB: Floating Point Data Path; The FP And ASIMD Execution Units Themselves.

FRS: Floating Point Schedulers As Well As The FP/vector Physical Register File Memories.

MC: Mid-core, the Decoders And Rename Units.

DFX: This Is Debug/test Logic And Stands For “design For X”Such As DFD (Design For Debug) , DFT (Design For Test) , DFM (Design For Manufacturability) , and Other Miscellaneous Logic.

LS: Load/store Unit Along With The 64KB Of L1 Data Cache Memories.

IXU: Integer Execution Unit; Contains The Execution Units, schedulers And Integer Physical Register File Memories.

TBW: Transparent Buffer Writes, includes The TLB Structures.

FE: The Front-end Including Branch Predictors, fetch Units And The 64KB L1 Instruction Cache Memories.

Exynos 9810 ichnography

Compare with M1 photograph, the dimension of unit of almost all function in M3 increases greatly, the area of module of final kernel function is 2.52mm² , still have the 512LB L2 cache of 0.98mm² and logic additionally.

SamSung

Exynos 9810 ichnography

Here, samSung showed whole group plan, labelled again 4 core, they adjacent of each other photograph is arranged, l2 and L3 slide are orderly also adjacent of each other photograph sets the land. This kind of layout saved job of a few position it seems that, because every piece the design duplicates simply 4 times next.

IPC raises 59 %

Finally, how do the framework of function analysis foundation that SamSung talked about them and they run all sorts of work load to dog through RTL and model imitate, so that evaluate design choice, discovery,the mistake undertakes fine tuning to μarch.

It is in this piece of slide, the official number that we got finally core IPC grows: ~59 % .

No less than we see in chart in that way, the growth of all work load is not linear, the growth that cuts us to see tall ILP job is laden is only be restricted 25 % , and MLP work load is likely did not increase almost. Additional, the IPC that still has load of very much mixture work increased ” 80 % .

Function and efficiency: The data of SamSung

Next slide revealed M2, the GeekBench4 function of M3 and A75 is behaved. Representing a product is Exynos 8895 respectively, exynos 9810 and Snapdragon 845.

Power efficiency (Power Efficiency) a be M3 all the time main subject

No less than we are narrated in evaluation, the 2.7GHz tall frequency of SamSung needs very tall voltage and power comsumption. Although it showed banner performance, but final efficiency however the M2 under Exynos 8895. The number here represents active system power, this is meant in CPU, memory controller andDRAMRespect, like measuring it in AT like us.

Reduce clock as identical as M2 2.3GHz, according to SamSung demonstrate, what we see M3 in efficiency respect is banner.

Next graphs showed load of the work that finish is covered the average power comsumption during the sources of energy uses circumstance and test. A left form indicate efficiency, form is shorter (joule is smaller) , platform efficiency is taller. A delegate function mark of right, body is chiefer, performance is better.

The working load; 1794 of 3 when I still checked M3 afresh top class frequency, 2314 with 2704MHz, let us understand efficiency how to change along with function widely more.

Overall and character, m3 offerred very dynamic result range. In (almost) equivalent the photograph of A75 competition result of peak value function and this generation is compared, m3 can develop good efficiency dominant position. The low function of M3 is nodded still the 2.3GHz of excel M2 is the most high-powered, still have at the same time remarkable power and can effect advantage.

Clock frequency is in 2.3GHz when, the function of M3 is apparent excel A75.

Pull further in 2.7GHz finally big performance gap, but efficiency is very tall however, than other any newest SoC use up more power comsumption.

The prospective strategy of SamSung and conclusion

Finally, the timetable that SamSung discussed this project morely and how to begin the work.

No less than what we say in the introduction in that way, the plan of M3 is in the 2nd quarter began 2014, as M1 finish, RTL also is in the 1st quarter began 2015. Here, samSung changed rhythm and program, the one part function that uses original plan at M3 is put into M2. Here, original M3 planned to undertake editing, so that be in,the small framework performance with greater first quarter implementation was driven 2016.

RTL 2017 deliver to SoC group first quarter, the first EVT0 that is used at Exynos 9810 flows piece. Notable is, the silicon chip of actual production is EVT1, its Tape-out is in 2017 metaphase happening. And final commercial Exynos 9810 appeared on the market in March 2018.

M3 is designed to SamSung is quite big breakthrough job for the group, because they must experience a program that reframes close to the project, and must answer time pressure of the extreme. Next generation products are rolled out before date of expiration.

Because time is restricted, samSung stayed on this product of a lot of improve a space, especially one of weaker parts in small architecture —— cache hierarchy (Cache Hierarchy, ) , this is the thing that SamSung respect represents its are dissatisfactory, this is the dynamical place that drives design group to continue to advance hard.

SamSung is not willing to divulge a kind of any physics realize detail. Because HotChips is forum of a small architecture, because this discloses information,reservation is in the μarch of M3. What we saw no less than in the past, when the supplier is carried out with different kind, the function of individual small architecture and power comsumption character may very different. Considering this, when measuring end item, very difficult these binds each other sides silicon chip are apart.

M3 looks like the small architecture with a solid whole, feeling more be pleasant to the eye in desktop form product like us. The performance side that feels SamSung is using μarch adopted directer method, in a lot of respects its expression must compare Arm more “ feral ” , this also explained the silicon dimension of M3 compares big reason.

When the efficiency that evaluates IP, examining more advanced and other small architecture is insufficient, becauseTransistorField of actual and electric project and the detail in devising an alternative exceed any apparent higher level character very easily. Besides, true without the supplier meeting discloses these detail, it will be head and shoulders above prep let alone the limits of public reader.

Here, final slide may be to provide the exposure that inspires a gender most, let the strategy of future of our dekko SamSung:

Allegedly SARC design group can have driving year every year now release rhythm and improve continuously. In fact, in the inquiry when me a few different designs choose and when norms, I undertake comparative between M3 and A76 when, samSung reminds me, the real competition of the new core of Arm will be the new fund Exynos M4 of next year, is not M3.

So far, we released two generation to improve only edition, but the IPC increase rate of M2 and M3 is mixed for 20 % respectively 59 % , although,SamSung was released really brief but very driving pursuant contrail.

Be in several days ago, arm is avowed route chart of its function core comes 2020, announced A76 succession person Deimos and Hercules, acceptance restricts 15 % and the acting border profit of 10 % . M3 already was achieved it seems that in predicting performance side or exceed A76 (it is at least in SPEC2006) , because of the power efficiency of this basis M4, we may see SamSung is custom-built finally the competitive advantage of the design pays off.

As a whole, we thanked SamSung to make the small framework exposure that sees nowadays, as the product that surpasses Arm, they are a quite unusual event in this private industry. Hope S.LSI and SARC settle the weak point of Exynos 9810 and M3, make the SoC of next year gains greater success hard. We expect to get it certainly!

Add a Comment

Your email address will not be published. Required fields are marked *