marginalia_nu 9 hours ago

This is a very tangential. I did some work in Trinary computer emulation a lifetime ago, there's a cute trick for finding a closed form translation between a division by a power of 3, and series of bit shifts and additions.

Start by observing that

1/3 - 1/2 = 2/6 - 3/6

or

1/3 = 1/2 - 1/2 (1/3)

Substitute equation above into RHS an infinite number of times and find

1/3 = -(-1/2)^N for N in 1..inf

You can do this with arbitrary pairs powers of 2 and 3 (also other bases).

The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.

philsnow 7 hours ago

> you may have heard of techniques such as carry lookahead, Kogge-Stone addition

Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.

  • atq2119 6 hours ago

    > invented the first multi-core cpu

    The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.

    "A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.

    • philsnow 2 hours ago

      Thank you for keeping me honest, I concede the point; I was quoting from his Wikipedia entry and wasn’t particularly critical because I took an architecture class from him in grad school and liked him as a professor.

  • oidar 4 hours ago

    I thought Kunle Olukotun led the team for the first multi-core CPU.

    • philsnow 2 hours ago

      You may absolutely be right, I don’t know who did it “first”.

      I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...

      (In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)

phkahler 9 hours ago

The Cinematronics arcade game processor has two 12-bit accumulators. The multiply instruction shifts them right as a single 24-bit value and adds the contents of memory if a 1 came out the LSB. So you clear the upper half, load one value in the lower, I forget how you set the memory address for the other operand... and execute several 1-bit multiplies is a row. You can get a 24bit product this way, but most code I saw had runs of 8 multiplies. The most common thing was a 2x2 matrix multiple to do coordinate rotations for game objects.

This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.

  • greesil 3 hours ago

    Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly. I have sometimes had to do fixed point math in the past 20 years and have had much respect for those programmers who came before me.

kens 14 hours ago

Author here if anyone has questions...

  • vanderZwan 12 hours ago

    What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?

    • phire 8 hours ago

      Obsolete.

      You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.

      The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.

      The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.

      The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).

      In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.

      In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.

      As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.

      In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.

      Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

      [1] https://www.youtube.com/watch?v=nll5MWlG7q4 [2] https://ieeexplore.ieee.org/document/282948/

      • atq2119 5 hours ago

        It seems you misunderstood the design of the multiplier described in the article.

        The Pentium multiplier isn't 3 bits per cycles, it's one full multiplication per cycle (though pipelined over multiple cycles).

        The part that's 3 bits is the Booth multiplier trick of multiplying 3-bit digits instead of 1-bit digits, but they're all multiplied in parallel.

        As such, I suspect (but don't know) that this trick or some variant of it is used even today.

        • phire 2 hours ago

          Yeah, you are right... I've misunderstood radix-8 multiplication, missed that this post was only talking about a small part of the Pentium's multiplier. and jumped to conclusions... And annoyingly, hackernews doesn't allow you to edit comments after a few hours

          On the R3000/R4000/R4200, the 3-bits-per-cycle multipliers do use radix-8 multiplication, but they don't have a dedicated 3x multiplier. Instead the the 3x result is latched during the first cycle (by adding (x << 1) + x). For the remaining cycles it can do a 3-bit multiplication with nothing more than a bit of booth recoding logic, and a single 64-bit wide adder.

          Then MIPS entirely abandoned this radix-8 encoding for the 8-bit-per-cycle multiplier in the R4400 and R4300, replacing it with a simple array of binary CSA adders. Probably because an array of base-2 adders is just much simpler. (Or at least that's what I think I can see on the R4300's die shot, I'm going to need to go back and take a much closer look at the multiplier)

          Anything I say about radix-256 in my first comment is probably nonsense, it's not radix-256 simply because it can do 8-bits in one cycle.

          What I missed is there is nothing limiting you to one radix-8 addition per cycle (like the early MIPS designs), you can combine the radix-8 encoding with an array of adders. And you only need 1/3rd the adders that a base-2 multiplier would need. The entire point of using the radix-8 encoding is that there is only one carry every 3 bits.

          You are probably right. This trick with the dedicated 3x multiplier is probably still used today.

    • kens 10 hours ago

      There were a lot of different algorithms in use back then. I don't know if there is a winner used in modern chips.

  • dsvf 12 hours ago

    No question, but just in general appreciation for the wonderful content you put out there that we get to enjoy!

  • kristianp 7 hours ago

    This is probably a basic question, but is this for floating point multiplies? Isn't the part thats being multiplied less than 64 bits because you also add the exponents?

    • kens 6 hours ago

      Yes, this is for floating point multiplies. But internally, the FPU uses x86 extended-precision numbers, 80-bit floats, with 64 bits of significand and 15 exponent bits. So the multiplier is dealing with 64-bit numbers, plus a few more bits to support rounding correctly.

      https://en.wikipedia.org/wiki/Extended_precision#x86_extende...

  • java-man 13 hours ago

    Ken, maybe it's time to publish a book?

    • kens 9 hours ago

      That would require me to focus on one subject :-)

  • dwedge 11 hours ago

    I only tenuously understand this so feel free to ignore the question if it's too asinine but:

    > (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

    Why can't you do the same to subtract x4 from x7 to get x3?

    • thombat 10 hours ago

      Because x8 is simply shifting 3 bits left so it's "free" (fixed-sized shifts are very simple to implement), whereas generating the x7 would require the additional circuitry, which additionally will be more expensive than the x3 was (since it's another shift+add beyond what x3 required)

    • mikequinlan 10 hours ago

      x7 is computed by x8 - x, so you get x8 - x - x4 which is 2 subtractions (and 2 delays). You only want 1 delay.

    • russdill 10 hours ago

      The question would be why isn't it 4x - 1x?

      • kens 8 hours ago

        The trick is that Booth's algorithm gives you the ×8 term for free; you multiply by one more in the next base-8 digit. You can put either ×4 or -×1 into a term, but you can't put both of them into one term. So you can do ×8 - ×1 but you can't do ×4 - ×1.

      • mjevans 8 hours ago

        I suspect it's logically equivalent, but one bit wider in hardware.

        Binary 4x is left shift by 2, rather than 2x which is by one.

        Adding or subtracting in 2's compliment space is the same operation for the bits of the word.

mjevans 12 hours ago

"This ×3 multiplier contains roughly 9000 transistors, a bit more than an entire Z80 microprocessor (1976). Keep in mind that the ×3 multiplier is a small part of the floating-point multiplier, which is part of the floating-point unit in the Pentium. Thus, this small piece of a feature is more complicated than an entire microprocessor from 17 years earlier, illustrating the incredible growth in processor complexity."

That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.

Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.

  • JumpCrisscross 11 hours ago

    > Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics

    We’ve been at that limit for decades.

    • GrumpyYoungMan 11 hours ago

      The fabs propped up the corpse of Moore's Law by throwing mountains of cash at expanding transistors into the third dimension: finFET, GAA, CFET, etc. That has kept the party going a little while longer than it would have lasted but it's a one-time deal since are no more dimensions to expand into.

      • brookst 11 hours ago

        …but that’s how it’s always worked. Moore’a law is dead, we’re at the limit of everything, oh hey, Moore’s lawn limps by again because someone did something clever.

        • light_hue_1 10 hours ago

          That's never how it worked. Moore's law was never dead. People are just endlessly confused about what Moore's law is.

          What ended was Dennard scaling around 2006. Roughly that frequency would keep going up as feature size went down. But because so many people are confused about what is what, you see a crappy muddled message.

          Moore's law has been going strong. It must end eventually, current predictions are that it will be in a decade or two.

          • perching_aix 10 hours ago

            It's starting to get a bit old that whenever I see Moore's law mentioned, I'll usually also run into a spiel about how people have the wrong idea about what it actually refers to, and that it's holding up just fine. This is despite the gen-on-gen and year-on-year performance improvements of computer hardware very clearly tapering off in recent memory.

            Maybe debating what always-somehow-wrong law to cite should not be the focus? Like it's very clear to me that being technically correct about what Moore's law or the Dennard scaling refers to is leaps and bounds less important than the actual, practical computing performance trends that have been observable in the market.

            • Kon5ole 8 hours ago

              What we see in the market is caused by software bloat. Chips are gaining performance faster than ever in absolute terms.

              I think Moore’s law should be avoided altogether when discussing progress in this area, because it’s hard to understand the effects of doubling intuitively. Rice grains on chessboards and all that.

              One might think ”Moore’s law is slowing down” means progress was faster before and slower now, when it is in fact completely opposite.

              If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million.

              Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.

              So in 2 years we added 500 times more transistors to our cpus than the first 20 years of “prime Moore’s law” did.

              This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat, and the fact that most users aren’t doing things with their computers where they can notice any improvements in performance.

              • perching_aix 8 hours ago

                > Chips are gaining performance faster than ever in absolute terms.

                But this is not what I as a consumer end up seeing at all. Consider the RTX 5090. Gen-on-gen (so, compared to the 4090), for 20-30% more money, using 20-30% more power, you get 20-30% more raster performance. Meaning the generational improvement is 0, software nonwithstanding.

                > If you consider the 20 years between the intel 286 and the pentium 3, transistor count went from about 150 thousand to 10 million. Today (using the ryzen 5950 and 7950 as examples), we got 5 Billion more transistors in just 2 years.

                Why would you bring absolute values into comparison with a relative value? Why compare the 286 and the P3 and span 20 years when you can match the 2 year timespan of your Ryzen comparison, and pit the P2 ('97) against the P3 ('99) instead? Mind you, that would reveal a generational improvement of 7.5M -> 28M transistors, a relative difference of +273%! Those Ryzens went from 8.3B to 13.2B, a +59% difference. But even this is misleading, because we're not considering die area or any other parameter.

                • relaxing 3 hours ago

                  Moore’s law never mentioned die area, and not because Gordon thought die sizes would never grow.

              • _0ffh 5 hours ago

                > This enormous acceleration of progress is increasingly unnoticed due to even faster increases in software bloat

                Can't resist to throw in this quote (from one of Wirth's students I think):

                "Software gets slower faster than hardware gets faster."

            • sitkack 10 hours ago

              It is ultimately a market effect, the technical specifics are not really important and are even conflated by industry insiders. See my sibling comment.

          • sitkack 10 hours ago

            Moore's law ends when the whole universe is a computer (which it already is).

            https://hasler.ece.gatech.edu/Published_papers/Technology_ov...

            Some view it as a doubling every 18 months, or a cost per transistor (this has gone up with the smallest nodes).

            It is roughly an exponential curve in the number of transistors we can use to make a "thing" with.

            It is both a capability (can we make things of a certain number of transistors) and is it economically viable to build things of that size.

            You could stay at the current node size and halve the cost of that wafer every 18 months and you would still be on the same curve. But it is easier in a our economic system to decrease the node size, keeping the rest of the fixed wafer costs the same and get 2x or 4x the density on the same lines.

            If I get nerd sniped, I'd find the two video presentations one by Krste and another by Jim Keller where they unambiguously explain Dennard Scaling and Moore's Law in a way that is congruent with what I just said.

          • wtallis 10 hours ago

            Also, chip fabs keep getting more expensive and taping out a chip design for those new fabs keeps getting more expensive. That makes today's semiconductor industry work quite differently from how things worked 30 years ago and means some segments see reduced or delayed benefits from the continued progression of Moore's Law. See eg. the surprisingly long-lasting commercial relevance of 28nm, 14nm, and 7nm nodes, all of which were used in leading products like desktop GPUs and CPUs for more years than Moore's Law would lead you to expect.

      • dehrmann 3 hours ago

        > into the third dimension

        Does this actually work? At some point, and this is been the case for a while, you're limited by thermals. You can't stack more layers without adding more cooling.

      • WXLCKNO 9 hours ago

        Depends, there could be 11 dimensions to expand into.

        • skissane 7 hours ago

          Assuming those extra dimensions really exist (it is unproven), I think we are centuries or even millennia away from being able to make technological use of them-if we ever will be at all

        • relaxing 3 hours ago

          Expand into the time dimension, evaluate infinite execution paths by reversing the pipeline, rerunning, and storing the results in a future accumulator.

      • acchow 7 hours ago

        > since are no more dimensions to expand into.

        Quantum computing is next, right?

        • adastra22 6 hours ago

          That’s not how quantum computing works.

    • Joker_vD 11 hours ago

      Depends on what limit exactly you are thinking about: the size of a transistor is still shrinking.

  • dlcarrier 10 hours ago

    The speed of software bloat matching hardware improvement is known as Wirth's law: https://en.wikipedia.org/wiki/Wirth%27s_law

    I think software bloat is growing faster, though.

    • _carbyau_ 9 hours ago

      I think the software bloat is being more affected by the speed of light. If only every damn thing didn't need internet access with its associated lag - often variable in timing...

      • klempner 7 hours ago

        Speed of light is very relevant for actual performance sensitive software, not just shitty overly chatty trash.

        While in practice the latency is over an order of magnitude larger, the speed of light round trip distance between two sockets on the same motherboard is likely on the scale of 1-2 nanoseconds which is already several cycles. Once you're talking about warehouse sized datacenters the speed of light distance can get into the microseconds even in a straight line through vacuum/air.

      • dragontamer 9 hours ago

        The internet has largely replaced CD Roms, Floppies and DVDs.

        The stuff you use a computer hard drive or Flash drive has remained consistent. Some items may have moved to the internet but most games, OS Utilities and whatnot are stored on local HDDs or SDDs. Always have and likely will remain so.

        I remember when data interchange meant burning a disc and walking a disc over to my buddy. XML will standardize that and simplify my workflow. /s

  • acchow 7 hours ago

    History of function calls:

    - goto/jmp to instr

    - vtable lookup

    - hash and lookup in a dictionary

    - run a Large Language Model

  • userbinator 11 hours ago

    On the other hand, the multiplier is far more regular in structure than a Z80. The datapath on the Pentium is several times wider too.

    • kens 8 hours ago

      A bit surprisingly, the Z80's datapath is 4 bits wide. The Pentium's FPU is more than 64 bits wide (extra bits for rounding and stuff). So, yes, there is a big difference in the datapath widths.

  • HPsquared 12 hours ago

    Luckily there is plenty room for improvement in most applications.

    • ajsnigrutin 11 hours ago

      But why would we waste thousands of hours of programmers time to optimize stuff, if we can instead waste millions if not billions of hours of users' time?

      /s

      • Someone 11 hours ago

        Because it saves lives. https://folklore.org/Saving_Lives.html:

        "Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?”

        • ab71e5 9 hours ago

          That assumes booting is a synchronous action instead of asynchronous, giving time to fetch a cup of coffee

        • speed_spread 9 hours ago

          By that logic, Electron apps kills people. Makes sense.

      • dijit 11 hours ago

        you know it’s funny, the first time I ever read a book about compiled languages- I think it might have even been K&R C… it talked about the fact that code is run significantly more often than it’s compiled, and read more often than it’s written. Thus we should optimise for runtime speed and afterwards readability.

        Yet on hacker news, the exact opposite sentiment is true.

        Employee hours are expensive, so optimise as much as you possibly can and push as much time onto the user as they will tolerate. Focus primarily on readability to the detriment of all other factors. Heck, even don’t bother compiling your program. Use an interpreter on the server.

        Keep your local cost down, your remote cost up… after all remote cost are not your cost, especially if every company is doing the same thing.

        It’s ironic that I could read a book and it be so wrong, yet that book is the foundation Bible for many people’s programming journey.

        • cipheredStones 10 hours ago

          Well, programming is a tool, right? There are clearly incorrect ways to program, but that doesn't mean there's one correct way to program - because it depends on what you're trying to make.

          It's incorrect to make a bookshelf that can't hold the weight of books, but IKEA isn't incorrect to make a crappy-but-adequate bookshelf that will get given away when you move, and a master carpenter isn't incorrect to make an expensive-but-beautiful bookshelf that will be used for many decades.

          A seed-stage startup trying to see if anyone cares about their product at all should probably accept it being 100x slower than it theoretically could be. And the Linux maintainers should be working hard to shave milliseconds.

        • jjmarr 9 hours ago

          I've been told it depends on management's goals.

          If you are a startup (many HN readers are), employee hours are relatively expensive. The goal is to get a product out as fast as possible, and most of the time you fail at making a working product anyways. Meanwhile, you only have to spend money on servers once you have a successful product. Deferring the problem makes sense here.

          On the other hand, I personally write code that'll be near guaranteed to run millions of times in some factory somewhere, and there's little benefit if I ship my code before the factory starts making stuff. I'll take the time to write high-performance and high-quality software because there's now a financial benefit to doing so.

          But ultimately, it's my boss who gives me insight into what I should be doing. Not a book, because the book doesn't have context that [company] will lose tens of thousands of dollars every day that I don't fix [issue].

        • joquarky 10 hours ago

          It feels like companies care a lot less about delivering quality.

        • tehjoker 10 hours ago

          This is just the difference between rational design and capitalist design. Capital must externalize costs to achieve profits, but it's not a rational approach to system design. This appears everywhere in the economy as capitalist pressures override rationality.

  • Salgat 11 hours ago

    Which is how it should be. There's a limited amount of resources in developing a chip, best to take advantage of the least costly route given those constraints.

  • EncomLab 11 hours ago

    Jonathan Blow has entered the chat...

sifar 11 hours ago

Interesting choice of radix-8 booth multiplier that needs the x3 circuit. Seems like a area/perf tradeoff in order to push the fmax which means they had a latency cycles constraint since the same could be achieved via more pipelining.

  • kens 8 hours ago

    Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because it avoids the extra ×3 circuit. Pipelining is tricky because there isn't a nice place to break the multiplication array into two pieces.

    • sifar 5 hours ago

      Sure it takes a few iterations, but there is nothing inherently complex about it.

      You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier

      There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).

efitz 10 hours ago

A long time ago on the 6502 I was trying to figure out how to quickly multiply by 3 in assembly and I came up with:

1. Shift-left

2. Add original value

For multibyte values use rotate left which carries, and do the rotates from LSB to MSB. Then clear carry, then do the adds from LSB to MSB with carries. Should work for any word size.

Ofc I was a teenager but this seemed clever and simple to me.

Was (am) I just missing something?

  • bean-weevil 10 hours ago

    This is covered in the article. See the section "Implementing a fast ×3 circuit with carry lookahead"

    • efitz 10 hours ago

      Thank you. I stopped reading after the octal discussion.

  • russdill 10 hours ago

    For most assembly, I'm not sure how this has an advantage over two adds.

    • dashtiarian 9 hours ago

      Shift add used to execute much faster than multiplication in 8088 days and people would use it when they had to multiply an int by a known scalar (shift took 4 clocks and add took 12).

      • funny_falcon 43 minutes ago

        Compilers still prefer LEA to multiply on 3,5 and 9, which performs shift+add I believe.

    • jonsen 9 hours ago

      A shift is faster than an add.

      • russdill 9 hours ago

        On 6502 they are both 2 cycles. On the vast majority of processors I'm aware of, "simple" alu operarations don't vary in execution time

        • kragen 9 hours ago

          On 8086 or ARM you can get a shift "for free" with your add, although on the 8086 you have to use the LEA instruction.

        • ksherlock 7 hours ago

          ADC #immediate is 2 cycles, as is ASL. Are you really multiplying a constant by 3 at runtime? Probably not. ADC zp is 3 cycles. Plus 2 cycles for the CLC.

thaumasiotes 10 hours ago

I'm missing something:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.

> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.

> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...

why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?

Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.

  • kens 8 hours ago

    For the 64-bit multiplication, you're adding 22 terms, one for each base-8 digit. (Think of grade-school multiplication.) Each term needs to be trivial to compute, so you can do a shift and/or a negation to get the term, but you can't do another addition.

    The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.

    For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.

    Hopefully this makes sense. There are a lot of digits and sums flying around.

    • thaumasiotes 8 hours ago

      How do you do a negation without doing an addition? Can you do better than -x = ~x + 1?

      • kens 7 hours ago

        I think that the +1 is accomplished simply by setting carry-in on the adder for that term, so you don't need a separate adder. (Disclaimer: I haven't looked at that part of the circuit yet.)

        Another interesting question is how sign extension is implemented for a negative term. The problem is that you're adding 64-bit shifted terms to create a 128-bit result. When negating a 64-bit term, all the unused bits to the left need to be set to 1. But the chip doesn't do this. (Among other things, the adders would need to be much wider than 64 bits.) So it must be using a trick for the sign extension.

      • sifar 4 hours ago

        If you are negating a single term you can't.

        If you are adding n terms using an adder tree, the 1 feeds to the carry in of the next level ( it is simply a concatenation since the carry term is shifted left by 1). Only thw final carry propagate adder will have the add with 1.

artemonster 12 hours ago

TLDR: 3a = (a << 1) + a

  • vanderZwan 12 hours ago

    From the article:

    > Multiplying a number by three is straightforward in binary: add the number to itself, shifted to the left one position. (As mentioned above, shifting to the left is the same as multiplying by two and is easy in hardware.) Unfortunately, using a simple adder is too slow.

    So no, not really. In fact the entire point of article is arguably explaining why it is not that simple.

brcmthrowaway 10 hours ago

Can AI come up with this circuit?

  • goldfishgold 2 hours ago

    Almost certainly, insofar as it’s already known and described in a number of places, including Wikipedia.