Why are GPUs so good at cracking passwords?
What is it about GPUs that lets them crack passwords so quickly?
It seems like the driving force behind adopting good key-derivation functions for passwords (bcrpyt, PBKDF2, scrypt) instead of yesterday's cryptographic hash (MD*, SHA*) is that the later are vulnerable to programs that run on GPUs and guess huge numbers of passwords extremely quickly. Why should GPUs happen to be so much better at evaluating those hash functions than CPUs?
To complete @Terry's answer: a GPU has a lot of cores (hundreds). Each core is basically able to compute one 32-bit arithmetic operation per clock cycle -- as a pipeline. Indeed, GPU work well with extreme parallelism: when there are many identical work units to perform, actually many more than actual cores ("identical" meaning "same instructions", but not "same data").
Some details, for a somewhat old NVidia card (a GTX 9800+, from early 2009): there are 128 cores, split into 16 "multicore units". Each multicore can initiate 8 operations per cycle (hence the idea of 128 cores: that's 16 times 8). The multicore handles work units ("threads") by groups of 32, so that when a multicore has an instruction to run, it actually issues that instruction to its 8 cores over 4 clock cycles. This is operation initiation: each individual operation takes up to 22 clock cycles to run. You can imagine the instruction and its operands walking into the circuit as an advancing front line, like a wave in a pool: a given wave will take some time to reach the other end of the pool, but you can send several waves sequentially.
So you can maintain the rhythm of "128 32-bit operations per cycle" only as long as you have at least 22 times as many "threads" to run (i.e. a minimum of 22·128 = 2816), such that threads can be grouped by packs of 32 "identical" threads which execute the same instructions at the same time, like hip-hop dancers. In practice, there are some internal thresholds and constraints which require more threads to achieve the optimal bandwidth, up to about 4096.
I could achieve close to 99% of the optimal bandwidth with a SHA-1 implementation. SHA-1 uses a bit more than 1100 32-bit operations (that would be around 900 on a CPU, but a GTX 9800+ has no rotation opcode, so rotations must be split into two shifts and a logical or), and the GPU ran at 1450 MHz, for a grand total of about 160 million SHA-1 computations per second. This can be achieved only as long as you have millions of SHA-1 instances to compute in parallel, as is the case for password cracking (at any time, you need 4096 parallel SHA-1 to feed the GPU cores, but you also have to deal with I/O costs for input of potential passwords, and these costs will dominate if you do not have a lot of SHA-1 instances to process).
The host PC, on its CPU (a quad-core 2.4 GHz Intel Core2), could achieve about 48 million SHA-1 per second, and that was with thoroughly optimized SSE2 code. A single SHA-1 will use about 500 clock cycles on such a CPU (the CPU can compute several instructions in a single cycle, provided they don't compete for resources and don't depend on each other), but, for password cracking, it is worthwhile to use SSE2 with its 128-bit registers, and able to compute 4 instructions in parallel. With SSE2 constraints, it takes about 800 clock cycles to run four parallel SHA-1, so that's 200 clock cycles per SHA-1 instance. There are four cores in that CPU and the whole thing runs at 2400 MHz, hence 48 million per second.
More recent hardware will be faster, but GPU more so. A GTX 680 sports a whooping 1536 cores, and there are two such GPU in a GTX 690. We are talking billions of SHA-1 instances per second here.
(For comparison, I also did an implementation of SHA-1 on the Cell processor, i.e. the CPU in a PS3 console, with its 8 "SPU" coprocessors. One SPU was not available. With the 7 others, I reached about 100 million SHA-1 per second, i.e. better than a contemporary big PC CPU, but not as good as a good GPU of the same era.)
Summary: GPU achieve great performance by using heavy parallelism, with hundreds (if not thousands) of cores. This is made possible by pipelining (each individual operation takes many cycles to run, but successive operations can be launched like trucks on a highway) and sharing instruction decoding (since many cores will run the same instructions at the same time).
Some context on the numbers: assuming a set of about 100 possible characters, there are 1 trillion possible 6-digit passwords. At a billion passwords per second, this is a bit under 17 minutes to try every possible 6-digit password. Less if we assume the character set is smaller (most people will never use a good chunk of the printable ASCII character set in passwords). Another reminder to salt your hashes.
If a password algorithm were designed to maximize the usefulness of a GPU for validating a *single* password (e.g. given the password FNORD, compute iterated SHA-1 hashes for FNORD0000, FNORD0001, FNORD0002, etc. through FNORD9999 and then combine them via some method), would any kind of cracking-hardware be able provide more bang-for-the buck than would be available for a "good guy" system that needed to validate passwords one at a time?
Do GPUs actually perform a single 32-bit arithmetic operation in a single clock cycle (taking into account latency so e.g. a throughput of 4 and latency of 4 would still count as 1 retired instruction per cycle)? I know that for CPUs at least, different kinds of arithmetic might take several cycles, even dozens (in the case of division, for example). Are GPUs not like that?