11.0, musings and gripes (starting the unstable branch off with a bang)

Mozilla is considering their options for the release of Firefox 11 given some recent events (more presently), but I think it is important to establish our unstable branch in a timely manner to reassure you and our studio audience that TenFourFox isn't throwing in the towel with 10. (How alliterative.) Remember, for those new to this blog, that 11 through 16 beta are unstable builds. Do not use them if you're not prepared to deal with bugs; use 10.x.

Firefox 11 is not the big leap that 10 was and there is little new user-facing, but there are some important changes in the machinery that are a big deal to us. The two most important are SPDY support (preffed off by default), and improved animated GIF performance. Many people have noticed and I commented way back in Fx4 that a big pack of animated GIFs on a page can bring the browser to a crawl. This doesn't completely get it back to Fx3.6, but it's a lot better, and at least now I can look at the smilies in 68KMLA without watching the core temperatures rise on the G5.

SPDY is another big deal, mostly because Google is pushing it hard and we are, near as I can tell, the only browser for PPC OS X that supports it in any form. SPDY is a modified HTTP with nearly ubiquitous TLS encryption and DEFLATE compression that furthermore multiplexes data transfer rather than traditional simultaneous sockets or sequential request pipelining. Personally, I'm not wild about it; it makes a moderately heavy protocol into a nightmare and my suspicion of Google knows no bounds. However, suitably written, it is faster, and it is definitely faster than SSL. Google Chrome supports it, natch, and uses it to talk to Google properties, and Twitter has recently deployed it, so the ball is rolling and the IETF is evaluating it for HTTP/2.0. It will become enabled by default in the Fx13 timeframe and like it or not, it's here to stay.

In the local changes dept., I rewrote the G3/G4 square root routine to completely avoid red zone stores and this seems to have fixed issue 134 (and made the square root routine shorter and faster to boot). Because we do not have automated test coverage and did not detect this problem with our routine testing, I have decided to leave our inline square root disabled on 10-stable unless there is a huge hue and cry over performance regressions. So, if you are using a math-heavy application, you should probably be using unstable.

Let's also have a little episode of "Optimizing for the G5, part III" (see part II and part I) in which, yet again, we discover another nasty little secret about the PowerPC 970 that Apple never told anyone about. In this episode, we focus on the mtctr and b[c]ctr[l] instructions, which act sort of like a computed GOTO. You can load any arbitrary address into a general-purpose register and use mtctr to transfer that register into the counter "CTR" special-purpose register, used both as a (surprise) loop counter and as a branch target. Thusly loaded, you can use bctr to branch there, bctrl to call a subroutine there (computed GOSUB?), or bcctr and bcctrl to do those based on a condition register status.

We already know from our previous treatise that the G5, and in fact all POWER CPUs from the POWER4 (on which the 970 is based) through today's POWER7, divvies up the instruction stream into dispatch groups of approximately 4 instructions, give or take, with an optional branch in slot 5. There are certain restrictions about the dispatch groups. While we knew that mtctr liked to be first, in fact, you can only manipulate one SPR per dispatch group, and any SPR-manipulation instruction must be in the first slot, not just mtctr. So, if you have something like mtctr r5:mflr r0 (load CTR from GPR 5; load GPR 0 from link register), this gets executed in two groups.

But wait, it gets worse! Recall we mentioned that there is an optional slot 5 where a branch instruction can be carried along for the ride. So, slicko: we can say mtctr r24:bctr and simply branch to register 24 or whatever in one group, right? Yes, you can, but you pay a specific and severe penalty for mtctr and any CTR branch in the same dispatch group. The G3 and G4 don't have this problem, only the G5 and other "big POWER" chips.

While auditing Fx11 in Shark to make sure gcc wasn't putting bad instructions like mcrxr in despite our CPU tuning parameters, I noticed that a particular routine had a disproportionate amount of access called JaegerStubVeneer. All architectures except x86 use something called a "veneer" in JavaScript JaegerMonkey, which is used to change the return address when a native C or C++ routine has to throw an exception. It is generally a performance robber -- the ARM guys estimate its penalty at around 4% -- but there is no good way around it on RISC systems because the return address is generally in a register, not on the stack, so it can't be adjusted without having a veneer routine to go through and manipulate it. There are a lot of natives available even to a JIT routine, so it gets called frequently. The PowerPC veneer is very short and looked like this:

; Stash LR in the reserved spot in the VMFrame. mflr r0 stw r0, 124(r1) ; Call r12. mtctr r12 bctrl ; Get LR back. lwz r0, 124(r1) mtlr r0 blr

In Shark, that bctrl was amazingly hot because of this limit on the G5. Now it looks like this (and in the next release, we will align it to 16-bytes to favour the G5 and G4 even more):

; Prepare to call r12. mtctr r12 ; Stash LR in the reserved spot in the VMFrame. (second group) mflr r0 stw r0, 124(r1) #if defined(_PPC970_) ; Keep bctrl away from mtctr! This appears to be the optimal scheduling. ; If they are together, G5 pays a huge penalty, more than other SPRs. ; It actually got worse with two nops, and putting the stw with bctrl. nop nop nop #endif ; Branch. (third group) bctrl ; Get LR back. lwz r0, 124(r1) mtlr r0 blr

As you can see from the comments, this required quite a bit of empiric testing. Optimal scheduling executes this in three dispatch groups: the mtctr all by itself, and then the mtlr and stw (saving the return address so that it can be adjusted if the stub throws), and then the bctrl. We put in three nops to force the bctrl to be off in its own dispatch group and not in the branch slot of the second one. Despite being longer, this actually cuts the execution time of the veneer in half on the G5, and this small change improves V8 by over two percent!

Interestingly, changing our entire branching system to split them in dispatch groups actually made performance worse, presumably because it made the code longer and bulkier and caused less branches to fit into their standard displacement (which are always faster). Admittedly, it's hard to do instruction-level scheduling based on the current design of the JIT. Instead, we just do this in certain specific places where we know they will occur together and always occur. The net improvement is nearly 3% for what is ultimately some extra no-ops and just a few lines of code.

I found in the LLVM sources an interesting little source file on G5 hazards and designing optimal dispatch groups which we will use in future optimizations. I attached it to issue 135 for the interested.

Now for the musings and gripes. Pwn2Own has come to its typical explosive end, and the schadenfreude is thick since Google Chrome's much ballyhooed sandbox took it on the chin (but props to Google, who are paying their promised $60,000 bounty to both successful attackers, and already have fixes on the way). Naturally, Firefox fell too, and the suspicion is that this is a cross-platform flaw which I am not allowed to talk about in detail (you'll find out soon enough). If the attack is as suspected, then we are vulnerable to it, although it would require special effort to attack Power Macs.

It is not clear if this will delay Firefox 11, but details on the exact flaw are not available, and launch day is Tuesday, so Mozilla may choose to fudge on the release date until more information surfaces. There are also some issues with video drivers that do not pose an concern to us. There will definitely be a followup release for 10-stable to address the security issue (I will wait to see if Mozilla retracts the 10.0.3 RC and issues a new one; we will follow suit), and if there is a security issue on 11 (this is not yet confirmed either), I will chemspill on this branch too.

We are presently pushing upstream our JaegerMonkey-with-type-inference backend to Mozilla as bug 731110, pending a couple higher priority fixes getting in first that clash with our work. That should be a nice benefit to 10.5 PPC builders building from the tree, will work with little change on AIX, and gives our Linux, Amiga and BSD brethren a starting point to convert it to SysV ABI. But it might not be there very long because of this interesting post by David Anderson in which he gives an ETA for IonMonkey, the next generation JavaScript JIT, of about 2-3 months. And, well, that really sucks. Ostensibly IonMonkey builds on the work already done with JaegerMonkey, but looking at the in-progress Mercurial tree for the ARM version of IonMonkey (which we would be based on), I say the hell it is: it's an almost completely different set of macro-ops and requires significantly longer and more complex logic for code generation. So it's kind of Sisyphean to finally get our JIT boulder up to the top after tracejit foundered and then have it roll back to the bottom in a few short months with IonMonkey. This thing had better wash windows and do dishes after the amount of effort that we invested in JM+TI. I just hope it lands after Firefox 17 so that we have some cycles to work on it.

Getting back to less gripe-y things, I was made aware of a TenFourBird project that is building a Thunderbird for PowerPC based on our changesets, and probably a few others of their own to comm-central. There are no builds available, but there is a build wiki, and I am delighted to see the project appear because I know that people have requested such a thing in the past. Please note that I know nothing about the person(s) working on it, and am not personally involved with it myself, so the usual caveats apply. Also, if you hate our icon, you'll really have a conniption with theirs. ;) Jokes aside, please let me know if you make contact with the developer(s) or have tried to build it with their instructions.

This is also a good time to point out a couple of other community builds. Tobias is maintaining up-to-date WebKit frameworks for 10.5 and has incorporated some of the JIT work for regular expressions. You should be alert for bugs, and it does not support 10.4, but Tobias has been a valued contributor to this project and I'm sure his builds will serve those of you well who need WebKit (but also make sure you support OmniWeb, which is still 10.4-compatible).

hikerxbiker is also issuing SeaMonkey builds for 10.5 PPC. These are built more or less off the tree and don't include any of our special features right now, but will eventually include the JIT when that gets through the pipeline. This might be a good option for those of you who need SeaMonkey's additional features, such as mail-news, Chatzilla, etc.

Anyway, release notes and builds (please read comments):

G3 ~~build removed until further notice due to architecture tag failure~~ build now corrected
G4/7400
G4/7450
G5

Grantecoryt

Grantecoryt

11.0, musings and gripes (starting the unstable branch off with a bang)

Post a Comment

0 Comments

Popular Posts

Archive

Recent

Categories

HOT

Menu Footer Widget