Wednesday, March 17, 2010

Troubleshooting an odd symlink bug

About a week ago an odd bug that was brought to my attention that occurs when people tried to install the Puppet package into an image made with InstaDMG. The bug started out in private emails, but we got it moved over to the developer mailing list, and you can take a look at it. A group of us banged our collective heads over it for a while, and finally I found it by just going over every step to see what was wrong. The problem manifested itself as the Puppet installer overwriting the softlink that you normally find at '/usr/lib/ruby/site_ruby', and instead putting a folder with the desired contents there. Replacing this symlink apparently broke other things, and thus began the bug-hunt. The bug was reported against InstaDMG because the installer works fine when used on a booted volume. My bet is that a similar problem would have manifested if someone had tried installing this to another volume other than the boot volume, thus clearing InstaDMG in this bug, but we didn't think of that at the time. My first instinct was that there was something wrong with the code in the 'installer' program when faced with the complex series of softlinks that it had to follow (a listing of that appears in a moment). I even created a script that mounted a dmg and tried to re-create the problem in a much simpler manner, but with no success. I did repeat the observed behavior, and knew that there was a problem in there somewhere, so I decided to try and figure out what was different about the softlink chain in this case from my test case. So I carefully followed the chain of symlinks on a mounted volume (InstaDMG output dmg, since I have a few of those lying around). Here is what I found: /usr/lib/ruby -> ../../System/Library/Frameworks/Ruby.framework/Versions/Current/usr/lib/ruby /System/Library/Frameworks/Ruby.framework/Versions/Current -> 1.8 /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/site_ruby -> ../../../../../../../../../../Library/Ruby/Site So if you follow this set of rules, on a booted volume '/usr/lib/ruby/site_ruby' winds up pointing at '/Library/Ruby/Site'. My understanding is that this the same in both 10.5.x and 10.6.x, but was different in 10.4. But if you are careful you would count the number of back-references in that last link. There are 10. But if you count the number of folders in the chain back form the 'site_ruby' folder you will only find 9. When you are booted this does not matter, once you are at the root directory you can just keep back-referencing all you like, and you still wind up in the same place. As a quick demonstartion you can do this in the Terminal: 'cd /; cd ..; pwd' and you will still be at root. But when the volume is mounted this means that the 'site_ruby' link winds up pointing back outside the image. So this explains the bad behavior: when the installer goes to look for the folder at this point it finds a broken symlink, so instead replaces that broken symlink with a valid folder. A pretty reasonable thing for the installer to do. I might have made this a failing error if I were the one programming it, but I am sure a lot of smart people came together in a meeting at Apple (or possibly NeXT) at some point in the past and decided that this was the correct behavior, and I can't call them wrong. When I started looking back, it seems that this extra back-reference has been in place since 10.5.0, and has been kept all the way through 10.6.2 (and it might continue). It has been masked all along because it only becomes a problem when you are not booted from the volume. I really should write up a small tool to comb through the whole filesystem and see if there are any other similar problems in any other symlinks, and report those as well in another Radar report, but I think I will leave that for another day. But I thought I would get this out there so if anyone else runs into some other similar bug they might remember this.

No comments:

Post a Comment