Saturday, March 24, 2012

building the command-line reliably from user provided input

The shell offers many opportunities for zen-like unlightenment, by throwing spitballs at every attempt to make sense of user-generated input handling. By user-generated I mean anything originating from an unreliable entity (a.k.a. a user), and therefore without any guarantees towards being well-formed for its intended purpose. Users are notoriously good at this. You ask them to type in a number, they say 'twelve'. You ask them for their surname, they type 'Mozes Kriebel'. The last case is problematic because most programmers are just like ordinary people, and ordinary people assume surnames are just single words. Our habit is reasoning about the common case, instead of considering all possible alternatives, and therefore we live our lives believing such unbearable nonsense as 'all crocodiles are green', 'all people have ten fingers' and 'it never rains in the Sahara.' Each of us has his own set of ridiculous believes; I feel safe in the claim that there isn't a fact so wrong or it has a true believer. But maybe that's just one of my beliefs.
Back to the programmer who made the wrong assumption that surnames don't contain spaces. Our friend is trying to process a dataset of names by calling a utility for each one and he's using shell scripts. His first attempt didn't go so well.
process_surname $name
The process utility expects just one argument--a surname. But when get_next_surname returned Mozes Kriebel, the shell's word splitting turned this into two arguments: "Mozes" and "Kriebel".
The programmer would soon overcome this problem by adding double quotes to prevent word splitting:
process_surname "$name"
That will do for today's lesson. But wait, there is more!
Let's assume that the process_surname utility is actually richer in functionality, and may accept some options with arguments.
if [ "$name" != "$origname" ]; then
option="--nee $origname"
process_surname $option "$name"
See what happened? If not, here's a short run-down. The first line checks if the name is identical to the original name (assume for the sake of the example that a name change could happen through marriage). Here, the double quotes are required as in the original script, to defend against shell script errors. If $name was unquoted, the test would read 'Mozes Kriebel != ...' which is incorrect syntax. The second line sets the --nee option, to pass to the processing function later on. This is also a quoted string.
The catch is in the final line. Observant readers spot the lack of quotes around $option. This is not a mistake! If $option were quoted, it would be passed as $1 to process_surname in its entirety, i.e. including the space following the '--nee' and the original surname after that. If this utility scans it's arguments looking for an exact match of '--nee', it won't find it. So we need the shell's word-splitting to separate '-nee' from what comes after it.
The problem is now clear. If $origname happens to be 'Jemig de Pemig' there seems to be no way to preserve the spaces on passing it as an argument to --nee.
I won't dwell on my journey along the Path of Many Misconceptions About the Shell, but I will show you just about the simplest way to do this generally.
set --
if [ "$name" != "$origname" ]; then
set -- --nee "$origname"
process_surname "$@" "$name"
This is one of the few times I found a use for setting the positional parameters. The magic bit is in the use of "$@", which expands to the positional parameters, with quotes around each individual parameter. There is no other construct in the shell that does this. The set on line 3 made $1 equal to "--nee" and $2 equal to "Jemig de Pemig". The last line is then equivalent to
process_surname "--nee" "Jemig de Pemig" "Mozes Kriebel"
which is exactly what we need.

Thursday, July 21, 2011

CREAM did it, using bugs in path length constraints, in OpenSSL/Globus

Well, here goes. My first addition to the blog after many years of having the right to do so. As I've got a nice piece of software witchcraft to uncover.

This regards GGUS ticket #67040, which can be viewed by the happy few at the URL:

In short (inspired by the game Cluedo):
CREAM did it, using bugs in path length constraints, in OpenSSL/Globus

And now the slightly more elaborate explanation about the problem, how we analyzed it, interpreted the information and implemented a reliable workaround. It also shows that the CREAM CE itself is not directly the cause, but a trigger of the bug. This problem can occur in a lot of other places too and is a pain to analyse. One added motivation on why its such a pain to analyse is that I'm seeing known effects and problems occur along the analyses steering me in mildly the right direction, while I'm already mind-programming a workaround.

Reproducing the problem was hard:
The effects observed by users is a failure in job submission to any gLite 3.2 CREAM CE, when its submitted through a WMS. Probably also on all EMI-1 CREAM CE too. The error message returned from the CREAM CE indicates a failure in gLExec's LCMAPS plugin that verifies a proxy certificate chain.

Prerequisites (all of this must be true aka logical AND) to trigger the faulty situation:
- Use the Terena eScience Personal TCS, which has a pathlen = 0 set on the final CA.
- Use old style proxies (GT2), note: they don't feature a path length constraint field.
- Use a CREAM CE on gLite 3.2 (uses Globus GT4 from VDT)
- Access the CREAM CE through a WMS to use sufficient delegations or MyProxy

Change any of the above parameters and it will work. Meaning, the problem did NOT occure when the following was used:
- Direct job submission (only ONE proxy delegation may be used)
- Direct gLExec test on the shell, which just works.

Unverified situations:
- The effects when using RFC 3820 proxies
- Using EMI-1's CREAM CE

Tests have shown that the certificate chain is constructed properly. The hypothesis is that the GT4 from the VDT is interfering with OpenSSL sequences that we rely on in LCMAPS.

Cause(s) of the problem and analyses so far:
The gLExec in the CREAM CE uses LCMAPS to perform the account mapping in gLite 3.2. LCMAPS is dynamically linked to Globus to support its direct Globus based interfaces. The LCMAPS framework launched several plugins, of which the verify-proxy is the first, from the lcmaps-plugins-verify-proxy package.

The verify-proxy fails with an error in the log file, originating from OpenSSL, that the path length of the certificate chain exceeded the constraint bound from the certificate chain itself. Analyses of the chain has shown that both the RFC5280 path length constraint and the RFC3820 path length constraint did not apply here. The Terena eScence TERENA eScience Personal CA has a critical basic constraint set to indicate a path length is 0 (=zero). This means that no other CA certificate can follow this CA certificate in a chain. The RFC 3820 path length constraint doesn't apply on old-style (i.e. GT2) proxy certificates.

Despite the installation and the certificate chain involved; OpenSSL triggers an X509_V_ERR_PROXY_PATH_LENGTH_EXCEEDED error code, indicating the path length exceeded in the proxy certificates. Given the research on the certificate chain we will assume that this is a false-positive (or true-negative).

The interesting details here is that the Terena eScience Personal CA, Terena eScience SSL CA and the FNAL SLCS are the only CAs using a Path Length Constraint of 0 (=zero) in the IGTF. This gives a motivation to search in this direction as similar certificate chains are not affected at all.

On both our EMI and gLite 3.2 test nodes running gLExec we couldn't reproduce the problem. We tried a gLite 3.2 CREAM CE and could reproduce the failure when we introduced a few extra delegations to the certificate chain before we submitted a test job.

After looking at the libraries used on the CREAM CE, being GT4 from the VDT, and knowing that the OpenSSL interaction is significantly different made us put the blame on the GT4 libraries. They are known to have changed parts of OpenSSL itself and their own callbacks. This might cause the weird effect in the verification stage. We've experienced race condition in library loading where the order of dynamic library resolvement and loading was significant for the observed failures. This problem has characteristics of it as the problem seemed to be specifc to the machine. We would need to investigate the GT4 OpenSSL interacting code to be certain about it. This is not an easy task and might be too expensive, while a work around is possible.

We looked at the CREAM CE interaction some more, installed a new CREAM CE from scratch and were interested to reproduce the problem in gLExec. Somehow we couldn't reproduce it when we ran gLExec standalone on the CREAM CE. This should not happen. It should have failed. We tried another proxy chain (mine this time) created from my OSX build of voms-proxy-init version 1.8.8. Again, the problem didn't occure. I hacked the gLExec script that was executing on the failing CREAM CE, which I tested using the glite-ce-job-submit tool, to copy the proxy certificate before deleting itself. We used this chain in the bare gLExec run and then it failed. This certificate chain was examined, turned out to be OK, but is different as it had CA certificates in it.

This seemed to be the root cause of the problem. The CREAM CE (or perhaps its delegation service) is writing the proxy certificate chain from the SSL contect in the Tomcat instance from the user's interaction. This certificate chain was writing including all the CA certificates up to the root CA.

We tested the gLExec with the output of voms-proxy-init/grid-proxy-init which do *not* include the CA certificates in the certificate chain. As this is not added, the CA certificates will be added to the verification sequences in a different way by the OpenSSL routines. This is required to verify the full chain. There is a use case for adding your own (intermediate) CA to the client/host certificate chain, but this doesn't count in the Grid world with the IGTF. As the CA certificates are added in a different way later and treated differently, OpenSSL will verify the certificate chain differently. Either the Globus OpenSSL or the OpenSSL 0.9.8a is to blame that certificate chains with old-style proxies have the path length constraint field, used exlusively for RFC 3820 proxies, set to 0 (=zero) instead of -1 (=minus one) aka uninitialized. This nullification is most probably triggered by the path length constraint value in the Terena sub-CA certificate added to the normal certificate chain evaluation sequences, instead of kept aside in the list of used CA certificates for a certificate chain in an SSL context.

Build a DIY (=Do It Yourself) Path Length Constraint a la RFC 5280 and RFC 3820 in the verify proxy LCMAPS plugin. This will work around any potential library loading issue that could possibly happen. It also works around odd implementations of the verification sequences and it can work around the bug of wrong initialization values for path length constraint. Another possible workaround would be to alter the certificate chain before it hits the verification stage. This could work, but needs research in the right code-wise location in OpenSSL to let this work reliably. We're also going to introduce a duplication of the certificate chain to not tamper with the original input and pragmatically we need to work with two different certificate chains. The first option is significantly less work and straight forward.

To consider for other tools:
OpenSSL and possibly GT5 needs double checking if the support for RFC proxies is capable of handling edge-case input, demonstrated by the CREAM CE (or a component thereof). The CREAM CE should not add the CA certificates to the gLExec input. We should be tolerant on the gLExec side, but regardless the CREAM CE should not have done this and should have followed the same approach with gLExec as to setting up an SSL context. This means that you do not send CA certificates over the wire unless you are absolutely sure that this is really needed.

lcmaps-plugins-verify version 1.4.11 is to be certified featuring a function to catch the X509_V_ERR_PROXY_PATH_LENGTH_EXCEEDED error and check the certificate chain for its RFC 5280 and RFC 3820 compliance regarding path length constraints.

Wednesday, May 11, 2011

notes from a dirty system installation

Normal system installations involve boot media, such as a CD-ROM, USB, or even a floppy. In our case it's PXE boot (netboot), which is a little more involved to set up initially because you need a network plan with DNS, DHCP, tftp and probably HTTP, but it is definitely worth the effort if you have to manage a couple of hundred systems. Some new ways to do installations have arrived with the introduction of virtual machines, and this is very easy as you only need to provide the disk or CD-ROM images as files on the host system.
But Sven asked me to transfer a system from a virtual machine to a physical box, for some reason that I won't mention here now. He would provide me with the (small) disk image of the virtual machine, and the physical box was something old that had already seen some use and was hooked up with the network.
I quickly realised that this was going to be an interesting exercise. I would need to write the image to disk, which meant that I had to boot into a ramdisk of sorts. The first problem that presented itself was that the box would only do PXE boot, and as the network was not under my control I would have to involve other system administrators.
The box had a previous installation on it (Backtrack, which is Debian based), and I figured I might as well try to do everything from within this installation.
After adding my ssh key to /root/.ssh/authorized_keys (and turning off the firewall) I could get out of the noise of the machine room and work from the peaceful quiet of my office. By inspecting /proc/partitions I found out the machine had 2 disks, and Sven agreed that we should set up a (software) RAID1 mirror set.

Now the system was running from /dev/hda1, and I couldn't mess with that disk live. (You should try this some day if you feel in a particular evil mood; run dd if=/dev/zero of=/dev/sda in the background while you continue to work. Observe how the system develops amnesia, dementia and finally something close to mad cow disease.) I decided to do something dirty: I would create a RAID1 set with just one disk. The mdadm program thinks this is a bad idea, so you have to --force the issue. The command-line was mdadm --create --level=1 -n 1 --force /dev/md0 /dev/hdb1 or something. Of course I first repartitioned /dev/hdb to have just a single partition of type raid autodetect. Next step: losetup -f vmdisk.img to treat the disk image as a block device, and kpartx -a /dev/loop0 to have mapped devices for each of the partitions inside. Now a simple dd if=/dev/mapper/loop0p1 of=/dev/md0 was all I needed to write the image.
The next step was to boot into the newly written system (a Debian 5). This involved some grubbery, after mount /dev/md0 /mnt and chroot /mnt I could navigate the system as if it was already there. I had to edit /boot/grub/menu.lst to set the root device to (hd1,0) and root=/dev/md0, and I had to install grub on the first bootable disk, which was (hd0,0). After cloning /etc/network/interfaces from the present system and setting up ssh keys again to ensure acces, I rebooted with fingers crossed.
Call it luck, but it worked. I was now running the cloned system from a RAID1 root device with only one disk. I did a resize2fs /dev/md0 because the image I originally wrote to it was really small compared to the disk. Now it was /dev/hda's turn to be added to the RAID set. After repartitioning to have the same size as its counterpart (the two disks weren't the same size), I added it with mdadm --manage --add /dev/md0 /dev/hda1 which unfortunately didn't work as expected, as the new addition just became a spare.

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hda1[1](S) hdb1[0]
120053632 blocks [1/1] [U]

Notice the (S) which indicates that hda1 is a spare. It won't be used until another disk fails, but as this set unfortunately only has a single disk, a single disk failure means game over.
The final command to activate the spare was mdadm --grow --raid-devices=2 /dev/md0. This enlarges the raid set, and the spare will now be activated. Indeed, the system started to recover immediately:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hda1[2] hdb1[0]
120053632 blocks [2/1] [U_]
[>....................] recovery = 0.0% (84608/120053632) finish=70.8min speed=28202K/sec

It took a while, but eventually I had a mirrored two-disk raid set!

Wednesday, April 21, 2010

Moving to Ubuntu 10.04 (Lucid) on my Macbook 2,1

I'm the happy owner/user of a Macbook (a 2,1 to be exact) since a few years. Actually it is company property, but since I'm the only user they let me run whatever I like. Which happens to be Ubuntu.

It was really Willem who set me on to this; I don't think I would have had the guts to do this to my Mac if he hadn't been adamant that it would work. And it does! It did require quite a few tweaks, though, as the Ubuntu community page testifies.

But overall I was happily running 9.10 (Karmic); I learned to live with the few quirks, like having to log off every time I came into work and attached the external monitor (it would crash the video driver if I didn't). And suspend/resume was just really slow.

Now that Lucid is just around the corner, and after my previous success with installing it on my wife's new desktop, I got feisty and tried it on the Mac. First, I used the live CD and that worked so splendidly that I just had to do the actual upgrade. It lasted all night, but in the end it worked like a charm.

So it's worth at this point to make a few notes.
  • The new theme did not appear the first time I booted. This was to be expected because I did an upgrade, not a fresh install. Most things should preferably stay as they were.
  • The ssh-agent environment variables were gone from my terminal shells. Why? I don't know, but a somewhat related bug report suggests the use of the keychain package.
  • Thunderbird 3.0 is chugging away on indexing all my mail. I could turn it off but I think I'll let it run for a while and check out the new search capability.
  • Sound didn't immediately work on the live CD but this was resolved by killing the pulseaudio daemon. Sound does work after the upgrade.
  • The volume and brightness buttons work too. Very nice.
  • Attaching the external monitor turns off the desktop effects. This may be related to the crash bug I mentioned earlier. But you can turn it back on right after.
That's it for now, more news later!

Friday, April 9, 2010

My new inspiron 560

(Sorry, this is not really about software but still something I'd like to share)

I decided to go all-in and finally buy myself (and my wife) a new desktop machine. It replaces a machine that I got for a token fee, an old box that my company was going to recycle. At the time I thought it would be good enough for running Linux, even though it was slow, didn't have a big disk and little memory; Linux is frugal, right? Well, the Linux desktop has grown up considerably in the last decade and the combined memory footprint of Evolution, Gnome and Firefox combined rivals that of mainstream systems.

So the new machine which now proudly hums away on my desk (more on the humming later) is an all-black, all-shiny, DELL Inspiron 560. The cheapest DELL has to offer through their website, for just over 400 euros including delivery to my home. Even though it is cheap, it is a step up from what I had: it has a dual-core 2.7 GHz CPU, 4 GB memory, 320 GB disk. Funny detail: black was the cheapest color. Reminds me of the Ford model T days.

My wife is the principal user of the desktop, since I do most of my work on a laptop that I carry everywhere. To save myself from a support nightmare, I switched to Ubuntu LTS releases some time ago. Just mid 2009, we made the switch to Ubuntu 8.04. So Ubuntu 8.04 was the first and most logical choice for the new box; the smoothest possible transition I could imagine.

After popping in a fresh Ubuntu 8.04.4 CD and starting up, most things looked alright. Except there was no network. This puzzled me somewhat as a device clearly showed up in the output of ifconfig; there were just no packets coming through at all. A quick search for the precise hardware spec revealed a known issue with the driver, and the workaround to download and use the Realtek provided (open-source) driver. I was worried that this problem would just keep coming back with every kernel update, but it fixed the immediate issue.

Meanwhile, my attention was drawn to another, quite severe problem. The machine was making quite a bit more noise than I expected, to the point of being irritating. There was clearly a fan spinning loud and hard in there. My first suspicion went to the case fan near the rear of the box, but this turned out to be wrong. To make matters worse, the loud fan had a bearing problem and started to make horrible rumbling sounds.

Since I had already done away with Windows completely I had no chance to verify that this was caused by some software defect. It started as soon as the computer was turned on, so in any case it wasn't due to something Linux did. It was such a depressing conclusion that I bought a lemon. Just the thought of having to spend time and energy getting this fixed (imagine explaining a support person over the phone that you run Linux, not Windows...) caused agony.

Luckily DELL shipped a bootable diagnostics CD with the computer, and it allows you to run several tests to verify the correct workings of the machine. Two tests were especially interesting: a CPU fan test and a case fan test. Both tests drive up the fans to a high RPM, and then down again. I should explain that while I was running from the diagnostics disc, the terrible noise persisted.

The CPU fan test revealed that the CPU does indeed have a fan (or maybe more than one) and that it can be heard, but only at high RPM. At low settings (normally, if the CPU is not under load) it can hardly be made out (1700 RPM or so). The case fan was even lower; it's a big one so it runs only at 500 RPM nominally, but can be stepped up to 1500 or so if the case runs hot. Bot tests produced different sounds, and the noisy fan stayed noisy. It could only mean one thing: the video card fan. The card is an Nvidia GeForce 310.

The Ubuntu 8.04 system worked, but not very smoothly. The network driver was a kludge, I couldn't get anything but the VESA driver working for video (or I didn't try hard enough) and the noise made the whole thing just unworkable. I decided that maybe, just maybe, the upcoming LTS release, Ubuntu 10.4, would be a better option. This proved to be sheer lucidity.

Ubuntu 10.4 is not even out, but you can get beta 2 already and this is actually encouraged as the experience of more people trying the system at this stage will help iron out the remaining wrinkles. From the moment I popped in the CD and started, I was amazed by the ground they've covered since the last release (I'm running 9.10 on my laptop so I'm close to cutting-edge there). The installation was smooth, very few questions asked and in no time at all I had a new OS running plus network. (Still, noise.) I logged on, clicked around appreciatively, and then I selected 'maximum visual effects'. This triggered the system to prompt me whether I wanted to install the proprietary Nvidia drivers (of course, you silly!). After the installation there was some hiccup about not being able to switch or load drivers (somekind of fb kernel driver got in the way? I couldn't tell), but a reboot set things straight. And how! As soon as the Nvidia driver loaded, the computer became silent (relatively speaking). And this makes all the difference. First I had regretted my choice for this particular desktop machine, and now I find it very good value for money. I should probably still chase up DELL about the resonance in the video card fan, but it no longer prevents me from enjoying the new computer. It's funny that during startup and shutdown the noise can still be heard, that is before loading and after unloading the Nvidia driver.

And Ubuntu 10.4 is just fine, even at beta 2. I was so confident that I replaced my old desktop with the new one just the day I had to leave for four days to visit the last EGEE User Forum, after I rsync'd all user data and tested that my wife could still read her e-mail.

Friday, February 12, 2010

Chicken and egg: install rpm using rpm

What do you do when a collegue has deleted the rpm and yum packages from a CentOS system (by mocking around with the sqlite package)? Reinstall them of course. Hmm but how, when rpm is absent?

Setting up a local rpm installation

The solution is to copy rpm from another computer with (approximately) the same operating system. The following files are required (substitute lib64 for lib when you're on a 32-bit system), put them in a temporary directory on the target system:

  • /bin/rpm --> bin/
  • /usr/lib64/librpm*.so --> lib/
  • /usr/lib64/libsqlite*.so --> lib/
  • /usr/lib/rpm/macros --> lib/rpm

Some configuration files are expected to be present, though, and rpm needs to be told to look for them in the correct location. This is done with a little wrapper script (named like this:

export LD_LIBRARY_PATH=`dirname $0`/lib
mv ~/.rpmmacros ~/.rpmmacros.orig
cp `dirname $0`/lib/rpm/macros ~/.rpmmacros
`dirname $0`/bin/rpm --rcfile `dirname $0`/lib/rpmrc --define "_rpmlock_path /var/lock/rpm" "$@"
[ -e ~/.rpmmacros.orig ] && mv ~/.rpmmacros.orig ~/.rpmmacros

You can then use the temporary rpm installation by going to the temporary directory and running ./ It will still work on the system's package database.

Installing rpm's RPMs

This is easy now. First download rpm and required packages from a CentOS mirror. You need the packages for rpm, rpm-libs and sqlite (make sure you choose the right platform, i386 or x86_64). Then do a ./ -i *.rpm so all these packages are installed at once. Now you can run the system's rpm again, phew!

Installing Yum

You may still need to get yum back. This is done similarly, by downloading the packages yum, yum-fastestmirror, yum-metadata-parser, rpm-python and python-sqlite. Then do a ./ -i *.rpm for these and you've can install packages easily again.

Wednesday, July 1, 2009

User SSH configuration for virtualised remote hosts

As part of the agile testbed here at Nikhef's grid group, there are a couple of Xen hosts. Thanks to Dennis' mktestbed scripts it's quite easy to manage the guests' lifecycles. The guests are on their own network and one can use ssh forwarding to access them from the desktop. So you would login to the Xen host, and from there login to the guests.
But sometimes it is convenient to use graphical tools available on the desktop to do something on the remote guest machine. This would be possible with a direct SSH connection. The straightforward solution would be to use SSH port forwarding.
There is a more convenient way to get the remote guests appear as ordinary hosts from the desktop via ssh (without resorting to a VPN or so): using the ssh configuration file located in ~/.ssh/ssh_config:
Host coolhost.testdomain coolhost
Hostname coolhost.testdomain
Protocol 2
User root
# avoid often changing host fingerprint prompt
CheckHostIP no
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
ForwardX11 yes
# route through the Xen host
ProxyCommand ssh -q -A xenhost.domain nc %h %p 2>/dev/null
The key line here is the last one: this opens an ssh connection to the Xen host, and uses netcat to open a connection to the guest's ssh socket.
The configuration above also removes the hostkey check present in SSH. Usually one would really want this, but as I'm generating and destroying machines all the time and the connection to the xenhost is verified already, it doesn't really bring much additional security.