Thursday, July 21, 2011

CREAM did it, using bugs in path length constraints, in OpenSSL/Globus

Well, here goes. My first addition to the blog after many years of having the right to do so. As I've got a nice piece of software witchcraft to uncover.

This regards GGUS ticket #67040, which can be viewed by the happy few at the URL: https://ggus.eu/ws/ticket_info.php?ticket=67040


In short (inspired by the game Cluedo):
CREAM did it, using bugs in path length constraints, in OpenSSL/Globus


And now the slightly more elaborate explanation about the problem, how we analyzed it, interpreted the information and implemented a reliable workaround. It also shows that the CREAM CE itself is not directly the cause, but a trigger of the bug. This problem can occur in a lot of other places too and is a pain to analyse. One added motivation on why its such a pain to analyse is that I'm seeing known effects and problems occur along the analyses steering me in mildly the right direction, while I'm already mind-programming a workaround.

Reproducing the problem was hard:
The effects observed by users is a failure in job submission to any gLite 3.2 CREAM CE, when its submitted through a WMS. Probably also on all EMI-1 CREAM CE too. The error message returned from the CREAM CE indicates a failure in gLExec's LCMAPS plugin that verifies a proxy certificate chain.

Prerequisites (all of this must be true aka logical AND) to trigger the faulty situation:
- Use the Terena eScience Personal TCS, which has a pathlen = 0 set on the final CA.
- Use old style proxies (GT2), note: they don't feature a path length constraint field.
- Use a CREAM CE on gLite 3.2 (uses Globus GT4 from VDT)
- Access the CREAM CE through a WMS to use sufficient delegations or MyProxy

Change any of the above parameters and it will work. Meaning, the problem did NOT occure when the following was used:
- Direct job submission (only ONE proxy delegation may be used)
- Direct gLExec test on the shell, which just works.

Unverified situations:
- The effects when using RFC 3820 proxies
- Using EMI-1's CREAM CE

Hypothesis:
Tests have shown that the certificate chain is constructed properly. The hypothesis is that the GT4 from the VDT is interfering with OpenSSL sequences that we rely on in LCMAPS.

Cause(s) of the problem and analyses so far:
The gLExec in the CREAM CE uses LCMAPS to perform the account mapping in gLite 3.2. LCMAPS is dynamically linked to Globus to support its direct Globus based interfaces. The LCMAPS framework launched several plugins, of which the verify-proxy is the first, from the lcmaps-plugins-verify-proxy package.

The verify-proxy fails with an error in the log file, originating from OpenSSL, that the path length of the certificate chain exceeded the constraint bound from the certificate chain itself. Analyses of the chain has shown that both the RFC5280 path length constraint and the RFC3820 path length constraint did not apply here. The Terena eScence TERENA eScience Personal CA has a critical basic constraint set to indicate a path length is 0 (=zero). This means that no other CA certificate can follow this CA certificate in a chain. The RFC 3820 path length constraint doesn't apply on old-style (i.e. GT2) proxy certificates.

Despite the installation and the certificate chain involved; OpenSSL triggers an X509_V_ERR_PROXY_PATH_LENGTH_EXCEEDED error code, indicating the path length exceeded in the proxy certificates. Given the research on the certificate chain we will assume that this is a false-positive (or true-negative).


The interesting details here is that the Terena eScience Personal CA, Terena eScience SSL CA and the FNAL SLCS are the only CAs using a Path Length Constraint of 0 (=zero) in the IGTF. This gives a motivation to search in this direction as similar certificate chains are not affected at all.

On both our EMI and gLite 3.2 test nodes running gLExec we couldn't reproduce the problem. We tried a gLite 3.2 CREAM CE and could reproduce the failure when we introduced a few extra delegations to the certificate chain before we submitted a test job.

After looking at the libraries used on the CREAM CE, being GT4 from the VDT, and knowing that the OpenSSL interaction is significantly different made us put the blame on the GT4 libraries. They are known to have changed parts of OpenSSL itself and their own callbacks. This might cause the weird effect in the verification stage. We've experienced race condition in library loading where the order of dynamic library resolvement and loading was significant for the observed failures. This problem has characteristics of it as the problem seemed to be specifc to the machine. We would need to investigate the GT4 OpenSSL interacting code to be certain about it. This is not an easy task and might be too expensive, while a work around is possible.

We looked at the CREAM CE interaction some more, installed a new CREAM CE from scratch and were interested to reproduce the problem in gLExec. Somehow we couldn't reproduce it when we ran gLExec standalone on the CREAM CE. This should not happen. It should have failed. We tried another proxy chain (mine this time) created from my OSX build of voms-proxy-init version 1.8.8. Again, the problem didn't occure. I hacked the gLExec script that was executing on the failing CREAM CE, which I tested using the glite-ce-job-submit tool, to copy the proxy certificate before deleting itself. We used this chain in the bare gLExec run and then it failed. This certificate chain was examined, turned out to be OK, but is different as it had CA certificates in it.

This seemed to be the root cause of the problem. The CREAM CE (or perhaps its delegation service) is writing the proxy certificate chain from the SSL contect in the Tomcat instance from the user's interaction. This certificate chain was writing including all the CA certificates up to the root CA.

We tested the gLExec with the output of voms-proxy-init/grid-proxy-init which do *not* include the CA certificates in the certificate chain. As this is not added, the CA certificates will be added to the verification sequences in a different way by the OpenSSL routines. This is required to verify the full chain. There is a use case for adding your own (intermediate) CA to the client/host certificate chain, but this doesn't count in the Grid world with the IGTF. As the CA certificates are added in a different way later and treated differently, OpenSSL will verify the certificate chain differently. Either the Globus OpenSSL or the OpenSSL 0.9.8a is to blame that certificate chains with old-style proxies have the path length constraint field, used exlusively for RFC 3820 proxies, set to 0 (=zero) instead of -1 (=minus one) aka uninitialized. This nullification is most probably triggered by the path length constraint value in the Terena sub-CA certificate added to the normal certificate chain evaluation sequences, instead of kept aside in the list of used CA certificates for a certificate chain in an SSL context.

Workaround:
Build a DIY (=Do It Yourself) Path Length Constraint a la RFC 5280 and RFC 3820 in the verify proxy LCMAPS plugin. This will work around any potential library loading issue that could possibly happen. It also works around odd implementations of the verification sequences and it can work around the bug of wrong initialization values for path length constraint. Another possible workaround would be to alter the certificate chain before it hits the verification stage. This could work, but needs research in the right code-wise location in OpenSSL to let this work reliably. We're also going to introduce a duplication of the certificate chain to not tamper with the original input and pragmatically we need to work with two different certificate chains. The first option is significantly less work and straight forward.

To consider for other tools:
OpenSSL and possibly GT5 needs double checking if the support for RFC proxies is capable of handling edge-case input, demonstrated by the CREAM CE (or a component thereof). The CREAM CE should not add the CA certificates to the gLExec input. We should be tolerant on the gLExec side, but regardless the CREAM CE should not have done this and should have followed the same approach with gLExec as to setting up an SSL context. This means that you do not send CA certificates over the wire unless you are absolutely sure that this is really needed.

Output:
lcmaps-plugins-verify version 1.4.11 is to be certified featuring a function to catch the X509_V_ERR_PROXY_PATH_LENGTH_EXCEEDED error and check the certificate chain for its RFC 5280 and RFC 3820 compliance regarding path length constraints.