OurGrid - Bug 871

This bug was consequence of a strange behavior of Openfire XMPP Server. Some s2s messages were arriving out of order. We struggled through different approaches and ideas until find out "the truth", here is the complete history:

A new Peer wanted to join OurGrid's community, in this way, it joined to LSD's Discovery Service and began to be shown at http://status.ourgrid.org/, but, sometimes it strangely disappears and appears again. This Peer had a particular feature, it was not connecting to xmpp.ourgrid.org [1], but in another one (Openfire's too).

Changing the logs level to debug, we verified that LSD's Discovery Service was receiving failure notifications of this Peer, what was not expected, because it was never stopped. After some debugging, we discovered that Commune was sending failure notifications because it was receiving messages with unexpected sequence number (Commune should shutdown the connection with an external component that sends messages with sequence numbers different of those that are expected).

The first approach was verify Smack (that is used by Commune) to find possible bugs that could interfere at the order of arriving messages, but, after some Google research, we discovered that this issue was a known and already registered bug of Openfire. In this way, we tried to install other XMPP Servers, like ejabberd and Tigase, although, they are very different of Openfire, witch has a pretty and user friendly webconsole (where you can change all the configurations). Tigase did not work at all and ejabberd was very "slow" and also, some control messages like the components status, were not being delivered.

After those tries, the only solution envisioned was the implementation of a buffer for reordering messages at Application's level (Commune). So, every time that a message arrives, the application verifies if it has the expected sequence number, if not, the message should be buffered and wait for a limited timeout for the right sequence message.

The solution was implemented, tested and util there, everything worked fine. The OurGrid's version 4.2.4 was released and deployed... Users started to run their jobs... The community froze, the jobs were not ending and some Workers were being allocated for the Brokers permanently...Let's do some log analyse... Messages are getting lost! Hmm... Seems that the out of order messages buffer is not working very well.

What do to now? After doing some "backtracking", we remembered that ejabberd is the most used XMPP Server by all kind of organizations and has a very active community, but... Why? It is very slow... And what about some larger messages, like the OurGrid's components status?

The answer: you're doing it wrong! There is a lot of configuration properties at ejabberd, like c2s and s2s shappers, max_stanza_size for file transfers, and etc. After properly configurations, ejabberd is working very well without out of order messages traded between distinct servers.

The truth: Openfire has a very strange bug, maybe caused by a concurrency
problem and not fixed until the most recent release, ejabberd hasn't, maybe for this (and a lot of other reasons), ejabberd is currently the most used.

The lesson: doesn't matter how much your boss is putting pressure on you, never leave a tentative (specially the one that is more likely to succeed) until you have investigate it as deep as possible (or until your boss commands you to, of course).

[1] - OurGrid's public XMPP Server