LTSLLC Blog: Prospero

Showing posts with label Prospero. Show all posts

Wednesday, April 12, 2017

Introducing Miranda: Miranda is Distributed

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop. At this point, I'm filling in gaps, so things might jump around a bit.

One of the design goals of Miranda was to make it able to cross availability zones. This was due to a limitation of the previous system, Prospero. Prospero had trouble crossing availability zones because the database it used, Mnesia, did not like latency. Furthermore, Prospero used RabbitMQ a lot and every round trip meant network delays.

Miranda does not use Mnesia or RabbitMQ, so it does not have these problems.

If a hurricane takes down a data center, the remaining nodes will keep going. Subscriptions that the lost node was responsible for will be distributed among the remaining nodes.

When the down node comes back online, the system will "fill it in" on what happened while it was down. More specifically, when a node joins the cluster, it sends and receives the SHA1s of the cluster, users, topics, subscriptions, events and deliveries of the system. If the SHA1 that it has locally does not match a remote SHA1, then the system tries to merge the remote file with its local file.

When a node joins or leaves the cluster the online nodes hold an election. During the election, the subscriptions are distributed to the various nodes of the cluster.

When an Event or Delivery takes place it is shared among the members of the cluster. That way, any node can host any Subscription.

Changes to Users, Topics, and Subscriptions are also shared among the cluster members.

Sunday, April 9, 2017

Introducing Miranda: Prospero and Miranda Security

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop.

Prospero supported limited security. Admins could connect to it, log in, create new users and perform other administrative operations with a web browser. It used HTTP for everything, however, and all its communications were in plain text. It also used plain text to talk to RabbitMQ.

Messages (POSTs) sent to it had to be signed with a symmetric key. Admins therefore had access to all the user keys and they are stored in plain text in the database.

Miranda is more serious about security. When the system is first installed a new certificate authority is created. This CA is used to sign the certificates that the various nodes present when they join the cluster.

All users have a key pair and to do anything, they must first create a session. The session is a random, 8 byte integer that is encrypted with the user's public key when it is handed back to the user.

All communication going into Miranda is encrypted using SSL/TLS. Communications coming out of Miranda depend on the subscription: it can be HTTP or HTTPS.

Tuesday, March 28, 2017

Introducing Mindanda: the Next Step

In preparation for a talk I'm giving at DOSUG I'm going to post my thinking as it develops.

The next step for Prospero is Miranda.

Miranda builds on many of the concepts that Prospero had. Things like capturing HTTP POSTs, forwarding them to subscribers, etc. Miranda does what Prospero does and takes the next step. In addition to capturing POSTs, it also capture PUT and DELETE.

Miranda also addresses Prospero's limitations. Prospero, for example, does everything in the clear. Miranda does everything in SSL/TLS. There are many areas that Miranda improves on Prospero.

Miranda works by sitting in front of, and recording for, web services. When a client sends a POST/PUT/DELETE to an endpoint, Miranda records it. Since Miranda is a distributed, fault-tolerant system it makes the underling web service to appear to be more reliable than it actually is.

Ideally, the service appears to be as reliable as Miranda is, and hopefully, that is very reliable.

Monday, March 27, 2017

Introducing Miranda: Prospero

OK, if a system with "9 9s of reliability" is unattainable with a sane budget, what is attainable?

The answer was Prospero.

Prospero began life as a "skunkworks" project with Erlang. While its initial goal may have been something else, it was used to make systems which only had 1 or 2 9s of reliability seem like they were up most of the time.

Clients would define a topic and then send HTTP POSTs to Prospero. Other clients would "subscribe" to these topics and register a URL that Prospero would associate with them. When a client POSTed Prospero would, in turn, POST to the URL that had been registered.

Prospero did its job pretty well, and Pearson later released it as an open source project with the name of "Subpub."

Chris Chew initially did most of the development. I did a little bit in the way of XML processing.

Over the course of about 3 years of support, the following limitations were discovered:

It was difficult to find people with Erlang experience
Prospero was difficult to modify
Prospero was limited to one data center
Mnesia (the database that Prospero used) could become corrupted
Prospero had difficulty with binary messages

There were other problems like one down subscriber could cripple the system, or that Prospero would stubbornly refuse to send any additional messages until all of the current messages were dealt with, but these were the issues that really stuck with us.

Saturday, March 25, 2017

Introducing Miranda: Motivation

In preparation for a talk that I'll give in June at DOSUG I'm going to post my thinking as it develops. The first section has to do with the motivation for Miranda. The next section deals with how Miranda works and the last sections sums up.

So without further ado here's what I came up with:

The first section deals with the motivation for and the origins of the predecessor to Miranda: Prospero. It talks about how our boss wanted "9 9s of reliability" and how this works out to 30 milliseconds of downtime per year! For the curious, I did some tests...

A ping of google.com yielded an average of time of 100ms.
micosoft.com and ibm.com timed out.

A system with "9 9s of reliability" is unattainable for the average company because it would cost too much. You would need highly available hardware, several sites, a distributed, formally verified system, and hundreds if not thousands of developers.

Consider that the system would need to run on special hardware that has at least two "nodes" each of which has its own CPU, memory, disk and network connection. Each node sends a "heartbeat" to the other nodes to ensure they are up. Remember that a failing node needs to switch over in less than 30 milliseconds when a fault occurs or the show's over!

Just the system that monitors the actual system would be challenging to design!

Such a system could, however, be built.

It would just probably cost billions of dollars, take years to develop, and require hundreds if not thousands of developers.

Friday, March 24, 2017

Why not Prosepo? Why not Modify Prospero?

Up until now, I have been talking about the limitations of the Prospero system. But wouldn't it be easier to just modify Prospero instead of starting from scratch? That's what this post is about.

I had to create a new system because:

A total rewrite was required to get away from Erlang
Prospero is tightly tied to Mnesia
Prospero is not well documented

A Total Rewrite was Required to get away from Erlang

A limitation of Prospero was that finding people with Erlang experience was difficult or impossible. Using a language like Java means a total rewrite anyways.

Prospero is tightly tied to Mnesia

The database that Prospero uses, Mnesia, is very tightly coupled to Prospero. One of Mnesia's limitations is that it doesn't cross availability zones well. Getting Prospero to work with multiple availability zones would require replacing Mnesia - a major rewrite. Given that situation, a total rewrite would not require too much additional effort.

Prospero is not well Documented

While Prospero (Subpub in its latest incarnation) is widely used it is not well documented in terms of comments and the like. The amount of effort required to document Prospero would approach the effort required for a rewrite.

Thursday, March 23, 2017

Why not Prospero? Miscellaneous

This is part of a multi-part discussion of why I started the Miranda project. This post is a grab-bag of issues that I had with the original system.

Prospero Only Records HTTP POSTs

The original system only captures HTTP POST events. It didn't forward PUT and DELETE.

Prospero Only Uses HTTP for New Events

You could not send new events via HTTPS. Furthermore, all communications between nodes was unencrypted, making it unsuitable for a non-secure environment.

Prospero Uses Symmetric Key Encryption

This meant that administrators knew all the keys; and if an attacker gains access to one table, they get all the client messages.

Miranda Addresses All these Issues

Miranda forwards HTTP PUT and DELETE, as well as POST.

Miranda does everything using SSL/TLS dealing with the 2nd issue.

Miranda uses public key encryption instead of secret key encryption. Thus administrators don't actually have client keys and an attacker gains no advantage by getting access to the keys that clients use.

Wednesday, March 22, 2017

Why not Prspero? Availability Zones

Prospero cannot work across availability zones: so you cannot have a West coast data center and an East coast data center work together. This is, admittedly a rare problem: only a few companies even have a data center these days, but with the advent of services like AWS people want this capability for their systems.

Prospero relies on a distributed database called Mnesia, and Mnesia does not deal well with long delays, such as those you might run across in coast-to-coast operations. In addition Prospero uses RabbitMQ, which also wants all its nodes to be in the same data center. For these reasons, Propero is limited to one data center.

This was a problem. For a mission critical application to go down if you lose one data center is bad. If a hurricane takes down a data center for a week this could be a very bad thing indeed.

We never did come up with a solution, but Miranda is designed for distributed use.

Tuesday, March 21, 2017

Why not Prospero? Erlang

This is the first part of a multi-part discussion of why I started the Miranda project. This post goes into the limitations of the language that Prospero was written in, Erlang.

Erlang was created at Ericsson (a telecom company based in Stockholm, Sweden) in the 80s and released to the world in the 90s. It is used with "soft real-time" (where you can occasionally miss a deadline).

Two major drawbacks to Erlang are that it is hard to find people with Erlang experience and it can take several months for someone used to a language like Java to become proficient in Erlang.

It is Hard to Find People with Erlang Experience

Unlike C++, Java or C#, it is much harder to find people who have experience with Erlang. At Pearson in Denver, for example, we had to contract with a firm in Europe to get support for Ejabberd, a chat program written in Erlang.

When we tried to find new members for the team to support Prospero, we had to dispense with Erlang experience as a requirement because nobody had it.

Erlang is basically a niche language in this country, with few adherents.

It is Hard to Train People in Erlang

As a rule of thumb, it would take several months before a developer was "up to speed" with Erleng.

In learning Erlang, one had to learn a different style of development called Functional Programming. An important difference with Functional Programming is that there are no variables - so a statement like "i++" should not be supported in a functional language. This is very different from traditional (imperative) languages, and takes awhile to get used to.

Erlang syntax is also very different from the various "C-like" languages. For example, if expressions in Erlang cannot have function calls and are seldom used.

The difference in programming styles and syntax combine to make Erlang a difficult language to pick up.

Erlang also has Good Points

Erlang also has its good points like light weight processes. I once created a program that used several million threads (called processes in Erlang), but I couldn't do the same thing in Java. At several thousand threads the VM wanted more memory than the system had.

Monday, March 20, 2017

Why not Prospero?

Prospero, the predecessor to Miranda, was a system written by Chris Chew (with a little help from me) in Erlang to capture HTTP POSTs and play them back to interested parties.

The question is this: why not just use Prospero? Why go to the trouble of writing a new system? I will answer this question in a series of posts. This is partially because the answer is that complex, but also so that I have something to talk about for the next few days.

In broad strokes, here is the answer:

Prospero is written in Erlang

It is hard to find people who have experience with Erlang
It is hard to train people to use Erlang

Prospero cannot cross availability zones

Mnesia limitation

Prospero depends on many other systems

Erlang
RabbitMQ
Mnesia

Why not modify Prospero?
Misc reasons

Prospero only deals in POSTs
Prospero only runs over HTTP, not HTTPS

Friday, February 24, 2017

Miranda and Encryption

Why does Miranda want to use a certificate authority? Why does Miranda use encryption at all?

Briefly, Miranda uses local certificate authorities to make it cheaper to use and easier to evaluate. Miranda uses encryption because events (messages) may contain things like personally identifiable information or other sensitive information.

The long answers are, well, longer.

First of all, if you don't have a requirement to encrypt things, then you can turn encryption off. Miranda was designed to be used on things like Amazon cloud, however, with traffic potentially going across the internet, so your events (messages) could be sent in the clear. If you are comfortable with that arrangement, then you can simply turn off encryption.

Miranda uses local certificate authorities because all nodes in the cluster are required to have certificates. It can quickly get expensive creating CERTs for every node in your cluster, not to mention inconvenient, with that approach. Instead, you create your own certificate authority and use the local CA to sign all your node keys.

Miranda uses encryption because I found myself in situations where I wished that its predecessor, Prospero, did. In particular, one of the obstacles to using Prospero in AWS was its lack of support for encryption. Another problem was crossing availability zones. If we had a node on the West coast, and another on the East coast, then they would probably talk across the internet.

As far as what to use, I thought SSL/TLS with their wide use, would be well supported, secure, cheap, and easy to use. While they are indeed well supported, secure and inexpensive I have not found SSL/TLS to be at all easy to use. I have run across a problem that has forced me to do all my development work "in the clear." I refer to the dreaded "Invalid signature" problem that I posted on Stack Overflow about.

At any rate, that is why Miranda uses local certificate authorities and encryption in general.

All hail netty!

Sunday, January 1, 2017

Prospero

The name Prospero comes form the play The Tempest by William Shakespeare. The original Prospero system was written by Chris Chew in Erlang with a little help from me (I wrote a module that parsed XML).

Miranda is a Java port of Prospero with a few changes. First of all it is written in Java instead of Erlang. It was very difficult to find people who had done functional programming (other than JavaScript) let alone Erlang so I view this as an improvement.

Secondly, Miranda doesn't need RabbitMQ. This meant that you needed a whole other cluster devoted to RabbitMQ which had to have fast access to your nodes to work.

Finally, Prospero used Mnesia. In addition to backups (Mnesia had the unfortunate habit of corrupting its files), Mnesia didn't cross AWS availability zones too well. Miranda is designed to run with nodes in different availability zones.

LTSLLC Blog