LTSLLC Blog: March 2017

Friday, March 31, 2017

Introducing Miranda: How Miranda Works 2

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop.

Jumping back...

Miranda is composed of subsystems, each of which runs in its own thread. The threads communicate by means of Messages and BlockingQueues.

A thread simply takes the next Message off the queue and processes it.

The various subsystems of Miranda each have a State object. The states "know" how to process a Message (a method called processMessage), and common behavior, like shutting down or how to deal with a file change, is gathered into the superclasses of states.

Each state has a class it is partnered with that does the "real" work.

Introducing Miranda: Closing

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop.

Reasech shows that people tend to remember the opening and closing of a presentation the most; so I am going to go into my closing early on.

Hopefully this talk has given you an idea of what Miranda can do for you and the confidence that it can do it. Now the the decision you have to make is whether to pursue it further.

You download Miranda from GitHub at https://ClarkHobbie/miranda and perform a trial run. The tools that you will need are there as well. Setting up Miranda is relatively simple.

I would bid you to throw off the shackles of the Pager! While Miranda may not eliminate the need for a pager altogether, it will give you the confidence to know that, if you take your time in responding to a page, it's not the end of the world.

Thursday, March 30, 2017

Introducing Miranda: How Miranda Works

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop.

So Miranda looks like something you might be interested in, but how does it work?

Mirada sits in front of your web service and accepts POST/PUT/DELETEs on its behalf. It then sends those events to your web service when it is up.

When a client sends a message to Miranda that is called an Event. The end point that they send the POST/PUT/DELETE to is called a Topic. An Event is delivered to a client as part of a subscription. When Miranda gets a 200 response as a result of sending an Event to a client, it records it as a Delivery. All these things are set up with Users.

Before doing anything, a User must login to the system and establish a Session.

Miranda operates as a cluster of nodes. To do anything, a qurom of 2 nodes needs to be established. Each Topic can have different rules about when to recognize an Event. The default policy is after Miranda receives a POST/PUT/DELETE and has forwarded the event to a qorum of other nodes, it responds to the client, telling them that it received the Event.

At the same time that a node is telling the other nodes about an Event, it writes the Event to the persistent store.

Events are kept for a configurable period of time but the default is a week. Deliveries are also kept for a configurable period of and this also defaults to a week.

Wednesday, March 29, 2017

Introducing Miranda: 9s of Reliability

In preparation for a talk I'm giving at DOSUG I'm going to post my thoughts as they develop.

What do the various "9s" of reliability mean? Here is a table with the amount of yearly downtime that each level allows:

# of Nines	Percenatgae	Time
1	90%	A month
2	99%	3 days
3	99.9%	8.8 hours
4	99.99%	52.6 Minutes
5	99.999%	5.3 Minutes
6	99.9999%	32 Seconds
7	99.99999%	3.2 Seconds
8	99.999999%	320 Milliseconds
9	99.9999999%	32 Milliseconds

So 1 9 any team could do if they have a working system. If they even have a pager the person with it does not pay a lot of attention to it.

If the company is serious about it 2 9s is no longer a joke. They definitely have a pager that people trade off on a weekly basis. When the pager goes off. the person tries to respond in 20 minutes or less. The person on call has the phone numbers of the rest of the team in case they need something.

At 3 9s, there is definitely someone with a pager and there have been very serious conversations about getting a control center, a la NASA, for the system. The system may or may not be distributed.

At 4 9s, there is a control room manned by very humor limited folks who have the numbers of each of the team members, and when something goes wrong, they call them. The person on call switches off weekly, reminiscent of wearing a pager, and if there isn't a hot standby then there have been very serious conversations as to why not.

At 5 9s, there is a control room, manned be people who were too serious for the previous level. The control person's duty is to switch over the system to a hot standby and to call the on-call person when a problem develops. Being on-call is no joke, and when the pager goes off, the on-call person must respond within 10 minutes.

At 6 9s, things are insane. The people who were too serious for 5 9s are manning the control room. There is a person whose sole duty is to decide whether to switch to the hot standby. There are two levels of people on call, and each of them must respond within 5 minutes.

Levels 7,8 and 9 require varying levels of hardware to support them, and there is still a control room and lists of on-call people. But the question becomes, if you are serious, what is this for? At 7 or 8 9s it becomes hard to tell if the system has been down. After all, people's internet connections are not 7 or 8 9s reliable.

Miranda Takes Systems with 1 or 2 9s and Makes Them Appear to have Between 5 and 6 Nines

Miranda is a distributed, fault-tolerant system designed to run behind a load balancer that accepts messages on behalf of the underlying system and delivers them when the system is up.

The underlying system can be down for hours or days while Miranda accepts message for it. Therefore, the system appears to have between 5 and 6 9s of reliability, when it has fewer than that.

Tuesday, March 28, 2017

Introducing Mindanda: the Next Step

In preparation for a talk I'm giving at DOSUG I'm going to post my thinking as it develops.

The next step for Prospero is Miranda.

Miranda builds on many of the concepts that Prospero had. Things like capturing HTTP POSTs, forwarding them to subscribers, etc. Miranda does what Prospero does and takes the next step. In addition to capturing POSTs, it also capture PUT and DELETE.

Miranda also addresses Prospero's limitations. Prospero, for example, does everything in the clear. Miranda does everything in SSL/TLS. There are many areas that Miranda improves on Prospero.

Miranda works by sitting in front of, and recording for, web services. When a client sends a POST/PUT/DELETE to an endpoint, Miranda records it. Since Miranda is a distributed, fault-tolerant system it makes the underling web service to appear to be more reliable than it actually is.

Ideally, the service appears to be as reliable as Miranda is, and hopefully, that is very reliable.

Monday, March 27, 2017

Introducing Miranda: Prospero

OK, if a system with "9 9s of reliability" is unattainable with a sane budget, what is attainable?

The answer was Prospero.

Prospero began life as a "skunkworks" project with Erlang. While its initial goal may have been something else, it was used to make systems which only had 1 or 2 9s of reliability seem like they were up most of the time.

Clients would define a topic and then send HTTP POSTs to Prospero. Other clients would "subscribe" to these topics and register a URL that Prospero would associate with them. When a client POSTed Prospero would, in turn, POST to the URL that had been registered.

Prospero did its job pretty well, and Pearson later released it as an open source project with the name of "Subpub."

Chris Chew initially did most of the development. I did a little bit in the way of XML processing.

Over the course of about 3 years of support, the following limitations were discovered:

It was difficult to find people with Erlang experience
Prospero was difficult to modify
Prospero was limited to one data center
Mnesia (the database that Prospero used) could become corrupted
Prospero had difficulty with binary messages

There were other problems like one down subscriber could cripple the system, or that Prospero would stubbornly refuse to send any additional messages until all of the current messages were dealt with, but these were the issues that really stuck with us.

Sunday, March 26, 2017

Update Properties

It is time, once again, to update the properties page. Since I haven't updated in a while, there is lots to do.

Saturday, March 25, 2017

Introducing Miranda: Motivation

In preparation for a talk that I'll give in June at DOSUG I'm going to post my thinking as it develops. The first section has to do with the motivation for Miranda. The next section deals with how Miranda works and the last sections sums up.

So without further ado here's what I came up with:

The first section deals with the motivation for and the origins of the predecessor to Miranda: Prospero. It talks about how our boss wanted "9 9s of reliability" and how this works out to 30 milliseconds of downtime per year! For the curious, I did some tests...

A ping of google.com yielded an average of time of 100ms.
micosoft.com and ibm.com timed out.

A system with "9 9s of reliability" is unattainable for the average company because it would cost too much. You would need highly available hardware, several sites, a distributed, formally verified system, and hundreds if not thousands of developers.

Consider that the system would need to run on special hardware that has at least two "nodes" each of which has its own CPU, memory, disk and network connection. Each node sends a "heartbeat" to the other nodes to ensure they are up. Remember that a failing node needs to switch over in less than 30 milliseconds when a fault occurs or the show's over!

Just the system that monitors the actual system would be challenging to design!

Such a system could, however, be built.

It would just probably cost billions of dollars, take years to develop, and require hundreds if not thousands of developers.

Friday, March 24, 2017

Why not Prosepo? Why not Modify Prospero?

Up until now, I have been talking about the limitations of the Prospero system. But wouldn't it be easier to just modify Prospero instead of starting from scratch? That's what this post is about.

I had to create a new system because:

A total rewrite was required to get away from Erlang
Prospero is tightly tied to Mnesia
Prospero is not well documented

A Total Rewrite was Required to get away from Erlang

A limitation of Prospero was that finding people with Erlang experience was difficult or impossible. Using a language like Java means a total rewrite anyways.

Prospero is tightly tied to Mnesia

The database that Prospero uses, Mnesia, is very tightly coupled to Prospero. One of Mnesia's limitations is that it doesn't cross availability zones well. Getting Prospero to work with multiple availability zones would require replacing Mnesia - a major rewrite. Given that situation, a total rewrite would not require too much additional effort.

Prospero is not well Documented

While Prospero (Subpub in its latest incarnation) is widely used it is not well documented in terms of comments and the like. The amount of effort required to document Prospero would approach the effort required for a rewrite.

Thursday, March 23, 2017

Why not Prospero? Miscellaneous

This is part of a multi-part discussion of why I started the Miranda project. This post is a grab-bag of issues that I had with the original system.

Prospero Only Records HTTP POSTs

The original system only captures HTTP POST events. It didn't forward PUT and DELETE.

Prospero Only Uses HTTP for New Events

You could not send new events via HTTPS. Furthermore, all communications between nodes was unencrypted, making it unsuitable for a non-secure environment.

Prospero Uses Symmetric Key Encryption

This meant that administrators knew all the keys; and if an attacker gains access to one table, they get all the client messages.

Miranda Addresses All these Issues

Miranda forwards HTTP PUT and DELETE, as well as POST.

Miranda does everything using SSL/TLS dealing with the 2nd issue.

Miranda uses public key encryption instead of secret key encryption. Thus administrators don't actually have client keys and an attacker gains no advantage by getting access to the keys that clients use.

Wednesday, March 22, 2017

Why not Prspero? Availability Zones

Prospero cannot work across availability zones: so you cannot have a West coast data center and an East coast data center work together. This is, admittedly a rare problem: only a few companies even have a data center these days, but with the advent of services like AWS people want this capability for their systems.

Prospero relies on a distributed database called Mnesia, and Mnesia does not deal well with long delays, such as those you might run across in coast-to-coast operations. In addition Prospero uses RabbitMQ, which also wants all its nodes to be in the same data center. For these reasons, Propero is limited to one data center.

This was a problem. For a mission critical application to go down if you lose one data center is bad. If a hurricane takes down a data center for a week this could be a very bad thing indeed.

We never did come up with a solution, but Miranda is designed for distributed use.

Tuesday, March 21, 2017

Why not Prospero? Erlang

This is the first part of a multi-part discussion of why I started the Miranda project. This post goes into the limitations of the language that Prospero was written in, Erlang.

Erlang was created at Ericsson (a telecom company based in Stockholm, Sweden) in the 80s and released to the world in the 90s. It is used with "soft real-time" (where you can occasionally miss a deadline).

Two major drawbacks to Erlang are that it is hard to find people with Erlang experience and it can take several months for someone used to a language like Java to become proficient in Erlang.

It is Hard to Find People with Erlang Experience

Unlike C++, Java or C#, it is much harder to find people who have experience with Erlang. At Pearson in Denver, for example, we had to contract with a firm in Europe to get support for Ejabberd, a chat program written in Erlang.

When we tried to find new members for the team to support Prospero, we had to dispense with Erlang experience as a requirement because nobody had it.

Erlang is basically a niche language in this country, with few adherents.

It is Hard to Train People in Erlang

As a rule of thumb, it would take several months before a developer was "up to speed" with Erleng.

In learning Erlang, one had to learn a different style of development called Functional Programming. An important difference with Functional Programming is that there are no variables - so a statement like "i++" should not be supported in a functional language. This is very different from traditional (imperative) languages, and takes awhile to get used to.

Erlang syntax is also very different from the various "C-like" languages. For example, if expressions in Erlang cannot have function calls and are seldom used.

The difference in programming styles and syntax combine to make Erlang a difficult language to pick up.

Erlang also has Good Points

Erlang also has its good points like light weight processes. I once created a program that used several million threads (called processes in Erlang), but I couldn't do the same thing in Java. At several thousand threads the VM wanted more memory than the system had.

Monday, March 20, 2017

Why not Prospero?

Prospero, the predecessor to Miranda, was a system written by Chris Chew (with a little help from me) in Erlang to capture HTTP POSTs and play them back to interested parties.

The question is this: why not just use Prospero? Why go to the trouble of writing a new system? I will answer this question in a series of posts. This is partially because the answer is that complex, but also so that I have something to talk about for the next few days.

In broad strokes, here is the answer:

Prospero is written in Erlang

It is hard to find people who have experience with Erlang
It is hard to train people to use Erlang

Prospero cannot cross availability zones

Mnesia limitation

Prospero depends on many other systems

Erlang
RabbitMQ
Mnesia

Why not modify Prospero?
Misc reasons

Prospero only deals in POSTs
Prospero only runs over HTTP, not HTTPS

Saturday, March 18, 2017

Hierarchical States

Miranda use hierarchical states.

Hierarchical means that behavior can be shared across several classes. This is very useful if you are sharing files, as Miranda does, and you don't want to write the same code over and over again.

I first heard about hierarchical states with Harel Statecharts, but I later used ROOMcharts (part of the late, Real-Time Object-Oriented Modeling methodology that later got absorbed into UML) because I liked them better.

The nice thing about hierarchical states are that you can define a behavior in a base state and all states that extend it get that behavior as well. In the case of Miranda, the State class, that all state classes extend, responds to the stop message. That way, all classes "know" how to stop.

Some classes do a bit more. The ToipicFile class checks to see if it needs to be written, while the Node class, which represents different nodes in the cluster, needs to disconnect before shutting down. The state classes for these objects know to watch for a shut down message, and behave differently in those cases.

Friday, March 17, 2017

Jetty

I using Jetty for HTML and servlets for Minda. So far it has been pretty.

Uh oh, I probably just jinxed it.

I always envisioned a web interface for Miranda - to do things like check status, add or modify users and for administering topics.

I'm using Goolge's angularjs with this - which is probably overkill. But I took an online class on angular so I'm going to use it. So far things have worked out well. The main problem is that the web site looks like something I would create: clunky and lacking in grace.

Thursday, March 16, 2017

Mockito for the Win

I have not been doing a lot of work on Miranda this week. I wanted to say that Mockito has worked out well, however. It has simplified tests, particularly for states, by providing mock objects for Consumers.

Tuesday, March 14, 2017

It's Alive2: SSL

Miranda seems to work with SSL (actually TLS).

I didn't actually have to change anything, I just regenerated the serverkeystore and truststore files; and it seemed to work!

This is an important milestone for the Miranda project. I have spent several weeks trying to get SSL/TLS working. The key was being able to swap out Netty with Mina. For some reason, Netty, was giving me the "Invalid signature" error and I couldn't get it to work. Fortunately, Mina was a bit more forgiving.

Sunday, March 12, 2017

It's Alive!

Today I got two nodes to talk to each other. Granted it was through an unencrypted channel but still...I felt like some sort of mad scientist: "Do you hear me Egor? It's aaaaalliiiivveeeee!

Thursday, March 9, 2017

Well at least Mina Works

I recently opened a defect on Netty. While they responded very quickly the person that I worked with insisted that it was not a bug and that something with the certificates was being messed up.

I don't know how to get things working with Netty, but things work with core Java and Mina, so it looks like Miranda will be going with Mina.

Wednesday, March 8, 2017

A Bug for Netty

I have put this off for as long as I can, it's time to post a bug to the Netty project.

As many readers of this blog will know, I have had a rocky relationship with Netty. The turning point came when I saw what a sorry state SSL/TLS was in when it comes to Java. I then posted an apology to Netty. Nevertheless Netty may have some problems.

In particular, my attempts to use Netty with a "local certificate authority" have met with failure. I posted my problem to Stack Overflow in hopes of finding a solution there, but after a week no one has voted for the question or (aside from Jim Garrison) commented on the question.

Since that time I discovered another networking framework called Mina. I have created a new test up on GitHub that uses Mina and seems to work.

I have been working with someone from Netty regarding the issue, so far without result.

Tuesday, March 7, 2017

Mina Seems to Work

I tired using apache mina in a test program (up on github at https://github.com/ClarkHobbie/ssltest3) and it worked. I have posted a bug to the Netty project and see what they do with it. In the mean time I will use apache mina.

This is not an indictment of Netty, I will be able to switch over if the folks there figure out what is going wrong.

UPDATE

The folks at Netty (specifically, normanmaurer) got back to almost immediately and asked me to try 4.1.9. I did and it still had problems, but I am very impressed with how fast they got back to me.

Monday, March 6, 2017

Basic Use Cases

This represents the basic use cases that Mirada needs to implement in order to be useful.

Log in as admin
Create a user
Create a topic
Create a subscription
Create a message
Create a delivery

Log in as Admin

The user connects to the system with a browser
The system asks the user to log in
The user supplies their credentials
The system responds with the status page and gives the user a cookie

Create a User

The user goes to the users section of the app
The system presents the user with a menu of operations
The user indicates create a user
The system presents the user with the new user form
The user completes the form and presents it to the system
The system creates the new user
The systems tells the other node about the new user
The system receives acknowledgement from the other node about the new user
The system reports success to the user

Create a Topic through the UI

The user goes to the topics section of the app
The system presents the user with the topics status screen
The user indicates they want to create a new topic
The system presents the user with the new topic form
The user completes the form and presents it to the system
The system create the new topic
The system tells the other nodes about the new topic
The system receives acknowledgement about the new topic from the other nodes
The system reports success to the user

Create a Subscription by Sending a POST

The client posts a new subscription
The system checks the client's session
The client's session is valid
The system checks for duplicated subscriptions
The new subscription does not duplicate another topic
The system creates the new subscription
The system tells the other nodes about the new subscription
The system receives acknowledgement about the new subscription from the other nodes
The system reports success to the client

Create a Subscription through the UI

The user goes to the subscriptions section of the app
The system presents the user with the subsriptions status screen
The user indicates they want to create a new subscription
The system presents the user with the new subscription form
The user completes the form and presents it to the system
The system checks for duplicated subscriptions
The new subscription does not duplicate another subscription
The system creates the new subscription
The system tells the other nodes about the new subscription
The system receives acknowledgement about the new subscription from the other nodes
The system reports success to the user

Create a new Event

The client does an HTTP POST to a topic
The system checks the client's session
The client's session is valid.
The system records the data of the client's POST to a new event
The system tells the other nodes about the new event.
The system writes the new Event to the persistent store
The system reports success to the client

Deliver an Event

A new Event comes in
The system checks to see what subscriptions are interested
At least one subscription is interested
The system hands the Event off to the Subscription for delivery.
The Subscription notes the Event to persistent sore
The Subscription delivers the Event to the subscriber.
The system creates a new Delivery object
The system adds the Delivery to the subscription
The Subscription starts writing the Delivery to the persistent store
The system tells the other nodes about the Devlivery

Sunday, March 5, 2017

To Post or not to Post

I have been taking an online course on agularjs in the hopes that I could use it on the admin side of Miranda. During my setup of the web site (I'm using Jetty btw) I discovered that Chrome doesn't like my certificate authority either.

I was planning on posting a bug to Netty regarding my difficulties in using a local CA, but I think I will wait until Chrome likes my web site.

Saturday, March 4, 2017

State Classes: Where the Brains are in the Miranda System

I decided to use external states in the Miranda system. Because Miranda is a multi-threaded system this is what I call a Big Deal.

What this means, in practical terms, is that each thread reacts differently depending on what state it's in. For example, when a node is waiting for a "join" message, it reacts differently to a "join" message then when it is trying to connect.

What this boils down to is that, most of the behavior logic goes in the state classes and what I normally think of as the class itself does very little.

Time will tell if this was a good move or A Very Bad Idea.

Friday, March 3, 2017

Exceptions and Panics

Miranda should never stop. Ever.

This means my blithe strategy of printing a stack trace and calling System.exit in response to an exception cannot hold. Instead, most threads must continue and instead throw a Panic when things get really bad.

A Panic is like an exception, except that the system treats a panic as a request to shut down instead of automatically printing a stack trace and exiting.

I have also created a panic method on the Miranda class. If the method returns at all (the default behavior is to call System.exit but this will change) it returns a boolean to indicate if the system as a whole is panicking, or if the caller should try and keep going. The system may attempt to shutdown instead of calling System.exit so correctly handling the response to a call to Miranda.panic is important.

Thursday, March 2, 2017

Getting a Handle on Things

Given that

netty TLS is giving me grief*
Socket based SSl/TLS seems to work

I have decided the following:

All network communication (send/receive) will go through the network object
All network communications will use a handle: nothing will use the network directly
For sockets, all receives will use their own thread

I'm no too crazy about the last point, and in particular I don't know if you can ask a socket to send a message while you already issued a receive. We shall see.

The good news is that this should make testing easier: just call a method and see what messages end up in the various queues.

* = After looking at the state of non-blocking, secure (i.e. SSL/TLS) I/O in Java, I changed my opinion of netty; but it still may have bugs.

Apologies to netty

Yesterday I saw just how bad SSL/TLS was for nio and java and realised just how much work the folks who put out netty really have to do. Up until now, I have been frustrated with netty only to gain respect for anyone who will deal with SSL/TLS using non-blocking I/O in java.

So in summary I am sorry for my attitude up until now and will try to do better.

Wednesday, March 1, 2017

Who is Responsible for nio TLS?

And I thought netty was bad...

It doesn't hold a candle to nio TLS...

Consider this link. My god, you would need to be a TLS expert to use it! And this is from Oracle...

Putting the reasons aside for the moment, it seems clear that

nio TLS is non-trival to use
There are very few libraries available

And this is after 10 years!

I am speechless. Either developer are not using SSL/TLS with java, or I am missing something.