Pensieri di un lunatico minore
With the recent Amazon S3 outage of approximately 8 hours, there’s a lot of people blowing a lot of energy on lambasting Amazon for the downtime. While I think we’d all love to have systems that never go down, the probability of such occurring in the “real world” is relatively small, unless you’re running some esoteric hardware that most people aren’t. Before we go any further, let’s quickly break down what we mean by availability.
When most people talk about availability, they often using the marketing-speak method of speaking in “nines”. Five nines, or 99.999% is the “gold standard” of what most people talk about, but few people actually achieve. To clarify, here’s what it means when you convert percentages to actual time spans.
This means that the vaunted “five nines” allows for only a bit more than five minutes of downtime in a year, and only 6 seconds per week, which translates to less than 1 second per day. Quite honestly, you can’t even bounce an HTTP server in that period of time reliably. For example, if you’re running on a single server, and you have to reboot it more than once a year, you’ll likely never hit 99.99%, even if nothing else breaks.
So what does this mean when we talk about systemic availability? It means that putting all your eggs in one basket—regardless of the quality of the basket—is silly. While many people think about drives failing, and implement RAID or some other technique, and some think about CPU and memory, very few think or plan for electrical system failure or cooling failure. These kind of problems, which strike entire data centers, are not uncommon, and can not be waved away by saying that you have redundant infrastructure.
A vast majority of availability problems, however, are not hardware driven—even though that’s all people think about. They come from a few areas:
- Operator error
- Configuration error
- Software failure/bugs
- Networking
- Power and cooling
All of these cost serious money to solve. They are solved through processes and planning and not just traditional technical operations. In examining failure modes of systems I’ve worked on, a vast majority are preventable. They are due to someone making an unplanned change that isn’t properly vetted. They’re based on software configuration errors, and they’re based on upgrades that simply weren’t tested first.
The silver lining here is this: a vast majority of sites, companies, etc., do not need this kind of availability. The pursuit of high availability tends to be a mental masturbation exercise by people who want to spend money, but aren’t willing to do the cost-benefit analysis. Before undertaking anything above 99.9%, you really need to understand your business to a level that will allow you to make a rational decision about risks. Often, it is cheaper to rebate money to people than it is to fix the problem.
So what do I say to those who puff up and say “I can do better”? I say “no you can’t”. At least, not likely. The cost of running exceptionally high availability systems is not just hardware. It is operational costs. It is staffing, monitoring infrastructure, planning and operational processes. It doesn’t happen when you’ve only got one machine. It doesn’t even happen when you have 50 machines.
Don’t delude yourself any more about your own ability to run systems at that level than you delude yourself into assuming someone else can as well.
2 thoughts
That’s overstating the case, perhaps, but what I hate is the whole insane concept that UML is a “documentation format”. It’s not. It’s OK to use it to think about a problem, and to sketch out the architecture of the problem, but it is not a substitute for narrative and true documentation.
If someone hands me another UML model (complete with thousands of HTML pages documenting every class), and says that their software is “documented”, I’m going to smack them.
2 thoughts
Giles Bowklett whines about technology support in the iPhone:
Apple decided to leave out hardware support for the radio band that T-Mobile will be using for their 3G system. So while you can use it on T-Mobile’s voice and EDGE network, 3G on T-Mobile won’t work, and since it’s a hardware limitation, can’t be hacked to make it work either.
I’m pretty sure these tactics are illegal under United States antitrust law and have been for over a hundred years.
Apple also doesn’t support TDMA, CDMA, AMPS, W-CDMA, TD-CDMA, iDEN, WiDEN, TD-SCDMA, CDMA2000 or even D-AMPS.
Waaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaah.
I’m pretty sure he hasn’t the foggiest notion what antitrust laws are really about, nor the idea of monopoly and restraint of trade, neither of which apply in this context. Maybe one day they will, but even that is highly unlikely.
4 thoughts
For the past few weeks, I’ve been heads-down working on getting an environment up and running. It’s not exactly a small task, as you can see from the picture on the right. Close to 250 blade servers, and a couple heavily partitioned AIX boxes are the main components.
As much as I enjoy a challenge, the sheer volume of equipment that has to come up, run and have huge numbers of applications installed is pretty intimidating—even for me. Having said that, I’ve got a great team, and we’re getting close to finishing this part of the project. Now, if I can just get the client to stop making absurd decisions about networks and systems.
No thoughts
GigaOm puts it best:
Under the new law, for example, a board meeting may be conducted “in person or through the use of [an] electronic or telecommunications medium.” A “‘virtual company’ will be, as a legal matter, a Vermont limited liability company,” said Johnson. And other states are required to recognize the corporation as a legitimate LLC. So while in the past many companies registered in Delaware to take advantage of that state’s business-friendly policies, with this law, Internet-driven startups may find Vermont even more ideal.
This is a huge step forward at leveling the playing field for start-ups. I’m a bit surprised it was VT, rather than some other state, but who knows? With the ability to fully virtualize your company from hardware to people and sevices, why should you have to have some pointless “physical” presence? I’m sure however, that other states may get annoyed.
No thoughts
For a project I’m working on, we have upwards of 200 Intel-based blades across 16 chassis. One of the joys of this is that all of them need to be imaged to an absolutely identical base image, but then a few things distributed out that are different on each of them. Rather than use the traditional Ghost) approach, which in my opinion has increased in suckage substantially in the past few releases, we’re exploring a new option: FOG. FOG has a few advantages over Ghost:
- It’s Linux-based and therefore actually reliable
- Can image almost anything to anything
- Has a central deployment console
- Can use multicast to distribute images/etc at a high rate to lots of boxes simultaneously.
The big advantage over something like IBM Director? Price and customization.
No thoughts
At work, we just started rolling out Lotus Notes 8 to all of our clients. For those under a rock, this is a ground-up rewrite of Notes and it is based on the Eclipse framework. Now, my experience with Eclipse has been almost uniformly horrible, but perhaps that’s just working with Rational tools. Anyway, a few minor observations on my machine (T42p ThinkPad, 2GB RAM):
- Looks a lot better. The old Notes 7 “look and feel”, if you could call it that, was antiquated at best. It felt like a throwback to the early 90s.
- It feels snappier. I don’t know if this is necessarily true, but it is more responsive.
- Integration of Sametime is 1000x better, and finally useful.
- Message “threading” is finally there. If it existed before, I couldn’t find it.
Not all is well, but it’s still a lot better than 7.
3 thoughts
I remember when computers were, for me at least, exclusively for work. I might occasionally dial up a server to get mail or ftp files, but most of the time I was offline. All I could do was write and program. Now I feel as if someone snuck a television onto my desk. Terribly addictive things are just a click away. Run into an obstacle in what you’re working on? Hmm, I wonder what’s new online. Better check.
That’s Paul Graham in Disconnecting Distraction. What most struck me as I was reading the article is how much it has mirrored my time with computers. Over the years, the Internet has become more central to my job, and more indispensable; or so I thought. What he has done is illuminate the dark underbelly of work and the Internet. I am reminded of something said by someone perhaps even more seminal in their work: Donald Knuth. In Knuth versus Email, Prof. Knuth writes:
Email is a wonderful thing for people whose role in life is to be on top of things. But not for me; my role is to be on the bottom of things. What I do takes long hours of studying and uninterruptible concentration. I try to learn certain areas of computer science exhaustively; then I try to digest that knowledge into a form that is accessible to people who don’t have time for such study.
Now, I certainly can’t take Prof. Knuth’s position. My job is, quite often, to be “on top” of what is going on, and this requires a constant digestion of new information. What I have learned is that there is a certain amount of training required of others in their electronic interactions. Towards that, I’ve made the following rules:
- My email is only refreshed every 30 minutes, and often I don’t read it more often than once an hour or less. If someone truly needs something, they can call me or send me an instant message.
- I read all instant messages, however I do not respond immediately simply because someone sent me a message. Sometimes they require real-time interaction, but often they’re simply a way to get around my time-filter on email.
- I don’t, as a rule, answer my phone after hours if it’s a work call. I do this by making sure that everyone I work with is in my iPhone, as well as the number of the company-wide call-router.
In the past, I worked to respond immediately to all requests, no matter how tedious and insensitive to time. This is akin to a fully interrupt-driven existence, which is not a survivable model. First, it means that forces outside your control dictate your ability to focus. More importantly though, it means that every time the interrupt fires there is a mental context-switch. Take, for example, email. Putting down whatever I’m working on, changing to my email client, and reading 1-2 emails may only take 30 seconds, however it then takes several minutes to get back in the zone to do work.
Reading “Disconnection Distraction”, I am reminded that I also allow other things to intrude into work, whether professional or personal. Things like RSS, personal email and news all conspire to diminish my ability to focus. It is a combination of over-stimulation and mind-numbing tedium, and it is something that I need to work on to be more effective.
No thoughts
I have a unique talent it seems. I can “pick up” new technology at an exceptionally quick rate. This sounds useful, however it tends to get me into trouble because it quickly appears that I have some enormous depth in something I’m just beginning to sort out.
I’m not sure why this is. Perhaps it is the dearth of actual competent individuals in IT; perhaps it is that once you demonstrate some breadth of knowledge, it is assumed that you know everything to a certain level of competence. I don’t. I know something about nearly everything in the technology world, but there’s a much smaller set of areas where my knowledge is deep and extensive. This can be dangerous. While it lets me bridge a lot of gaps when working with clients, it sometimes leads project management people to make assumptions.
Take this morning for example. I’ve been asked to represent my team in a discussion about Active Directory with the client. Certainly, I know what directories are. In fact, I probably know directories (and meta-directories) better than 99% of the people working with Active Directory. That doesn’t, however, mean that I understand implementation issues with Active Directory in particular (I don’t), nor do I understand all the intricate and interwoven dependencies with Windows deployments (I don’t, and have no desire). My experience is in glueing together directories from dozens upon dozens of organizations into one organization-wide view. It’s not a trivial task, but it’s not the same task, and assuming that because I’ve done one that I can do the other? Dangerous.
I know it, but why is it so hard for others to understand the various skill sets inside even a specific technology/discipline? I can learn—that’s my talent—but even for me it takes some time, something we are sorely lacking in this project.
2 thoughts
I really do, you’re so sexy and svelte and say all the right things, but see, I’ve tried to install you on 2 servers now, and also on 2 different platforms under VMWare (Fusion and Server) and in both cases strange things have happened that cause you to abandon me. I’m sorry, but I think it’s time to go back to my first love: FreeBSD.
No thoughts
At work, I wanted to set up a spare blade in a chassis to be a general catch-all server for the lab. The blade in question is a HS40 blade for the IBM BladeCenter chassis. While it’s about 2 years old, it’s a 4-socket blade with 8 cores and 16GB RAM. Not exactly something to sneeze at. The distribution I chose to use is Ubuntu 8.04 LTS Server. I’ve generally had good luck with Ubuntu on the desktop, so I figured I’d give it a whirl on the server.
No go, unfortunately. For some reason, I run into a couple problems that I need to work out:
- Ubuntu doesn’t see the MegaRAID controller, and instead somehow sees the two individual drives. Looks like I need special drivers.
- It seems unhappy with memory mirroring set up, which shouldn’t even be visible.
- Finally, it installs, but then reboots and says it’s an “unsupported CPU type”
Strange, but a problem for tomorrow.
No thoughts
I’d like to go on record as announcing the demise of Blu-Ray. Now, I’m not saying that people won’t buy it, nor that it may exist for many years as a format. Even Laserdisc survived for over 20 years with 2% market share in the United States. What I mean by this incendiary comment is that Blu-Ray is irrelevant. This is true for several reason.
First, the average person simply has no reference for quality. Therefore quality doesn’t sell. VHS beat Beta, VHS beat Laserdisc. VHS sucks. It’s not that people couldn’t tell the difference if they wanted to, but that, with the exception of a miniscule number of videophiles, they simply don’t care. Does the picture move? Do the sounds go boom? Good enough. DVD didn’t win, in my estimation, primarily because of quality. It won because it was easy-to-use, lasted longer, and was less subject to damage by children than VHS was. In addition, it was a lot cheaper to manufacture. Also, it was “familiar” in form-factor to the Compact Disc.
Next, while there is a growing number of people with high definition televisions, it’s still a minority of the market, and even more, the number of people with a 1080p set is effectively zero as a percentage. Is it zero? No, but it’s a largely irrelevant number. A huge number of people are still using CRTs and aren’t likely to replace them until forced to, i.e. when they die.
Finally, the Internet. With the rise of broadband penetration into the United States—never mind the rest of the world that left us behind long ago—video-on-demand is becoming more and more of a reality. Looking at iTunes gives you an idea of where things are going in the home movie-watching world.
I’d also like to point out that movies are 100x more ripe for rental use than CDs, but the reasoning behind that is best saved for a later post.
2 thoughts
In earth-shattering news, Ryan Park has revealed that CouchDB is not a relational database, and therefore may not behave like a relational database. Glad we got that out of the way. In more important news:
- Oracle, DB2, SQLserver, PostgreSQL and MySQL are also not strictly relational, and in many ways not even close
- Relational databases don’t solve all problems, or even many problems well
- Much of the world’s data is not in a relational database
- The web ain’t the world
I realize these are difficult truths, and that when you have a golden hammer, you wish to slam it into every nail, but that doesn’t mean it’s the right answer. Some of the most advanced applications I’ve ever worked on were written in object databases, and I’ve even built some very cool applications for data correlation in something that’s really more of an object/network blend. I’ve supported applications in network databases and even hierarchical databases and even something that isn’t really describable.
Understanding what data model works in reality is very different than what seems conceptually pure and simple. Relational algebra is a nice model, but it’s not always the right solution. Relational algebra is a good way to think about a problem, but don’t confuse that with it being the best way to solve it.
Unfortunately, a vast majority of “developers” have never either been forced to work through problems with a different tool, or even more telling, chosen to try and figure out other solutions. If you’ve only looked through one window, then the view out another—into the same world—may be terrifying.
4 thoughts
For a project, I’m doing a bunch of integration with Cisco’s Call Manager and automatic remapping of handsets. It’s a pretty simple process really, except for one fact: Cisco’s documentation on this sucks. It’s also a rather broken approach. Since others might run into this, I thought I’d post a summary of what actually works.
First, you need the following bits of information:
- Device name (e.g., the handset name like
SEP00175A52EE7C)
- Mobility manager user ID and password
- Target user
- Target profile (optional)
One of the catches is that, at least in the v4.2 documentation for the release I’m working with, Cisco sometimes says password, sometimes PIN and even calls it a certificate in places. This is confusing if you don’t know that it needs to actually be a password.
So let’s say you want to remotely login a handset (SEP00175A52EE7C) for a user (JDOE) with a profile (JDOEMOBILE). You’d need to create an XML tree that looks like this:
<request>
<appinfo>
<appid>MobilityManagerUser</appid>
<appcertificate>supersquirrel</appcertificate>
</appinfo>
<login>
<devicename>SEP00175A52EE7C</devicename>
<userid>JDOE</userid>
<deviceprofile>JDOEMOBILE</deviceprofile>
</login>
</request>
Simple enough, on the surface. Here’s where the trick comes in. Cisco is unclear on how you might deliver this. If you’re call manager is at ccm.acme.com, then the URL for your API is http://ccm.acme.com/emservice/EMServiceServlet. Cisco says “use POST”, but that’s not quite what they mean. If you just POST the XML to that URL with a Content Type of text/xml, you’ll get all sorts of useless server errors.
What you actually have to do is pretend that you’re actually posting a FORM. To do this, I use a multipart/form-data. Then, you pass the XML document above inside a part labeled “xml”. Once you figure that out, it’s not complicated.
Logout is similar, with the following XML document:
<request>
<appinfo>
<appid>MobilityManagerUser</appid>
<appcertificate>supersquirrel</appcertificate>
</appinfo>
<logout>
<devicename>SEP00175A52EE7C</devicename>
</logout>
</request>
So if you ever need to work with a Cisco Call Manager to do station mobility, there’s a bit of insight. This might have changed in later releases, but the client I’m working with is intent on staying on v4.2.
N.B. For some reason, Wordpress is converting all the XML tags to lower case, which they are not. They are sorta camel case. Make sure you get that right, as Cisco seems to care very deeply about it.
No thoughts
Besides offering both an amazing shopping experience and the leading cloud computing offering, Amazon apparently also owns a mind-reading robot. I say this because today, Amazon announced persistent disk volumes. While they’re not available yet, they are coming very soon. There’s a couple of observations I’d make about what’s there, and what’s missing at this point. First the great things:
Familiarity As far as I can tell, it looks to be a familiar structure with some twists. This means that applications should be able to run unmodified on this new capability.
Public Discussion I think the recent pattern that Amazon has demonstrated, with explaining and exploring new features publicly prior to announcement, is one to be applauded in a company. While this won’t work for everyone—often the user has no idea what they need—it does work well to polish up the last bit of a service offering.
So what’s missing? What would I like more information on?
Technology Not that it matters overly, since it’s only exposed via web services, but I’m guessing this is an iSCSI service. It might be FCoIP, but I doubt it since we’re not having to worry about non-UNIXy traffic (i.e. FICON).
Pricing The suitability of this solution will come down to cost. As with S3, it’s likely that there’s going to be multiple components to pricing, but above the cost of storage itself, I’m wondering how transfers will be accounted for. This could have the biggest impact on databases that do an enormous amount of reading and writing. While you could simply ignore utilization and charge only for the storage itself, I think that unfairly penalizes people who are “sane” with their read/write patterns.
Performance The big elephant in the room is performance. I’m sure it’ll be acceptable for 99% of applications, but it would be nice to understand that rough performance model of the system. Also, can they be put into RAID arrays to speed up sequential reads/writes?
Reliability Amazon has done well at demonstrating a high level of reliability, but I think it would be nice to know a little more about what’s underneath the volumes, if only to better recognize the risks.
Snapshot Format It would be useful to fully understand the snapshot format that gets written. I’m assuming it will be very similar to what is used for AMIs, but documentation is key. Since S3 only supports objects up to 5GB, obviously many situations will require splitting this up. This would accomplish two things: 1) Protect against long-term tie-in with Amazon, 2) Allow for pulling down snapshots to debug locally.
Snapshot Timing How long is it going to take to snapshot large elements? This has two phases. First, the “marking” of a point in time to snapshot. The second is the actual movement of the snapshot to S3. The first is important because, at least in the case of many applications like databases, you must quiesce the database prior to snapshotting for best effect. If this is less than 1 second, then it’s great. If it’s longer, then hiccups could happen. The second simply tells you what your exposure window is.
All in all, this represents the last major piece of the puzzle to adoption. There’s other things that would be nice, and perhaps I’ll write about those another time, but these give you everything you need to take existing infrastructure and “move it into the cloud”. I think it’s clear that Amazon intends to be the player in this space, and isn’t content to leave a great offering alone.
1 thought
James finally did it. What is it? He typed rm -rf / on a Solaris box. Fortunately, he wasn’t root. I, however, was root when I typed rm -rf * on a SunOS 4 box that ran a lot of core apps for a company in the mid 1990s. By the time I realized what had happened, I had blown away the kernel, and /etc. Needless to say, much hilarity ensued, fortunately it didn’t crash due to the fact that the kernel and all the files it needed were still open, and therefore accessible. It was the reboot that would kill it.
So, I had to restore /etc from tape, the kernel from a backup location (it had been custom built), and hope. Turned out I got it right, except for marking the magic setting on the kernel to make it bootable. Live and learn.
1 thought
InfoWorld has an article about Java’s blazing new speed. I read it when it was posted about a week ago, and stewed on it, and various people have commented on it, but I just find the whole thing absurd in the extreme. The fetishistic obsession with speed is amusing, but misguided. As I’ve said repeatedly in the past, performance is a complex topic, and the speed an application runs is often limited by many things unrelated to the actual performance of the underlying language implementation.
My favorite quote, though is this brilliant little insight:
Scripting languages are ideal for smaller programs but Java is the choice for larger programs, he said. “As your program grows in size, the lack of strong typing basically kills your ability to handle a very large program and so you don’t find the million-line Perl program,” he [Cliff Click] said. One-million-line Java programs are plentiful, Click said. Strong typing refers to the capability of knowing the type of memory objects.
I don’t know about you, dear reader, but the concept of a million-line monolithic program is absurd and laughable. Now, I’m sure 1/2 those lines in a Java program are absorbed in meaningless syntactical stupidity and obsession with XML and the latest nonsense in getting around a brain-dead language, but it’s still an absurd size to build one monolithic system. To me, what this demonstrates is the total inability to demonstrate perspective in understanding when your tool is in the way.
I know of at least one ERP system, implemented in a “scripting language” for a large multi-national corporation. It’s not even close to 1M lines of code, but it is broken into a huge number loosely coupled systems. The company doesn’t care to discuss it publicly as they consider it a strategic advantage to be able to adjust their business at a quick rate.
3 thoughts
James Robertson writes about the disasterous opening of Terminal 5 at Heathrow. For the past 18 months, I’ve been myred in two major airport projects, and perhaps I can shed some light onto the situation. I also happen to have some knowledge of BAA, which is the entity that operates Heathrow (among many others).
First, I think it’s important to understand what an airport is, and is not. To understand airports, one must not think about transportation, but about shopping malls and civic projects. The financial model for an airport, at least in the United States1, is by and large, based on renting space to airlines, and sometimes services around that space. It’s a shopping mall. Sure, there’s a lot of other financial components, and obviously it has a lot of expenses that a shopping mall doesn’t, but in the end, it’s often operated that way.
Next, all airports are projects of “civic importance”. They are huge projects, often in the hundreds of millions, if not billions of dollars, and are seen as a point of pride for a city, county or region. While you would think this would be a good thing, it actually tends to mean that the management is highly politicized, the operations molasses-like in it’s agility, and priorities inverted to ensure difficulty.
For example, I can’t begin to recount the number of meetings I’ve sat in where hours were spent discussing exactly what shade of terrazo stone to use on the floor, and then the entire IT operations were dismissed as being “unimportant”. I’ve had 2 hour discussions about the exact shade of white that a CCTV camera needs to be painted, without anyone discussing whether D1 resolution is really the right choice. It’s a beauty contest, quite honestly. The budget for art is often bigger than the IT capital budget for new construction.
When you’re in that situation, with inverted priorities, and a business model that isn’t really driven by the end consumer, i.e., the traveler, the chances of failure are high. No airport ever opens on time. No airport ever opens without problems. Some of it is complexity, but some of it is lack of planning and priorities around operational concerns.
1 I believe this is largely true world-wide, though less so in places where a single carrier dominates an airport, like British Airways at Terminal 5.
No thoughts
The other day, I was chatting with a friend who is trying to start a company. This friend was worried about the infrastructure costs associated with building up her company. I mentioned Amazon Web Services, which I thought nicely solved a lot of the issues. There were 3 big issues that I saw in making this successful:
- Getting DNS wired up properly with all IP addresses being dynamic
- Implementing some real form of redundancy when everything is behind a hazy curtain
- Making your application fit into Amazon’s model (S3, SDB, etc.)
Today, I get an email from Amazon saying they’ve fixed the first two, Huzzah! After playing a tiny bit with it, I think it’s 99% of the way there. Here’s a few observations:
- Redundancy still seems to only come from the east coast, and it’s unclear if that’s a temporary thing or otherwise. It’d be nice to also be able to solve the geography thing.
- I think dynamic DNS is an arguably better solution than static IP reservations. I’m quite sure Amazon already uses this internally for a lot of management, why not expose it and tie it together? Static IPs only solve the “public” facing part of the problem.
- Charging to use a static IP (when you don’t have it bound to a running EC2 instance) isn’t really a fiscal issue, so much as it’s a reasonably polite way of discouraging people from “sitting” on IPs.
Overall, though, great progress. That just leaves the whole model issue. What I’d love to see is a S3-like storage rental for EC2, but implemented using iSCSI or something similar. Then I could grab it anytime I had to restart a machine, and not have to, quite honestly, create kludges around it. There’s still issues, like filesystem stability, etc., but it is a thought.
2 thoughts
Mercurial, my favorite distributed version control system has turned 1.0. This is very exciting. I’ve been using it for about 6 months, and have been very happy not only with its capabilities, but the general performance and stability that I’ve seen. A few nits here and there, but I’m looking forward to seeing if the 1.0 release fixes that
No thoughts
Yesterday, James Governor posited the fifteen rules of cloud computing, or at least something that might be interpreted as such by many people. It’s not a lot deeper an inspection of the issues than one David Letterman might use, but it is an interesting point of departure for discussion. Since I’ve been doing some cloud/grid/fog/whatever work for the past year, I figured I might add my inflation adjusted two cents.
I am going to use these out of order, since I want to discuss certain sets together.
- If you peel back the label and its says “Grid” or “OGSA” underneath… its not a cloud.
- If you need to send a 40 page requirements document to the vendor then… it is not cloud.
- If there is a consultant in the room… its not a cloud.
- If there is no API… its not a cloud.
I’ve lumped all of these together, because they seem to me to talk about complexity. The term “grid” is highly overloaded and often means different things to different people in different contexts. To me, the biggest difference between a grid and a cloud is that a grid is designed around a specific application domain (i.e. high performance computing), and the cloud is an abstract utility. You could build a grid on top of a cloud, I guess I’m saying.
I don’t disagree prima facia with this argument, though I don’t think the presence of a consultant is a “smell”, but rather whether you have to have one.
- If you need to install software to use it… its not a cloud.
- If you can’t buy it on your personal credit card… it is not a cloud
Well, as some people have pointed out, a lot of us can’t buy it on our personal credit cards, depending on the application, but I think the spirit is that the barrier of entry, financially and otherwise, should be effectively zero. Amazon has done this. Whatever I use, I can just have billed to my card monthly. A few pennies here, a dollar there. The final application of micropayments in some ways.
- If they are trying to sell you hardware… its not a cloud.
OK, that’s obvious. A cloud is service.
- If you need to rearchitect your systems for it… Its not a cloud.
Well, that’s not really fair. Sure you could deploy it as is onto a cloud, but that doesn’t mean it’d work, be reliable, or otherwise take advantage of what a cloud is about: elastic scalability. Amazon is probably the prototypical “cloud service”, but you do have to rethink a lot of things if you want it to be bullet proof. These aren’t necessarily things you don’t have to worry about normally, just things that most people don’t worry about normally.
- If it takes more than ten minutes to provision… its not a cloud.
- If you can’t deprovision in less than ten minutes… its not a cloud.
- If you need to specify the number of machines you want upfront… its not a cloud.
Elasticity. That’d be the key. With Amazon, I can bring servers on and offline quickly. In fact, I’ve had to develop some approaches to hysteresis to reduce the volatility of the systems. This is actually not unlike any form of dynamic capacity management, but as in my previous comment, it’s simply something largely ignored in the past.
- If you know where the machines are… its not a cloud.
- If you can’t connect to it from your own machine… its not a cloud.
- If you own all the hardware… its not a cloud.
They’re in a data center? Seriously though, I do need to know roughly where the machines are. For example, I might want to store some of my data in the US and some in Europe (see European privacy laws), or I might want to serve my Japanese customers out of a local asian facility, rather than dragging it across the transpacific fiber. What I think might be more accurate is that you can’t reach out and touch your machines, no matter what.
Think of it like the electric company. I know sorta where some of my power comes from, but I can’t know exactly, and they certainly won’t let me go in and touch the shiny buttons and colorful dials. I can, however, contract with them for certain constraints on the creation of my energy, abstractly, by requiring that it be green energy.
- If it only runs one operating system… its not a cloud.
Well, this seems a bit silly. For better or worse, the cloud world has consolidated on the Linux environment. Do I wish I could get FreeBSD? Sure, and I’m sure a lot of people would, for some inexplicable reason, like to run Windows, but Linux has become the de facto option, and I don’t think that’s likely to change. It’s not economically viable to support lots of options for most organizations.
Not a bad start, but I think we need to look at the bigger issues, and less at the implementation.
No thoughts
FreeBSD has always been my favorite server UNIX implementation. With FreeBSD 7.0, I think they’ve retaken the crown in performance and stability. Way to go guys. More details here, with pretty graphs.
No thoughts
I’m so sick of spam. Just on this blog:
Akismet has caught 91,332 spam for you since you first installed it.
That’s in about a year, or an average of 250/day.
1 thought
David Zatz looks at the technical insanity going on over supposedly lost e-mails at the WhiteHouse:
If, in fact, the bulk of the White House email records are now stored in bundles of rotting PST files, all at or above their maximum safe load-level, that ain’t good in a very big way. [...] I object to using the inaccurate and inflated claim of excessive cost as a reason to avoid compliance with the Presidential Records Act.
Basically, it comes down to this. Either they are grossly incompetent or they are perjuring themselves to protect something. Which is more comforting as an American?
Having worked on some of the initial efforts at protecting governmental email for the Archives, I can tell you it’s hard. But it is solvable, and much more easily now than in 1997.
No thoughts
I’m a huge admirer of Steve Wozniak, and think he did some brilliant work, and may still be doing it. I am not sure, why we care what he thinks about consumer gear, since I don’t believe I’ve seen a single piece of development with any impact on the mass market since the Apple II that has his name on it. That’s 30 years. That’s not to belittle his contributions, but I’m simply not sure how it’s more relevant than the average Slashdot poster.
No thoughts
So what does it take to build some software on this new toy of mine? Well, here’s some of the steps that are required to get some basics installed:
sudo apt-get install build-essential
sudo apt-get install subversion
sudo apt-get install mercurial
Finally, since I was looking to play with Scheme, I decided to build it on my machine. This consisted of the following steps:
- Download the latest snapshot from the website at
http://pre.plt-scheme.org/plt-clean-tree.tgz
- Unpack with GNU tar. For comparison, the Eee PC took 5.3seconds to do this, whereas my Core Duo MacBook Pro w/2GB of RAM and a 5400RPM drive took 8.5s, although it spend less time in user space.
- Discover that you need to install a bunch of things using
apt-get, such as: libX11-dev, libXt-dev, libXmu-dev, libxaw7-dev
- Run
./configure. For comparison, the Eee PC took 35 seconds to run, the MacBook Pro, 26.1. This is where CPU is starting to make a big difference I suspect.
- Build it. For the Eee PC, I just used regular
make, and it took 1125 seconds. Since the MacBook Pro has 2 cores, I used make -j3, and so it only took 489 seconds. This isn’t a totally valid comparison since on Linux it uses the Athena toolkit and Xforms, whereas on MacOS X it uses wxWindows. Still, an interesting comparison.
N.B.: Note that I had originally intended to use a ext3 formatted SD card to build things like this on, but I kept getting I/O errors around 2MB of writing, and just gave up until I can track it down.
No thoughts
For those paying attention, I mentioned in a previous post that I had purchased an Asus Eee PC. Specifically, I bought a black 4G. Some specifications, as they might or might not matter:
- 7” LCD (800×480) display with LED backlight
- 900MHz Intel Celeron, but down-clocked to around 633MHz, which I believe is a Dothan CPU1 chip.
- 512MB RAM on a single SO-DOMM
- 4GB SSD
- WiFi, Ethernet, 3 x USB 2.0
- 5200mAh battery
- VGA-quality webcam
- Stereo speakers
For the $400 it cost from Amazon, not a bad setup. Certainly it’s not going to replace the Sony TZ series or Apple’s MacBook Air. It is however smaller and lighter than either of them. And 1/4 the price.
So what do I think of it so far? That’s a bit hard to say. I’ve not really taken it on a trip yet to evaluate it on a plane—the big test—but I can make some initial observations:
- Wow, 2 pounds is light. It makes my ThinkPad T42p or MacBook Pro feel like a boat anchor.
- The screen is tiny, but amazingly legible. The DPI rating is pretty high, so I’ve been able to compensate for some of the size limitations by just making fonts smaller.
- The keyboard is tiny, and requires some getting used to. I can type hunt-and-peck without a problem, but touch-typing is definitely taking some getting used to. The feel, however is pretty good by notebook standards. It’s not a ThinkPad, or even a MacBook, but it’s better than any Dell I’ve used.
- Speakers are pretty good for a notebook, and coworkers commented they sounded better than you’d expect for the size.
- Moving from the captive UI to the normal KDE user interface is pretty easy.
- Skype works great on it.
- Performance isn’t bad, and certainly better than I expected. Application launch isn’t blinding, but once things are running, it’s pretty zippy.
- At first, I was confused as there is only a single trackpad button, but it differentiates left/right by where you click on it. It is, however, way too firm.
- Battery life with the backlight down some and WiFi turned off is exceptional. I actually got 4 hours and there was still some left.
- All my coworkers want one. Lots for doing what I’m using it for: network monitoring, configuration, and other “mobile terminal” applications.
- With a few additional software packages installed, I still have about 1GB free for user data, which is sufficient for my application. I’ve added a 16GB SDHC card to augment this.
So, overall, I’m quite satisfied with my purchase. Driving home from Raleigh last night, I thought: you know, this would be a brilliant “car computer”. Now if I could only think of a justification to do that! Otherwise, my only planned modifications are more memory, potentially hacking an internal GPS receiver (SiRFstar III here I come), and maybe finding a little more storage.
1 Dothan? Not Dothan, AL I hope. I hated that place when I had to visit there for work a few times.
1 thought
I grew up on big iron. I didn’t learn to program on the Apple II, like so many of my peers, or the original IBM PC under Turbo Pascal but instead on a PDP-10, then Data General Nova systems. From there I worked on Prime and a whole host of others. I actually didn’t really start writing any software for PCs until long after Windows ruled the roost. During that time, I also spent some time as a system operator, and part time systems programmer, on a few different mainframes running either MVS/XA or VM/SP.
What I’m getting to is that to me mainframes are more “normal” than PCs are, and I still wonder sometimes how people get anything done on a PC. That means I was quite interested to read IBM’s announcement of the new z10 Enterprise Class server. For those who don’t really understand what separates a mainframe from effectively any other machine, including your average UNIX machine, here’s a few specs:
- Up to 64 CP(Central Processor) modules, each running at 4.4GHz, and which can be assigned to very specialized tasks as appropriate to a specific environment
- Up to 1.5TB of RAM
- Up to 64 coupling links for building huge clusters, up to 32 of which can be InfiniBand
- Up to 1,024 ESCON or 336 FICON Express4 connections
- Up to 8 cryptographic features
Mind you, that’s one system. A Parallel Sysplex lets you tie together 32 of these systems into a single logical facility.
One of the thing that historically has escaped many people is that they see the number of CP modules, or the megahertz rating, or the MIPS rating, is that mainframes burn almost zero CPU cycles dealing with I/O, network, etc. All of this is offloaded to specialized processors. Having built a lot of high-performance gear on x86 platforms, I can tell you that trying to run huge I/O or network bandwidths tend to enter into the voodoo territory quite quickly. The traditional interrupt-driven architecture is simple, but it doesn’t scale usefully to multiple gigabits of traffic. You don’t buy a mainframe to replace a supercomputer. You buy it to replace huge databases and high-throughput applications.
I’m not saying that a mainframe is for everyone, but it is certainly a very interesting product to this day. Things that mainframe shops achieve regularly, like 99.999% true availability, are rarely, if ever, achieved in the traditional client server world. Some of this is simple process issues. Mainframe shops tend to be very regimented. The rest is simply that much of the hardware and software has thousands and thousands of man years of testing behind it.
3 thoughts
I was bored, and so I bought a new toy:

More to come.
2 thoughts
This “morning” Amazon experienced a massive failure of their web services infrastructure; specifically Simple Storage Service and Elastic Compute Cloud. Lots of people jumped in to quickly yell and scream about the issue, wringing their hands about trust and such.
What seems to be completely missed here is something very, very common: technology fails. It fails regularly and with often unpredictable regularity. It’s the way the world works. You can build a system that doesn’t fail, but that doesn’t mean your software won’t break. I’m not trying to excuse Amazon for their issues. Without knowing what happened, it’s nearly impossible to know how foreseeable it would have been. What I am trying to say is that people need to get a grip.
Here’s how this downtime calculation works, assuming Amazon’s outage was 61 hours, and that there are approximately 8,765.81 hours in a year. That gets you a downtime of .0006, or inverted to the normal view and coverted to percentages, an uptime2 of 99.94%. Is that perfect? Nope. Is that “good enough”? Maybe. What I do know is that actually achieving this kind of uptime isn’t as easy as the armchair geeks would like to think.
So what lessons should the users of Amazon’s web services take out of this?
- Always have a contingency plan, even if it consists of a dark blanket over your head and a six pack of your favorite beverage.
- Make sure you understand the true cost of downtime, and what the impact is to your organization/business/whatever.
- SLAs mean nothing.
- Nobody’s perfect.
- Don’t assume.
And what lessons should Amazon take away?
- Communication is key. A vast majority of the “complaints” were centered around lack of information.
- Information is confidence. Confidence is trust. Amazon should institute something similar to what Salesforce.com did after their big issues a few years ago.
- Never underestimate the anger of a mob.
All told, I still think that “cloud computing” for a lot of applications is the future.
1 My understanding is it was somewhere around 2.5 to 3 hours, but I’m including a few extra to cover stabilization times, and restart times.
2 If we just allow 2.5 hours, it’s an uptime of 99.97%.
1 thought