Objects, Anomalies, and Actors: The Next Revolution
Steve Vinoski believes that actor-oriented languages such as Erlang are better prepared for the challenges of the future: cloud, multicore, high availability and fault tolerance.
Bio
Steve Vinoski is the author of "Toward Integration" by IEEE Internet Computing, and has written for magazines such as C/C++ Users Journal and C++ Report and is the co-author of Advanced CORBA Programming with C++ (APC) with Michi Henning. He is currently an architect at Basho Technologies. He previously worked as chief architect for IONA Technologies, HP, Apollo Computer and Texas Instruments.
About the conference
Software is changing the world; QCon aims to empower software development by facilitating the spread of knowledge and innovation in the enterprise software development community; to achieve this, QCon is organized as a practitioner-driven conference designed for people influencing innovation in their teams: team leads, architects, project managers, engineering directors.
Making the Web Faster with HTTP 2 Protocol
As we all know very well, HTTP is the protocol by which browsers and Web servers communicate to implement all the Web applications that we all love and use since the 1990s.
The last ratified specification version of the HTTP is 1.1. It was approved in 1996. Over time it had little improvements, mostly clarifications. HTTP 1.1 served well since then and continues to serve well for most Web applications, but there is plenty of room for further improvements.
As a matter of fact, between 1997 and 1998 there was an attempt to specify a better HTTP protocol, than called HTTP-NG. However the work on that specification ceased then because it was considered too early to submit as proposal for a new protocol version, given that HTTP 1.1 was only starting to be adopted by the generality of the browsers.
Only recently the work on HTTP 2.0 specification has finally resumed with a call for proposals. There were a few proposals, but in reality they are not exactly new. Most of the proposals are based on ideas already thought by the researchers of the HTTP-NG working group.
Currently there are three proposals submitted to the HTTP 2.0 working group also known as HTTP bis: the SPDY protocol, HTTP Speed + Mobility and Network-Friendly HTTP Upgrade.
Lets take a look at the current set of proposals, so you can see where we are and what we can expect for a final HTTP 2.0 protocol specification.
Amon 0.9 – Fast, scalable async JSON logging with ZeroMQ
One of the key features of Amon is that you can use it to directly log any datastructure. There is no need to convert your log entries to strings, like the traditional logging. Behind the scenes Amon converts the data to JSON and stores it in a Mongo database. The benefits of this approach is that you can search your data through the web interface. If you need to parse and analyze your data, you can export it from Mongo to JSON using a simple one line command.
Amon uses HTTP as a default input protocol. HTTP is easy to setup and reliable, but it is blocking and really slows down your application when you have to log a lots of data. After some time experimenting and reading about UDP, TCP and RabbitMQ I’ve stumbled upon this article by Nicholas Piël about ZeroMQ. ZeroMQ is simple and elegant although it will take some time to get your head arround it. Most importantly - it is extremely fast and reliable. Below you can see the results of my benchmarks comparing the speed of the traditional logging to a file and logging to Amon using ZeroMQ.
Data-Intensive Text Processing with MapReduce
Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well.
Nodejs vs Play for Front-End Apps
Mar 29, 2011: The source used for these tests is now available at https://github.com/s3u/ebay-srp-nodejs andhttps://github.com/s3u/ebay-srp-play.
Mar 27, 2011: I updated the charts based on new runs and some feedback. If you have any tips for improving numbers for either Nodejs or Play, please leave a comment, and I will rerun the tests.
We often see “hello world” style apps used for benchmarking servers. A “hello world” app can produce low-latency responses under several thousands of concurrent connections, but such tests do not help make choices for building real world apps. Here is a test I did at eBay recently comparing a front-end app built using two different stacks:
- nodejs (version 0.4.3) as the HTTP server, using Express (with
NODE_ENV=production) as the web framework with EJS templates and cluster for launching node instances (cluster launches 8 instances of nodejs for the machine I used for testing) - Play framework (version 1.1.1) as the web framework in production mode on Java 1.6.0_20.
The intent behind my choice of the Play framework is to pick up a stack that uses rails-style controller and view templates for front-end apps, but runs on the JVM. The Java-land is littered with a large number of complex legacy frameworks that don’t even get HTTP right, but I found Play easy to work with. I spent nearly equal amounts of time (under two hours) to build the same app on nodejs and Play.
WebSockets: A Guide
WebSockets provide two-way realtime communication between a client and server, and thus are exceedingly useful in building modern web games. Browser-based games can profit from an always-on, low-latency connection by enabling the rapid transmission of information about player and global game state previously emulated by methods such as Ajax polling and Comet. It is useful to first look at the history of WebSockets and gain an understanding of how WebSockets work at a technical level before we examine how we may use WebSockets most effectively. Armed with this knowledge, we can simplify the network layer and build amazingly responsive games that provide a high level of multiplayer interactions.
History
The Internet was developed (to grossly oversimplify) as a way to allow organizations to share information efficiently and with little delay. Information on the Internet is transported using a suite of connection protocols named TCP/IP, defining the method through which computers would share information on a decentralized network with reasonable certainty that the information arrived correctly. TCP/IP functions by providing information about the message, such as source and destination. The message contains a checksum (a value calculated from the data in the message) that can be used by a receiver to verify if all of the information was received correctly. The spread of the “web” as we know it was through the HTTP protocol, providing a layer of abstraction that packages TCP/IP connections in an envelope containing information about the request as well as the data for the request itself, such as form fields and cookie values. This HTTP interface provides a simple request-response interface that works well for actions such as fetching a web page, loading an image, or submitting data to a server for persistence.
Using SPDY and HTTP transparently using Netty
Most people have already heard about SPDY, the protocol, from google, proposed as a replacement for the aging HTTP protocol. Webservers are browsers are slowly implementing this protocol and support is growing. In a recent article I already wrote about how SPDY works and how you can enable SPDY support in Jetty. Since a couple of months Netty (originally from JBoss) also has support for SPDY. Since Netty is often used for high performant protocol servers, SPDY is a logical fit. In this article I'll show you how you can create a basic Netty based server that does protocol negotiation between SPDY and HTTP. It used the example HTTPRequestHandler from the Netty snoop example to consume and produce some HTTP content.
To get everything working we'll need to do the following things:
- Enable NPN in Java to determine protocol to use.
- Determine, based on the negotiated protocol, whether to use HTTP or SPDY.
- Make sure the correct SPDY headers are sent back with HTTP.
SPDY uses an TLS extension to determine the protocol to use in communication. This is called NPN. I wrote a more complete explanation and shown the messages involved in the article on how to use SPDY on Jetty, so for more info look at that article. Basically what this extension does is that during the TLS exchange a server and client also exchange the transport level protocols they support. In the case of SPDY a server could support both the SPDY protocol and the HTTP protocol. A client implementation can then determine which protocol to use.
Since this isn't something which is available in the standard Java implementation, we need to extend the Java TLS functionality with NPN.
Felix’s Node.js Convincing the boss guide
Now that you're all hyped up about using node.js, it's time to convince your boss. Well, maybe. I have had the pleasure of consulting for different businesses on whether node.js is the right technology, and sometimes the answer is simply no.
So this guide is my opinionated collection of advice for those of you that want to explore whether node.js makes sense for their business, and if so, how to convince the management.
Bad Use Cases
CPU heavy apps
Even though I love node.js, there are several use cases where it simply doesn't make sense. The most obvious such case is apps that are very heavy on CPU usage, and very light on actual I/O. So if you're planning to write video encoding software, artificial intelligence or similar CPU hungry software, please do not use node.js. While you can twist and bend things quite a bit, you'll probably get better results with C or C++.
That being said, node.js allows you to easily write C++ addons, so you could certainly use it as a scripting engine on top of your super-secret algorithms.
Simple CRUD / HTML apps
While node.js will eventually be a fun tool for writing all kinds of web applications, you shouldn't expect it to provide you with more benefits than PHP, Ruby or Python at this point. Yes, your app might end up slightly more scalable, but no - your app will not magically get more traffic just because you write it in node.js.
The truth is that while we are starting to see good frameworks for node.js, there is nothing as powerful as Rails, CakePHP or Django on the scene yet. If most of your app is simply rendering HTML based on some database, using node.js will not provide many tangible business benefits yet.
NoSQL + Node.js + Buzzword Bullshit
If the architecture for your next apps reads like the cookbook of NoSQL ingredients, please pause for a second and read this.
From MongoDB to Riak
At Bump Technologies, we recently completed a significant database migration from MongoDB to Riak. Almost all of our users' data -- the lists of people they've bumped, communications sent and received, handset information, social network OAuth tokens, etc. -- had been stored in MongoDB, but if you open the app today all of these interactions will be backed by Riak.
Lightweight in-process concurrent programming
The greenlet package is a spin-off of Stackless, a version of CPython that supports micro-threads called "tasklets". Tasklets run pseudo-concurrently (typically in a single or a few OS-level threads) and are synchronized with data exchanges on "channels".
A "greenlet", on the other hand, is a still more primitive notion of micro-thread with no implicit scheduling; coroutines, in other words. This is useful when you want to control exactly when your code runs. You can build custom scheduled micro-threads on top of greenlet; however, it seems that greenlets are useful on their own as a way to make advanced control flow structures. For example, we can recreate generators; the difference with Python's own generators is that our generators can call nested functions and the nested functions can yield values too. Additionally, you don't need a "yield" keyword. See the example in tests/test_generator.py.
Greenlets are provided as a C extension module for the regular unmodified interpreter.
Greenlets are lightweight coroutines for in-process concurrent programming.
Why we moved from NodeJS to RoR
Disclaimer: This post is in no way a rant about NodeJS or Ruby on Rails. It merely reflects on our decision and the reasoning behind it. Both the frameworks are great for the purpose they are built, and yes that is why a part of our stack is still running on NodeJS.
I am huge fan of NodeJs and I believe it’s a very exciting technology and we will see it getting more popular down the line. I greatly admire it but in spite of everything I recently ported Targeter App from NodeJS to Ruby on Rails.
The reason we wrote it in NodeJS initially, was pretty simple. I had a library that I could use to instantly ship the app (we had 54 hours to make it at Startup Weekend), and I have been coding in JavaScript quite regularly as compared to Ruby. Since our stack involved MongoDB, it only made sense to live in a JS only environment. However, as the app grew, I realized that NodeJS was a wrong choice for the app. Let me outline the reasons below.
The JavaScript World Domination Plan at 16 Years
Brendan Eich recaps the major milestones and controversies in JavaScript’s history, the performance improvements, the current work on the next version of JavaScript, ending with some demoes.
Bio
Brendan Eich is CTO of Mozilla. In 1995, Eich invented JavaScript (ECMAScript), the Internet’s most widely used programming language. He also co-founded the mozilla.org project in 1998, serving as chief architect. Eich helped launch the award winning Firefox Web browser in November 2004 and Thunderbird e-mail client in December 2004.
About the conference
SPLASH stands for Systems, Programming, Languages and Applications: Software for Humanity. SPLASH is an annual conference that embraces all aspects of software construction and delivery, and that joins all factions of programming technologies.
Introducing RabbitMQ-Web-Stomp
For quite a while here, at RabbitMQ headquarters, we were struggling to find a good way to expose messaging in a web browser. In the past we tried many things ranging from the old-and-famous JsonRPC plugin (which basically exposes AMQP via AJAX), to Rabbit-Socks (an attempt to create a generic protocol hub), to the management plugin (which can be used for basic things like sending and receiving messages from the browser).
Over time we've learned that the messaging on the web is very different to what we're used to. None of our attempts really addressed that, and it is likely that messaging on the web will not be a fully solved problem for some time yet.
That said, there is a simple thing RabbitMQ users keep on asking about, and although not perfect, it's far from the worst way do messaging in the browser: exposing STOMP through Websockets.
Some queuing theory: throughput, latency and bandwidth
You have a queue in Rabbit. You have some clients consuming from that queue. If you don't set a QoS setting at all (basic.qos), then Rabbit will push all the queue's messages to the clients as fast as the network and the clients will allow. The consumers will balloon in memory as they buffer all the messages in their own RAM. The queue may appear empty if you ask Rabbit, but there may be millions of messages unacknowledged as they sit in the clients, ready for processing by the client application. If you add a new consumer, there are no messages left in the queue to be sent to the new consumer. Messages are just being buffered in the existing clients, and may be there for a long time, even if there are other consumers that become available to process such messages sooner. This is rather sub optimal.
So, the default QoS prefetch setting gives clients an unlimited buffer, and that can result in poor behaviour and performance. But what should you set the QoS prefetch buffer size to? The goal is to keep the consumers saturated with work, but to minimise the client's buffer size so that more messages stay in Rabbit's queue and are thus available for new consumers or to just be sent out to consumers as they become free.
Let's say it takes 50ms for Rabbit to take a message from this queue, put it on the network and for it to arrive at the consumer. It takes 4ms for the client to process the message. Once the consumer has processed the message, it sends an ack back to Rabbit, which takes a further 50ms to be sent to and processed by Rabbit. So we have a total round trip time of 104ms. If we have a QoS prefetch setting of 1 message then Rabbit won't sent out the next message until after this round trip completes. Thus the client will be busy for only 4ms of every 104ms, or 3.8% of the time. We want it to be busy 100% of the time.
An Introduction to Redis in PHP using Predis
Redis is an open source data structure server with an in-memory dataset that does much more than simple key/value storage thanks to its built-in data types.
It was started in 2009 by Salvatore Sanfilippo and because of its popularity quickly grew, being chosen by big companies like VMware (who later hired Sanfilippo to work on the project full time), GitHub, Craigslist, Disqus, Digg, Blizzard, Instagram, and more (see redis.io/topics/whos-using-redis).
You can use Redis as a session handler, which is especially useful if you are using a multi-server architecture behind a load balancer. Redis also has a publish/subscribe system, which is great for creating an online chat or a live booking system. Documentation and more information on Redis and all of its commands can be found on the project’s website, redis.io.
There is a lot of argument whether Redis or Memcache is better, though as the benchmarks show they perform pretty much on par with each other for basic operations. Redis has more features than Memcache, such as in-memory and disk persistence, atomic commands and transactions, and not logging every change to disk but rather server-side data structures instead.
In this article we’ll take a look at some of the basic but powerful commands that Redis has to offer using the Predis library.
The Anatomy Of Search Technology: Blekko’s NoSQL Database
Imagine that you're crazy enough to think about building a search engine. It's a huge task: the minimum index size needed to answer most queries is a few billion webpages. Crawling and indexing a few billion webpages requires a cluster with several petabytes of usable disk -- that's several thousand 1 terabyte disks -- and produces an index that's about 100 terabytes in size.
Serving query results quickly involves having most of the index in RAM or on solid state (flash) disk. If you can buy a server with 100 gigabytes of RAM for about $3,000, that's 1,000 servers at a capital cost of $3 million, plus about $1 million per year of server co-location cost (power/cooling/space.) The SSD alternative requires fewer servers, but serves a lot fewer queries per second, because SSDs are much slower than RAM.
You might think that Amazon's AWS cloud would be a great way to reduce the cost of starting a search engine. It isn't, for 4 main reasons:
- Crawling and indexing requires a lot of resources all the time; you can't save money by only renting most of the servers some of the time.
- Amazon currently doesn't rent servers with SSDs. Putting the index into RAM on Amazon is very expensive, and only makes sense for a search engine with several % market share.
- Amazon only rents a limited number of ratios of disk i/o to ram size to core count. It turns out that we need a lot of disk i/o relative to everything else, which makes Amazon less cost effective.
- At some cluster size, a startup has enough economy of scale to beat Amazon's cost+profit margin. At launch (November, 2010) blekko had 700 servers, and we currently have 1,500. That's well beyond the break-even point.
Debugging node.js memory leaks
Part of the value of dynamic and interpreted environments is that they handle the complexities of dynamic memory allocation. In particular, one needn’t explicitly free memory that is no longer in use: objects that are no longer referenceable are found automatically and destroyed via garbage collection. While garbage collection simplifies the program’s relationship with memory, it not mean the end of all memory-based pathologies: if an application retains a reference to an object that is ultimately rooted in a global scope, that object won’t be considered garbage and the memory associated with it will not be returned to the system. If enough such objects build up, allocation will ultimately fail (memory is, after all, finite) and the program will (usually) fail along with it. While this is not strictly — from the native code perspective, anyway — a memory leak (the application has not leaked memory so much as neglected to unreference a semantically unused object), the effect is nonetheless the same and the same nomenclature is used.
While all garbage collected environments create the potential to create such leaks, it can be particularly easy in JavaScript: closures create implicit references to variables within their scopes — references that might not be immediately obvious to the programmer. And node.js adds a new dimension of peril with its strictly asynchronous interface with the system: if backpressure from slow upstream services (I/O, networking, database services, etc.) isn’t carefully passed to downstream consumers, memory will begin to fill with the intermediate state. (That is, what one gains in concurrency of operations one may pay for in memory.) And of course, node.js is on the server — where the long-running nature of services means that the effect of a memory leak is much more likely to be felt and to affect service. Take all of these together, and you can easily see why virtually anyone who has stood up node.js in production will identify memory leaks as their most significant open issue.
Vert.x vs node.js simple HTTP benchmarks
For a bit of fun, I decided to do a little bit of micro-benchmarking of Vert.x vs node.js HTTP performance.
Firstly, a disclaimer: This isn’t rigorous benchmarking, and I haven’t attempting to benchmark a wide range of use cases (I just test some HTTP stuff here). All the benchmarking is done on a single machine (my desktop). This is not ideal – a good benchmark would have clients and servers on different physical machines and a real network between them. So don’t read too much into these results. (You can read a little bit, just not too much
) In the future, when I can get hold of some real hardware I intend to do some real benchmarking. Until then this will have to do.
Apparatus: All tests were run on my desktop: An AMD Phenom II X6 (that’s a six core, not as good as the latest Intels but pretty good), 8GB RAM (although only a fraction was used in the tests), Ubuntu 11.04.
Versions: vert.x-1.0.final, node.js 0.6.6
Pushing Data, Not Pages is the New Model for Application Development
A crop of tools for building applications in a new way has emerged over the past month. The way developers build applications has changed over the past few years. The old client-server model involved doing most of the work on the server side and then piping the results down to a dumb client. That might be a command line application crunching numbers and then displaying the results to a terminal application, or it might be a complex Ruby on Rails application transforming data queried from MySQL and rendering it as an HTML page that is sent to a browser. But to create Web applications that are both feature rich and responsive, like Gmail and Facebook, developers have put more and more of the application in the browser. Add the inconsistent network connections of mobile devices to the mix, and you’ve got a host of new development challenges.
That’s where these new tools come in. Meteor, Mojito and Firebase have joined CouchApps in supporting the modern paradigm of data-centric model of development. What they all have in common is a design philosophy of sending data to apps, not rendered pages. Applications can then process the data and send the changes back to the server. The cloud is still important – extremely heavy lifting can still be done server side, but the many of the smaller processing actions will be done on the client side. Data will be stored in the cloud, but cached locally. Clients get smarter and servers become less visible.
Frameworks
The Meteor team thinks they’ve foundt a solution in providing a framework for using JavaScript for both client side and server side development. Developers can use Meteor to build real-time applications that live mostly in the browser, and even push code changes to users while the application is in use. Meteor doesn’t replace AJAX libraries like jQuery, instead it helps AJAX applications get smarter.
Redis-proxy – It’s like haproxy except for redis
Why RedisProxy?
Typically for every redis server we setup, we have a backup server setup as a slave of the main server.
If the Active Redis crashes or goes down for maintenance, we want the application to seamlessly use(read/write) data from the backup server. But the problem is once the backup takes over as active it will be out of sync with the original(master) and should become the sale of the current active. This is solved by redis-proxy, which proxies the active redis. It is also smart enough to issue slave of commands to machines that start up and make masters slave of no one.
This reduces the common redis slave master replication dance that needs to be done when bad stuff happens or maintenance of the servers are needed
Features
- Server Monitoring (to track masters and slaves)
- Automatic slave upgrade on master failure
- Connection Pooling
- Supports Pipelining
- Honors Existing Master Slave Configurations( ie. if the masters and slaves are already setup then it will maintain the same configuration, instead of largescale movement of data)