document databases in online publishing
DESCRIPTION
TRANSCRIPT
My name is Irakli. Let me give you some background about myself and how I tricked conference organizers into thinking that I was qualified to talk today. J I am a director of engineering at Na?onal Public Radio. Which is a fancy way of saying: I lead the soDware team that is responsible for the code behind npr.org, NPR API and NPR mobile apps. Prior to joining NPR, I spent several years developing open-‐source products for the online publishing industry. Some of these products are now used by news organiza?ons like: The Na?on, The New Republic, Thomson Reuters and Al Jazeera. I have been using document-‐based [or, so-‐called: NoSQL] databases, on and off, for almost a year, now and have enjoyed the experience a lot! Because I enjoyed it so much, I wanted to share my story at this conference. I contacted the organizers and they kindly agreed [I hope they will not regret it by the ?me we are done J]. So here it is: one guy’s story of falling in love with the document databases and why he thinks they have a significant role in online publishing, specifically.
1
One of the main reasons why I love document databases is: because it is a truly disrup?ve technology. And when we say “disrup?ve technology” we mean something so innova?ve that it helps create fundamentally new value network, thus altering exis?ng market and disrup?ng legacy technologies in the market. The innova?on of disrup?ve technologies is not just an incremental progression over exis?ng capabili?es. Rather it is a fundamentally re-‐thought, novel approach to solving hard problems. For instance, there’re many good SQL databases, both open-‐source as well as: commercial. And everybody has their favorite: some like SQL server X’s simplicity, others: love the power of the database Y etc. But fundamentally SQL is one way to model data and solve data-‐warehousing problems. It has its ?me-‐proven advantages, as well as some significant shortcomings. Document databases are an architecturally different approach to solving data problems. They are not a drop-‐in replacement or an incremental improvment over SQL. They do have their own shortcomings, but they also allow solving problems that were either very hard or impossible to solve with the tradi?onal, SQL-‐oriented databases.
2
Tradi?onal, SQL database theory has strong emphasis on ACID compliance. You probably remember that ACID stands for: Atomicity, Consistency, Isola?on and Durability. The Consistency property ensures that no database transac?on violates referen?al integrity rules defined in the database schema. Isola?on is a requirement that asserts that, given concurrent access to data, parallel opera?ons cannot access data that is being modified by a another transac?on, but have to wait un?l the transac?on completes. Isola?on is commonly implemented with pessimis?c locking. Isola?on and Consistency requirements in ACID-‐compliance cons?tute a fundamental problem for system’s scalability.
3
To put it in the words of Werner Vogels, CTO of Amazon and one of the foremost experts in the field of distributed compu?ng: “If you’re concerned about scalability, any algorithm that forces you to run agreement will eventually become your boaleneck. Take that as a given.” ACID-‐compliance is all about various processes [and nodes], in the system, checking with each-‐other to keep data consistent across the en?re system. Therefore, it’s not as much about how well-‐implemented master-‐slave or master-‐master replica?on in your database is, but the bigger challenge is the architectural constraint that ACID-‐compliance imposes on scalability.
4
How important is scalability for a Web system? Is it something that maaers just for Amazon, Facebook, Google and alike? Internet is an incredibly fast-‐growing medium. It took radio 38 years aDer introduc?on to reach 50 MM users, it took television 13 years, Internet did it in just 4 and it has been growing exponen?ally ever since.
5
In a report published in June, this year, Cisco forecasted that global IP traffic will quadruple by 2015. It means: more users, larger amount of content, more types of content, more sources of content and more real-‐?me content. In this context, by “real-‐?me-‐content” I mean things like: check-‐ins, coverage of live events and ci?zen journalism during breaking news. Now, most of us in the content-‐produc?on industry, believe that having more traffic and more content is good news. Scratch that: it’s great news! As a maaer of fact, Internet community has goaen so obsessed by the amount of website traffic that it is oDen used as the most significant measure of a website’s success or failure. So: more traffic is good news… except and unless you are the developer responsible for making sure the website is s?ll up and running when traffic quadruples.
6
We started scalability discussion by men?oning the scalability limita?ons that ACID-‐compliance requirement enforces. This constraint is actually a specific case of a more generic theorem called: Brewer’s or CAP Theorem. The theorem was formulated as a conjecture by a UC Berkeley professor: Eric Brewer in 2000. Two years later, Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewer's conjecture. CAP Theorem states that, when designing distributed soDware systems there are three proper?es that are commonly desired: 1. Consistency 2. Availability and 3. Par??on Tolerance, Theorem proves that it is impossible to achieve all three at the same ?me[1]. Even though names sound intui?ve, it is probably worth-‐while to clarify what Gilbert and Lynch meant by each of the defini?ons in CAP, since there are mul?ple (some?mes contradictory) and confusing defini?ons floa?ng around the web.
7
Consistency basically stands for the requirement that all nodes in a distributed system must see the same data all the ?me (subset of ACID compliance). Availability means: every request should succeed to receive a response. System as a whole should be highly available. Par??on Tolerance, in a distributed system, means system should allow some fault-‐tolerance. When some nodes crash or some communica?ons links fail, it is important that system s?ll performs as expected.
8
Let’s look at some popular distributed data storage systems that you are probably familiar with and see which bucket they fall into in the CAP spectrum. Rela?onal databases, LDAP directory servers and xFS file-‐systems are all examples of consistent and available distributed systems. They are consistent because they provide ACID compliance. They are not par??on-‐tolerant because they do not have a quorum system for removing unreachable nodes from the system.
9
MongoDB, Terrastore, Redis and BigTable all guarantee consistency, and they use quorum for par??on tolerance but they forfeit Availability.
10
Domain Name Service (yeap, the one that drives all internet traffic), CouchDB, Riak and Cassandra are all examples of Available and Par??on-‐tolerant distributed systems. They do not guarantee consistency. Rather they provide a promise of something known as “eventual consistency”. For any given request, you may receive a value that is globally stale (system-‐wide) and definitely not isolated per ACID-‐compliance requirements, but eventually all nodes will sync-‐up. Not “running agreement-‐based algorithm”, that Amazon’s Werner Vogels was preaching, is exactly the sacrifice that systems like CouchDB and DNS make to provide extreme scalability and fault-‐tolerance.
11
In his 2000 keynote at the ACM Symposium on Principles of Distributed Compu?ng (the same one where he formulated CAP theorem), Dr. Brewer also came up with a new defini?on he called: BASE. BASE stands for: Basically Available SoD-‐state, Eventual-‐consistency. He formulated and used BASE principles to demonstrate the trade-‐offs and differences from ACID-‐compliant systems
12
ACID-‐compliant systems have following traits: consistency, isola?on, focus on commit, nested transac?ons, pessimis?c locking and typically they are fixed schema-‐based, therefore: inflexible to evolve.
13
In contrast, BASE systems exhibit: weak consistency, availability priori?zed above else, best-‐effort approach to conflict-‐resolu?on, op?mis?c locking. Systems with the BASE philosophy consider approximate responses to be OK, are architecturally simpler, faster and evolve flexibly, since they are typically schema-‐less.
14
CouchDB is not a “beaer MySQL” or a “simpler Oracle”. It is really good at availability and par??on tolerance and has many traits making it a beaer tool for some of the problems tradi?onally solved with rela?onal databases. But one thing it is not: it is not a drop-‐in replacement for SQL databases. There are tradeoffs when choosing a document database, and specifically: CouchDB. The most obvious and honestly “scary” tradeoff is: forfei?ng Consistency. We as computer scien?sts were trained hard and log that data must be consistent, models must be normalized, referen?al integri?es must be maintained and etc. How can we even dream about forfei?ng consistency even for scalability and fault-‐tolerance?
15
The reality, however is that there are systems engineering problems where strict data consistency is crucial, but there are many where -‐ it is not. If you are building a stock trading soDware you should probably use a data storage that guarantees consistency. Financial systems, in general require high-‐level of consistency, but it is not given for just any system. Anybody who has built a real-‐life, high-‐throughput system knows that in many cases you end-‐up de-‐normalizing data model to allow for beaer performance. It is similar to forfei?ng consistency in the CAP model. With a document-‐based database like Couch, some of your request may occasionally return slightly stale data. Addi?onally, data in document format is oDen highly de-‐normalized and less referen?ally consistent than data in a fully normalized, rela?onal database. However, if you are building a news publishing website none of this is unheard of. High-‐traffic news websites have been de-‐normalizing data and implemen?ng aggressive caching for years. This is neither new or radical. On the contrary, instead of: home-‐cooked and half-‐baked, proprietary solu?ons, now we can use a standard, open-‐source, highly op?mized, well tested solu?on like CouchDB. Personally, I think it’s a preay good deal.
16
At this point, I’ve spent good por?on of this presenta?on explaining the scalability profile of CouchDB (and similar systems); discussed how improvements are not quan?ta?ve but are fundamentally qualita?ve. We have also talked about tradeoffs that the increased availability imposes. Let’s forget about scalability for now, however, and talk about other characteris?cs of CouchDB as a document storage engine. ADer all, CouchDB is not the only document database and there are document databases that do guarantee data consistency, so forfei?ng consistency is actually a trait of AP systems (in CAP model), not: that of document databases in general. An important trait of document databases, however, is that they are schema-‐less. There is no pre-‐defined, strict schema, no table structures or rigid rela?onships between document types. Document types live in a free world and evolve very flexibly.
17
OK, this is by far one of my ugliest slides. And what you see here is a rough ER diagram generated off a fresh, vanilla installa?on of a popular open-‐source content management system: Drupal. There are 72 tables on this diagram. Some of you may be familiar with Drupal. It is highly extensible (and generally really awesome), but it does not do much out of the box. So when we used Drupal for crea?ng websites like that of The Na?on or The New Republic, we installed dozens of addi?onal Drupal modules and wrote a bunch on top ourselves. Meaning: we added even more tables. And you can clearly see how unreadable this schema already is. Obviously we never even tried to visualize en?re data-‐model on any real projects, because it would have been useless.
18
The same data model in a document-‐based database, would look like this: (see slide) I know, I know! I am exaggera?ng, obviously we would have more than one logical type of a document even in a document database, but schema-‐less modeling means: at the physical level it is just one document type, so what you see here is really not that far from reality as far as actual data storage goes. Most things above and beyond are really part of the applica?on logic and business rules. Since my presenta?on is one of the last ones at this conference, I am sure you have already listened to presenters who went in great detail about data-‐modeling in CouchDB and I am sure they are much bigger experts of the subject than I am. So I will spare you the experience. Suffice to say that embedding documents greatly simplifies data models. Think about just the amount of so-‐called “mapping” tables that rela?onal systems need to model things like: many-‐to-‐many rela?onships. Also, in the case of online publishing specifically, most business objects are… well, documents so having a storage engine that operates in terms of documents is extremely natural and enjoyable. There’s much less discrepancy between physical and logical models. Things, in most cases, just make sense and fall in line naturally.
19
Another important, stark difference between relational databases and CouchDB is the absence of a query language. As most other things about CouchDB, it’s pretty “scary” for the newcomers. So much so, that some other document databases have actually opted to implementing an SQL-like syntax (MongoDB for instance) and I know a lot of people who appreciate that. In contrast, CouchDB uses Map/Reduce, first filtering the data with a Map function and then (optionally) grouping it with a Reduce function, if needed. The documents, result of a map function as well as reduce function are all saved on a B-tree (the secret sauce of CouchDB’s performance). If in a relational database you would have normalized data and then you would index some columns from that data, most things in Couch are a B-tree index to begin with. This has significant consequences and much like in the case with forfeiting data consistency, there are some real trade-offs to be made. While Map/Reduce is very powerful, obviously you will find some queries that you could run in SQL that are either impossible to model with a View or are too expensive/too slow. Also, Views are not as dynamic as SQL queries. They are built incrementally and a complete rebuild of one, in a large database is an expensive operation. As such, it really pays off to carefully think through the Views that a system will be using at the early stages of the system design.
20
The good news is: in online publishing most user-‐facing content is a document type, a lis?ng of documents and an aggrega?on -‐-‐ exactly the things that document-‐based databases and CouchDB’s Views are highly op?mized for. As a maaer of fact, at NPR, to withstand millions of unique users that the main website gets, our legacy system uses an architecture with very similar constraints. It has content objects that are serialized XML, XML lists of content objects and aggrega?ons also represented in an XML format. While in the back-‐end we do use an SQL database, the front-‐end architecture has made many architectural decisions similar to those made in CouchDB. Yes, the legacy system uses XML instead of JSON… I know, I know! But we have been running our systems for a long while, so some of it pre-‐dates the ?me when JSON got all sexy and trendy J
21
To summarize, AP-‐style (as defined by CAP model) document databases exhibit following traits, important for online publishing systems that get significant traffic and have real-‐?me content streams: -‐ High availability -‐ Par??on Tolerance -‐ Schema-‐less architecture -‐ Document-‐oriented storage -‐ Index-‐based semi-‐dynamic querying like that in CouchDB Views.
The benefit from each one of these features is a result of a tradeoff. For teams architec?ng systems and implemen?ng document databases, it is crucial to understand and appreciate the tradeoffs made. That said, document databases are disrup?ve, benefits they provide are real and ignoring them, not augmen?ng tradi?onal, rela?onal storage systems with document-‐based ones would be a mistake.
22
Thank you for your aaen?on.
23