document databases in online publishing

23
My name is Irakli. Let me give you some background about myself and how I tricked conference organizers into thinking that I was qualified to talk today. I am a director of engineering at Na?onal Public Radio. Which is a fancy way of saying: I lead the soDware team that is responsible for the code behind npr.org, NPR API and NPR mobile apps. Prior to joining NPR, I spent several years developing opensource products for the online publishing industry. Some of these products are now used by news organiza?ons like: The Na?on, The New Republic, Thomson Reuters and Al Jazeera. I have been using documentbased [or, socalled: NoSQL] databases, on and off, for almost a year, now and have enjoyed the experience a lot! Because I enjoyed it so much, I wanted to share my story at this conference. I contacted the organizers and they kindly agreed [I hope they will not regret it by the ?me we are done ]. So here it is: one guy’s story of falling in love with the document databases and why he thinks they have a significant role in online publishing, specifically. 1

Upload: irakli-nadareishvili

Post on 28-Nov-2014

1.011 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Document  Databases In  Online Publishing

My  name  is  Irakli.  Let  me  give  you  some  background  about  myself  and  how  I  tricked  conference  organizers  into  thinking  that  I  was  qualified  to  talk  today.  J    I  am  a  director  of  engineering  at  Na?onal  Public  Radio.  Which  is  a  fancy  way  of  saying:  I  lead  the  soDware  team  that  is  responsible  for  the  code  behind  npr.org,  NPR  API  and  NPR  mobile  apps.    Prior  to  joining  NPR,  I  spent  several  years  developing  open-­‐source  products  for  the  online  publishing  industry.  Some  of  these  products  are  now  used  by  news  organiza?ons  like:  The  Na?on,  The  New  Republic,  Thomson  Reuters  and  Al  Jazeera.    I  have  been  using  document-­‐based  [or,  so-­‐called:  NoSQL]  databases,  on  and  off,  for  almost  a  year,  now  and  have  enjoyed  the  experience  a  lot!  Because  I  enjoyed  it  so  much,  I  wanted  to  share  my  story  at  this  conference.  I  contacted  the  organizers  and  they  kindly  agreed  [I  hope  they  will  not  regret  it  by  the  ?me  we  are  done  J].      So  here  it  is:  one  guy’s  story  of  falling  in  love  with  the  document  databases  and  why  he  thinks  they  have  a  significant  role  in  online  publishing,  specifically.  

1  

Page 2: Document  Databases In  Online Publishing

One  of  the  main  reasons  why  I  love  document  databases  is:  because  it  is  a  truly  disrup?ve  technology.  And  when  we  say  “disrup?ve  technology”  we  mean  something  so  innova?ve  that  it  helps  create  fundamentally  new  value  network,  thus  altering  exis?ng  market  and  disrup?ng  legacy  technologies  in  the  market.      The  innova?on  of  disrup?ve  technologies  is  not  just  an  incremental  progression  over  exis?ng  capabili?es.  Rather  it  is  a  fundamentally  re-­‐thought,  novel  approach  to  solving  hard  problems.    For  instance,  there’re  many  good  SQL  databases,  both  open-­‐source  as  well  as:  commercial.  And  everybody  has  their  favorite:  some  like  SQL  server  X’s  simplicity,  others:  love  the  power  of  the  database  Y  etc.  But  fundamentally  SQL  is  one  way  to  model  data  and  solve  data-­‐warehousing  problems.  It  has  its  ?me-­‐proven  advantages,  as  well  as  some  significant  shortcomings.      Document  databases  are  an  architecturally  different  approach  to  solving  data  problems.  They  are  not  a  drop-­‐in  replacement  or  an  incremental  improvment  over  SQL.  They  do  have  their  own  shortcomings,  but  they  also  allow  solving  problems  that  were  either  very  hard  or  impossible  to  solve  with  the  tradi?onal,  SQL-­‐oriented  databases.    

2  

Page 3: Document  Databases In  Online Publishing

Tradi?onal,  SQL  database  theory  has  strong  emphasis  on  ACID  compliance.  You  probably  remember  that  ACID  stands  for:  Atomicity,  Consistency,  Isola?on  and  Durability.    The  Consistency  property  ensures  that  no  database  transac?on  violates  referen?al  integrity  rules  defined  in  the  database  schema.    Isola?on  is  a  requirement  that  asserts  that,  given  concurrent  access  to  data,  parallel  opera?ons  cannot  access  data  that  is  being  modified  by  a  another  transac?on,  but  have  to  wait  un?l  the  transac?on  completes.  Isola?on  is  commonly  implemented  with  pessimis?c  locking.    Isola?on  and  Consistency  requirements  in  ACID-­‐compliance  cons?tute  a  fundamental  problem  for  system’s  scalability.    

3  

Page 4: Document  Databases In  Online Publishing

To  put  it  in  the  words  of  Werner  Vogels,  CTO  of  Amazon  and  one  of  the  foremost  experts  in  the  field  of  distributed  compu?ng:      “If  you’re  concerned  about  scalability,  any  algorithm  that  forces  you  to  run  agreement  will  eventually  become  your  boaleneck.  Take  that  as  a  given.”    ACID-­‐compliance  is  all  about  various  processes  [and  nodes],  in  the  system,  checking  with  each-­‐other  to  keep  data  consistent  across  the  en?re  system.  Therefore,  it’s  not  as  much  about  how  well-­‐implemented  master-­‐slave  or  master-­‐master  replica?on  in  your  database  is,  but  the  bigger  challenge  is  the  architectural  constraint  that  ACID-­‐compliance  imposes  on  scalability.  

4  

Page 5: Document  Databases In  Online Publishing

How  important  is  scalability  for  a  Web  system?  Is  it  something  that  maaers  just  for  Amazon,  Facebook,  Google  and  alike?    Internet  is  an  incredibly  fast-­‐growing  medium.  It  took  radio  38  years  aDer  introduc?on  to  reach  50  MM  users,  it  took  television  13  years,  Internet  did  it  in  just  4  and  it  has  been  growing  exponen?ally  ever  since.  

5  

Page 6: Document  Databases In  Online Publishing

In  a  report  published  in  June,  this  year,  Cisco  forecasted  that  global  IP  traffic  will  quadruple  by  2015.  It  means:  more  users,  larger  amount  of  content,  more  types  of  content,  more  sources  of  content  and  more  real-­‐?me  content.  In  this  context,  by  “real-­‐?me-­‐content”  I  mean  things  like:  check-­‐ins,  coverage  of  live  events  and  ci?zen  journalism  during  breaking  news.    Now,  most  of  us  in  the  content-­‐produc?on  industry,  believe  that  having  more  traffic  and  more  content  is  good  news.  Scratch  that:  it’s  great  news!  As  a  maaer  of  fact,  Internet  community  has  goaen  so  obsessed  by  the  amount  of  website  traffic  that  it  is  oDen  used  as  the  most  significant  measure  of  a  website’s  success  or  failure.      So:  more  traffic  is  good  news…  except  and  unless  you  are  the  developer  responsible  for  making  sure  the  website  is  s?ll  up  and  running  when  traffic  quadruples.  

6  

Page 7: Document  Databases In  Online Publishing

We  started  scalability  discussion  by  men?oning  the  scalability  limita?ons  that  ACID-­‐compliance  requirement  enforces.    This  constraint  is  actually  a  specific  case  of  a  more  generic  theorem  called:  Brewer’s  or  CAP  Theorem.    The  theorem  was  formulated  as  a  conjecture  by  a  UC  Berkeley  professor:  Eric  Brewer  in  2000.  Two  years  later,  Seth  Gilbert  and  Nancy  Lynch  of  MIT  published  a  formal  proof  of  Brewer's  conjecture.    CAP  Theorem  states  that,  when  designing  distributed  soDware  systems  there  are  three  proper?es  that  are  commonly  desired:  1.  Consistency  2.  Availability  and  3.  Par??on  Tolerance,      Theorem  proves  that  it  is  impossible  to  achieve  all  three  at  the  same  ?me[1].    Even  though  names  sound  intui?ve,  it  is  probably  worth-­‐while  to  clarify  what  Gilbert  and  Lynch  meant  by  each  of  the  defini?ons  in  CAP,  since  there  are  mul?ple  (some?mes  contradictory)  and  confusing  defini?ons  floa?ng  around  the  web.      

7  

Page 8: Document  Databases In  Online Publishing

Consistency  basically  stands  for  the  requirement  that  all  nodes  in  a  distributed  system  must  see  the  same  data  all  the  ?me  (subset  of  ACID  compliance).  Availability  means:  every  request  should  succeed  to  receive  a  response.  System  as  a  whole  should  be  highly  available.  Par??on  Tolerance,  in  a  distributed  system,  means  system  should  allow  some  fault-­‐tolerance.  When  some  nodes  crash  or  some  communica?ons  links  fail,  it  is  important  that  system  s?ll  performs  as  expected.  

8  

Page 9: Document  Databases In  Online Publishing

Let’s  look  at  some  popular  distributed  data  storage  systems  that  you  are  probably  familiar  with  and  see  which  bucket  they  fall  into  in  the  CAP  spectrum.    Rela?onal  databases,  LDAP  directory  servers  and  xFS  file-­‐systems  are  all  examples  of  consistent  and  available  distributed  systems.  They  are  consistent  because  they  provide  ACID  compliance.  They  are  not  par??on-­‐tolerant  because  they  do  not  have  a  quorum  system  for  removing  unreachable  nodes  from  the  system.  

9  

Page 10: Document  Databases In  Online Publishing

MongoDB,  Terrastore,  Redis  and  BigTable  all  guarantee  consistency,  and  they  use  quorum  for  par??on  tolerance  but  they  forfeit  Availability.    

10  

Page 11: Document  Databases In  Online Publishing

Domain  Name  Service  (yeap,  the  one  that  drives  all  internet  traffic),  CouchDB,  Riak  and  Cassandra  are  all  examples  of  Available  and  Par??on-­‐tolerant  distributed  systems.  They  do  not  guarantee  consistency.  Rather  they  provide  a  promise  of  something  known  as  “eventual  consistency”.      For  any  given  request,  you  may  receive  a  value  that  is  globally  stale  (system-­‐wide)  and  definitely  not  isolated  per  ACID-­‐compliance  requirements,  but  eventually  all  nodes  will  sync-­‐up.    Not  “running  agreement-­‐based  algorithm”,  that  Amazon’s  Werner  Vogels  was  preaching,  is  exactly  the  sacrifice  that  systems  like  CouchDB  and  DNS  make  to  provide  extreme  scalability  and  fault-­‐tolerance.  

11  

Page 12: Document  Databases In  Online Publishing

In  his  2000  keynote  at  the  ACM  Symposium  on  Principles  of  Distributed  Compu?ng  (the  same  one  where  he  formulated  CAP  theorem),  Dr.  Brewer  also  came  up  with  a  new  defini?on  he  called:  BASE.    BASE  stands  for:  Basically  Available  SoD-­‐state,  Eventual-­‐consistency.    He  formulated  and  used  BASE  principles  to  demonstrate  the  trade-­‐offs  and  differences  from  ACID-­‐compliant  systems  

12  

Page 13: Document  Databases In  Online Publishing

ACID-­‐compliant  systems  have  following  traits:  consistency,  isola?on,  focus  on  commit,  nested  transac?ons,  pessimis?c  locking  and  typically  they  are  fixed  schema-­‐based,  therefore:  inflexible  to  evolve.  

13  

Page 14: Document  Databases In  Online Publishing

In  contrast,  BASE  systems  exhibit:  weak  consistency,  availability  priori?zed  above  else,  best-­‐effort  approach  to  conflict-­‐resolu?on,  op?mis?c  locking.  Systems  with  the  BASE  philosophy  consider  approximate  responses  to  be  OK,  are  architecturally  simpler,  faster  and  evolve  flexibly,  since  they  are  typically  schema-­‐less.    

14  

Page 15: Document  Databases In  Online Publishing

CouchDB  is  not  a  “beaer  MySQL”  or  a  “simpler  Oracle”.  It  is  really  good  at  availability  and  par??on  tolerance  and  has  many  traits  making  it  a  beaer  tool  for  some  of  the  problems  tradi?onally  solved  with  rela?onal  databases.  But  one  thing  it  is  not:  it  is  not  a  drop-­‐in  replacement  for  SQL  databases.      There  are  tradeoffs  when  choosing  a  document  database,  and  specifically:  CouchDB.  The  most  obvious  and  honestly  “scary”  tradeoff  is:  forfei?ng  Consistency.    We  as  computer  scien?sts  were  trained  hard  and  log  that  data  must  be  consistent,  models  must  be  normalized,  referen?al  integri?es  must  be  maintained  and  etc.  How  can  we  even  dream  about  forfei?ng  consistency  even  for  scalability  and  fault-­‐tolerance?      

15  

Page 16: Document  Databases In  Online Publishing

The  reality,  however  is  that  there  are  systems  engineering  problems  where  strict  data  consistency  is  crucial,  but  there  are  many  where  -­‐  it  is  not.  If  you  are  building  a  stock  trading  soDware  you  should  probably  use  a  data  storage  that  guarantees  consistency.  Financial  systems,  in  general  require  high-­‐level  of  consistency,  but  it  is  not  given  for  just  any  system.  Anybody  who  has  built  a  real-­‐life,  high-­‐throughput  system  knows  that  in  many  cases  you  end-­‐up  de-­‐normalizing  data  model  to  allow  for  beaer  performance.  It  is  similar  to  forfei?ng  consistency  in  the  CAP  model.    With  a  document-­‐based  database  like  Couch,  some  of  your  request  may  occasionally  return  slightly  stale  data.  Addi?onally,  data  in  document  format  is  oDen  highly  de-­‐normalized  and  less  referen?ally  consistent  than  data  in  a  fully  normalized,  rela?onal  database.      However,  if  you  are  building  a  news  publishing  website  none  of  this  is  unheard  of.  High-­‐traffic  news  websites  have  been  de-­‐normalizing  data  and  implemen?ng  aggressive  caching  for  years.  This  is  neither  new  or  radical.  On  the  contrary,  instead  of:  home-­‐cooked  and  half-­‐baked,  proprietary  solu?ons,  now  we  can  use  a  standard,  open-­‐source,  highly  op?mized,  well  tested  solu?on  like  CouchDB.      Personally,  I  think  it’s    a  preay  good  deal.  

16  

Page 17: Document  Databases In  Online Publishing

At  this  point,  I’ve  spent  good  por?on  of  this  presenta?on  explaining  the  scalability  profile  of  CouchDB  (and  similar  systems);  discussed  how  improvements  are  not  quan?ta?ve  but  are  fundamentally  qualita?ve.  We  have  also  talked  about  tradeoffs  that  the  increased  availability  imposes.      Let’s  forget  about  scalability  for  now,  however,  and  talk  about  other  characteris?cs  of  CouchDB  as  a  document  storage  engine.  ADer  all,  CouchDB  is  not  the  only  document  database  and  there  are  document  databases  that  do  guarantee  data  consistency,  so  forfei?ng  consistency  is  actually  a  trait  of  AP  systems  (in  CAP  model),  not:  that  of  document  databases  in  general.    An  important  trait  of  document  databases,  however,  is  that  they  are  schema-­‐less.  There  is  no  pre-­‐defined,  strict  schema,  no  table  structures  or  rigid  rela?onships  between  document  types.  Document  types  live  in  a  free  world  and  evolve  very  flexibly.      

17  

Page 18: Document  Databases In  Online Publishing

OK,  this  is  by  far  one  of  my  ugliest  slides.  And  what  you  see  here  is  a  rough  ER  diagram  generated  off  a  fresh,  vanilla  installa?on  of  a  popular  open-­‐source  content  management  system:  Drupal.    There  are  72  tables  on  this  diagram.      Some  of  you  may  be  familiar  with  Drupal.  It  is  highly  extensible  (and  generally  really  awesome),  but  it  does  not  do  much  out  of  the  box.  So  when  we  used  Drupal  for  crea?ng  websites  like  that  of  The  Na?on  or  The  New  Republic,  we  installed  dozens  of  addi?onal  Drupal  modules  and  wrote  a  bunch  on  top  ourselves.  Meaning:  we  added  even  more  tables.  And  you  can  clearly  see  how  unreadable  this  schema  already  is.  Obviously  we  never  even  tried  to  visualize  en?re  data-­‐model  on  any  real  projects,  because  it  would  have  been  useless.  

18  

Page 19: Document  Databases In  Online Publishing

The  same  data  model  in  a  document-­‐based  database,  would  look  like  this:  (see  slide)    I  know,  I  know!  I  am  exaggera?ng,  obviously  we  would  have  more  than  one  logical  type  of  a  document  even  in  a  document  database,  but  schema-­‐less  modeling  means:  at  the  physical  level  it  is  just  one  document  type,  so  what  you  see  here  is  really  not  that  far  from  reality  as  far  as  actual  data  storage  goes.  Most  things  above  and  beyond  are  really  part  of  the  applica?on  logic  and  business  rules.    Since  my  presenta?on  is  one  of  the  last  ones  at  this  conference,  I  am  sure  you  have  already  listened  to  presenters  who  went  in  great  detail  about  data-­‐modeling  in  CouchDB  and  I  am  sure  they  are  much  bigger  experts  of  the  subject  than  I  am.  So  I  will  spare  you  the  experience.    Suffice  to  say  that  embedding  documents  greatly  simplifies  data  models.  Think  about  just  the  amount  of  so-­‐called  “mapping”  tables  that  rela?onal  systems  need  to  model  things  like:  many-­‐to-­‐many  rela?onships.    Also,  in  the  case  of  online  publishing  specifically,  most  business  objects  are…  well,  documents  so  having  a  storage  engine  that  operates  in  terms  of  documents  is  extremely  natural  and  enjoyable.  There’s  much  less  discrepancy  between  physical  and  logical  models.  Things,  in  most  cases,  just  make  sense  and  fall  in  line  naturally.  

19  

Page 20: Document  Databases In  Online Publishing

Another important, stark difference between relational databases and CouchDB is the absence of a query language. As most other things about CouchDB, it’s pretty “scary” for the newcomers. So much so, that some other document databases have actually opted to implementing an SQL-like syntax (MongoDB for instance) and I know a lot of people who appreciate that. In contrast, CouchDB uses Map/Reduce, first filtering the data with a Map function and then (optionally) grouping it with a Reduce function, if needed. The documents, result of a map function as well as reduce function are all saved on a B-tree (the secret sauce of CouchDB’s performance). If in a relational database you would have normalized data and then you would index some columns from that data, most things in Couch are a B-tree index to begin with. This has significant consequences and much like in the case with forfeiting data consistency, there are some real trade-offs to be made. While Map/Reduce is very powerful, obviously you will find some queries that you could run in SQL that are either impossible to model with a View or are too expensive/too slow. Also, Views are not as dynamic as SQL queries. They are built incrementally and a complete rebuild of one, in a large database is an expensive operation. As such, it really pays off to carefully think through the Views that a system will be using at the early stages of the system design.

20  

Page 21: Document  Databases In  Online Publishing

The  good  news  is:  in  online  publishing  most  user-­‐facing  content  is  a  document  type,  a  lis?ng  of  documents  and  an  aggrega?on  -­‐-­‐  exactly  the  things  that  document-­‐based  databases  and  CouchDB’s  Views  are  highly  op?mized  for.    As  a  maaer  of  fact,  at  NPR,  to  withstand  millions  of  unique  users  that  the  main  website  gets,  our  legacy  system  uses  an  architecture  with  very  similar  constraints.  It  has  content  objects  that  are  serialized  XML,  XML  lists  of  content  objects  and  aggrega?ons  also  represented  in  an  XML  format.  While  in  the  back-­‐end  we  do  use  an  SQL  database,  the  front-­‐end  architecture  has  made  many  architectural  decisions  similar  to  those  made  in  CouchDB.    Yes,  the  legacy  system  uses  XML  instead  of  JSON…  I  know,  I  know!  But  we  have  been  running  our  systems  for  a  long  while,  so  some  of  it  pre-­‐dates  the  ?me  when  JSON  got  all  sexy  and  trendy  J  

21  

Page 22: Document  Databases In  Online Publishing

To  summarize,  AP-­‐style  (as  defined  by  CAP  model)  document  databases  exhibit  following  traits,  important  for  online  publishing  systems  that  get  significant  traffic  and  have  real-­‐?me  content  streams:  -­‐  High  availability  -­‐  Par??on  Tolerance  -­‐  Schema-­‐less  architecture  -­‐  Document-­‐oriented  storage  -­‐  Index-­‐based  semi-­‐dynamic  querying  like  that  in  CouchDB  Views.  

The  benefit  from  each  one  of  these  features  is  a  result  of  a  tradeoff.  For  teams  architec?ng  systems  and  implemen?ng  document  databases,  it  is  crucial  to  understand  and  appreciate  the  tradeoffs  made.  That  said,  document  databases  are  disrup?ve,  benefits  they  provide  are  real  and  ignoring  them,  not  augmen?ng  tradi?onal,  rela?onal  storage  systems  with  document-­‐based  ones  would  be  a  mistake.  

22  

Page 23: Document  Databases In  Online Publishing

Thank  you  for  your  aaen?on.    

23