www.mindteck.com building scalable solutions for commerce august 2011 ...
TRANSCRIPT
www.mindteck.com
Building Scalable Solutions for Commerce
August 2011
http://www.mindteck.com/coe/cloud-computing.html([email protected])
Version1.4
Confidential © Mindteck 2011 | 2 | www.mindteck.com
Disclaimer
• Coverage Disclaimer
– I don’t cover every aspect of building large scale applications,
– but I’m sincerely working on it!
• Presentation
– I’m only representing what I’ve understood, learnt and practiced during my architecture experience
• Objective
– I’m here to share and learn and I’m sure you gathered here for the same purpose too!
– Good luck to you all
Confidential © Mindteck 2011 | 3 | www.mindteck.com
Just a quote to begin with..
• So a scalability system is meant to be:– The least scalable component of your system becomes a
bottleneck for the whole system
Reference http://msdn.microsoft.com/en-us/library/aa291873(v=vs.71).aspx
Confidential © Mindteck 2011 | 4 | www.mindteck.com
Tagged.com
• Tagged Architecture – Scaling To 100 Million Users, 1000 Servers, And 5 Billion
Page Views
– 100 million registered members
– 25 million unique worldwide monthly visitors
– 6 million unique U.S. monthly visitors
– 7 billion page views per month
• Platform– PHP Webapp, Java, Memcached
Confidential © Mindteck 2011 | 5 | www.mindteck.com
Youtube.com
• Servers– Supports the delivery of over 100 million videos per day– Founded 2/2005– 3/2006 30 million video views/day– 7/2006 100 million video views/day
• Platform– Apache– Python– Linux (SuSe)– MySQL– psyco, a dynamic python->C compiler– lighttpd for video instead of Apache
Confidential © Mindteck 2011 | 6 | www.mindteck.com
Google.com
• Worlds largest Search Engine Serves– Search ~450,000 low-cost commodity servers in 2006 – Google indexed 8 billion+ web pages in 2005– Over 200+ GFS clusters at Google – A cluster can have 1000 or even 5000 machines
• Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster
– ~6000 MapReduce applications at Google and hundreds of new applications are being written each month
– BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users
• Platform– Linux– A large diversity of languages: Python, Java, C++
Confidential © Mindteck 2011 | 7 | www.mindteck.com
Facebook.com
• Biggest Social Site Serves– Serves 570 billion page views per month (according to Google Ad
Planner)– There are more photos on Facebook than all other photo sites combined
(including sites like Flickr)– More than 3 billion photos are uploaded every month– Facebook’s systems serve 1.2 million photos per second
• This doesn’t include the images served by Facebook’s CDN.
– More than 25 billion pieces of content (status updates, comments, etc) are shared every month
– Facebook has more than 30,000 servers
• Platform– PHP, but it has built a compiler for it so it can be turned into native code on its web
servers, thus boosting performance– Linux, but has optimized it for its own purposes– MySQL, but primarily as a key-value persistent storage, moving joins and logic
onto the web servers since optimizations are easier to perform there – Memcached
Confidential © Mindteck 2011 | 8 | www.mindteck.com
Amazon.com
• World’s largest ecommerce site serves– More than 55 million active customer accounts
– More than 1 million active retail partners worldwide
– Between 100-150 services are accessed to build a page
• Platform– Linux
– Oracle
– C++
– Perl
– Mason
– Java
– Jboss
– Servlets
Confidential © Mindteck 2011 | 9 | www.mindteck.com
Scalability – A close look
• Accommodate
– the increased usage of user requests
– an increased dataset that it can handle with agreed SLA
• Maintainable
– to ensure service request are met and managed easily
• Is a property of
– a system which indicates its ability to either handle enlarged as demands increases
• Scale succeed if
– it continues to be available at consistent speeds as the number of users and requests continues to grow to very high number.
Confidential © Mindteck 2011 | 10 | www.mindteck.com
Are there really any challenges while designing scalable solutions – hmmm, you bet?• Cost
– How much one could afford to spend on h/w for the increasing scalability
• Timeline– Sooner, you do it better the business returns
• Maintainability & Manageability– Easy to maintain as the complexity of the components grows
• Tools and Technology Approach– Choice of tool that could directly provide scalability features for easy leverage
• (Growing) Data– Complex business process in dealing with the data
– information access is become more challenge with additional layer of complexity with scalable components
Confidential © Mindteck 2011 | 11 | www.mindteck.com
Scalability Design Requirements• Increase in performances (Can it really)
– Caching, Replication techniques
• Low Latency– Network dependency, too many back-end integrations, multiple dB read/write access
• High Reliability– Data integrity and access to right info
• Dynamic – No of users, volume of data– Data peak load during the spike
• Operational efficiency– Round-trip presentation, retrieval and performance
• Low cost– Focus on design, than investing on new features
• High Availability– To make always the features are available
• Manageability– Its easy to manage and administer with limited skill and knowledge
Confidential © Mindteck 2011 | 12 | www.mindteck.com
Scalability layers of Architecture
• Client– Caching– HTTP Protocol
• Web Server– Load Balance
• Application Server– Distributed Server Caching– Connection Pooling– Load balancer & Clustering (Commodity h/w)– Synch Vs Asynch Choice of Messaging
• Database– Data Replication– Federation– Database Sharding or Partitioning
• Sharding helps to isolate and constrain storage, CPU, memory, and IO– Memchaced– Hadoop
Confidential © Mindteck 2011 | 13 | www.mindteck.com
Techniques to improve scalability
Confidential © Mindteck 2011 | 14 | www.mindteck.com
Scalability and Performance Technique
• Load balancing–Vertical scaling and Horizontal Scaling
• Caching/replication• Partitioning• Parallelism• Redundancy• Request Processing• Asynchronous Messaging• Multi-thread• Resource Pooling• Session Management
Confidential © Mindteck 2011 | 15 | www.mindteck.com
Scalability Approach – Load balancing Hardware• Load Balancing and Clustering
– Vertical Scaling
– Vertical Partitioning
– Horizontal Scaling
– Horizontal Partitioning
Confidential © Mindteck 2011 | 16 | www.mindteck.com
Scaling Explained
Confidential © Mindteck 2011 | 17 | www.mindteck.com
How we do typically – Vertical Scaling
Increasing the hardware resources without changing the number of nodesReferred to as “Scaling up” the Server
CPU
RAM
CPUnRAMn
Confidential © Mindteck 2011 | 18 | www.mindteck.com
How we do typically – Vertical Partitioning
Each service deployed at individual physical node/hardware
Confidential © Mindteck 2011 | 19 | www.mindteck.com
How we do typically – Horizontal Scaling
Each service deployed at individual physical node/hardware
Load B
ala
nce
r
Confidential © Mindteck 2011 | 20 | www.mindteck.com
Client /Server : Use Efficient HTTP Protocol Design• Connection Management
• Intermediate network support
• Concurrent Request Processing
• Design Considerations– Architect your application in a way that encourages HTTP caching
– Identify key settings of an HTTP server that affect scalability and performance
– Understand important efficiency-related parameters of a typical HTTP API, such as that provided by Java
• Scalability Tips– Use GET and POST Judiciously
– Consider HTTP for Nonbrowser Clients
– Promote HTTP Response Caching
– Support Persistent Connections
Confidential © Mindteck 2011 | 21 | www.mindteck.com
DB
DB
Caching Explained
Confidential © Mindteck 2011 | 22 | www.mindteck.com
Caching / Replication• Readily accessible data structure that allows thread-safe access to in-memory
data
• Clustered Cache
– Cache system where each cache instance is aware of other cache instances in a cluster and is capable of synchronizing operations with its peers. Cache contents are typically mirrored
• Distributed Cache
– Distribute any cached state across a cluster to maximize retrieval efficiency, reduce overall memory used, and guarantee data redundancy (fragmented data sets over the network)
• Technology Tool
– Squid, EHCache
– Jcache (JSR107)
– Memcached
– Hadoop / Map-Reduce
Confidential © Mindteck 2011 | 23 | www.mindteck.com
Distributed Cached Techniques
• Replicated Cache
–Where memory isn’t issue
–Only few cache nodes
Confidential © Mindteck 2011 | 24 | www.mindteck.com
Caching Explained
• Partitioned Cache
–More cache nodes avail
–Does not contain entire cache data in single node
–Partition the data in a cluster, so each node will share the burden
Confidential © Mindteck 2011 | 25 | www.mindteck.com
Asynchronous Messaging
• Asynchronous process will not block further processing and may optionally be notified when the operation is completed
• Use Java 5.0 Concurrency package for asynchronous behavior
• Ajax Asynchronous JavaScript and XML) – XMLHttpRequest object is used to exchange data asynchronous from the web server
• JMS (Java Messaging Service) - In distributed environment JMS API can be used to read, receive and send messages in multiple formats
• Asynchronous Web Services
• Asynchronous communication mode –
– Polling type
– Push type
Confidential © Mindteck 2011 | 26 | www.mindteck.com
Request Processing
• Connection Management
• Data Marshaling
• Request Servicing
• Design Considerations
– Synchronous Communication
• Servlets/JSP
– Asynchronous Communication
• JMS
• Scalability Tips
Confidential © Mindteck 2011 | 27 | www.mindteck.com
Parallelism
• Conceptually doing more then one task at a time. Based on hardware and software it can be implemented
• Software –Threads
• Hardware– Massively parallel processors (MPP) – Nodes that don’t share
data but compute by routing data between nodes.
– Symmetric multiprocessing machines(SMP) –Nodes consists of multiple processors that share same data
– Clustered computing system – Nodes consists of multiple computers that don’t share same data but route it between computers over network.
Confidential © Mindteck 2011 | 28 | www.mindteck.com
Redundancy
• Duplication of hardware or software, so that more resource are available for execution
• Redundancy increases ability of system to scale but increases reliability i.e. availability of application in case of one node crashes.
• It refers duplication of data in all nodes.
• Drawback are deployment cost and consistency.
Confidential © Mindteck 2011 | 29 | www.mindteck.com
Resource Pooling
• Collection of pre-created objects that can be loaned out to save the expense of creating them many times
– Thread Pool– EJB Pool– Database Connection Pool
• When application have to deal with large number of request, to reduce the overhead resource pooling can be done
• Database connection can be pooled or thread object can be pooled.
• Based on the application requirement the fixed pool of object can be created which request can borrow.
Confidential © Mindteck 2011 | 30 | www.mindteck.com
Best Practices
• Caching best practices:– Choose between lazy and early loading objects in memory.– Cache objects that are expensive to compute and frequently
used.– Use immutable keys for caching eliminate possibility of map
leak.– Allocate enough heaps.– Does not cache write objects
• Use many cache nodes.
• Use Externalizable which is faster as compared to default serialization.
• LRU cache policy or any other based on fitment for caching data.
Confidential © Mindteck 2011 | 31 | www.mindteck.com
Best Practices
• Cache as coarse grained object as possible in read only mode
• Object Caching framework:–Open Source
o Java Caching System (JCS) from Jakarta (part of the Turbine project)o OSCacheo Commons Collections (another Jakarta project)o JCache API (SourceForge.net)
–Commercialo SpiritCache (from SpiritSofto Coherence (Tangosol)o Javlin (eXcelon)o Object Caching Service for Java (Oracle)
Confidential © Mindteck 2011 | 32 | www.mindteck.com
Best Practices
• Use clustering technologies
• Consider logical versus physical tiers
• Isolate transactional methods
• Eliminate business logic layer state when possible
• Use Caching Extensively and Appropriately– Avoid hitting DB, opening transaction and connection unless
absolutely required
– Avoid remote communication, proper use of value object
• Constraint concurrent access to limited resource
• Proper usage of java.util.concurrent package.
Confidential © Mindteck 2011 | 33 | www.mindteck.com
Best Practices
• Please understand Scaling takes Iteration(s)• Don't try Over Design upfront• Choose the right tool for the job after do enough research
and understand your requirements. Don’t follow tool choice just because everyone says it
• Be open to see if you could think different to get away from traditional approach to find your own scalability solutions
Confidential © Mindteck 2011 | 34 | www.mindteck.com
Quick Review of Scalability Example
Server Station
Server Station
Enterprise
Database
Gateway
GSM/GPRS
ZigBee Wireless
Smart Energy Meter
Gateway
ZigBee Wireless
Gateway
ZigBee WirelessIP Link
Cloud engineCloud engine
“N”
Environm
ents
SmartAppliance
Electric Gas Meter
In-Premise display
Smart Energy Meter
SmartAppliance
Electric Gas Meter
In-Premise display
Smart Energy Meter
SmartAppliance
Electric Gas Meter
In-Premise display
Billing
Challenge Make Smart Energy Management System Scalable and accessible to a very large client base. Performance should not be degraded Service provided should be secured. Minimal infrastructure cost.
Mindteck’s Approach Deploy the services offered by Smart Energy Management System on to Cloud platform coupled with
RestFul Web services for scalability, with load balanced gateway servers
Benefits Scalable Enterprise level test infrastructure that meets requirements
Confidential © Mindteck 2011 | 35 | www.mindteck.com
Application Software, Smart Energy Banking-Financial-
Services-Insurance BFSI, Business Intelligence, Business Process
Outsourcing, Content Analytics, Electronic Design Services, Firmware,
Hardware / Device, Infrastructure, Java, Knowledge Management,
Wireless
Life Sciences, Maintenance, Mechanical, Microsoft Technologies,
Mobile Platforms, MySQL, Open Source, Oracle, Public Sector, Product
Development, QA & Testing Services, SAP, Services,
Semiconductor, Smart Energy, Storage, Support
Services, System Software, SQL Server, Verticals, ZigBee
Thank You