architecting enterprise solutions chapter 8 system control patterns

Architecting Enterprise Solutions

Chapter 8

System Control Patterns

Chapter 8 – System Control Patterns

Hastighed uden kontrol!!!

Faktorer• Er den rigtigt information tilstede til den rigtige tid• Giver systemet mulighed for at reagere efter omstændighederne• Er der adgang til den nødvendige kontrol• Har forkerte personer adgang til for meget kontrol.

Mangel på kontrol giver i bedste fald,

systemer med en ringere værdi for virksomheden.

Krav til arkitekten:

Systemer skal designes således at de giver virksomheden mulighed

for at kontrollere programmet fra modtagelsen til udfasningen.


Kontinuert statusrapportering

Så længe et system fungerer normalt, er der ingen problemer.

Men når en eller flere del elementer fejler, kan det meget hurtigt få

følger over alt i systemet.

For at rette en fejl er der to ting man skal vide.

1. Hvad har fejlet ?

2. I hvilken grad har det fejlet ?

Hvad kræves for at få systemet i gang igen.

Hvordan undgår vi at fejlen opstår igen.



Distribueret system karakteristika- Typisk en relativ høj kompleksitet- Designet til at give, scalabillity, performance, avalibillity, osv.

Del elementer der fejler => Højere load til de resterende elementer

Hvor meget skal fejle, og hvor længe, for at de resterende dele ikke

længere kan kapere belastninger, og hele systemet gå ned.

- Billetlugen- Termisk flugt- Tændrør



Husk! brugere er ikke ejere.

Hvem skal reagere?

Kontrol Teams, skal tilknyttes et system, og sikrer at det kører

Optimalt under den daglige drift.



Belastningen ?

App server kapaciteten ?

Reager før det er for sent.

Hvem?

Hvordan?

Debuggere,

Grafiske monitorer,

Netværks monitorer,

Interactive system monitor tools

Kan være besværeligt i live scenariet

mk:@MSITStore:C:%5CDocuments%20and%20Settings%5Cad.CIM%5CDesktop%5CDiv%5Cwiley,.architecting.enterprise.solutions.patterns.for.high-capability.internet-based.systems.(2004).lib.chm::/9104final/images/fig8%2D1%5F0%2Ejpg

mk:@MSITStore:C:%5CDocuments%20and%20Settings%5Cad.CIM%5CDesktop%5CDiv%5Cwiley,.architecting.enterprise.solutions.patterns.for.high-capability.internet-based.systems.(2004).lib.chm::/9104final/images/fig8%2D1%5F0%2Ejpg


Designing a reporting interface

You must answer the following questions for the system:

• S1. Which elements would cause the most impact in terms of availability and performance should they fail?

• S2. Which elements would be a bottleneck for performance and scalability should they approach maximum capacity?

• S3. Are any elements particularly unreliable?

• S4. To where will elements report their status?

• S5. What should be the format of the status information and how will elements transmit it?

• S6. How much load (processor, memory, network traffic, etc.) will the suggested level of monitoring generate? Do you need more capacity in parts of the system to cope with this?


Designing a reporting interface

For each system element from which you need status reports, you also need to answer the following questions:

• E1. What information do you need to know about the element?

• E2. Does the element have a built-in capability for reporting its status to the required location using the required mechanism? If not, how can this be added?

• E3. What conditions or thresholds are important for this element in terms of excessive load or imminent failure?

• E4. How often should the element report its status?


Designing a reporting interfaceGenerelle informations grupper

• Status information: the element is running correctly; the network card is running at 100 Mbps; all disks in the RAID array are working OK.

• Usage information: the number of concurrent user requests; the level of memory usage; the amount of data passed through a network interface.

• Execution information: the user thread has entered a particular method on a software component; an HTTP connection has been made to the web server; an SSL connection has been established with the content switch.


GlobalTech support site, behov

• The routers that connect the demilitarized zone (DMZ) to the outside world.

• The firewall between the DMZ and the internal network.

• The network switches on the DMZ and the internal network.

• The hardware servers that house the web servers.

• The business tier servers called by the web servers.

• The database housing the support knowledge base.


GlobalTech del-løsninger

• The web servers have the ability to continuously write their status to log files. A simple log-scraping program is run every 30 seconds to extract the number of user requests served, whether the request was for a dynamic page or a binary asset, and the time to serve each request. The log-scraping program makes this information available across SNMP.

• The application servers have a built-in SNMP interface. With a small modification to the application software we can ensure the servers report the number of requests for dynamic pages, the average time to serve each request, and the number of concurrent user sessions. We also report individual requests that take more than 30 seconds to serve, including detailed information on the request and the related session state information.

• The database server has a proprietary monitoring client used by the database administrator. We write a small proxy that intercepts information coming out of the database and extracts the high-level information about the number of queries run and the average time to return the result set.

• The switches and routers all support SNMP for indicating they are still alive.

• We run the ‘top’ command on every hardware server to monitor the server processes, reporting the CPU time and memory used by each process – this covers the load balancers as well as the other types of software server.


Impact of the Pattern on Non-functional Characteristics

Availability Availability is potentially improved as the generated information can be used to identify and predict element failure or overload.

Performance Performance is negatively impacted because of the overhead of the continuous reporting.

Scalability Unaffected by this pattern.Security There is potentially a negative impact on security as extended system

information is available to any intruder who has the capability of monitoring network traffic.

Manageability Manageability is improved because up-to-date information about each element's condition is continuously available.

Maintainability Maintainability is potentially improved because management information can sometimes be useful in diagnosing a fault or problem. For example, requests for dynamic pages failing when the data access servers take more than 30 seconds to pass back the result set may indicate a pre-defined time-out in the database drivers used by the application servers.

Flexibility Unaffected by this pattern.Portability Unaffected by this pattern.Cost The cost of introducing continuous reporting for every type of system element is

always going to be significant whether the element supports reporting out of the box or not. This cost is justified because continuous status reporting is at the heart of a controllable system.


Operational Monitoring and Alerting

Systemet der giver ”operation teamet” adgang til informationerne fraCSRet, og der med muligheden for at reagere.

Husk!, Bliver aldrig bedre end dit CSR

Vægtning og prioritering af CSR information:- For lidt- Tilstrækkeligt- For meget- Forsent ??

Minimum er status (OK / !OK)Og såfremt OK tilstand, også MEM forbrug, DISK forbrug og antalRequests/tid


Operational Monitoring and Alerting• Rate of system state change. There is no point in reporting information more

frequently than the state of a particular system element changes or is likely to change.

• Resolution mechanism. Given that the point of this pattern is to notify operations staff in time for them to apply remedial action, the rate of status reporting is related to the amount of time it takes to implement the remedial action. For example, if the remedial action involves sourcing and configuring a new server (which would take days) then there is no point in reporting the triggering status every 15 seconds.

• Impact of reporting on performance. Report information too frequently and performance will be affected by the amount of time spent monitoring and processing the information and the volume of information present in the network.

• Failure window. Report information too infrequently and serious problems could arise between reporting intervals. You can ask how long the system could reasonably survive if a particular system element fails. For example, if there is a single router to the outside world, it should be reporting its status every 30 seconds as any capacity issues or downtime will directly affect the user perception of the system. On the other hand, the status of one presentation tier server out of 20 will be less of an issue. This server could be down for five minutes without anyone really noticing. Hence you could have the presentation tier servers notify their status every four or five minutes.


OMA for GlobalTech

• The reporting agents on each of the web servers will report user load and the time to serve each request. These are delivered to the system management application which will generate an alert should user load or response time exceed pre-defined thresholds for a sustained period.

• SNMP messages from the application servers are sent to the system management application. These messages are monitored for unexpected increases in the level of load, the amount of time to process a request and the number of concurrent users. Sustained increases of this type will generate an alert.

• Database information about the number of queries run and the average time to return the result set is sent to the system management application, which will generate an alert should these values exceed pre-defined thresholds for a sustained period.

• SNMP messages indicating normal operation are generated by all switches, routers, network cards, operating systems, application servers, databases, and web servers in the system and delivered to the management application. If such messages cease for a particular element, a critical alert is generated in the form of pager messages to members of the system operations team currently on duty (or on call). Any running instance of the graphical management console will display a dialog box requesting immediate action.


Operational Monitoring and Alerting - NFC

Availability The alerts can help the operations team prevent the system from becoming partially or wholly unavailable.

Performance Performance is negatively impacted because a reasonably high level of continuous reporting is required on some system elements to support the required level of monitoring.

Scalability Unaffected by this pattern.Security Unaffected by this pattern.Manageability Manageability is improved as there is no need to

manually monitor the system.Maintainability Unaffected by this pattern.Flexibility Unaffected by this pattern.Portability Unaffected by this pattern.Cost Cost is increased, regardless of whether a specific management

application is purchased or custom solutions are built. This cost is justified as it makes the system manageable for less money than employing many operations people.


3-Category logging

Giver mulighed for et detaljeret billede af hvad systemet præcis

foretog sig da systemet fejlede.

”There is little point in knowing of the impending doom of the system

unless you can do something about it.”

3-Category logging er typisk en forlængelse af OMA


3-Category logging

Log information i 3 niveauer.

• Debug – usually execution-trace information such as which methods have been called on a software component and with what parameters

• Information – simple warnings about the system condition such as timeouts, missing data or uncommon code flows

• Error – things that go very wrong such as failure to connect to a database or loss of connection between web server and load balancer


3-Category logging

Hvad hvornår ?

Error:- Ikke håndterede exceptions- Ikke honorerede pre og post condition

Debug:- Alle metoder starter med at logge debug info med de modtagede

parameter værdier.

Information:- Reserveres til kendte system events f.eks.- Skedulerede processer- Indkommende beskeder- Osv.


3-Category logging

3 Niveauer for logging i hver kategori

Mild:

Log med summary

Moderate:

Log med detaljeret information

Severe:

Log med fuld stack trace, og alt tilgængeligt tilstands information.


3-Category logging

Rolling archive- Skift log fil hvert passende tidsinterval (typisk hvert døgn)- Overskriv gamle log filer med passende interval (typisk hver uge)

Gamle logs kan bevares via Pattern ”Offline Reporting export”


3-Category logging

Availability Unaffected by this pattern.Performance Performance is negatively impacted because the logging

mechanism introduces a processing overhead.Scalability Unaffected by this pattern.Security Unaffected by this pattern.Manageability Manageability is improved because the logged information

can also be used by system managers to monitor system execution.Maintainability Maintainability is improved as the logging gives support and

development engineers the information they need to track errors in the system or trace its execution.

Flexibility Unaffected by this pattern.Portability Unaffected by this pattern.Cost Cost is increased as it will take time and effort to add and configure the

logging for different system elements. This cost can be hard to quantify – it is fairly simple to implement a logging mechanism but a lot harder to ensure that all developers write their code to use it in the correct way. However this cost can be very quickly recouped as less time is spent on troubleshooting during the time the system is in production.


System Overview

Skab et generelt billede af systemets tilstand, Differentier viewet til de

forskellige typer modtageren

Interessante elementer:- Nuværende system status- Historisk status giver mulighed for ekstrapolering af info


System Overview

Overblik af GlobalTech’s data mængde:

Ved fuld system drift er der 26 samtidig streams af logging information

Alle rapporter data på mellem 4 og 20 karakteristika

Vi antager 2500 bruger (System Krav)

Vi antager at en bruger max loader en side hvert 5 minut.

Se regne stykket sid 167 => over 6 millioner logs om dagen

Hvis en log fylder ca 80 bytes, giver der 48MB log data om dagen

distribueret ud over flere servere.


System Overview

System Overview, skal altså aggreger over den enorme mængde og

samle den relevante information i let overskuelige text og grafik.

Kompleksiteten ved at implementere dette afhænger af:

• The granularity of the system elements to be monitored.• The amount of information from each element to be aggregated.• The view of the information.

Mål:

Mængde af information skal reduceres, til en noget man kan få

overblik over ved et øjekast, eller som minimum over en kop kaffe


System Overview

Availability Unaffected by this pattern.Performance Although performance of the management function is

negatively impacted by the introduction of an extra layer of communication, the performance of the system itself is unchanged.

Scalability Scalability is improved indirectly as the need for extra capacity will be determined in good time and additional capacity can be added (finances permitting).

Security Unaffected by this pattern.Manageability Manageability is improved because all system elements are

considered as a single entity for monitoring.Maintainability Unaffected by this pattern.Flexibility Flexibility is improved as a new reporting agent or monitoring

agent can be implemented under the abstracting layer without impacting existing agents.

Portability Unaffected by this pattern.Cost Cost is increased by the creation (or purchase) of an additional layer.

This cost may be quite substantial, depending on the degree of analysis and the number of different views we wish to generate.


Dynamically – Adjustable Configuration

Giver mulighed for at konfigurere systemet, uden af forstyrre driften.

Identificer nøgle parametrer der kan påvirke NFC:

• The number of simultaneous requests that can be made to a web server

• The number of simultaneous sessions that can be maintained by an application server

• The load-balancing algorithm used

• The number of simultaneous connections to a data access server

• The size of data caches

• Security keys

Implementer de fundene parameter således at de kan reloade deres

konfiguration under afvikling.


Dynamically – Adjustable Configuration

Availability Unaffected by this pattern.Performance Performance is negatively impacted by the processing

overhead if configuration changes are read using a ‘pull-based’ mechanism.

Scalability Unaffected by this pattern.Security Unaffected by this pattern.Manageability Manageability is improved as the system's

characteristics can be more easily altered to cope with unexpected conditions.

Maintainability Unaffected by this pattern.Flexibility Unaffected by this pattern.Portability Unaffected by this pattern.Cost Cost is increased by the analysis effort required to identify the

system parameters that can be altered to significantly affect the non-functional characteristics of the system and by the implementation of the dynamic reconfiguration mechanism for these parameters.


DMZ

Beskyt dit systems tilgang over netværked mod onde interesser, både

internt og eksternt.

GlobalTech’s løsning ->


DMZ

Availability Availability may be negatively impacted as the firewall becomes a single point of failure (standard procedure is for a firewall to ‘fail closed’, i.e. in the event of failure it will deny all connections to the protected systems).

Performance There is a potential negative impact on performance due to the overhead of network traffic filtering and the necessity for physical separation between the web servers and the application servers as defined in DEDICATED WEB AND APPLICATION SERVERS (although splitting the servers may actually improve performance). If this has not already been done to improve another non-functional characteristic, it must be done to implement a DMZ and so will add multiple extra network hops for each user transaction.

Scalability The scalability of the underlying application is unaffected. However, the additional elements (such as filtering routers and firewall software) must be able to scale to the desired number of users and concurrent connections.

Security Security is improved because fewer systems are exposed to attack and multiple firewall artefacts must be breached to compromise security.

Manageability Manageability is negatively impacted since the very restrictions that limit access to internal data may make it difficult to access the application from an internal monitor.

Maintainability Unaffected by this pattern. (delvis enig)Flexibility Unaffected by this pattern.Portability Unaffected by this pattern.Cost Cost is increased as extra elements must be procured to build the DMZ. These include not

only the filtering routers, firewall software and firewall host, but also the extra network equipment, such as switches and cabling, used in the DMZ itself.


Information Obscurity

Hvordan sikrer vi vores hemmelige data hvis en uautoriseret person

får adgang til vores system.

- Lås dataene inde i dit CPS.

- Obskurer dataene.


Information ObscurityObskurering betyder i de fleste tilfælde kryptering

• Encryption and decryption are comparatively slow and, in general, the stronger the encryption mechanism used the slower it is to encrypt or decrypt. Hence, encrypting all the data in our system would have a serious impact on the performance of the system unless we were willing to make a large investment in dedicated encryption hardware to speed this up.

• In order to encrypt or decrypt, the application must have access to the encryption key or keys. These keys cannot themselves be encrypted as you still need a key to recover the keys! This is a similar issue to the protection of credentials for the ‘locks’. At some point, initial information is needed to start unlocking the protection around the data which means that this initial information is very sensitive and itself needs protecting. Definitely a case of ‘quis custodiet ipsos custodes?’. (Who gards the gards)

• All parts of the system that need to access encrypted data will need access to the

encryption keys. This means that if all data is encrypted then all parts of the system must be encryption-aware (or interface with something that is) and the encryption keys must be made widely available – making them more vulnerable to being stolen.



Skal alt obfuskeres?

• The impact, should that data be accessed by an unauthorized third party, for the user, for the company and for the relationship between the two.

• The incentive for a third party to find this data.

• The accessibility of the place where the data is stored.

• Whether this data can be used to compromise further data.

• The data protection rules governing this type of data.



Availability Availability should not be negatively impacted, but care should be taken not to introduce single points of failure in the form of encryption key distribution and management services.

Performance Performance is negatively impacted if an obscurity mechanism is introduced because of the processing overhead associated with the mechanism. This is particularly true of complex encryption algorithms with long key lengths.

Scalability There should not be a negative impact on scalability, but any mechanisms used by the obscurity policy, such as encryption key distribution and management services, should themselves be scalable.

Security Security is improved by data obscurity because, even in the event of an attack during which the attacker gains access to the file system, system memory and application database, any sensitive data is not usable by the attacker. Security is also improved by configuration obscurity as any attacker will find it more difficult to obtain the information they need to crack the system.

Manageability Manageability is negatively impacted as additional resources will be needed for the encryption mechanism (such as key management).

Maintainability Obfuscation techniques, in particular, can affect the maintainability of the system as the developers have to remember obscure names for the configuration files, etc.

Flexibility Flexibility may be negatively impacted as you may need to maintain back-compatibility with existing encrypted data or obscured configuration.

Portability Portability is negatively impacted as you must ensure that any new platform supports the encryption mechanisms you wish to use.

Cost Cost is probably increased as the extra requirements of encryption may require either additional general capability to support software encryption or dedicated encryption hardware. You may also need to buy additional encryption software depending on what comes with your existing platforms and tools.


Secure Channels

Hvordan sikrer vi os, at der ikke er nogen der lytter med?

- Bruger skal identificerer sig via user/pass- Følsomme data skal krypteres før transmision

Mest almindelige metode at implementere dette er igennem SSL

Da der er meget udbredt blandt web browsere.


SSLGlobal Tech bruger SSL og certifikater

• Viewing of order status by retailers• Placing of orders by retailers• Logging in by public Internet customers• Changing of details by retailers and public Internet customers


Secure Channel - NFCAvailability There is potentially a negative impact on availability if the obscurity

mechanism causes server-affinity, which undermines effective failover.Performance Performance is negatively impacted by the processing overhead

if you introduce an obscurity mechanism for data in transit.Scalability There is potentially a negative impact on scalability if the SECURE

CHANNEL causes server-affinity, which undermines effective load-balancing.

Security Security is improved because data that is captured in transit is not usable by the attacker.

Manageability There is a slightly negative impact on manageability as there are now artefacts of the SECURE CHANNEL to be managed, such as SSL server certificates.

Maintainability Unaffected by this pattern.Flexibility Unaffected by this pattern.Portability The choice of obscurity mechanism and its level of support on

multiple platforms may have a negative impact on portability.Cost Cost is increased as you must obtain and maintain one or more server certificates for your SECURE CHANNEL. Also, you may need to increase the hardware specification of your web servers or buy dedicated encryption hardware to mitigate the associated performance overhead.