Oracle 11g R1/R2 Real Application Clusters Essentials

Oracle 11g R1/R2 Real Application Clusters Essentials

Overview of this book

Oracle RAC or Real Application Clusters is a grid computing solution that allows multiple nodes (servers) in a clustered system to mount and open a single database that resides on shared disk storage. Should a single system (node) fail, the database service will still be available on the remaining nodes. Oracle RAC is an integral part of the Oracle database setup. You have one database with multiple users accessing it, in real time. This book will enable DBAs to get their finger on the pulse of the Oracle 11g RAC environment quickly and easily.This book will cover all areas of the Oracle RAC environment and is indispensable if you are an Oracle DBA who is charged with configuring and implementing Oracle11g R1, with bonus R2 information included. This book presents a complete method for the configuration, installation, and design of Oracle 11g RAC, ultimately enabling rapid administration of Oracle 11g RAC environments.This practical handbook documents how to administer a complex Oracle 11g RAC environment. Packed with real world examples, expert tips and troubleshooting advice, the book begins by introducing the concept of Oracle RAC and High Availability. It then dives deep into the world of RAC configuration, installation and design, enabling you to support complex RAC environments for real world deployments. Chapters cover Oracle RAC and High Availability, Oracle 11g RAC Architecture, Oracle 11g RAC Installation, Automatic Storage Management, Troubleshooting, Workload Management and much more. By following the practical examples in this book, you will learn every concept of the RAC environment and how to successfully support complex Oracle 11g R1 and R2 RAC environments for various deployments within real world situations. This book is the updated release of our previous Oracle 11g R1/R2 Real Application Clusters Handbook. If you already own a copy of that Handbook, there is no need to upgrade to this book.

Oracle 11g R1/R2 Real Application Clusters Essentials

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

High Availability

High availability concepts

Fault-tolerant systems and high availability

High availability solutions for Oracle

Summary

Oracle 11g RAC Architecture

Oracle 11g RAC architecture

Hardware architecture for Oracle 11g RAC

Network architecture for Oracle 11g RAC

Storage architecture for Oracle 11g RAC

Storage protocols for RAC

Oracle 11g RAC components

New ASM features and RAC

New Oracle 11g ASM Disk Group compatibility features

Summary

Clusterware Installation

Preparing for a cluster installation

Oracle 11g R1 Clusterware installation

Oracle 11g R2 Clusterware installation

Removing/Reconfiguring a Grid Infrastructure configuration

Summary

Automatic Storage Management

Overview of Automatic Storage Management (ASM)

ASM instance configuration and management

Overview of ASMCMD

ASM 11g R1 new features

ASM 11g R2 new features

ASM backup strategies

Summary

Managing and Troubleshooting Oracle 11g Clusterware

Oracle 11g RAC Clusterware administration

Managing Oracle 11g Clusterware utilities

Troubleshooting Oracle 11g Clusterware

New features in Oracle 11g R2 Clusterware

Summary

RAC Database Administration and Workload Management

RAC database configuration and creation

What's new in Oracle 11g R1 and R2 databases?

RAC database administration

Automatic Workload Management

What's new in Oracle 11g services' behavior?

Summary

Backup and Recovery

An overview of backup and recovery

An overview of Recovery Manager (RMAN)

Backup types and methods

RMAN new features in 11g R1 and 11g R2

RMAN best practices for RAC

OCR and Voting disk backup and recovery strategies

Summary

Performance Tuning

Tuning differences: single instance versus RAC

New Oracle 11g performance tuning features

New performance features in Oracle 11gR2

Analyzing the Cache Fusion impact on RAC performance

Monitoring RAC cluster interconnect performance

Oracle cluster interconnects

Monitoring RAC wait events

Summary

Oracle 11g Clusterware Upgrade

Overview of an upgrade

Upgrading Oracle 10g R2 Clusterware to Oracle 11g R1

Upgrading to Oracle 11g R2 Clusterware

Downgrading Oracle Clusterware after an upgrade

Summary

Real-world Scenarios

Adding a new node to an existing cluster

Removing a node from the cluster

Adding an RAC database instance

Deleting an RAC database instance

Converting a single-instance database to an RAC database

Relocating an RAC database and instances across nodes

Summary

Enabling RAC for EBS

EBS architecture

Oracle 11g RAC suitability

Installing EBS 12.1.1

EBS implementation on Oracle 11g RAC

RAC-enabling EBS 12.1.1

Establishing applications environment for Oracle RAC

Setting up load balancing

Configuring Parallel Concurrent Processing

Cloning EBS concepts in brief

Summary

Maximum Availability

Oracle 11g Streams for RAC

Best practices for Streams in an RAC environment

New features for Streams in Oracle 11g R2

Oracle 11g Data Guard and RAC

New features for Data Guard in Oracle 11g R2

Summary

Additional Resources and Tools for the Oracle RAC Professional

Sample configurations

Oracle RAC commands and tips

Operating system-level commands for tuning and diagnosis

Additional references and tips

Clusterware startup sequence for Oracle 11g R2

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

High availability concepts

High availability provides data center environments that run mission-critical database applications with the resiliency to withstand failures that may occur due to natural, human, or environmental conditions. For example, if a hurricane wipes out the production data center that hosts a financial application's production database, high availability would provide the much-needed protection to avoid data loss, minimize downtime, and maximize availability of the firm's resources and database applications. Let's now move to the high availability concepts.

Planned versus unplanned downtime

The distinction needs to be made between planned downtime and unplanned downtime. In most cases, planned downtime is the result of maintenance that is disruptive to system operations and cannot be avoided with current system designs for a data center. An example of planned downtime would be a DBA maintenance activity such as database patching to an Oracle database, which would require taking an outage to take the system offline for a period of time. From the database administrator's perspective, planned downtime situations usually are the result of management-initiated events.

On the other hand, unplanned downtime issues frequently occur due to a physical event caused by a hardware, software, or environmental failure or caused by human error. A few examples of unplanned downtime events include hardware server component failures such as CPU, disk, or power outages.

Most data centers will exclude planned downtime from the high availability factor in terms of calculating the current total availability percentage. Even so, both planned and unplanned maintenance windows affect high availability. For instance, database upgrades require a few hours of downtime. Another example would be a SAN replacement. Such items make comprehensive four nine solutions nigh impossible to implement without additional considerations. The fact is that implementing a true 100% high availability is nearly impossible without exorbitant costs. To have complete high availability for all components within the data center requires an architecture for all systems and databases that eliminates any Single Point of Failure (SPOF) and allows for total online availability for all server hardware, network, operating systems, applications, and database systems.

Service Level Agreements for high availability

When it comes to determining high availability ratios, this is often expressed as the percentage of uptime in a given year. The following table shows the approximate downtime that is allowed for a specific percentage of high availability, granted that the system is required to operate continuously. Service Level Agreements (SLAs) usually refer to monthly downtime or availability in order to calculate service levels to match monthly financial cycles. The following table from the International Organization for Standardization (ISO) illustrates the correlation between a given availability percentage and the relevant amount of time a system would be unavailable per year, month, or week:

Availability %	Annual downtime	Monthly downtime*	Weekly downtime
90%	36.5 days	72 hours	16.8 hours
95%	18.25 days	36 hours	8.4 hours
98%	7.30 days	14.4 hours	3.36 hours
99%	3.65 days	7.20 hours	1.68 hours
99.5%	1.83 days	3.60 hours	50.4 minutes
99.8%	17.52 hours	86.23 minutes	20.16 minutes
99.9% ("three nines")	8.76 hours	43.2 minutes	10.1 minutes
99.95%	4.38 hours	21.56 minutes	5.04 minutes
99.99% ("four nines")	52.6 minutes	4.32 minutes	1.01 minutes
99.999% ("five nines")	5.26 minutes	25.9 seconds	6.05 seconds
99.9999% ("six nines")	31.5 seconds	2.59 seconds	0.605 seconds

Note

For monthly calculations, a 30-day month is used.

It should be noted that availability and uptimes are not the same thing. For instance, a database system may be online but not available, as in the case of application outages such as when a user's SQL script cannot be executed.

In most cases, the number of nines is not often used by the database or system professional when measuring high availability for data center environments because it is difficult to extrapolate such hard numbers without a large test environment. For practical purposes, availability is calculated more as a probability or average downtime given per annual basis.

High availability interpretations

When it comes to discussing how availability is measured, there is a debate on the correct method of interpretation for high availability ratios. For instance, an Oracle database server that has been online for 365 days in a given non-leap year might have been eclipsed by an application failure that lasted for nine hours during a peak usage period. As a consequence, the users will see the complete system as unavailable, whereas the Oracle database administrator will claim 100% "uptime." However, given the true definition of availability, the Oracle database will be approximately 99.897% available (8751 hours of available timeout of 8760 hours per non-leap year). Furthermore, Oracle database systems experiencing performance problems are often deemed partially or entirely unavailable by users, while in the eyes of the database administrator the system is fine and available.

Another situation that presents a challenge in terms of what constitutes availability would be the scenario in which the availability of a mission-critical application might go offline yet is not viewed as unavailable by the Oracle DBA, as the database instance could still be online and thus available. However, the application in question is offline to the end user, thus presenting a status of unavailable from the perspective of the end user. This illustrates the key point that a true availability measure must be from a holistic perspective and not strictly from the database's point of view.

Availability should be measured with comprehensive monitoring tools that are themselves highly available and present the proper instrumentation. If there is a lack of instrumentation, systems supporting high-volume transaction processing frequently during the day and night, such as credit-card-processing database servers, are often inherently better monitored than systems that experience a periodic lull in demand. Currently, custom scripts can be developed in conjunction with third-party tools to provide a measure of availability. One such tool that we recommend for monitoring database, server, and application availability is that provided by Oracle Grid Control, which also includes Oracle Enterprise Manager.

Oracle Grid Control provides instrumentation via agents and plugin modules to measure availability and performance on a system-wide enterprise level, thereby greatly aiding the Oracle database professional to measure, track, and report to management and users on the status of availability with all mission-critical applications and system components. However, the current version of Oracle Enterprise Manager will not provide a true picture of availability until 11g Grid Control is released in the future.

Recovery time and high availability

Recovery time is closely related to the concept of high availability. Recovery time varies based on system design and failure experienced, in that a full recovery may well be impossible if the system design prevents such recovery options. For example, if the data center is not designed correctly with the required system and database backups and a standby disaster recovery site in place, then a major catastrophe such as a fire or earthquake will almost always result in complete unavailability until a complete MAA solution is implemented. In this case, only a partial recovery may be possible. This drives home the point that for all major data center operations, you should always have a backup plan with an offsite secondary disaster-recovery data center to protect against losing all critical systems and data.

In terms of database administration for Oracle data centers, the concept of data availability is essential when dealing with recovery time and planning for highly available options. Data availability references the degree to which databases such as Oracle record and report transactions. Data management professionals often focus just on data availability in order to judge what constitutes an acceptable data loss with different types of failure events. While application service interruptions are inconvenient and sometimes permitted, data loss is not to be tolerated. As one Chief Information Officer (CIO) and executive once told us while working for a large financial brokerage, you can have the system down to perform maintenance but never ever lose my data!

The next item related to high availability and recovery standards is that of Service Level Agreements or SLAs for data center operations. The purpose of the Service Level Agreement is to actualize the availability objectives and requirements for a data center environment per business requirements into a standard corporate information technology (IT) policy.

System design for high availability

Ironically, by adding further components to the overall system and database architecture design, you may actually undermine your efforts to achieve true high availability for your Oracle data center environment. The reason for this is by their very nature, complex systems inherently have more potential failure points and thus are more difficult to implement properly. The most highly available systems for Oracle adhere to a simple design pattern that makes use of a single, high quality, multipurpose physical system with comprehensive internal redundancy running all interdependent functions, paired with a second like system at a separate physical location. An example would be to have a primary Oracle RAC clustered site with a second Disaster Recovery site at another location with Oracle Data Guard and perhaps dual Oracle RAC clusters at both sites connected by stretch clusters. The best possible way to implement an active standby site with Oracle would be to have Oracle Streams and Oracle Data Guard. Large commercial banking and insurance institutions would benefit from this model for Oracle data center design to maximize system availability.

Business Continuity and high availability

Business Continuity Planning (BCP) refers to the creation and validation of a rehearsed operations plan for the IT organization that explains the procedures of how the data center and business unit will recover and restore, partially or completely, interrupted business functions within a predetermined time after a major disaster.

In its simplest terms, BCP is the foundation for the IT data center operations team to maintain critical systems in the event of disaster. Major incidents could include events such as fires, earthquakes, or national acts of terrorism.

BCP may also encompass corporate training efforts to help reduce operational risk factors associated with the lack of information technology (IT) management controls. These BCP processes may also be integrated with IT standards and practices to improve security and corporate risk management practices. An example would be to implement BCP controls as part of Sarbanes-Oxley (SOX) compliance requirements for publicly traded corporations.

The origins for BCP standards arose from the British Standards Institution (BSI) in 2006 when the BSI released a new independent standard for business continuity named BS 25999-1. Prior to the introduction of this standard for BCP, IT professionals had to rely on the previous BSI information security standard, BS 7799, which provided only limited standards for business continuity compliance procedures. One of the key benefits of these new standards was to extend additional practices for business continuity to a wider variety of organizations, to cover needs for public sector, government, non-profit, and private corporations.

Disaster Recovery

Disaster Recovery (DR) is the process, policies, and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after either a natural or human-caused disaster.

Disaster Recovery Planning (DRP) is a subset of larger processes such as Business Continuity and should include planning for resumption of applications, databases, hardware, networking, and other IT infrastructure components. A Business Continuity Plan includes planning for non-IT-related aspects, such as staff member activities, during a major disaster as well as site facility operations, and it should reference the Disaster Recovery Plan for IT-related infrastructure recovery and business continuity procedures and guidelines.

Business Continuity and Disaster Recovery guidelines

The following recommendations will provide you with a blueprint to formulate your requirements and implementation for a robust Business Continuity and Disaster Recovery plan:

Identifying the scope and boundaries of your Business Continuity Plan:
The first step enables you to define the scope of your new Business Continuity Plan. It provides you with an idea of the limitations and boundaries of the Business Continuity Plan. It also includes important audit and risk analysis reports for corporate assets.
Conducting a Business Impact Analysis session:
Business Impact Analysis (BIA) is the assessment of financial losses to institutions, which usually results as the consequence of destructive events such as the loss or unavailability of mission-critical business services.
Obtaining support for your business continuity plans and goals from the executive management team:
You will need to convince senior management to approve your business continuity plan, so that you can flawlessly execute your disaster recovery planning. Assign stakeholders as representatives on the project planning committee team, once approval is obtained from the corporate executive team.
Understanding its specific role:
In the possible event of a major disaster, each of your departments must be prepared to take immediate action. In order to successfully recover your mission-critical database systems with minimal loss, each team must understand the BCP and DRP plans, as well as follow them correctly. Furthermore, it is also important to maintain your DRP and BCP plans, as well as conduct periodic training of your IT staff members on a regular basis to have successful response time for emergencies. Such "smoke tests" to train and keep your IT staff members up to date on the correct procedures and communications will pay major dividends in the event of an unforeseen disaster.

One useful tool for creating and managing BCP plans is available from the National Institute of Standards and Technologies (NIST). The NIST documentation can be used to generate templates that can be used as an excellent starting point for your Business Continuity and Disaster Recovery planning. We highly recommend that you download and review the following NIST publication for creating and evaluating BCP plans, Contingency Planning Guide for Information Technology Systems, which is available online at http://csrc.nist.gov/publications/nistpubs/800-34/sp800-34.pdf.

Additional NIST documents may also provide insight into how best to manage new or current BCP or DRP plans. A complete listing of NIST publications is available online at http://csrc.nist.gov/publications/PubsSPs.html.

Oracle 11g R1/R2 Real Application Clusters Essentials

Oracle 11g R1/R2 Real Application Clusters Essentials

Overview of this book

Related Content you might be interested in

Current Title:

Oracle 11g R1/R2 Real Application Clusters Essentials

High availability concepts

Planned versus unplanned downtime

Service Level Agreements for high availability

Note

High availability interpretations

Recovery time and high availability

System design for high availability

Business Continuity and high availability

Disaster Recovery

Business Continuity and Disaster Recovery guidelines