Understanding Common Causes Oracle DB Downtime

 RoadToMaster –  Availability is the degree to which an application, service, or functionality is available upon demand. Availability is measured by the perception of an application’s end user. End users experience frustration when their data is unavailable, and they do not understand or care to differentiate between the complex components of an overall solution. Performance failures due to higher than expected usage create the same havoc as the failure of critical components in the solution. And . Both small businesses and global enterprises have users al lover the world who require access to data 24 hours a day.

Reliability, recoverability, timely error detection, and continuous operations are primary characteristics of a highly available solution:

■ Reliability: Reliable hardware is one component of a high availability solution. Reliable software—including the database, Web servers, and application—is just as critical to implementing a highly available solution.

■ Recoverability: There may be many choices in recovering from a failure if one occurs. It is important to determine what types of failures may occur in your high availability environment, and how to recover from those failures in the time that meets your business requirements. For example, if a critical table is accidentally
deleted from the database, what action should you take to recover it? Does your architecture provide the ability to recover in the time specified in a service level agreement (SLA)?

■ Timely error detection: If a component in your architecture fails, then fast detection is another essential component in recovering from a possible unexpected failure. While you may be able to recover quickly from an outage, if it takes an additional 90 minutes to discover the problem, then you may not meet your SLA. Monitoring the health of your environment requires reliable software to view it quickly and the ability to notify the DBA of a problem.

■ Continuous operations: Continuous access to your data is essential when very little or no downtime is acceptable to perform maintenance activities. Activities such as moving a table to another location within the database, or even adding additional CPUs to your hardware, should be transparent to the end user in a high availability architecture.

Causes Of Downtime

One of the challenges in designing a high availability solution is examining and addressing all the possible causes of downtime. It is important to consider causes of both unplanned and planned downtime when designing a fault tolerant and resilient IT infrastructure. Planned downtime can be just as disruptive to operations, especially in global enterprises that support users in multiple time zones.

Unplanned Outage : Describes unplanned outage types and provides examples of each type.

1. Computer Failure A computer failure outage occurs when the system running the database becomes
unavailable because it has shut down or is no longer accessible. Examples :

– Database system hardware failure
– Operating system failure
– Oracle instance failure
– Network interface failure.

2. Storage Failure : A storage failure outage occurs when the storage holding some or all of the database contents becomes unavailable because it has shut down or is no longer accessible. Examples :

– Disk drive failure
– Disk controller failure
– Storage array failure

3. Human Error : A human error outage occurs when unintentional or malicious actions are committed that cause data within the database to become logically corrupt or unusable. The service level impact of a human error outage can vary significantly depending on the amount and critical nature of the affected data.

– Dropped database object
– Inadvertent data changes
– Malicious data changes

4. Data Corruption : A data corruption outage occurs when a corruption hardware or software component causes corrupt data to be read or written to the database. The service level impact of a data corruption outage may vary, from a small portion of the database (down to a single database block) to a large portion of the database (making it essentially unusable). Examples:

-Operating system or storage device driver, host bus adapter, disk controller, or volume manager error causing bad disk read or writes.
-Stray writes by operating system or other application software.

5. Site failure : A site failure outage occurs when an event causes all or a significant portion of an application to stop processing or slow to an unusable service level. A site failure may affect all processing at a data center, or a subset of applications supported by a data center. Examples :

– Extended site-wide power failure
– Site-wide network failure
– Natural disaster making a data center inoperable
– Terrorist or malicious attack on operations or the site

Planned Outage : Displays the same information for planned outage types.

1. System changes : Planned system changes occur when performing routine and periodic maintenance operations and new deployments. Planned system changes include any scheduled changes to the operating environment that occur outside the organizational data structure within the database. The service level impact of a planned system change varies significantly depending on the nature and scope of the planned outage, the testing and validation efforts made prior to implementing the change, and the technologies and features in place to minimize the impact. Examples :

– Adding/removing processors to/from an SMP server
– Adding/removing nodes to/from a cluster
– Adding/removing disks drives or storage arrays
– Changing configuration parameters
– Upgrading/patching system hardware and software
– Upgrading/patching Oracle software
– Upgrading/patching application software
– System platform migration
– Database relocation

2. Data Changes : Planned data changes occur when there are changes to the logical structure or physical organization of Oracle Database objects. The primary objective of these changes is to improve performance or manageability.  Examples :
– Table definition changes
– Adding table partitioning
– Creating and rebuilding indexes

So conclusion, from the various of common causes Oracle DB downtime, More specifically, a high availability architecture should have the following traits:
■  Be transparent to most failures
■ Provide built-in preventative measures
■ Provide proactive monitoring and fast detection of failures
■ Provide fast recoverability
■ Automate the recovery operation
■ Protect the data so that there is minimal or no data loss
■ Implement the operational best practices to manage your environment
■ Provide the high availability solution to meet your SLA

Leave a comment