So far we've talked mostly about configurations and very little about managing an OGG instance. We'll discuss some of the more common management commands you'll probably use most often.
Here you'll get to know some of the most useful commands for displaying OGG status and/or for root cause analysis investigation.
We would be investigating long-running transactions when there is a sequence lag of 1 or more logs difference between the recovery checkpoint and the current checkpoint. We do it using the following command:
GGSCI> info ex01sand showch
The output of the preceding command will be as follows:
… Recovery Checkpoint (position of oldest unprocessed transaction in the data source): Thread #: 1 Sequence #: 10501 ß way behind by 20 logs … Current Checkpoint (position of last record read in the data source): Thread #: 1 Sequence #: 10521 …
You or your database administrator would need to troubleshoot as to which transaction is not committing its work. You might need to wait for the transaction to complete; or sometimes the user of that transaction might have forgotten to commit its work and you need to alert him/her to either commit their work or roll it back to clear the long-running transaction lag.
We can check Extract/Replicat status' for a running instance using the following command.
GGSCI> send pp01sand status
The output of the preceding command will be as follows:
EXTRACT PP01SAND (PID 696398) Current status: Recovery complete: At EOF Current read position: Sequence #: 11 RBA: 1096 Timestamp: 2012-12-17 11:05:44.778119 Extract Trail: /u01/app/oracle/gg/dirdat/pr Current write position: Sequence #: 6 RBA: 1408 Timestamp: 2012-12-17 11:20:03.446074 Extract Trail: /u01/app/oracle/goldengate/dirdat/rp
For measuring the true lag between the data source and Extract process use the following command:
GGSCI> send * getlag
The output of the preceding command will be as follows:
Sending GETLAG request to EXTRACT EX01SAND ... Last record lag: 1 seconds. At EOF, no more records to process. …
However, the following command is more common:
GGSCI> info all
The output of the preceding command will be as follows:
Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING EX01SAND 00:00:00 00:00:09 EXTRACT RUNNING PP01SAND 00:00:00 00:00:06 REPLICAT RUNNING RE01SAND 00:00:00 00:00:05
The Miscellaneous commands used are as follows. This closes the current trail file and opens a new one:
GGSCI> send pp01sand rollover GGSCI> send ex01sand stop GGSCI> send ext01sand forcestop
When an extract does not stop normally, you can abruptly stop it with the
forcestop
option.Following are the different ways to report stats on objects (tables):
GGSCI> stats ex01sand latest table scott.*
The output of the preceding command will be as follows:
Extracting from SCOTT.TCUSTMER to SCOTT.TCUSTMER: *** Latest statistics since 2012-12-17 12:57:36 *** Total inserts 0.00 Total updates 0.00 Total deletes 1.00 Total discards 0.00 Total operations 1.00 End of Statistics.
Reporting the status as a rate of insert/update/deletes per time interval is done using the following commands:
GGSCI> stats ex01sand table.tcustmer reportrate sec|min|hr GGSCI> stats ex01sand totalsonly scott.*
For killing a Replicat / Deleting an E/R group, use the following command:
GGSCI> kill replicat <group> GGSCI> delete <group> [!]
The trail commands to be used are as follows. When you add trails to your extracts, you can specify the size of the trails with the keyword megabytes n where n is a number representing the size in megabytes.
GGSCI> add exttrail | rmttrail, extract <group name> megabytes n GGSCI> alter exttrail | rmttrail extract <group name> megabytes n GGSCI> delete exttrail | rmttrail <trail name>
Long running transactions (LRT) are identified by OGG via the WARNLONGTRANS
parameter which by default is 60 minutes and it checks after every 5 minutes. It is wise to set this in your extract parameter file to a valid value for your environment. When checking for an LRT with the show checkpoint showch
option to the info
command, you ideally want the sequence number for both the recovery and current checkpoint to be the same; under normal circumstances this will be the case. But if the difference is more than 1 sequence, then you need to investigate further as to which transaction is holding up the delay. An LRT is not a true lag because it is still processing rows, but just not committing until the end of the transaction. The SQL may be a bad design so it's worth investigating and identifying the DML in question and suggesting a more scalable approach to the designer. OGG has a monitoring tool called the Director which normally will alert you of such LRT and will also identify the long-running transaction automatically.
The info <group name> detail
command will give you detailed information regarding your environment. Try it, and usually I save the output along with info * showch
to a Notepad file for later reference should it be needed for troubleshooting. I normally perform this when I make structural changes to tables or add/remove tables from the OGG configuration.
Checking for lags is a common command used by DBA's as a quick way to detect any lags with the E/R or both with the info all
command. If the lag is in the range of several minutes to hours, you need to investigate further. It could be due to an LRT, network latencies, or a number of other issues with your database performance. There are times it is obvious, but other times you really need to dig in through OGG logs and/or database logs. This one time, I had a huge lag with the data pump and the send <group name> status
command helped me identify that the issue was related outside the OGG/database. The issue was with network latencies. The command output will display the current read position in the local trail file directory and the current write position to the remote trail. The current read position has a sequence number associated; say 11 as in our case in the preceding example, which is part of the trail file name. An example of a directory listing of the trail files is as follows:
-rw-rw-rw- 1 oracle oinstall 1096 Dec 17 11:05 pr000010 -rw-rw-rw- 1 oracle oinstall 1231 Dec 17 12:57 pr000011 … -rw-rw-rw- 1 oracle oinstall 1231 Dec 17 13:40 pr000030
So, the pump is not shipping the trails as fast as they should and we seem to be 29 trail files behind. At this point you need to focus on probable network latencies by contacting your network administrator and working with him/her.
The ...getlag
command is also a very useful command when you stop the extract and want to ensure the pump and replicat have a chance to drain all its processing while they are still running and have no more records to process. Equally, you can also issue the following command to either the pump or replicat while the extract is stopped to check for any further processing by either:
GGSCI> send PP01SAND logend
The output of the preceding command will be as follows:
Sending LOGEND request to EXTRACT PP01SAND ... YES.
YES
meaning that all the records have been processed in the data source and there aren't any further records to process.
Finally, you can check out some of the stats commands to display statistics about your objects. Try them in your environment.