懸崖邊上的舞者,記7.2生產(chǎn)數(shù)據(jù)庫災(zāi)難事件
作者:張子良
版權(quán)所有,轉(zhuǎn)載請(qǐng)注明出處
引子:出事了
7月2日是一個(gè)難得的大晴天,一段時(shí)間以來桂林一直在下雨,一直下,害的我減肥的計(jì)劃一再的泡湯,因?yàn)橄掠昝刻煸缟系呐懿阶兂闪擞^雨。和往常一樣,準(zhǔn)時(shí)來到單位,卻不準(zhǔn)時(shí)的得到一個(gè)消息:昨晚前置生產(chǎn)數(shù)據(jù)庫出問題了,連接不上。本能的反應(yīng),日志滿了,或者服務(wù)器交換空間不足。急忙趕到運(yùn)維間,經(jīng)詢問夜間值班人員,發(fā)現(xiàn)后續(xù)的情況是:logsegment日志空間確實(shí)是滿了,發(fā)現(xiàn)問題后,運(yùn)維人員采用dump日志的方式進(jìn)行處理,卻發(fā)現(xiàn)dump根本不被執(zhí)行,無奈最后只得采用增加log設(shè)備空間的方式,進(jìn)行了緊急處理。現(xiàn)在的問題是到底是什么原因引起的log空間的陡然間增加呢?
一、日志分析:發(fā)現(xiàn)了問題,卻無法定位?
取了數(shù)據(jù)庫運(yùn)行日志,展開分析,發(fā)現(xiàn)了以下問題:
00:06:00000:00912:2013/07/02 03:31:53.98 server Space available in the log segment has fallen critically low in database 'XXXDB'. All future modifications to this database will be suspended until the log is successfully dumped and space becomes available.
00:01:00000:00046:2013/07/02 03:31:54.08 server Error: 1105, Severity: 17, State: 4
00:01:00000:00046:2013/07/02 03:31:54.09 server Can't allocate space for object 'syslogs' in database 'XXXDB' because 'logsegment' segment is full/has no free extents. If you ran out of space in syslogs, dump the transaction log. Otherwise, use ALTER DATABASE to increase the size of the segment.
00:01:00000:00046:2013/07/02 03:31:54.09 server DUMP TRANSACTION for database 'XXXDB' could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database 'frtdb' shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server DUMP TRANSACTION for database 'XXXDB' could not truncate the log. Either extend the log using ALTER DATABASE ... LOG ON command or eliminate the oldest active transaction in database 'XXXDB' shown in syslogshold table.
00:01:00000:00046:2013/07/02 03:31:54.09 server background task message: LOG DUMP:'logsegment' for 'XXXDB' dumped!
問題很明顯:2013/07/02 03:31:53就已經(jīng)出現(xiàn)了空間不足的情況,2013/07/02 03:31:54.08出現(xiàn)了log空間已滿的情況。2013/07/02 03:31:54.09執(zhí)行了Dump操作,但是失敗了,無奈那么為什么會(huì)失敗呢?又是什么操作導(dǎo)致了日志空間的劇烈增長呢?繼續(xù)跟蹤日志。
二、發(fā)現(xiàn)死鎖:死鎖會(huì)導(dǎo)致日志空間滿嗎
繼續(xù)向上追溯發(fā)現(xiàn)在此之前曾經(jīng)發(fā)現(xiàn)出現(xiàn)過死鎖的情況,具體日志如下:
00:02:00000:00796:2013/07/01 07:13:24.33 server Deadlock Id 1 detected
Deadlock Id 1 detected. 1 deadlock chain(s) involved.
Deadlock Id 1: Process (Familyid 0, Spid 1117, Suid 4) was executing a UPDATE command in the procedure 'insprep0' at line 1 at nested level 1.
Deadlock Id 1: Process 1117 was involved in application ''.
Deadlock Id 1: Process 1117 was involved on host name ''.
Deadlock Id 1: Process 1117 was involved in transaction '$user_transaction'.
SQL Text:
Executing procedure: *00111700009969_3e59fe
Subordinate SQL Text:XXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 1: Process 796 was involved in application ''.
Deadlock Id 1: Process 796 was involved on host name ''.
SQL Text: XXXXXX
Deadlock Id 1: Process (Familyid 0, Spid 796) was waiting for a 'shared page' lock on page 60607 of table 't_merch_det' in database 'frtdb' but process (Familyid 0, Spid 1117) already held a 'exclusive page' lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 1117) was waiting for a 'exclusive page' lock on page 5866141 of table 't_merch_det' in database 'frtdb' but process (Familyid 0, Spid 796) already held a 'shared page' lock on it.
Deadlock Id 1: Process (Familyid 0, Spid 796) was chosen as the victim.
Victim process host = `', user = `sysadmin' program name = `' host processes = `13697416' .
End of deadlock information.
00:00:00000:01262:2013/07/01 14:49:42.27 server Deadlock Id 2 detected
Deadlock Id 2 detected. 1 deadlock chain(s) involved.
Deadlock Id 2: Process (Familyid 0, Spid 813, Suid 4) was executing a FETCH CURSOR command at line -1.
Deadlock Id 2: Process 813 was involved in application ''.
Deadlock Id 2: Process 813 was involved on host name ''.
Deadlock Id 2: Process 813 was involved in transaction '$user_transaction'.
SQL Text: XXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262, Suid 4) was executing a UPDATE command in the procedure 'insprep0' at line 1 at nested level 1.
Deadlock Id 2: Process 1262 was involved in application ''.
Deadlock Id 2: Process 1262 was involved on host name ''.
Deadlock Id 2: Process 1262 was involved in transaction '$user_transaction'.
SQL Text:
Executing procedure: *00126200040493_8f416a
Subordinate SQL Text: XXXXXXX
Deadlock Id 2: Process (Familyid 0, Spid 1262) was waiting for a 'exclusive row' lock on row 21 page 10073 of the 'XXXXXX' table in database 'frtdb' but process (Familyid 0, Spid 813) already held a 'exclusive row' lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 813) was waiting for a 'shared row' lock on row 11 page 18705629 of the 'XXXXXX' table in database 'frtdb' but process (Familyid 0, Spid 1262) already held a 'exclusive row' lock on it.
Deadlock Id 2: Process (Familyid 0, Spid 1262) was chosen as the victim.
Victim process host = `', user = `sysadmin' program name = `' host processes = `13500790' .
End of deadlock information.
根據(jù)分析發(fā)現(xiàn),死鎖的時(shí)間發(fā)生在2013/07/01 14:49:42.27,而發(fā)生死鎖的進(jìn)程也被終止,如果是由此原因?qū)е碌膯栴}的話,爆發(fā)時(shí)間對(duì)不上。根據(jù)日志清理的定時(shí)計(jì)劃,再次階段出現(xiàn)的問題,日志應(yīng)該已經(jīng)被清理,所以可以排除掉這種情況。
三、數(shù)據(jù)庫變更:重啟過,都變更了什么?
繼續(xù)向上追溯,發(fā)現(xiàn)6.30日數(shù)據(jù)庫曾經(jīng)被重啟過,發(fā)現(xiàn)數(shù)據(jù)庫參數(shù)也曾經(jīng)發(fā)生過變化:
00:00:00000:00013:2013/06/30 01:10:56.60 server Completed filling free space info for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:56.61 server Started cleaning up the default data cache for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:57.04 server Completed cleaning up the default data cache for database 'XXXDB'.
00:00:00000:00013:2013/06/30 01:10:57.04 server Checking external objects.
00:00:00000:00013:2013/06/30 01:10:57.46 server The transaction log in the database 'XXXDB' will use I/O size of 8 Kb.
00:00:00000:00013:2013/06/30 01:10:57.46 server Database 'XXXDB' is now online.
00:00:00000:00001:2013/06/30 01:10:57.47 server Recovery has restored the value of 'local async prefetch limit' for '64K' pool in 'default data cache' from '80' to 'DEFAULT'.
00:00:00000:00001:2013/06/30 01:10:57.47 server Recovery has restored the value of 'local async prefetch limit' for '8K' pool in 'default data cache' from '80' to 'DEFAULT'.
00:00:00000:00001:2013/06/30 01:10:57.57 server Recovery has restored the original size for '64K' pool and '8K' pool in 'default data cache'.
00:00:00000:00001:2013/06/30 01:10:57.58 server Recovery complete.
00:00:00000:00001:2013/06/30 01:10:57.59 server ASE's default unicode sort order is 'binary'.
00:00:00000:00001:2013/06/30 01:10:57.60 server ASE's default sort order is:
00:00:00000:00001:2013/06/30 01:10:57.60 server 'nocase_cp936' (ID = 52)
00:00:00000:00001:2013/06/30 01:10:57.60 server on top of default character set:
00:00:00000:00001:2013/06/30 01:10:57.60 server 'cp936' (ID = 171).
那么是否是數(shù)據(jù)庫參數(shù)變更導(dǎo)致的問題,這個(gè)地方是我重點(diǎn)懷疑的地方,但是始終沒有找到數(shù)據(jù)支撐,變更這些參數(shù),是否就是導(dǎo)致問題的根源呢?我需要更多的數(shù)據(jù)的支撐。
四、性能監(jiān)控:陡然增加的日志空間占用,究竟發(fā)生了什么?
好在部署了數(shù)據(jù)庫性能監(jiān)控監(jiān)控軟件,通過分析性能監(jiān)控圖表,發(fā)現(xiàn)7.2凌晨2點(diǎn)至3點(diǎn)半,數(shù)據(jù)庫log空間出現(xiàn)了激增,空間占用率呈直線60度上升,直至100%。那么可以肯定的是,在2點(diǎn)的時(shí)候,一定是有什么大事務(wù)的數(shù)據(jù)庫操作?那么究竟是什么操作呢?
五、柳暗花明:日志,還是日志
既然數(shù)據(jù)庫本身的日志無法滿足分析的需要,那么前置平臺(tái)日志,肯定會(huì)有相關(guān)的記錄,跟蹤前置日志,終于發(fā)現(xiàn),7.2日凌晨2點(diǎn)的-3點(diǎn)半期間確實(shí)有一個(gè)大的數(shù)據(jù)庫事務(wù)發(fā)生,那就是從當(dāng)前交易流水表復(fù)制交易記錄到歷史交易記錄表,記錄數(shù)1100w條,單表需要數(shù)據(jù)庫存儲(chǔ)空間5G,執(zhí)行完成插入操作后,需要進(jìn)行后續(xù)的刪除操作,而恰在此時(shí)時(shí)出現(xiàn)了日志空間不足的情況。新的疑問就出現(xiàn)了,為何這個(gè)問題之前沒有發(fā)生呢?
六、歷史問題:為何以前沒有發(fā)生問題呢
通過跟蹤6月30日以前的日志發(fā)現(xiàn),執(zhí)行當(dāng)前表到歷史表的數(shù)據(jù)拷貝的操作,都是失敗的。直接返回-1,失敗的原因是由于當(dāng)前數(shù)據(jù)庫可用空間不足或者其他原因,正在跟蹤分析中。。。
聯(lián)系客服