호스팅 TIP > 하드웨어 > 3Ware 컨트롤러 에러메세지 (3Ware Controller Problem Determination Procedures)

호스팅 TIP

보안

시스템

프로그래밍

하드웨어

월간 인기 게시물

	게시물 96건

3Ware 컨트롤러 에러메세지 (3Ware Controller Problem Determination Procedures)

글쓴이 : 최고관리자 날짜 : 2010-04-30 (금) 14:43 조회 : 9725

글주소 :

출처 : https://twiki.cern.ch/twiki/bin/view/FIOgroup/DiskPrbTw

/var/log/messages

These messages can appear in the system syslog file. They are documented here to assist in filtering out what are real and what are false errors. Only the messages explicitly labeled as generating lemon events (e.g. RAID_TW_CTLR or RAID_TW_DISK ) will be reported to the operator. The RAID_TW lemon events which are defined here are obtained from running query commands rather than looking at log history.

Message Action

kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0002): Degraded unit detected:unit=0, port=1 A degraded disk has been found as part of the RAID array. Follow DiskWinTwMirrorRecover

kernel: 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. Message can be ignored. It indicates that the tw_cli start verify has been run.

kernel: 3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0. Message can be ignored. It indicates that the tw_cli start verify has been run and completed

kernel: 3w-9xxx: scsi1: ERROR: (0x03:0x010D): Invalid field in CDB:. According to the article on the 3ware web site, this indicates a request for a status page which does not exist. This is not an error with the adapter or disk and the message can be ignored.

kernel: 3w-xxxx: scsi3: Command failed: status = 0xc4, flags = 0x43, unit #8. A smartctl error listing command such as -l selftest has been issued against a disk which does not exist. If there should be no disk present in this port, this error can be ignored. Otherwise, follow the 3ware problem determination procedures following running lemon-host-check.

9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=5 A bad block has been found and re-located. This is not a serious problem and will occur from time to time. If there are a large number of these errors, the disk can be checked using tw_cli start verify and then replaced if more errors occur.
Some corruption cases with fsprobe have been identified where these messages are also present. This is not conclusive at the moment (03/11/07). Symptoms where this message has been related to a data corruption have been further observed with the tt_07_1 models (25/01/08).
Recommended action is to replace the drive at the port listed in the message (port 5 in this case)

kernel: 3w-xxxx: scsi0: AEN: WARNING: Sector repair occurred: Port #1. A bad block has been found and re-located. This is not a serious problem and will occur from time to time. If there are a large number of these errors, the disk can be checked using tw_cli start verify and then replaced if more errors occur

9xxx: scsi1: AEN: WARNING (0x04:0x004B): Battery temperature is high The card has detected battery temperature problems. Follow the procedure in DiskPrbTwBbuFault

3w-xxxx: scsi2: AEN: WARNING: Unclean shutdown detected: Unit #6. This indicates the machine was powered-down without doing a clean shutdown. While this is not an error in itself and can be ignored, it may explain other errors such as a corrupted file system where the cache was not saved to disk

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): <NULL>:. This message is not understood but seems to be caused by a memory space problem on the machine. It does not indicate a problem with the controller and can be ignored.

Spare capacity too small for some units: spare unit=3, RAID unit=2 The spare disk is too small to be incorporated into the RAID. See DiskWinTwUnitDiskSizeFix

3w-9xxx: scsi0: AEN: ERROR (0x04:0x0057): Battery charging fault:. A battery is failing. Run the DiskWinTwBbuTest procedure to check the batter and raise a vendor call if the messages re-occur

kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0039): Buffer ECC error corrected:address=0x3449700. This is a serious error which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.

kernel: 3w-9xxx: scsi0: AEN: ERROR (0x04:0x0024): Buffer integrity test failed:error=0x3013. This is suspected as being a serious error message which merits a vendor call. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.

kernel: 3w-xxxx: scsi1: AEN: ERROR: Drive ECC error detected: Port #0. This error will generate a RAID_TW_DISK error. The disk at the specified port has failed, usually shown up by a scheduled verification or media scan. The disk should be replaced.

kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x005F): Cache synchronization failed; some data lost:unit=1. This is suspected as being a serious error message which merits a vendor call. Some file system corruption may also occur. This message will generate a RAID_TW_CTLR error to the operator and the controller should be replaced.

kernel: 3w-9xxx: scsi2: ERROR: (0x03:0x0101): Invalid command opcode:opcode=0x4D. This message occurs when SMART data is requested from a 'logical' disk rather than a physical one (such as /dev/sdb on a 3ware controller). Check the /etc/smartd.conf file and the CDB configuration.

kernel: 3w-xxxx: scsi1: Unit #0: Command (c4d44e00) timed out, resetting card. A vendor call should be raised. This problem has occurred around the same time as data corruption problems and seems to be related to cabling or enclosure problems.

kernel: 3w-9xxx: scsi1: AEN: ERROR (0x04:0x0047): Battery voltage is too low

kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0045): Battery voltage is low

kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0044): Battery voltage is normal

kernel: 3w-9xxx: scsi1: AEN: INFO (0x04:0x0056): Battery charging completed The battery on the cards runs down and is re-charged automatically. On its own, these messages are normal. Under some circumstances, the card will perform this in loops with recharging every few minutes (rather than once a week or so as usual). If recharging occurs very often, a vendor call should be raised to replace the battery

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity testis overdue

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0051): Battery health check started

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0052): Battery health check completed

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0055): Battery charging started As above

kernel: 3w-9xxx: scsi1: AEN: WARNING (0x04:0x0036): Verify fixed data/parity mismatch:unit=0. A start verify operation has caused the unit to be checked and it has found a problem. This has been corrected automatically and no further action is required unless the problem occurs repeatedly.
If the problem occurs more than 5 times in an hour, a RAID_TW_DISK alarm is raised and a vendor call should be created.
See here for 3ware explanation

kernel: 3w-xxxx: scsi0: Command failed: status = 0xc7, flags = 0xd0, unit #7. This message has been seen when a disk has completely failed. Normally other monitoring such as the RAID_TW or SMART_SELFTEST alarm will detect the problem as well. The error indicates a SMART test is failing since the disk does not respond to SMART requests. The failing disk within the unit should be replaced

kernel: 3w-xxxx: scsi0: AEN drain failed, retrying. The exact cause of this message is not known. The message has been seen when a disk is determined to have failed and is rebuilding. Thus, this message does not guarantee a problem, a full verify or mediascan is recommended to detect the exposure of the problem.

kernel: 3w-9xxx: scsi2: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=7, unit=1. A drive has reported an ECC-error and the disk should be replaced. This will generally lead to a RAID_TW alarm and the vendor call will follow from the standard procedure.

kernel: 3w-9xxx: AEN: WARNING (0x04:0x0042): Primary DCB read error occurred:port=0, error=0x208. The unit has completely failed. Data loss is likely. Vendor call required

kernel: 3w-9xxx: scsi0: AEN: WARNING(0x04:0x0043): Backup DCB read error detected:port=9, error=0x1019. Exact cause not known but current recommendation is a vendor call

kernel: 3w-9xxx: scsi0: AEN: WARNING (0x04:0x002F): Verify not started; unit never initialized:RAID1 subunit=0. This message occurs when an array is verified for the first time. It can be ignored.

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x000C): Initialize started:unit=0. This message occurs when an array is verified for the first time. It can be ignored.

kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0007): Initialize completed:unit=0. This message occurs when an array is verified for the first time. It can be ignored.

kernel: 3w-9xxx: scsi1: WARNING: (0x06:0x000C): Character ioctl (0x108) timed out, resetting card. This message has been seen when there was a bus problem on the machine. Raise a vendor call

kernel: Call Trace:{__alloc_pages+768}
{dma_alloc_pages+125}
kernel: {dma_alloc_coherent+97}
{:3w_9xxx:twa_chrdev_ioctl+227}
kernel: {do_page_fault+575}
{autoremove_wake_function+0}
kernel: {dput+56} {strncpy_from_user+74}
kernel: {sys_ioctl+853} {system_call+126} There is a problem with the amount of DMA memory available. This was seen on 3ware 95XX cards with an old version of the firmware (3.04). Try a newer version of the firmware to see if this resolves the problem.

Flash file system repaired:. This message has been seen on a few machines usually followed by a controller reset. The root cause is not known but the controller reset justified a vendor call. Follow the procedure for the controller reset

kernel: 3w-9xxx: scsi1: ERROR: (0x06:0x000C): PCI Parity Error: clearing. Suspect a problem with the controller card. This has been seen on s0 series of machines. Raise a vendor call for a check of motherboard and potential controller replacement.

kernel: 3w-9xxx: scsi0: ERROR: (0x06:0x0010): Microcontroller Error: clearing. Suspect a problem with the controller card. This has been seen on e5 series machines along with an fsprobe corruption. Raise a vendor call for a check of motherboard and potential controller replacement.

scsi0: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=21. A disk has shown higher than expected levels of SMART errors. It should be replaced. Raise a vendor call for a check of motherboard and potential controller replacement.

여행스케치

웹소식

자유게시판

인기검색어

	ceph
	http
	DDOS	1
	format	6
	sq
	--	5
	volume	4
	squid	4
	zero	3
	forward	1