Find it

Saturday, June 9, 2012

Plex DISABLED RECOVER/STALE state while volume is ENABLED ACTIVE state.

Recently I had a situation where, plexes went into DISABLED RECOVER state while volumes were in ENABLED ACTIVE state. The reason that all volumes remained ENABLED ACTIVE state because, only one side of mirrored plexes were disabled however other side of mirror was in good state or enabled active. This is quite rarely seen situation (at least in my environment) hence I thought of writing a blog entry about this occurrence.

I believe, due to some temporary IO failure on disk c1t1d0 this had happened and I see some logs in diagnostic messages indicating drive c1t1d0 going offline... see below vxprint output

root:XXXXXXXX:/root # vxprint -htg rootdg
DG NAME NCONFIG NLOG MINORS GROUP-ID
ST NAME STATE DM_CNT SPARE_CNT APPVOL_CNT
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
CO NAME CACHEVOL KSTATE STATE
VT NAME NVOLUME KSTATE STATE
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO

dg rootdg default default 29000 1218716511.6.XXXXXXXX

dm rootdisk c1t0d0s2 auto 16384 143328960 -
dm rootmirror c1t1d0s2 auto 20095 143318784 -

v crash - ENABLED ACTIVE 16780224 ROUND - fsgen
pl crash-01 crash ENABLED ACTIVE 16780224 CONCAT - RW
sd rootdisk-03 crash-01 rootdisk 67131071 16780224 0 c1t0d0 ENA
pl crash-02 crash DISABLED RECOVER 16780224 CONCAT - RW
sd rootmirror-02 crash-02 rootmirror 16780224 16780224 0 c1t1d0 ENA

v home - ENABLED ACTIVE 2097152 SELECT - fsgen
pl home-01 home ENABLED ACTIVE 2106432 CONCAT - RW
sd rootdisk-09 home-01 rootdisk 102808127 2106432 0 c1t0d0 ENA
pl home-02 home DISABLED RECOVER 2106432 CONCAT - RW
sd rootmirror-04 home-02 rootmirror 35666880 2106432 0 c1t1d0 ENA

v networker - ENABLED ACTIVE 10485760 SELECT - fsgen
pl networker-01 networker ENABLED ACTIVE 10491456 CONCAT - RW
sd rootdisk-08 networker-01 rootdisk 92316671 10491456 0 c1t0d0 ENA
pl networker-02 networker DISABLED RECOVER 10491456 CONCAT - RW
sd rootmirror-05 networker-02 rootmirror 37773312 10491456 0 c1t1d0 ENA

[... Many lines, skipped for brevity ...]

To recover from such incident I would suggest to try below command first:

# vxrecover -bsE -g diskgroup_name

# vxrecover -bsE -g rootdg

Use vxtask list or vxtask -l list command to check the sync operation status.
Where,

vxrecover - perform volume recovery operations

-b => Performs recovery operations in the background. With this option, vxrecover runs in the background  to attach stale plexes and subdisks, and to resyn-chronize mirrored volumes and RAID-5 parity. If this is used with -s, volumes are started before recovery begins in the background.

-E => Starts disabled volumes or plexes even when they are in the EMPTY state. This is useful for start-ing up volumes restored by the vxmake utility when specified along with the -s option.

This would try to recover the plex & sync them again. Just in case if this doesn't works then I would prefer to detach the plex(es) & attach them again.. you can do so via "vxplex det" & "vxplex att".

In my case first option worked like piece of cake.

root:XXXXXXXX:/root # vxprint -htg rootdg
DG NAME NCONFIG NLOG MINORS GROUP-ID
ST NAME STATE DM_CNT SPARE_CNT APPVOL_CNT
DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE
RV NAME RLINK_CNT KSTATE STATE PRIMARY DATAVOLS SRL
RL NAME RVG KSTATE STATE REM_HOST REM_DG REM_RLNK
CO NAME CACHEVOL KSTATE STATE
VT NAME NVOLUME KSTATE STATE
V NAME RVG/VSET/CO KSTATE STATE LENGTH READPOL PREFPLEX UTYPE
PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE
SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE
SV NAME PLEX VOLNAME NVOLLAYR LENGTH [COL/]OFF AM/NM MODE
SC NAME PLEX CACHE DISKOFFS LENGTH [COL/]OFF DEVICE MODE
DC NAME PARENTVOL LOGVOL
SP NAME SNAPVOL DCO

dg rootdg default default 29000 1218716511.6.XXXXXXXX

dm rootdisk c1t0d0s2 auto 16384 143328960 -
dm rootmirror c1t1d0s2 auto 20095 143318784 -

v crash - ENABLED ACTIVE 16780224 ROUND - fsgen
pl crash-01 crash ENABLED ACTIVE 16780224 CONCAT - RW
sd rootdisk-03 crash-01 rootdisk 67131071 16780224 0 c1t0d0 ENA
pl crash-02 crash ENABLED ACTIVE 16780224 CONCAT - RW
sd rootmirror-02 crash-02 rootmirror 16780224 16780224 0 c1t1d0 ENA

v home - ENABLED ACTIVE 2097152 SELECT - fsgen
pl home-01 home ENABLED ACTIVE 2106432 CONCAT - RW
sd rootdisk-09 home-01 rootdisk 102808127 2106432 0 c1t0d0 ENA
pl home-02 home ENABLED ACTIVE 2106432 CONCAT - RW
sd rootmirror-04 home-02 rootmirror 35666880 2106432 0 c1t1d0 ENA

v networker - ENABLED ACTIVE 10485760 SELECT - fsgen
pl networker-01 networker ENABLED ACTIVE 10491456 CONCAT - RW
sd rootdisk-08 networker-01 rootdisk 92316671 10491456 0 c1t0d0 ENA
pl networker-02 networker ENABLED ACTIVE 10491456 CONCAT - RW
sd rootmirror-05 networker-02 rootmirror 37773312 10491456 0 c1t1d0 ENA

[... Many lines, skipped for brevity ...]

Hope this helps someone.

BTW, just for refreshing the basics, let's take a look at Plex states and Condition flags.

EMPTY: This state indicates that you have not yet defined which plex has the good data (CLEAN), and which plex does not have the good data (STALE).

CLEAN: This state is normal and indicates that the plex has a copy of the data that represents the volume. CLEAN also means that the volume is not started and is not currently able to handle I/O (by the administrator's control).

ACTIVE: This state is the same as CLEAN, but the colume is or was currently started, and the colume is or was able to perform I/O.

SNAPDONE: This state is the same as ACTIVE or CLEAN, but is a plex that has been synchronized with the volume as a result of a “vxassist snapstart” operation. After a reboot or a manual start of the volume, a plex in the SNAPDONE state is removed along with its subdisks.

STALE: This state indicates that VxVM has reason to believe that the data in the plex is not synchronized with the data in the CLEAN plexes. This state is usually caused by taking the plex offline or by a disk failure.

SNAPATT: This state indicates that the object is a snapshot that is currently being synchronized but does not yet have a complete copy of the data.

OFFLINE: This state indicates that the administrator has issued the “vxmend off” command on the plex. When the administrator brings the plex back online using the “vxmend on” command, the plex changes to the STALE state.

TEMP: The TEMP state flags (TEMP, TEMPRM, TEMPRMSD) usually indicate that the data was never a copy of the volume’s data, and you should not use these plexes. These temporary states indicate that the plex is currently involved in a synchronization operation with the volume.

NODEVICE: This flag indicates that the disk drive below the plex has failed.

REMOVED: This flag has the same meaning as NODEVICE, but the system admin has requested that the device appear as failed.

IOFAIL: This flag is similar to NODEVICE, but it indicates that an unrecoverable failure occurred on the device, and VxVM has not yet verified whether the disk is actually bad.

Note: I/O to both the public and the private regions must fail to change the state from IOFAIL to NODEVICE.

RECOVER: This flag is set on a plex when two conditions are met:

1) A failed disk has been fixed (by using vxreattach or the vxdiskadm option, “Replace a failed or removed disk”).
2) The plex was in the ACTIVE state prior to the failure.

I'm sure above notes will stand helpful…