DR Best Practices In a Virtualized Environment?

Running our Appliance (ZCA), ZCS on VMware, or any other virtual machine software? Post your thoughts here.
Post Reply
User avatar
L. Mark Stone
Ambassador
Ambassador
Posts: 2796
Joined: Wed Oct 09, 2013 11:35 am
Location: Portland, Maine, US
ZCS/ZD Version: 10.0.6 Network Edition
Contact:

DR Best Practices In a Virtualized Environment?

Post by L. Mark Stone »

WE ARE A ZIMBRA HOSTING PROVIDER PROVIDER RUNNING A VIRTUALISED ZIMBRA FARM ON 7.2.1 PRESENTLY; WE ALSO SUPPORT MULTIPLE CLIENTS' VIRTUALISED ZIMBRA MULTI- AND SINGLE-SERVER ENVIRONMENTS. ALL OF OUR STORAGE IS SAN-BACKED.
WE ARE LOOKING TO IMPROVE OUR DR METHODOLOGIES IN THE EVENT A SANDY/IRENE/KATRINA OR SIMILAR EVENT TAKES, OR IS EXPECTED TO TAKE, DOWN THE PRIMARY DATA CENTER SITE, AND A FAILOVER TO THE SECONDARY DATA CENTER AT LEAST SEVERAL HUNDRED MILES AWAY IS MADE.
SPECIFICALLY, WE ARE LOOKING FOR A ZIMBRA-SUPPORTED PROCESS WHICH:



IS HYPERVISOR AGNOSTIC (BUT WHICH MAY RELY ON HYPERVISOR-SPECIFIC DATA TRANSPORT TOOLS).

IS DATA-SAFE (E.G. DOES NOT RELY ON RSYNCING THE ENTIRE /OPT/ZIMBRA TREE WITHOUT USING LDIF TO COPY LDAP FROM THE PRODUCTION TO THE DR SITE WITH ZIMBRA VERSIONS PRIOR TO 8).

IS, TO THE EXTENT POSSIBLE, ZIMBRA SERVER-VERSION AGNOSTIC.

BY LEVERAGING INTRA-DAY TRANSPORT OF THE REDO LOGS, MINIMIZES LOSS OF DATA WHEN FAILING OVER TO THE DR SITE.

MINIMIZES THE TIME TO "LIGHT UP" A WORKING DR SITE ONCE A DECISION IS MADE TO FAILOVER.




BACKGROUND

HISTORICALLY OUR DR HAS RELIED ON HAVING COLD STANDBY ZIMBRA SERVERS READY IN THE SECONDARY DATA CENTER; RSYNCING THE ZIMBRA NE BACKUPS FROM THE PRIMARY TO THE SECONDARY DATA CENTER; RUNNING ZMRESTORELDAP AND ZMRESTOREOFFLINE FOLLOWED BY MAKING CHANGES TO PUBLIC DNS TO COMPLETE THE FAILOVER. AS THE STORE SIZES GROW, THIS PROCESS CONSUMES A GREAT DEAL OF INTER-DATA CENTER BANDWIDTH (TO RSYNC THE BACKUPS) AND INCREASES THE TIME TO COMPLETE THE FAILOVER AS ZMRESTOREOFFLINE TAKES A FAIR AMOUNT OF TIME TO DO ITS WORK.
THE ADMIN GUIDE POINTS OUT (NE 7.2, APRIL 2012 EDITION, PAGE 239) THAT STORAGE LAYER SNAPSHOTS, IN CONJUNCTION WITH ZMPLAYREDO, MAY BE USED AS AN ALTERNATIVE TO ZIMBRA'S OWN BACKUP AND RESTORE FEATURE BY LEVERAGING ZIMBRA'S REDO LOGS. BUT THE ADMIN GUIDE DOES NOT LAY OUT THE ENTIRE REDOLOG DR PROCESS TO THE EXTENT IT DOES FOR A ZIMBRA SERVER REPLACEMENT.
CONSEQUENTLY, WE ARE STARTING THIS THREAD TO TRY TO DOCUMENT THIS REDO LOG-BASED DR PROCESS. ONCE DONE, I'M HAPPY TO WRITE A WIKI ARTICLE AND SUGGEST ZIMBRA MARK IT AS CERTIFIED DOCUMENTATION ONCE THEY HAVE QA'D IT. WE ARE SEEING HARDLY ANY NEW ZIMBRA INSTALLS GOING ON BARE METAL, SO UPGRADING THE DOCUMENTED DR PROCESSES TO TAKE ADVANTAGE OF HYPERVISOR-RELATED TECHNOLOGIES SEEMS TIMELY. PLEASE HELP!


PROPOSED HIGH-LEVEL DR PROCESS

READING BETWEEN THE LINES OF THE ABOVE-MENTIONED ADMIN GUIDE PROCESS, WE ARE CONTEMPLATING TESTING THE FOLLOWING PROCESS, AND WOULD BE GRATEFUL FOR FEEDBACK FROM THE COMMUNITY BEFORE WE GET TOO FAR AHEAD OF OURSELVES HERE. AT A HIGH LEVEL, THE PROCESS IT SEEMS TO US WOULD COMPRISE:


ONE-TIME OR INFREQUENT INITIALIZATION TASKS



SHORTEN THE TTLS FOR ALL ZIMBRA SERVER PUBLIC DNS RECORDS, AND/OR USE A THIRD-PARTY DNS PROVIDER WHO CAN ORCHESTRATE RAPID DNS FAILOVER-BASED CHANGES.

ON THE PRODUCTION ZIMBRA SERVERS, AS ROOT DO A "CHKCONFIG ZIMBRA OFF", THEN AS THE ZIMBRA USER DO A ZMCONTROL STOP, SHUT DOWN THE VIRTUAL MACHINE, NOTE THE EXACT TIME, TAKE A HYPERVISOR-LEVEL SNAPSHOT OF THE ZIMBRA VIRTUAL MACHINES AND THEN RESTART THE VIRTUAL MACHINES AND ZIMBRA, AND THEN AS ROOT RUN "CHKCONFIG ZIMBRA ON".

DEPENDING ON THE HYPERVISOR, USE APPROPRIATE PROCEDURES TO CREATE CLONES OF THE ZIMBRA SERVERS FROM THE SNAPSHOTS AND TRANSFER THEM TO THE DR DATA CENTER.

(AS AND WHEN PRODUCTION ZIMBRA IS UPGRADED/PATCHED, STEPS 2. AND 3. HERE WILL NEED TO BE REPEATED).

AT THE DR SITE, BOOT THE CLONED ZIMBRA MAILBOX SERVERS ONLY (ZIMBRA ITSELF WILL NOT START AND SHOULD NOT BE STARTED)


ROUTINE SYNCING BETWEEN THE PRODUCTION AND DR SERVERS



SCHEDULE A CRON JOB WHICH, AFTER EACH ZMBACKUP ON THE PRODUCTION SERVERS, RSYNCS FROM /OPT/ZIMBRA/BACKUP ON THE PRODUCTION SERVERS TO THE DR SERVERS JUST /OPT/ZIMBRA/BACKUP/LDAP AND /OPT/ZIMBRA/BACKUP/SYS (FOR FULL BACKUPS) AND /OPT/ZIMBRA/BACKUP/LDAP, /OPT/ZIMBRA/BACKUP/SYS AND /OPT/ZIMBRA/BACKUP/REDOLOGS (FOR INCREMENTAL BACKUPS).

AFTER THE ABOVE RSYNC IS COMPLETE, RUN ZMRESTORELDAP DAILY AND ZMPLAYREDO --LOGFILE=/OPT/ZIMBRA/BACKUP/SESSIONS/INCR-[SESSIONID] AFTER THE INCREMENTAL BACKUP RSYNCS ONLY.

IN ZIMBRA'S CRONTAB AT THE DR SITE, COMMENT OUT THE LINES WHICH RUN FULL AND INCREMENTAL ZMBACKUP, BUT KEEP THE LINE WHICH PRUNES OLDER BACKUPS.

SCHEDULE A CRON JOB WHICH EVERY HOUR OR SO, RSYNCS /OPT/ZIMBRA/REDOLOG WITH THE --DELETE SWITCH FROM THE PRODUCTION ZIMBRA MAILBOX SERVERS TO THE DR SITE ZIMBRA MAILBOX SERVERS.


FAILOVER TO DR SITE PROCESS


ON THE DR ZIMBRA MAILBOX SERVERS, RUN ZMPLAYREDO --LOGFILE=/OPT/ZIMBRA/REDOLOG ONE LAST TIME, THEN START ZIMBRA ON ALL DR SERVERS.

IF THE PRODUCTION SITE IS STILL REACHABLE, SHUTDOWN THE ZIMBRA SERVERS.

UPDATE PUBLIC DNS TO POINT TO THE DR ZIMBRA SERVERS.


QUESTIONS AND MISSING BITS



IS IT TRUE THAT, ONCE A ZIMBRA SERVER IS SNAPSHOTTED AS ABOVE AND TRANSPORTED TO THE DR SITE, KEEPING THIS DR ZIMBRA SERVER PERIODICALLY IN SYNC WITH THE PRODUCTION SITE REQUIRES ONLY RESTORING A RECENT VERSION OF LDAP AND REPLAYING THE REDO LOGS? IS THERE ANY OTHER DATA WHICH NEEDS TO COPIED OVER TO AVOID HAVING TO SHIP THE ENTIRE /OPT/ZIMBRA/BACKUP TREE BETWEEN THE PRODUCTION AND DR SITES?

IS ANYONE USING THIS TECHNIQUE (OR A VARIATION THEREOF) PRESENTLY, AND HAVE YOU EVER TESTED, OR ACTUALLY HAD TO, FAIL OVER TO THE DR SITE?

WHAT ELSE NEEDS TO BE INCLUDED FROM A PROCESS STANDPOINT THAT WE HAVEN'T THOUGHT OF?


THANKS!

MARK
___________________________________
L. Mark Stone
Mission Critical Email - Zimbra VAR/BSP/Training Partner https://www.missioncriticalemail.com/
AWS Certified Solutions Architect-Associate
User avatar
quanah
Zimbra Alumni
Zimbra Alumni
Posts: 1668
Joined: Fri Sep 12, 2014 10:33 pm
Contact:

DR Best Practices In a Virtualized Environment?

Post by quanah »

Hi Mark,
On the LDAP DR portion, what we did at my previous job, and would work well with Zimbra too, is to simply configure off-site replicas. That way, in the case of something like Katrina, we had an active backup of all the user data. This was for a university, so it included helpful bits like the students home contact information, etc.
If you did this with ZCS, I would recommend having an offsite server that is LDAP only, and configuring it either as a replica, or (with ZCS 8), as a backup master as part of an MMR configuration. You noted you wanted to be ZCS version agnostic with relation to the LDAP server. The only issue there is that the Zimbra schema can, and does, change between versions, even minor versions like 7.2.x to 7.2.y. So for things to remain usable, it is best to keep the ZCS versions in sync. On the plus side, upgrading an ldap only server is quick and fairly painless.
If you go the replica-only route, and there is a disaster, it is fairly simple to later promote the replica to be a master:
--Quanah">https://wiki.zimbra.com/wiki/Promoting_ ... DAP_Master
--Quanah
--
Quanah Gibson-Mount
Product Architect, Symas http://www.symas.com/
OpenLDAP Core team http://www.openldap.org/project/
User avatar
quanah
Zimbra Alumni
Zimbra Alumni
Posts: 1668
Joined: Fri Sep 12, 2014 10:33 pm
Contact:

DR Best Practices In a Virtualized Environment?

Post by quanah »

Two other thoughts.
a) rsync -- You can do this safely on ZCS 7 and prior if you stop the ldap server, and then run db_recover on all the databases (The master has two). This will write out all pending changes. Live rsync, or rsync where db_recover has not been run, continues to be unsafe. ;) On ZCS 8 and later, ldap simply needs to be stopped when running rsync (with -S)
b) I'm guessing you have a fairly large database, since you seem to want to avoid slapcat/slapadd (the official method of backing up ldap). I would note that in ZCS8, write performance for the LDAP server is more than 2 times greater than it is in ZCS7 and prior. This significantly decreases the amount of time it takes to slapadd a database. For example, with a 6 million entry database from a customer that I use for testing, slapadd time went from 140 minutes to 45 minutes.
--
Quanah Gibson-Mount
Product Architect, Symas http://www.symas.com/
OpenLDAP Core team http://www.openldap.org/project/
Krishopper
Outstanding Member
Outstanding Member
Posts: 769
Joined: Fri Sep 12, 2014 10:23 pm

DR Best Practices In a Virtualized Environment?

Post by Krishopper »

MARK,
[quote]

QUESTIONS AND MISSING BITS



IS IT TRUE THAT, ONCE A ZIMBRA SERVER IS SNAPSHOTTED AS ABOVE AND TRANSPORTED TO THE DR SITE, KEEPING THIS DR ZIMBRA SERVER PERIODICALLY IN SYNC WITH THE PRODUCTION SITE REQUIRES ONLY RESTORING A RECENT VERSION OF LDAP AND REPLAYING THE REDO LOGS? IS THERE ANY OTHER DATA WHICH NEEDS TO COPIED OVER TO AVOID HAVING TO SHIP THE ENTIRE /OPT/ZIMBRA/BACKUP TREE BETWEEN THE PRODUCTION AND DR SITES?

IS ANYONE USING THIS TECHNIQUE (OR A VARIATION THEREOF) PRESENTLY, AND HAVE YOU EVER TESTED, OR ACTUALLY HAD TO, FAIL OVER TO THE DR SITE?

WHAT ELSE NEEDS TO BE INCLUDED FROM A PROCESS STANDPOINT THAT WE HAVEN'T THOUGHT OF?


THANKS!

MARK[/QUOTE]
I AM DOING THE SAME TYPE OF THING WITH AMAZON AWS. /OPT/ZIMBRA GETS SNAPSHOT'ED LESS OFTEN (EVERY 4 HOURS), AND I HAVE MOUNTS FOR LDAP AND REDOLOGS THAT GET SNAPSHOT MUCH MORE OFTEN (EVERY 2 MINUTES). SINCE I AM RUNNING ZIMBRA ON AMAZON EC2, THE SNAPSHOTS ARE USING AMAZON'S EBS SERVICE AND ARE VERY FAST. I AM ALSO USING THE EC2-CONSISTENT-SNAPSHOT SCRIPT WHICH FORCES A FLUSH OF THE MYSQL (MAILBOX) DATABASES, AND THEN FREEZES THE FILE SYSTEM. THE WHOLE PROCESS TAKES LESS THAN 1 SECOND AND THERE ARE NO USER INTERRUPTIONS IN THE PROCESS.
I HAVEN'T ACTUALLY HAD TO DO A REAL DR, BUT I HAVE TESTED AND WAS QUICKLY ABLE TO RESTORE THE SERVER AND REPLY THE LAST 4 HOURS OF REDO LOGS, AND DID NOT SEE ANY ISSUES. IT PUTS MY RPO (RECOVERY POINT OBJECTIVE) AT 2 MINUTES, WHICH IMO, IS A GREAT TARGET.
IN ADDITION, I ALSO USED THIS TO RESTORE A DELETED ACCOUNT, AS I WAS ABLE TO RESTORE THE SNAPSHOT TO A TEMPORARY LOCATION, PLAY THE REDO LOGS UP TO A CERTAIN PERIOD OF TIME, AND THEN USE ZMMAILBOX WITH GETRESTURL/POSTRESTURL TO GET THE OLD ACCOUNT DATA AND RESTORE IT INTO THE PRODUCTION SERVER.
LONG STORY SHORT, I'VE BEEN DOING JUST THAT SINCE JULY 2ND WITHOUT ANY ISSUES. OFFLINE TESTING SHOWED SUCCESS (RESTORE AND VERIFY MOST RECENT MAIL AND ACCOUNTS WERE AVAILABLE IN A TEST OFFLINE SYSTEM THAT WASN'T PUT INTO PRODUCTION), AS WELL AS RESTORING CONTENT FROM AN ACCOUNT.
Travis Kensil
Posts: 17
Joined: Fri Sep 12, 2014 10:36 pm

DR Best Practices In a Virtualized Environment?

Post by Travis Kensil »

Mark,
We have attacked this problem at the hypervisor level (in our case VMWare) with Veeam Backup and Replication. We use Veeam to replicate Zimbra VM to our DR site, in case of disaster or issue we simply failover to our Zimbra server copy at our DR site. This allows us great flexibility and depending on one's data loss tolerance is flexible in terms of the amount of restore points per hour,day,week,month,etc.
Also based on the Zimbra/VMWare relationship and what I have read of their change from RedHat cluster services being not supported and instead encouraging native VMWare abilities to provide HA/FT would seem to be inline with that direction as well.
Also in terms of process, reverting to a DR replica is much quicker/simpler than some of the process you have discussed which sounds semi-manual and prone to issues/errors.
Veeam also provides us the ability to schedule testing of these VM replicas to ensure data integrity, which so far we have had no issues with.
Just our $0.02 from what we are doing, Veeam introduces a lot of flexibility but large quantities of raw storage or dedup. is needed as well as large bandwidth capacity for replication.
Post Reply