3,000 deployments a day
Operating in 29 countries
12 Cloud Foundries in five regions
- Open source software—significant operational and licensing cost savings.
- Configuration as code—enables easier standup of new CFs and expansion into new regions as well as platform component customization.
- Platform automation—open source Concourse enables more frequent and smoother upgrades to CF, more effortless patching of AMIs, the recreation of container instances, and credential rotation.
- Platform monitoring—better alerts and visualizations of all platform metrics through the implementation of Prometheus, Grafana, and alertmanager.
Liberty Mutual is the sixth-largest global property and casualty insurer with almost 45,000 employers, operating in 29 countries. The global enterprise has 12 Cloud Foundries in five regions, with seven in public cloud and five running on-premise. The largest Cloud Foundry (CF) deals with 54% of the workloads. Liberty Mutual employs 2,000 developers and has 12,000 applications running on the CF platform. On average, there are 3,000 deployments a day. This whole operation is supported by The Cloud Foundry Platform Team, composed of eight engineers.
Liberty Mutual supports various frameworks and build packs for its 12,000 CF applications, but most CF users develop using Java. The platform team also supports around 1,500 CF data services instances for Redis and RabbitMQ.
At the beginning of 2020, Liberty Mutual’s CF Platform Team wanted to assess whether it would be feasible to migrate its entire Cloud Foundry platform to open source Cloud Foundry. The migration had significant cost benefits that would see Liberty Mutual reducing its commercial per-app instance license fees and enabling it to discard seven-digit support contracts.
However, an internal review highlighted the need to build confidence among not only users but the platform team itself.
Only 62% of applications running in CF were deployed frequently, reflecting that a sizable proportion of the running workload wasn’t actively maintained or written for a cloud-native environment. This meant migrating some individual apps would be problematic.
The focus for the platform team was to put customers first, which meant there needed to be no impact on applications or the business, and because developers needed to be focused on application-specific priorities, this also demanded zero developer effort and involvement in the migration process.
As the data services were not included in the initial migration, they had to remain managed by the existing brokers in the foundries post-migration, and newly-migrated applications had to be capable of provisioning new service instances and maintaining activity to existing services.
Security requirements also meant that the new CF components were separate from existing CFs, but using the same subnets as the existing CFs to avoid firewall and security group-related issues for application traffic.
Liberty Mutual’s first decision was to bring in Stark & Wayne to train the team to build out, instrument, and maintain several scalable cloud foundries. S&W supplied practical hands-on training, including the toolkits required for buildouts and assisting with troubleshooting.
Additionally, S&W worked with the Liberty Mutual CF Platform Team to construct its migration plan to fulfill requirements with as little application downtime and developer impact as possible.
The most straightforward approach from an operator perspective would be to apply a ‘new city’ migration. This is where a new Cloud Foundry is built, and CF users can migrate their applications to the new Cloud Foundry during low-impact change windows. While being a safe approach, it can potentially take a significant amount of time for a large enterprise.
However, based on all Liberty Mutual’s requirements for its thousands of app developers and engineers, a ‘snap migration’ approach was selected, targeting the organization’s largest cloud foundry, running 54% of the workload.
This migration method demands no room for error but can be achieved within an acceptable time frame. It also requires more operator hands-on work taking a snapshot of the existing platform and importing it to the target platform during a change freeze. This approach has zero user involvement, but as it involves a change freeze, most of the work has to be done during the change window. The methods employed to snapshot a live cloud foundry are technically more complex than a ‘new city’ method.
In preparation for migration, a parallel Cloud Foundry was built as the migration target. S&W used an internally developed open source tool called Genesis to pre-package and distribute pre-test configuration options for BOSH manifests and customer configurations.
As this was a new tool to the Liberty Mutual platform team, training was provided on its use, including configuration customizations that might be required internally.
The buildout required several additional open source software stacks alongside Cloud Foundry to aid and augment its capabilities. For example, Concourse was implemented to automate deployments, Stratos was used as CF’s front-end, and Prometheus, Grafana, and alertmanager supplied platform status monitoring and alerts when necessary.
S&W also deployed Redis and RabbitMQ services brokers so that CF applications had the option to use them as needed.
A separate AWS account was used for new CF components to achieve security compliance, but subnets were paired across both the source and target accounts.
As requested by the client, the data services had to remain managed by the existing brokers in the foundries post-migration. This revolved around the proper configuration of firewall rules in the new AWS account to enable app traffic. Additional CF and BOSH configurations were also applied to enable the new platform to locate service instances and brokers in the original platform. S&W also undertook work on the Redis service instances to ensure they would be shared from the source platform’s BOSH Director to the post-migration target Director. This work involved using an NGINX server to continuously merge hostnames from old to new and allow apps access to DNS hostnames as needed. Extra work was required to create new router registers for new components and map them to the old CF to enable GoRouter to know where to reverse proxy traffic from old to new.
The migration process took place in a change window and required a code freeze to prevent any customer changes to the CF, followed by cloning the CF state to the new CF, verification, and validation, and the migration itself.
The data migration essentially only took 10-15 minutes, as S&W had performed an initial S3 sync before migration, which took 10 hours. The bulk of the migration focused on validating the target platform.
To snap all the databases, S&W performed a MySQL dump from the CF databases from the Cloud controller API VM and copied that to the target CF, and imported all the data to the CF databases on the target. Once migrated, Cloud controller and Diego were started and all Diego Cells restarted.
Validation involved comparing metadata between the source and the target CFs, monitoring the desired Long Running Process (LRP) count and the running LRP count to understand if there were crashing apps, along with monitoring Diego Cell capacity to optimize the cell sizes post-migration.
Smoke tests covered the most important functionality of the platform to ensure the platform was healthy and managing the new workload well.
The active migration, which involved cutting over DNS, was achieved by changing and directing 5,000 platform URLs and 2,000 custom ‘vanity’ URLs to a single A record before migration.
The migration also required shortening DNS TTLs (time to live) and cutting connections to old load balancers to force a reconnect to the new ones.
validations before impactful changes that the DNS cutover would introduce. This allowed for the project to be rolled back without any impact on the apps or the business.
The extensive testing did indicate discrepancies; for instance, smoke tests consistently failed after the migration, and there were discrepancies between LRPs running on source and target, but the team was able to perform rollbacks to allow time to deal with these issues.
The rollbacks indicated that better data was needed. A network tester was built to identify all egress traffic destinations, which were provided as endpoints to an app we deployed to the CF we were migrating. We then attempted to hit these endpoints to identify any unreachable network paths. We also established better network monitoring and metrics to help us identify the apps that were crashing. Ultimately, using the new metrics and logs, we were able to ascertain the main cause of the app crashes, which revolved around the use of the new AWS account and the ingress access rules linked to a security group.
To successfully migrate all CF applications to the target CF, we temporarily deployed the Diego cells into the old AWS account. After the migration, S&W worked with application teams to update their external AWS resources access rules to use a new security group defined in both AWS accounts.
Strong levels of support from management and application teams are required for such a large migration, which was a significant factor in the project’s success. Fortunately, there was immediate buy-in from management, who were very supportive throughout the process. S&W was able to work in partnership with Liberty Mutual’s many stakeholders, ensuring transparency, especially around the challenges, while prioritizing the needs of Liberty Mutual and its customers. The migration project was successfully completed without any detrimental effect on the business.