Dr. Xiujiao Gao 高秀娇, Author at Stark & Wayne https://www.starkandwayne.com/blog/author/xiujiao-gao/ Cloud-Native Consultants Thu, 30 Sep 2021 15:48:46 +0000 en-US hourly 1 https://wordpress.org/?v=6.0.3 https://www.starkandwayne.com/wp-content/uploads/2021/04/cropped-SW_Logos__shield_blue_on_transparent_with_border-32x32.png Dr. Xiujiao Gao 高秀娇, Author at Stark & Wayne https://www.starkandwayne.com/blog/author/xiujiao-gao/ 32 32 When You Have to Seek Truth in the Code https://www.starkandwayne.com/blog/when-you-have-to-seek-truth-in-the-code/ https://www.starkandwayne.com/blog/when-you-have-to-seek-truth-in-the-code/#respond Thu, 30 Sep 2021 17:30:00 +0000 https://www.starkandwayne.com//?p=3137 When Documenation is Misleading Have you ever doubted documentation? Have you ever dug into the code and found that the documentation is misleading? If the answer to any of the questions is yes, you have come to the right place. Please do not get me wrong, I love documentation. I actually learn a lot through

The post When You Have to Seek Truth in the Code appeared first on Stark & Wayne.

]]>
When Documenation is Misleading

Have you ever doubted documentation? Have you ever dug into the code and found that the documentation is misleading?

If the answer to any of the questions is yes, you have come to the right place.

Please do not get me wrong, I love documentation. I actually learn a lot through documentation and solve issues by following documentation. Maintaining accurate, readable, and up-to-date documentation is not an easy thing. We tend to improve our tools/software faster than documentation. Additionally, sometimes people who write the documentation are not the same people who develop the toolset. Messages may get lost during this process. How to write and maintain good up-to-date documentation is a whole different topic. In this blog, we focus on how to find out the truth beneath the not-so-accurate/confusing documentation.

Peeling Onions Layer by Layer

I call this approach “Peeling onions layer by layer”, the common thing here is that you will likely have tears. I will use a simplified real case from our work as an example to go through how this method works.

One of our clients asked us a question: “which port does BOSH vSphere CPI use to talk to vCenter, 80 or 443?” You could go and capture network traffic if you have access, but in our case, we were expected to find out the truth through documentation and code.

We peeled three layers Documentation => configs => source code and found the truth at the code core.

A quick search got us here from the documentation, we saw:

The vSphere CPI requires access to port 80/443 for all the ESXi hosts in your vSphere resource pool(s)

This is helpful, yet we still did not know which port was being used exactly. Next, we needed to take a look at the configs to see which port was configured to be used.

From more documentation, we found an example of CPI config. Unfortunately, it does not have a field to let you config port.

cpis:
- name: ((vcenter_identifier))
  type: vsphere
  properties:
    host: ((vcenter_ip))
    user: ((vcenter_user))
    password: ((vcenter_password))
    datacenters:
    - clusters:
      - { ((vcenter_cluster)): {}}
      datastore_pattern: ((vcenter_datastores_pattern))
      disk_path: ((folder_to_put_disks_in))
      name: ((vcenter_datacenter))
      persistent_datastore_pattern: ((vcenter_persistent_datastores_pattern))
      template_folder: ((folder_to_put_templates_in))
      vm_folder: ((folder_to_put_vms_in))

Double-checking the spec file in case the port was being set using a default value. We found it was not set as the example above. We also did not find the port being listed as a configurable spec.

From the job template, we found that the port never gets passed in.

params = {
    "cloud" => {
      "plugin" => "vsphere",
      "properties" => {
        "vcenters" => [
          {
            "host" => vcenter_host,
            "user" => p('vcenter.user'),
            "password" => p('vcenter.password'),
            "datacenters" => [],
            "default_disk_type" => p('vcenter.default_disk_type'),
            "enable_auto_anti_affinity_drs_rules" => p('vcenter.enable_auto_anti_affinity_drs_rules'),
            "upgrade_hw_version" => p('vcenter.upgrade_hw_version'),
            "enable_human_readable_name" => p('vcenter.enable_human_readable_name'),
            "http_logging" => p('vcenter.http_logging')
          }
        ],
        "agent" => {
          "ntp" => p('ntp')
        }
      }
    }
  }

Since it did not support configuring the port, we needed to find out what port value was set as default in the code. We believe it should be 443, however, we dug into the code to verify our assumption.

From the Gemfile in the source code, we can see how the http(s) request is created here, and how the port is set to 443 as default here.

That was such a fun experience!

Thanks

Thank you for reading my blog, I hope you enjoyed it.

Great appreciation goes to @jdee, this blog is inspired by watching him how to figure out the problem we described in this post.

The post When You Have to Seek Truth in the Code appeared first on Stark & Wayne.

]]>
https://www.starkandwayne.com/blog/when-you-have-to-seek-truth-in-the-code/feed/ 0
CF Push App: ERR Downloading Failed https://www.starkandwayne.com/blog/err-downloading-failed/ Thu, 22 Aug 2019 18:00:00 +0000 https://www.starkandwayne.com//err-downloading-failed/

What you easily see are usually consequences or symptoms, but not the root cause.

"The Dev environment is down!!!" We heard louder screams from developers when the busy dev env was down than when the production environment was down. We dropped whatever we were doing and wanted to stop those screams as fast as we can. We pushed an app and saw the following logs:

[cell/o] Creating container for app xx, container successfully created[cell/o] ERR Downloading Failed
[cell/0] OUT cell-xxxxx stopping instance, destroying the container
[api/0] OUT process crushed with type: "web"
[api/0] OUT app instance exited

If you care about more about what is root cause than the process of figuring it out, click here.

It told us that "Downloading Failed", but it will never directly tell us what failed to download. With some knowledge of how an app is pushed, staged, and run, we were easily able to guess that it was the droplet download that had failed. Because the next step would be the cell getting the droplet, and then running the app in the container it created if everything worked as expected. However, we still did not know what was the root cause of "Downloading Failed".

That is where the fun comes from, that is our chance to feel smart again, by figuring it out! :)

We ran "bosh ssh" to the cell node and looked at the logs, bad tls showed up in the log entries. With this bad tls information, we knew that the certificates had some issues. Unfortunately, the logs will never tell you exactly which certificates are the problematic ones.

In our case, we use the safe cli tool to manage all of the certificates which were stored in Vault. safe has a command "safe x509 validate [path to the cert]" which we can use to inspect and validate certificates. With a simple script, we looped through all of the certificates used in the misbehaving CF environment with the "safe validate"command.

The output told us that the following certificates were expired (root cause!!!)


api/cf_networking/policy_server_internal/server_cert
syslogger/scalablesyslog/adapter/tls/cert
syslogger/scalablesyslog/adapter_rlp/tls/cert
loggregator_trafficcontroller/reverse_log_proxy/loggregator/tls/reverse_log_proxy/cert
bbs/silk-controller/cf_networking/silk_controller/server_cert
bbs/silk-controller/cf_networking/silk_daemon/server_cert
bbs/locket/tls/cert
diego/scheduler/scalablesyslog/scheduler/tls/api/cert
diego/scheduler/scalablesyslog/scheduler/tls/client/cert
cell/rep/diego/rep/server_cert
cell/rep/tls/cert
cell/vxlan-policy-agent/cf_networking/vxlan_policy_agent/client_cert
cell/silk-daemon/cf_networking/silk_daemon/client_cert

If you are not using safe, you can also use the openssl command or other such commands to view the dates for certificates.

$ openssl x509 -noout  -dates -in cert_file
notBefore=Jul 13 22:25:49 2018 GMT
notAfter=Jul 12 22:25:49 2019 GMT

We then ran "safe x509 renew" against all of the expired certificates. After double checking that all of the expired certificates were successfully renewed, we then redeployed the CF in order to update the certificates.

The redeployment went well, for the most part, except for when it came to the cell instances, it hung at the first one forever. We then tried "bosh redeploy" using the "--skip-drain" flag, unfortunately, this did not solve our issue. We next observed that the certificates on the bbs node had been successfully updated to the new ones, while the certificates on the cell nodes were still the old ones. hmm... so this would mean that the bbs and cell nodes could not talk to each other.

Everyone needs a little help sometimes, so does BOSH.

Without digging further into what exactly was making the cell updates hang forever, we decided to give BOSH a little help. We ran "bosh ssh" to the cell that was hanging, and replaced all of the expired certificates in the config files manually, and then ran "monit restart all" on the cell. This helped to nudge the "bosh redeploy" into moving forward happily. We got a happy running dev CF back and the world finally quieted down.

The story should never end here, because a good engineer will always try to fix the problem before it becomes a real issue.

Our awesome coworker Tom Mitchell wrote Doomsday.

Doomsday is a server (and also a CLI) which can be configured to track certificates from different storage backends (Vault, Credhub, Pivotal Ops Manager, or actual websites) and provide a tidy view into when certificates will expire.

Deploy Doomsday, rotate your certs before they expire, and live a happier life!

The post CF Push App: ERR Downloading Failed appeared first on Stark & Wayne.

]]>

What you easily see are usually consequences or symptoms, but not the root cause.

"The Dev environment is down!!!" We heard louder screams from developers when the busy dev env was down than when the production environment was down. We dropped whatever we were doing and wanted to stop those screams as fast as we can. We pushed an app and saw the following logs:

[cell/o] Creating container for app xx, container successfully created[cell/o] ERR Downloading Failed
[cell/0] OUT cell-xxxxx stopping instance, destroying the container
[api/0] OUT process crushed with type: "web"
[api/0] OUT app instance exited

If you care about more about what is root cause than the process of figuring it out, click here.

It told us that "Downloading Failed", but it will never directly tell us what failed to download. With some knowledge of how an app is pushed, staged, and run, we were easily able to guess that it was the droplet download that had failed. Because the next step would be the cell getting the droplet, and then running the app in the container it created if everything worked as expected. However, we still did not know what was the root cause of "Downloading Failed".

That is where the fun comes from, that is our chance to feel smart again, by figuring it out! :)

We ran "bosh ssh" to the cell node and looked at the logs, bad tls showed up in the log entries. With this bad tls information, we knew that the certificates had some issues. Unfortunately, the logs will never tell you exactly which certificates are the problematic ones.

In our case, we use the safe cli tool to manage all of the certificates which were stored in Vault. safe has a command "safe x509 validate [path to the cert]" which we can use to inspect and validate certificates. With a simple script, we looped through all of the certificates used in the misbehaving CF environment with the "safe validate"command.


The output told us that the following certificates were expired (root cause!!!)



api/cf_networking/policy_server_internal/server_cert
syslogger/scalablesyslog/adapter/tls/cert
syslogger/scalablesyslog/adapter_rlp/tls/cert
loggregator_trafficcontroller/reverse_log_proxy/loggregator/tls/reverse_log_proxy/cert
bbs/silk-controller/cf_networking/silk_controller/server_cert
bbs/silk-controller/cf_networking/silk_daemon/server_cert
bbs/locket/tls/cert
diego/scheduler/scalablesyslog/scheduler/tls/api/cert
diego/scheduler/scalablesyslog/scheduler/tls/client/cert
cell/rep/diego/rep/server_cert
cell/rep/tls/cert
cell/vxlan-policy-agent/cf_networking/vxlan_policy_agent/client_cert
cell/silk-daemon/cf_networking/silk_daemon/client_cert

If you are not using safe, you can also use the openssl command or other such commands to view the dates for certificates.

$ openssl x509 -noout  -dates -in cert_file
notBefore=Jul 13 22:25:49 2018 GMT
notAfter=Jul 12 22:25:49 2019 GMT

We then ran "safe x509 renew" against all of the expired certificates. After double checking that all of the expired certificates were successfully renewed, we then redeployed the CF in order to update the certificates.

The redeployment went well, for the most part, except for when it came to the cell instances, it hung at the first one forever. We then tried "bosh redeploy" using the "--skip-drain" flag, unfortunately, this did not solve our issue. We next observed that the certificates on the bbs node had been successfully updated to the new ones, while the certificates on the cell nodes were still the old ones. hmm... so this would mean that the bbs and cell nodes could not talk to each other.

Everyone needs a little help sometimes, so does BOSH.

Without digging further into what exactly was making the cell updates hang forever, we decided to give BOSH a little help. We ran "bosh ssh" to the cell that was hanging, and replaced all of the expired certificates in the config files manually, and then ran "monit restart all" on the cell. This helped to nudge the "bosh redeploy" into moving forward happily. We got a happy running dev CF back and the world finally quieted down.

The story should never end here, because a good engineer will always try to fix the problem before it becomes a real issue.

Our awesome coworker Tom Mitchell wrote Doomsday.

Doomsday is a server (and also a CLI) which can be configured to track certificates from different storage backends (Vault, Credhub, Pivotal Ops Manager, or actual websites) and provide a tidy view into when certificates will expire.

Deploy Doomsday, rotate your certs before they expire, and live a happier life!

The post CF Push App: ERR Downloading Failed appeared first on Stark & Wayne.

]]>
Default Password for BOSH VMs https://www.starkandwayne.com/blog/how-to-lock-vcap-password-for-bosh-vms/ Sat, 18 May 2019 00:16:04 +0000 https://www.starkandwayne.com//how-to-lock-vcap-password-for-bosh-vms/

The default username for BOSH VMs is vcap. We have two options when comes to the vcap password for BOSH and VMs that are deployed by BOSH. One is to harden the vcap password, and the other is to let BOSH generate random vcap passwords for the VMs it deploys.

Harden Password in Manifest/Cloud Config

We can use env.bosh.password to set a password in resource pools or VM types in cloud configs. All the VMs associated with the resource pool or VM type will use the same password. If we only want to set a password for a specific instance, we can set it in instance groups.

The password configured in the manifest has to be sha-512 HASH version. You can run mkpasswd -s -m sha-512 to generate one pair. You will need run apt install whois on a linux VM to run mkpasswd if you don't have it.

Example of setting a password in resource_pools:

resource_pools:
  - name: my-job
    cloud_properties: {}
    network: default
    env:
      bosh:
        password: sha-512 HASH

Example of setting a password in vm_types:

vm_types:
- name: medium
  cloud_properties: {}
  env:
    bosh:
      password: sha-512 HASH

Example of setting a password for a specific instance:

instance_groups:
name: my-instance-name
  env:
    bosh:
     password: HASH of the password

Let BOSH generate random Password for VMs it deploys

BOSH v255.4 and above support automatically generating random password for each VM that the BOSH deploys. You can simply enable this feature in the BOSH manifest as below.

properties:
  director:
    generate_vm_passwords: true

How to Use Both Options in a Smart Way

Given these two options, I suggest that for bosh create-env, we should harden the password since there is no bosh ssh when you need to ssh into the BOSH director itself. For all other BOSH VMs we can let BOSH generate passwords randomly, most of the time we can use bosh ssh to access the deployed VMs when needed.

However, there are situations that you could not run bosh ssh successfully. For example, in AWS your deployment fails when you first try to deploy.

You will need ssh to the VM to look at the agent logs. Unfortunately, VMs are terminated and deleted when a deployment fails thus you could not run bosh ssh. You can not ssh even you have the private key for the VM.

In order to keep the failed deployment VM alive, we can set it in the BOSH manifest as follows:

instance_groups:
- name: bosh
  properties:
    director:
      debug:
        keep_unreachable_vms: true

Now the VM is not deleted even when the deployment fails. We can ssh to the VM as the vcap user using the private key you have, but we still can not sudo since we do not know vcap password, now the method in section one comes handy. We can just configure env.bosh.password in our instance group and redeploy.

I would like to point out that the same method above works for compilation VMs. And it is very helpful when we need debug compilation VMs.

The post Default Password for BOSH VMs appeared first on Stark & Wayne.

]]>

The default username for BOSH VMs is vcap. We have two options when comes to the vcap password for BOSH and VMs that are deployed by BOSH. One is to harden the vcap password, and the other is to let BOSH generate random vcap passwords for the VMs it deploys.

Harden Password in Manifest/Cloud Config

We can use env.bosh.password to set a password in resource pools or VM types in cloud configs. All the VMs associated with the resource pool or VM type will use the same password. If we only want to set a password for a specific instance, we can set it in instance groups.

The password configured in the manifest has to be sha-512 HASH version. You can run mkpasswd -s -m sha-512 to generate one pair. You will need run apt install whois on a linux VM to run mkpasswd if you don't have it.

Example of setting a password in resource_pools:

resource_pools:
  - name: my-job
    cloud_properties: {}
    network: default
    env:
      bosh:
        password: sha-512 HASH

Example of setting a password in vm_types:

vm_types:
- name: medium
  cloud_properties: {}
  env:
    bosh:
      password: sha-512 HASH

Example of setting a password for a specific instance:

instance_groups:
name: my-instance-name
  env:
    bosh:
     password: HASH of the password

Let BOSH generate random Password for VMs it deploys

BOSH v255.4 and above support automatically generating random password for each VM that the BOSH deploys. You can simply enable this feature in the BOSH manifest as below.

properties:
  director:
    generate_vm_passwords: true

How to Use Both Options in a Smart Way

Given these two options, I suggest that for bosh create-env, we should harden the password since there is no bosh ssh when you need to ssh into the BOSH director itself. For all other BOSH VMs we can let BOSH generate passwords randomly, most of the time we can use bosh ssh to access the deployed VMs when needed.

However, there are situations that you could not run bosh ssh successfully. For example, in AWS your deployment fails when you first try to deploy.

You will need ssh to the VM to look at the agent logs. Unfortunately, VMs are terminated and deleted when a deployment fails thus you could not run bosh ssh. You can not ssh even you have the private key for the VM.

In order to keep the failed deployment VM alive, we can set it in the BOSH manifest as follows:

instance_groups:
- name: bosh
  properties:
    director:
      debug:
        keep_unreachable_vms: true

Now the VM is not deleted even when the deployment fails. We can ssh to the VM as the vcap user using the private key you have, but we still can not sudo since we do not know vcap password, now the method in section one comes handy. We can just configure env.bosh.password in our instance group and redeploy.

I would like to point out that the same method above works for compilation VMs. And it is very helpful when we need debug compilation VMs.

The post Default Password for BOSH VMs appeared first on Stark & Wayne.

]]>
Migrate BOSH/Cloud Foundry (CF) Disks from vSphere Datastore(s) to Different Ones https://www.starkandwayne.com/blog/migrate-bosh-cf-disks-from-one-vsphere-datastore-to-another/ Mon, 06 May 2019 20:12:46 +0000 https://www.starkandwayne.com//migrate-bosh-cf-disks-from-one-vsphere-datastore-to-another/

Migrating disks for BOSH and Cloud Foundry (CF) VMs from the current datastore(s) to new datastore(s) can be painless if you do it right. Otherwise, you may end up losing your persistent disks and then your mind.

The following steps will help you avoid unnecessary pain during this process.

1) Attach the new datastore(s) to the hosts where the BOSH and CF VMs are running (Do not detach the old datastores at this time)

2) Change deployment manifest for the BOSH Director to configure vSphere CPI to
reference new datastore(s)

properties:  vsphere:
    host: your_host
    user: root
    password: something_secret
    datacenters:
    - name: BOSH_DC
      vm_folder: sandbox-vms
      template_folder: sandbox-templates
      disk_path: sandbox-disks
      datastore_pattern: '\new-sandbox\z' # <---
      persistent_datastore_pattern: '\new-sandbox\z' # <---
      clusters: [SANDBOX]

3) Redeploy the BOSH Director. Depend on how you deploy the BOSH director, the command you run in this step can vary.

4) Verify that the BOSH Director VM's root, ephemeral and persistent disks are all
now on the new datastore(s)

5) Run bosh deploy --recreate for CF deployments so that VMs are recreated and
persistent disks are reattached

6) Verify that the persistent disks and VMs were moved to new datastore(s) and
there are no remaining disks in the old datastore(s)

The post Migrate BOSH/Cloud Foundry (CF) Disks from vSphere Datastore(s) to Different Ones appeared first on Stark & Wayne.

]]>

Migrating disks for BOSH and Cloud Foundry (CF) VMs from the current datastore(s) to new datastore(s) can be painless if you do it right. Otherwise, you may end up losing your persistent disks and then your mind.

The following steps will help you avoid unnecessary pain during this process.

1) Attach the new datastore(s) to the hosts where the BOSH and CF VMs are running (Do not detach the old datastores at this time)

2) Change deployment manifest for the BOSH Director to configure vSphere CPI to
reference new datastore(s)

properties:  vsphere:
    host: your_host
    user: root
    password: something_secret
    datacenters:
    - name: BOSH_DC
      vm_folder: sandbox-vms
      template_folder: sandbox-templates
      disk_path: sandbox-disks
      datastore_pattern: '\new-sandbox\z' # <---
      persistent_datastore_pattern: '\new-sandbox\z' # <---
      clusters: [SANDBOX]

3) Redeploy the BOSH Director. Depend on how you deploy the BOSH director, the command you run in this step can vary.

4) Verify that the BOSH Director VM's root, ephemeral and persistent disks are all
now on the new datastore(s)

5) Run bosh deploy --recreate for CF deployments so that VMs are recreated and
persistent disks are reattached

6) Verify that the persistent disks and VMs were moved to new datastore(s) and
there are no remaining disks in the old datastore(s)

The post Migrate BOSH/Cloud Foundry (CF) Disks from vSphere Datastore(s) to Different Ones appeared first on Stark & Wayne.

]]>
How to Migrate Your CF from One vSphere Cluster to Another https://www.starkandwayne.com/blog/how-to-migrate-your-cf-from-one-vsphere-cluster-to-another/ Mon, 06 May 2019 14:38:21 +0000 https://www.starkandwayne.com//how-to-migrate-your-cf-from-one-vsphere-cluster-to-another/

Recently, one of our clients had to migrate their CF from one vSphere cluster to another. Here is the story: the client bought some more modern UCS chassis they would like to add to the existing cluster. Enhanced vMotion Compatibility (EVC) must be enabled to support mixed processors in the same cluster. You can't enable EVC if there is a single VM is on it. In order to enable EVC feature, they need to migrate the whole CF to a new cluster. Since it is a heavily used CF environment, they would like zero or minimum downtime for the migration.

The solution should have been simple since vSphere has the vMotion feature. The steps we came up with are: disable bosh resurrection, create a new cluster in the same vCenter, vMotion the CF VMs to the new cluster, enable EVC in the old cluster, vMotion the CF back to the old cluster, enable Bosh resurrection. Zero downtime and no user should have noticed that we migrated CF between two clusters.

Small challenges happened when we found out that we can't vMotion between the two clusters when VMs are running due to the CPU compatibility issues between the two clusters. (More details see vMotion between vSphere Clusters). This means we have to power off CF VMs before we can do vMotion, which means possible downtime for the platform.

Given the situation, we came up with the following working solution:

1) Turn off BOSH resurrection, otherwise BOSH will try to self-recover/recreate
your VMs that are down when you try to migrate.

2) Run bosh stop on a subgroup of the VMs so there were still same type VMs running to keep the platform working. bosh stop without --hard flag by default will stop VM while keeping the persistent disk.

3) Power off those VMs to do vMotion to the new cluster we created beforehand.

4) After vMotion, bring the VMs in the new cluster up.

5) Repeat the above process until you migrate all the VMs over to the new cluster

6) Delete or rename the old cluster.

7) Rename the new cluster to the old cluster's name.

8) Turn the BOSH resurrection back on.

The post How to Migrate Your CF from One vSphere Cluster to Another appeared first on Stark & Wayne.

]]>

Recently, one of our clients had to migrate their CF from one vSphere cluster to another. Here is the story: the client bought some more modern UCS chassis they would like to add to the existing cluster. Enhanced vMotion Compatibility (EVC) must be enabled to support mixed processors in the same cluster. You can't enable EVC if there is a single VM is on it. In order to enable EVC feature, they need to migrate the whole CF to a new cluster. Since it is a heavily used CF environment, they would like zero or minimum downtime for the migration.

The solution should have been simple since vSphere has the vMotion feature. The steps we came up with are: disable bosh resurrection, create a new cluster in the same vCenter, vMotion the CF VMs to the new cluster, enable EVC in the old cluster, vMotion the CF back to the old cluster, enable Bosh resurrection. Zero downtime and no user should have noticed that we migrated CF between two clusters.

Small challenges happened when we found out that we can't vMotion between the two clusters when VMs are running due to the CPU compatibility issues between the two clusters. (More details see vMotion between vSphere Clusters). This means we have to power off CF VMs before we can do vMotion, which means possible downtime for the platform.

Given the situation, we came up with the following working solution:

1) Turn off BOSH resurrection, otherwise BOSH will try to self-recover/recreate
your VMs that are down when you try to migrate.

2) Run bosh stop on a subgroup of the VMs so there were still same type VMs running to keep the platform working. bosh stop without --hard flag by default will stop VM while keeping the persistent disk.

3) Power off those VMs to do vMotion to the new cluster we created beforehand.

4) After vMotion, bring the VMs in the new cluster up.

5) Repeat the above process until you migrate all the VMs over to the new cluster

6) Delete or rename the old cluster.

7) Rename the new cluster to the old cluster's name.

8) Turn the BOSH resurrection back on.

The post How to Migrate Your CF from One vSphere Cluster to Another appeared first on Stark & Wayne.

]]>
BOSH Director and CF VMs Time Drift https://www.starkandwayne.com/blog/bosh-director-and-cf-vms-time-drift/ Fri, 01 Mar 2019 18:11:19 +0000 https://www.starkandwayne.com//bosh-director-and-cf-vms-time-drift/

When the time on BOSH director and CF VMs such as cells are off, it may throw off some of your applications with unusual or unexpected errors. One cause of time drift can occur when you update the NTP servers on your BOSH director without recreating the CF VMs that are already deployed.

If you see time drift between your BOSH director and CF VMs, there are some steps you can follow to fix it. The basic idea is that first making sure the correct NTP servers are configured in the manifest for BOSH director and deploy BOSH successfully, then make sure the NTP servers successfully sync up to the CF VMs that deployed by BOSH.

More specifically, first make sure the right NTP servers are configured in the manifest for BOSH director itself and redeploy BOSH director if you made a change to the manifest. When you ssh to the BOSH director, take a look at /var/vcap/bosh/etc/ntpserver, you should see the same NTP servers listed there, check /var/vcap/bosh/settings.json, you should also find a block with the correct NTP information.

Next, bosh ssh to the CF VMs that are affected by time drift, take a look at /var/vcap/bosh/etc/ntpserver, you probably see the old or default NTP servers, also check /var/vcap/bosh/settings.json, you probably will also find a ntp block with the old or default NTP servers.

In this case, you can bosh recreate the CF VMs that have time drift. After recreation, you will see the NTP servers on the CF VMs have been updated to the same ones configured on the BOSH director and time drift issue should now be fixed. You can run bosh recreate with a --max-in-flight= flag to control how many VMs you want to update at the same time.

The post BOSH Director and CF VMs Time Drift appeared first on Stark & Wayne.

]]>

When the time on BOSH director and CF VMs such as cells are off, it may throw off some of your applications with unusual or unexpected errors. One cause of time drift can occur when you update the NTP servers on your BOSH director without recreating the CF VMs that are already deployed.

If you see time drift between your BOSH director and CF VMs, there are some steps you can follow to fix it. The basic idea is that first making sure the correct NTP servers are configured in the manifest for BOSH director and deploy BOSH successfully, then make sure the NTP servers successfully sync up to the CF VMs that deployed by BOSH.

More specifically, first make sure the right NTP servers are configured in the manifest for BOSH director itself and redeploy BOSH director if you made a change to the manifest. When you ssh to the BOSH director, take a look at /var/vcap/bosh/etc/ntpserver, you should see the same NTP servers listed there, check /var/vcap/bosh/settings.json, you should also find a block with the correct NTP information.

Next, bosh ssh to the CF VMs that are affected by time drift, take a look at /var/vcap/bosh/etc/ntpserver, you probably see the old or default NTP servers, also check /var/vcap/bosh/settings.json, you probably will also find a ntp block with the old or default NTP servers.

In this case, you can bosh recreate the CF VMs that have time drift. After recreation, you will see the NTP servers on the CF VMs have been updated to the same ones configured on the BOSH director and time drift issue should now be fixed. You can run bosh recreate with a --max-in-flight= flag to control how many VMs you want to update at the same time.

The post BOSH Director and CF VMs Time Drift appeared first on Stark & Wayne.

]]>
Deploy HA CF with Anti-Affinity DRS Rules in vSphere https://www.starkandwayne.com/blog/deploy-ha-cf-with-anti-affinity-drs-rule-in-vsphere/ Mon, 02 Apr 2018 20:26:28 +0000 https://www.starkandwayne.com//deploy-ha-cf-with-anti-affinity-drs-rule-in-vsphere/

VM-VM Affinity Rules

vSphere VM-VM Affinity Rules specify whether selected individual VMs should run on the same host or be kept on separate hosts. You can either set Distribute Resource Scheduler (DRS) Affinity Rule which will keep all the VMs on the same host or DRS Anti-Affinity Rule which requires that each VM is on its own host. DRS Anti-Affinity Rule can be used to achieve host level High Availability (HA); since when you put VMs that run the same jobs on different hosts, if one host is down, you still have other nodes running and working.

Note, the number of the hosts must be equal to or greater than the number of vms you want to place with anti-affinity Distributed Resource Scheduler (DRS) rule.

HA in CF Deployment

To achieve HA in CF deployment, you can horizontally scale most Cloud Foundry components to multiple instances into different Availability Zones (Azs), in this scenario, onto different ESXi hosts. Using DRS Anti-Affinity Rule to achieve host level HA has two limitations: one is that the host number must not be less than the number of instances you want to apply DRS Anti-Affinity Rule; the other is you have to set DRS Anti-Affinity Rules for each type of instances you want to achieve host level HA while you should not create too many Anti-Affinity Rules in one cluster.

You could also use vSphere Resource Poolto achieve host level HA. You have to deal with load balance among all the hosts since the number of instances for different CF component can be 2, 3, and many more. How to use vSphere Resource Pool to achieve HA in CF is out of this blog's scope.

Let's continue to talk about how to use DRS Anti-Affinity Rule in CF deployment. We will use the ectd_consul instances in CF deployment as an example to show you how to configure VM-VM anti-affinity DRS rule to make sure each ectd_consul node is on its own host.

Before we can apply Anti-Affinity DRS Rule to the consul_etcd job in CF deployment, we need enable DRS Automation for the cluster that we are deploying vms in.

Enable DRS Automation

Go to your vSphere Web Client, right click on the cluster, choose setting, then click vSphere DRS, turn on the DRS automation and pick automation level such as Fully Automation or Partially Automation.

Next, we need to configure vm_type with Anti-Affinity DRS Rule in cloud config and apply it to the consul_etcd job in CF deployment manifest.

Define vm_type with Anti-Affinity DRS Rule in Cloud Config

In your cloud config, define a vm_type named consul_etcd as follows. This adds Anti-Affinity DRS rule to the vSphere cluster with a separate_vms rule type.

vm_type:
- name: consul_etcd
  cloud_properties:
    datacenters:
    - name: my-dc
      clusters:
      - my-vsphere-cluster:
          drs_rules:
          - name: separate-consul-etcd-rule
            type: separate_vms

Configure Manifest

In the manifest to deploy CF, configure the consul_etcd job in the instance groups to use the consul_etcd as vm_type. In a deployment repo generated using genesis, you can add the following configuration in you environment yml file. This configuration will create 3 VMs onto 3 different ESXi hosts in the cluster specified in the above cloud config. Without DRS Anti-Affinity Rule, quite often it creates 3 VMs onto 2 hosts even the total number of hosts is 3.

instance_groups:
- name: consul_etcd
  instances: 3
  vm_type: consul_etcd
  persistent_disk_type: consul
  networks:
  - your_consul_network_in_cc
  jobs:
    - name: etcd
      release: ectd
      properties:
        etcd:
        ...
    - name: metron_agent
    ....

Now you are ready to deploy. You will see the 3 consul nodes are allocated on 3 different ESXi hosts.

The post Deploy HA CF with Anti-Affinity DRS Rules in vSphere appeared first on Stark & Wayne.

]]>

VM-VM Affinity Rules

vSphere VM-VM Affinity Rules specify whether selected individual VMs should run on the same host or be kept on separate hosts. You can either set Distribute Resource Scheduler (DRS) Affinity Rule which will keep all the VMs on the same host or DRS Anti-Affinity Rule which requires that each VM is on its own host. DRS Anti-Affinity Rule can be used to achieve host level High Availability (HA); since when you put VMs that run the same jobs on different hosts, if one host is down, you still have other nodes running and working.

Note, the number of the hosts must be equal to or greater than the number of vms you want to place with anti-affinity Distributed Resource Scheduler (DRS) rule.

HA in CF Deployment

To achieve HA in CF deployment, you can horizontally scale most Cloud Foundry components to multiple instances into different Availability Zones (Azs), in this scenario, onto different ESXi hosts. Using DRS Anti-Affinity Rule to achieve host level HA has two limitations: one is that the host number must not be less than the number of instances you want to apply DRS Anti-Affinity Rule; the other is you have to set DRS Anti-Affinity Rules for each type of instances you want to achieve host level HA while you should not create too many Anti-Affinity Rules in one cluster.

You could also use vSphere Resource Poolto achieve host level HA. You have to deal with load balance among all the hosts since the number of instances for different CF component can be 2, 3, and many more. How to use vSphere Resource Pool to achieve HA in CF is out of this blog's scope.

Let's continue to talk about how to use DRS Anti-Affinity Rule in CF deployment. We will use the ectd_consul instances in CF deployment as an example to show you how to configure VM-VM anti-affinity DRS rule to make sure each ectd_consul node is on its own host.

Before we can apply Anti-Affinity DRS Rule to the consul_etcd job in CF deployment, we need enable DRS Automation for the cluster that we are deploying vms in.

Enable DRS Automation

Go to your vSphere Web Client, right click on the cluster, choose setting, then click vSphere DRS, turn on the DRS automation and pick automation level such as Fully Automation or Partially Automation.

Next, we need to configure vm_type with Anti-Affinity DRS Rule in cloud config and apply it to the consul_etcd job in CF deployment manifest.

Define vm_type with Anti-Affinity DRS Rule in Cloud Config

In your cloud config, define a vm_type named consul_etcd as follows. This adds Anti-Affinity DRS rule to the vSphere cluster with a separate_vms rule type.

vm_type:
- name: consul_etcd
  cloud_properties:
    datacenters:
    - name: my-dc
      clusters:
      - my-vsphere-cluster:
          drs_rules:
          - name: separate-consul-etcd-rule
            type: separate_vms

Configure Manifest

In the manifest to deploy CF, configure the consul_etcd job in the instance groups to use the consul_etcd as vm_type. In a deployment repo generated using genesis, you can add the following configuration in you environment yml file. This configuration will create 3 VMs onto 3 different ESXi hosts in the cluster specified in the above cloud config. Without DRS Anti-Affinity Rule, quite often it creates 3 VMs onto 2 hosts even the total number of hosts is 3.

instance_groups:
- name: consul_etcd
  instances: 3
  vm_type: consul_etcd
  persistent_disk_type: consul
  networks:
  - your_consul_network_in_cc
  jobs:
    - name: etcd
      release: ectd
      properties:
        etcd:
        ...
    - name: metron_agent
    ....

Now you are ready to deploy. You will see the 3 consul nodes are allocated on 3 different ESXi hosts.

The post Deploy HA CF with Anti-Affinity DRS Rules in vSphere appeared first on Stark & Wayne.

]]>
A Handy S3 CLI https://www.starkandwayne.com/blog/a-handy-s3-cli/ Wed, 14 Mar 2018 14:30:00 +0000 https://www.starkandwayne.com//a-handy-s3-cli/

Do you ever get annoyed that you have to install Python, pip, and then AWS CLI in order to simply access your S3 storage to manage your buckets?

I know once in a while, I do.

Then this awesome guy, James Hunt, showed me a handy tool S3 CLI he wrote. It is simple, but gets the job done. It supports list buckets, create/delete buckets, upload/delete files, etc.

Go here to download the binary for your OS, name it s3, change the permission, and put in your PATH. It is even cooler that s3 is avaiable through homebrew. Simply run the following commands:

brew tap jhunt/hacks
brew install s3

You can run basic s3 commands to see all the commands now.

General usage: s3 COMMAND [OPTIONS...]
  acls            List known ACLs and their purposes / access rules.
  commands        List known sub-commands of this s3 client.
  list-buckets    List all S3 buckets owned by you.
  create-bucket   Create a new bucket.
  delete-bucket   Delete an empty bucket.
  put             Upload a new file to S3.
  get             Download a file from S3.
  cat             Print the contents of a file in S3.
  url             Print the HTTPS URL for a file in S3.
  rm              Delete file from a bucket.
  ls              List the files in a bucket.
  chacl           Change the ACL on a bucket or a file.
  lsacl           List the ACL on a bucket or a file.

To access your Amazon S3, you can set up the following environment variables.

S3_AKI: Your Access Key ID.
S3_KEY: Your secret access key.
S3_REGION: The name of the AWS region.

If you are looking for more complicated operations in your Amazon S3 and other cloud storage service providers that use the S3 protocol, such as Google Cloud Storage or DreamHost DreamObjects, check s3cmd out.

The post A Handy S3 CLI appeared first on Stark & Wayne.

]]>

Do you ever get annoyed that you have to install Python, pip, and then AWS CLI in order to simply access your S3 storage to manage your buckets?

I know once in a while, I do.

Then this awesome guy, James Hunt, showed me a handy tool S3 CLI he wrote. It is simple, but gets the job done. It supports list buckets, create/delete buckets, upload/delete files, etc.

Go here to download the binary for your OS, name it s3, change the permission, and put in your PATH. It is even cooler that s3 is avaiable through homebrew. Simply run the following commands:

brew tap jhunt/hacks
brew install s3

You can run basic s3 commands to see all the commands now.

General usage: s3 COMMAND [OPTIONS...]
  acls            List known ACLs and their purposes / access rules.
  commands        List known sub-commands of this s3 client.
  list-buckets    List all S3 buckets owned by you.
  create-bucket   Create a new bucket.
  delete-bucket   Delete an empty bucket.
  put             Upload a new file to S3.
  get             Download a file from S3.
  cat             Print the contents of a file in S3.
  url             Print the HTTPS URL for a file in S3.
  rm              Delete file from a bucket.
  ls              List the files in a bucket.
  chacl           Change the ACL on a bucket or a file.
  lsacl           List the ACL on a bucket or a file.

To access your Amazon S3, you can set up the following environment variables.

S3_AKI: Your Access Key ID.
S3_KEY: Your secret access key.
S3_REGION: The name of the AWS region.

If you are looking for more complicated operations in your Amazon S3 and other cloud storage service providers that use the S3 protocol, such as Google Cloud Storage or DreamHost DreamObjects, check s3cmd out.

The post A Handy S3 CLI appeared first on Stark & Wayne.

]]>
Configure UAA in CF with SAML as A Service Provider https://www.starkandwayne.com/blog/configure-uaa-with-saml-as-service-provider/ Tue, 13 Mar 2018 14:15:00 +0000 https://www.starkandwayne.com//configure-uaa-with-saml-as-service-provider/

Before we start going through how to configure UAA in CF with SAML as a Service Provider, let's make sure we have common terminology.

UAA

The User Account and Authentication (UAA) is the OAuth2 server used as the identity management service for Cloud Foundry (CF).

UAA supports standard protocols such as the Security Assertion Markup Language (SAML) and Lightweight Directory Access Protocol (LDAP) to provide Single Sign-On (SSO) service.

SAML

SAML is an XML-based, open-standard data format for exchanging authentication and authorization data between an Service Provider (SP) and an Identity Provider (IDP).

The SP trusts the IDP to authenticate users and IDP generates an authentication assertion which is sent to SP to indicate that a user has been authenticated.

A common case is that setting Active Directory Federation Services (ADFS) as an IDP and SAML as a single-sign-on (SSO) SP.

SAML Integration on UAA

UAA can be configured as either an SP or IDP. Typically, UAA is the SP, and an external provider, such as Okta or [Active Directory Federation Services (ADFS)] (https://msdn.microsoft.com/en-us/library/bb897402.aspx), is the IDP.

We must configure both UAA SP and the external SAML IDP when we set SAML integration on UAA. A misconfigure on either side will cause authentication to fail.

Now we have the basic concepts defined, next we will walk you through how to configure UAA in CF with SAML as an SP.

Configure UAA in CF with SAML as An SP

Configure IDP
First, obtain the UAA SP metadata from the following endpoint and save it into a file.

https://login.YOUR-CF-SYSTEM-DOMAIN/saml/metadata

Next, import this SAML SP configure to your external IDP. Different IDPs have different instructions on how to import SP metadata, thus we will skip the details for this step.

Configure UAA SP

First, obtain the IDP metadata from your external IDP provider.

Next, we will configure UAA SP in the CF manifest and redeploy CF to make the UAA SP configuration take effect. No matter how you manage your CF manifest/deployments, the following configuration is needed for your CF manifest.

Note that the key/cert are usually generated automatically as part of CF manifest with the same root CA used by uaa.

name: uaa
jobs:
- name: uaa
  properties:
    login:
      saml:
        # Provider Information Configs
        providers:
          # Example
          myPvovider:
            nameID: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
            idpMetadata: the matadata itself or the link to it
            showSamlLoginLink: true
            linkText: Log in with XX IDP
            metadataTrustCheck: false
        #The active key is used for signing messages and the key to be used to encrypt messages.
        activeKeyId: key1
        keys:
          key1:
            key: #uaa login saml key
            certificate: # uaa login saml certificate
            passphrase: ""
          # you can add multiple keys such as key1, key2...

After you complete configurations on both sides, you can go ahead to verify your SAML integration with UAA in CF work.

The post Configure UAA in CF with SAML as A Service Provider appeared first on Stark & Wayne.

]]>

Before we start going through how to configure UAA in CF with SAML as a Service Provider, let's make sure we have common terminology.

UAA

The User Account and Authentication (UAA) is the OAuth2 server used as the identity management service for Cloud Foundry (CF).

UAA supports standard protocols such as the Security Assertion Markup Language (SAML) and Lightweight Directory Access Protocol (LDAP) to provide Single Sign-On (SSO) service.

SAML

SAML is an XML-based, open-standard data format for exchanging authentication and authorization data between an Service Provider (SP) and an Identity Provider (IDP).

The SP trusts the IDP to authenticate users and IDP generates an authentication assertion which is sent to SP to indicate that a user has been authenticated.

A common case is that setting Active Directory Federation Services (ADFS) as an IDP and SAML as a single-sign-on (SSO) SP.

SAML Integration on UAA

UAA can be configured as either an SP or IDP. Typically, UAA is the SP, and an external provider, such as Okta or [Active Directory Federation Services (ADFS)] (https://msdn.microsoft.com/en-us/library/bb897402.aspx), is the IDP.

We must configure both UAA SP and the external SAML IDP when we set SAML integration on UAA. A misconfigure on either side will cause authentication to fail.

Now we have the basic concepts defined, next we will walk you through how to configure UAA in CF with SAML as an SP.

Configure UAA in CF with SAML as An SP

Configure IDP
First, obtain the UAA SP metadata from the following endpoint and save it into a file.

https://login.YOUR-CF-SYSTEM-DOMAIN/saml/metadata

Next, import this SAML SP configure to your external IDP. Different IDPs have different instructions on how to import SP metadata, thus we will skip the details for this step.

Configure UAA SP

First, obtain the IDP metadata from your external IDP provider.

Next, we will configure UAA SP in the CF manifest and redeploy CF to make the UAA SP configuration take effect. No matter how you manage your CF manifest/deployments, the following configuration is needed for your CF manifest.

Note that the key/cert are usually generated automatically as part of CF manifest with the same root CA used by uaa.

name: uaa
jobs:
- name: uaa
  properties:
    login:
      saml:
        # Provider Information Configs
        providers:
          # Example
          myPvovider:
            nameID: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
            idpMetadata: the matadata itself or the link to it
            showSamlLoginLink: true
            linkText: Log in with XX IDP
            metadataTrustCheck: false
        #The active key is used for signing messages and the key to be used to encrypt messages.
        activeKeyId: key1
        keys:
          key1:
            key: #uaa login saml key
            certificate: # uaa login saml certificate
            passphrase: ""
          # you can add multiple keys such as key1, key2...

After you complete configurations on both sides, you can go ahead to verify your SAML integration with UAA in CF work.

The post Configure UAA in CF with SAML as A Service Provider appeared first on Stark & Wayne.

]]>