VCP-NV: Troubleshooting NSX Controllers and ESXi Connectivity

adminJune 9, 20150

This post is the last part of my series of posts looking at the VCP-NV objectives. This one will look at some of the tools and commands you can use to help troubleshoot an NSX implementation, starting with looking at some of the commands and tools than can help with troubleshooting NSX controllers, then moving onto commands that can be run on ESXi hosts to verify connectivity to the controllers.

Troubleshooting NSX Controllers

Being the Control plane of your NSX virtual network, NSX controllers are an extremely important component. If your controllers aren’t available or working correctly then NSX will not function. To help avoid this scenario, controllers are generally deployed in a group of three (or five, for added resilience), which allows for a controller failure without impacting the virtual network (two controller failures, if five controllers are deployed).

The first place to go to check the status of the NSX controllers is the ‘Installation’ page, under ‘Networks and Security’ in the vSphere Web Client:

To get more detail on the NSX controller nodes, we will need to use the command line interface. Once connected to an NSX controller, via SSH, run ‘show control-cluster status’ to view the controller cluster status:

nsx-controller # show control-cluster status
Type                Status                                       Since
--------------------------------------------------------------------------------
Join status:        Join complete                                06/08 07:34:54
Majority status:    Connected to cluster majority                06/08 07:37:19
Restart status:     This controller can be safely restarted      06/08 07:37:12
Cluster ID:         c122291b-9f04-42a6-bf18-932a13f7385a
Node UUID:          c122291b-9f04-42a6-bf18-932a13f7385a

Role                Configured status   Active status
--------------------------------------------------------------------------------
api_provider        enabled             activated
persistence_server  enabled             activated
switch_manager      enabled             activated
logical_manager     enabled             activated
directory_server    enabled             activated

To check recent events, you can run the ‘show control-cluster history’ command:

nsx-controller # show control-cluster history
===================================
Host nsx-controller
Node c122291b-9f04-42a6-bf18-932a13f7385a (172.16.1.70, nicira-nvp-controller.4.0.5.39275)
  05/11 11:55:43: Node started for the first time
  05/11 11:55:45: Joining cluster via node 172.16.1.70
  05/11 11:55:45: Waiting to join cluster
  05/11 11:55:45: Role api_provider configured
  05/11 11:55:45: Role directory_server configured
  05/11 11:55:45: Role switch_manager configured
  05/11 11:55:45: Role logical_manager configured
  05/11 11:55:45: Role persistence_server configured
  05/11 11:55:45: Joined cluster; initializing local components
  05/11 11:55:45: Disconnected from cluster majority
  05/11 11:55:55: Connected to cluster majority
  05/11 11:55:58: Initializing data contact with cluster

You can check list the controllers that make up the controller cluster by running ‘show control-cluster startup-nodes’:

nsx-controller # show control-cluster startup-nodes
172.16.1.70,172.16.1.71,172.16.1.73

And you can list the controller roles with ‘show control-cluster roles’:

The output shows whether the controller is master for a given role. The controller in the above example, isn’t master for any. You can list connections to the controller with:

nsx-controller # show control-cluster connections
role                port            listening open conns
--------------------------------------------------------
api_provider        api/443         Y         1
--------------------------------------------------------
persistence_server  server/2878     -         0
                    client/2888     Y         1
                    election/3888   -         0
--------------------------------------------------------
switch_manager      ovsmgmt/6632    Y         0
                    openflow/6633   Y         0
--------------------------------------------------------
system              cluster/7777    Y         0

And you can view controller statistics with:

nsx-controller # show control-cluster core stats
messages.received               0
messages.received.dropped       0
messages.transmitted            200
messages.transmit.dropped       0
messages.processing.dropped     0
connections.up                  129
connections.down                129
connections.timeout             0
connections.active              0
connections.sharding.subscribed 0

You can get more detail on the connections to and from a controller by running:

nsx-controller # show network connections of-type tcp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:9998          0.0.0.0:*               LISTEN      1817/domain
tcp        0      0 127.0.0.1:9999          0.0.0.0:*               LISTEN      1799/java
tcp        0      0 127.0.0.1:8080          0.0.0.0:*               LISTEN      1103/python
tcp        0      0 127.0.0.1:8081          0.0.0.0:*               LISTEN      1072/python
tcp        0      0 0.0.0.0:30865           0.0.0.0:*               LISTEN      983/csync2
tcp        0      0 0.0.0.0:1234            0.0.0.0:*               LISTEN      1799/java
tcp        0      0 127.0.0.1:2003          0.0.0.0:*               LISTEN      1100/python
tcp        0      0 127.0.0.1:2004          0.0.0.0:*               LISTEN      1100/python

This is like running a ‘netstat’.

Troubleshooting NSX Compute Nodes

When a ESXi host/cluster is configured for NSX, a number of VIBs are installed on the host to give it the ability to participate in NSX virtual networks. These are called esx-vxlan, esx-vsip and esx-dvfilter-switch-security.

You can check these VIBs have been installed by running the following command on the ESXi host:

To check controller connectivity from the ESXi host you can run:

Or:

# esxcli network vswitch dvs vmware vxlan network list -–vds-name 
VXLAN ID  Multicast IP               Control Plane                        Controller Connection  Port Count  MAC Entry Count  ARP Entry Count  MTEP Count
--------  -------------------------  -----------------------------------  ---------------------  ----------  ---------------  ---------------  ----------
    5000  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  172.16.1.70 (up)            1                1                0           0
    5004  N/A (headend replication)  Enabled (multicast proxy,ARP proxy)  172.16.1.70 (up)            1                0

In the ‘Controller Connection’ column you can see the controller IP address, and it’s status. If all is healthy you should see some established connections on port 1234, which the ‘netcpad’ service uses to connect to the NSX controller instance:

# esxcli network ip connection list| grep tcp | grep 1234
tcp         0       0  172.16.1.90:43954  172.16.1.70:1234  ESTABLISHED     44754  netcpa-worker

If you find that a host isn’t connected to the controller, one possible step would be to restart the netcpad service on the host:

# /etc/init.d/netcpad restart

The service has it’s own log file, which is useful for troubleshooting, which is found at /var/log/netcpa.log.

Useful Links and Resources

https://pubs.vmware.com/NSX-6/topic/com.vmware.ICbase/PDF/nsx_60_cli.pdf

Troubleshooting NSX Controllers

Troubleshooting NSX Compute Nodes

Useful Links and Resources

VCP-NV: Monitor a VMware NSX Implementation

Using PowerCLI to Set Log Rotation Options for a Group of Virtual Machines

Related posts

5 Easy Steps to Mastering TCPdump for Network Troubleshooting

A Beginner’s Guide to the Netstat Command

Tracing Your Steps: A Beginner’s Guide to Traceroute in Linux