This post is the last part of my series of posts looking at the VCP-NV objectives. This one will look at some of the tools and commands you can use to help troubleshoot an NSX implementation, starting with looking at some of the commands and tools than can help with troubleshooting NSX controllers, then moving onto commands that can be run on ESXi hosts to verify connectivity to the controllers.
Troubleshooting NSX Controllers
Being the Control plane of your NSX virtual network, NSX controllers are an extremely important component. If your controllers aren’t available or working correctly then NSX will not function. To help avoid this scenario, controllers are generally deployed in a group of three (or five, for added resilience), which allows for a controller failure without impacting the virtual network (two controller failures, if five controllers are deployed).
The first place to go to check the status of the NSX controllers is the ‘Installation’ page, under ‘Networks and Security’ in the vSphere Web Client:
To get more detail on the NSX controller nodes, we will need to use the command line interface. Once connected to an NSX controller, via SSH, run ‘show control-cluster status’ to view the controller cluster status:
nsx-controller # show control-cluster status Type Status Since -------------------------------------------------------------------------------- Join status: Join complete 06/08 07:34:54 Majority status: Connected to cluster majority 06/08 07:37:19 Restart status: This controller can be safely restarted 06/08 07:37:12 Cluster ID: c122291b-9f04-42a6-bf18-932a13f7385a Node UUID: c122291b-9f04-42a6-bf18-932a13f7385a Role Configured status Active status -------------------------------------------------------------------------------- api_provider enabled activated persistence_server enabled activated switch_manager enabled activated logical_manager enabled activated directory_server enabled activated
To check recent events, you can run the ‘show control-cluster history’ command:
nsx-controller # show control-cluster history =================================== Host nsx-controller Node c122291b-9f04-42a6-bf18-932a13f7385a (172.16.1.70, nicira-nvp-controller.4.0.5.39275) 05/11 11:55:43: Node started for the first time 05/11 11:55:45: Joining cluster via node 172.16.1.70 05/11 11:55:45: Waiting to join cluster 05/11 11:55:45: Role api_provider configured 05/11 11:55:45: Role directory_server configured 05/11 11:55:45: Role switch_manager configured 05/11 11:55:45: Role logical_manager configured 05/11 11:55:45: Role persistence_server configured 05/11 11:55:45: Joined cluster; initializing local components 05/11 11:55:45: Disconnected from cluster majority 05/11 11:55:55: Connected to cluster majority 05/11 11:55:58: Initializing data contact with cluster
You can check list the controllers that make up the controller cluster by running ‘show control-cluster startup-nodes’:
nsx-controller # show control-cluster startup-nodes 172.16.1.70,172.16.1.71,172.16.1.73
And you can list the controller roles with ‘show control-cluster roles’:
The output shows whether the controller is master for a given role. The controller in the above example, isn’t master for any. You can list connections to the controller with:
nsx-controller # show control-cluster connections role port listening open conns -------------------------------------------------------- api_provider api/443 Y 1 -------------------------------------------------------- persistence_server server/2878 - 0 client/2888 Y 1 election/3888 - 0 -------------------------------------------------------- switch_manager ovsmgmt/6632 Y 0 openflow/6633 Y 0 -------------------------------------------------------- system cluster/7777 Y 0
And you can view controller statistics with:
nsx-controller # show control-cluster core stats messages.received 0 messages.received.dropped 0 messages.transmitted 200 messages.transmit.dropped 0 messages.processing.dropped 0 connections.up 129 connections.down 129 connections.timeout 0 connections.active 0 connections.sharding.subscribed 0
You can get more detail on the connections to and from a controller by running:
nsx-controller # show network connections of-type tcp Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:9998 0.0.0.0:* LISTEN 1817/domain tcp 0 0 127.0.0.1:9999 0.0.0.0:* LISTEN 1799/java tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 1103/python tcp 0 0 127.0.0.1:8081 0.0.0.0:* LISTEN 1072/python tcp 0 0 0.0.0.0:30865 0.0.0.0:* LISTEN 983/csync2 tcp 0 0 0.0.0.0:1234 0.0.0.0:* LISTEN 1799/java tcp 0 0 127.0.0.1:2003 0.0.0.0:* LISTEN 1100/python tcp 0 0 127.0.0.1:2004 0.0.0.0:* LISTEN 1100/python
This is like running a ‘netstat’.
Troubleshooting NSX Compute Nodes
When a ESXi host/cluster is configured for NSX, a number of VIBs are installed on the host to give it the ability to participate in NSX virtual networks. These are called esx-vxlan, esx-vsip and esx-dvfilter-switch-security.
You can check these VIBs have been installed by running the following command on the ESXi host:
To check controller connectivity from the ESXi host you can run:
Or:
# esxcli network vswitch dvs vmware vxlan network list -–vds-name VXLAN ID Multicast IP Control Plane Controller Connection Port Count MAC Entry Count ARP Entry Count MTEP Count -------- ------------------------- ----------------------------------- --------------------- ---------- --------------- --------------- ---------- 5000 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 172.16.1.70 (up) 1 1 0 0 5004 N/A (headend replication) Enabled (multicast proxy,ARP proxy) 172.16.1.70 (up) 1 0
In the ‘Controller Connection’ column you can see the controller IP address, and it’s status. If all is healthy you should see some established connections on port 1234, which the ‘netcpad’ service uses to connect to the NSX controller instance:
# esxcli network ip connection list| grep tcp | grep 1234 tcp 0 0 172.16.1.90:43954 172.16.1.70:1234 ESTABLISHED 44754 netcpa-worker
If you find that a host isn’t connected to the controller, one possible step would be to restart the netcpad service on the host:
# /etc/init.d/netcpad restart
The service has it’s own log file, which is useful for troubleshooting, which is found at /var/log/netcpa.log.
Useful Links and Resources
https://pubs.vmware.com/NSX-6/topic/com.vmware.ICbase/PDF/nsx_60_cli.pdf