Sunday, January 16, 2022

How to resolve issues with half network links online in Oracle SuperCluster M8 using Oracle Quad 10Gb/dual 40Gb Ethernet Adapter

Recently I had an issue with Oracle Quad 10Gb/dual 40Gb Ethernet Adapter - the operating system shown only half of its connected links as up (online), another half was shown as down. Actually another half of links (which was down from Solaris OS) was connected and links also had green constant light, both from the switch and adapter sides.  

root@hostname:~# dladm show-phys                                                           
LINK            MEDIA         STATE      SPEED  DUPLEX    DEVICE
net0            Ethernet      down       0      unknown   i40e0
net1            Ethernet      down       0      unknown   i40e1
net2            Ethernet      up         10000  full      i40e2
net3            Ethernet      down       0      unknown   i40e3
net4            Ethernet      down       0      unknown   i40e4
net5            Ethernet      down       0      unknown   i40e5
net6            Ethernet      up         10000  full      i40e6
net7            Ethernet      down       0      unknown   i40e7
net8            Ethernet      up         1000   full      vnet0
net9            Ethernet      up         1000   full      vnet1
net10           Infiniband    up         32000  unknown   ibp0
net11           Infiniband    up         32000  unknown   ibp1
net14           Infiniband    up         32000  unknown   ibp2
net15           Infiniband    up         32000  unknown   ibp3

Here we've got the LDom with two CMIOU. Each CMIOU provides 4 i40e devices, because each ethernet adapter was configured to separate 4 (4x10Gb=40Gb) links. Actually net0, net2, net4 and net6 were physically connected and all of those must be in up state, but we can see that net0 and net4 were down. So I was recommended by support to downgrade the firmware of those adapters, power cycle the server and upgrade aftermath (plus power cycling again). 

Every operation with firmware requires power cycle of PDom. Take it into account when planning your work.

The sequence of steps was as :

1. Shutdown any non-primary domains first (use ldm list to identify).

# init 0

2. Save LDom configuration if needed, boot info factory-default SP config

# ldm add-config backup
# ldm set-config factory-default

# ldm list-config

3. Shutdown primary domain.

# init 0

4. Inside ILOM power off the host (or the platform, it depends on your hardware) using stop for HOST (or SYS) target.

dd

5. Configure boot from iso over the network (ILOM) :

-> set /Servers/PDomains/PDomain_0/SP/services/kvms/host_storage_device/ mode=disabled
-> set /Servers/PDomains/PDomain_0/SP/services/kvms/host_storage_device/remote/ server_URI=nfs://10.xx.xx.xxx:/private/boot.iso
-> set /Servers/PDomains/PDomain_0/SP/services/kvms/host_storage_device/ mode=remote
-> set /SP/cli timeout=0
-> show /Servers/PDomains/PDomain_0/SP/services/kvms/host_storage_device/

Carefully modify and set the parameters, do not insert extra spaces, dashes, semicolons etc. Also take into account the necessity to create network available NFS v3 server. 

Another option to configure iso file to boot from is to use the BUI (not in this post).

6. Start the HOST target

7. In OpenBoot prompt, check the availability of rcdrom device for boot and boot from it :

{0} ok devalias                                                                              
fallback-miniroot        /pci@304/pci@1/usb@0/storage@2/disk@0                               
rcdrom                   /pci@304/pci@1/usb@0/storage@2/disk@0                               
virtual-console          /virtual-devices/console@1                                          
name                     aliases                                                             
{0} ok boot rcdrom                                                                           
Boot device: /pci@304/pci@1/usb@0/storage@2/disk@0  File and args: 

You might receive new faulty events like inability to use interconnect from host to SP etc. It's explainable, just take them into account.

8. Logon into the system. I used Oracle VTS Image (MOS patch 26091982). Username and password are jack/jack. For root the password is solaris.

9. Create network link for connecting to NFS to download firmware (if it's not in iso already). Use network and link from working LDom.

# ipadm create-addr -T static -a local=10.xx.xx.xxx/24 net1

10. Mount NFS share :

# mount -F nfs -o rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=3,proto=tcp,forcedirectio,nocto 10.xx.xx.xxx:/export/test-share /mnt/tst

11. Perform other manipulations which you're needed (downgrade/upgrade firmware, reconfigure equipment etc.)

12. Revert back LDom configuration via ILOM, and disable booting from remote iso, for example :

-> set /Servers/PDomains/PDomain_0/host/bootmode config=backup

-> set /Servers/PDomains/PDomain_0/sp/services/kvms/host_storage_device/ mode=miniroot

13. Shutdown OS and stop HOST target. Wait till status_detail of HOST target will be 'Host is off'. Then start the HOST target

14. The operating system should be loaded. Check the status of the links :

# dladm show-phys
LINK            MEDIA         STATE      SPEED  DUPLEX    DEVICE
net0            Ethernet      up         10000  full      i40e0
net1            Ethernet      down       0      unknown   i40e1
net2            Ethernet      up         10000  full      i40e2
net3            Ethernet      down       0      unknown   i40e3
net4            Ethernet      up         10000  full      i40e4
net5            Ethernet      down       0      unknown   i40e5
net6            Ethernet      up         10000  full      i40e6
net7            Ethernet      down       0      unknown   i40e7
net8            Ethernet      up         1000   full      vnet0
net9            Ethernet      up         1000   full      vnet1
net10           Infiniband    up         32000  unknown   ibp0
net11           Infiniband    up         32000  unknown   ibp1
net14           Infiniband    up         32000  unknown   ibp2
net15           Infiniband    up         32000  unknown   ibp3

 

Looks good :) Repeat these steps for upgrading firmware (if needed). 

That's it :) Good Luck !



No comments:

Post a Comment