Stella Administration Guide
I quickly write this guide to leave some pointers to Kostas before leaving on holiday... I will complete it in 2007
Lilian
Creating a New User
- Use the same username as on the department's NFS/Unix system.
- Find the user's UID by using "cd ~USERNAME; ls -ln"
- Log onto Stella as root
- Use the following command with the appropriate UID and USERNAME: "useradd -u UID -g 800 -d /users/USERNAME -m USERNAME".
Note: 800 is the Group id for the APT group - Update the NIS database for compute nodes: "cd /var/yp && make"
Starting Matlab's job manager on Stella
- Log onto Stella as root
- cd /usr/local/appl/matlab/toolbox/distcomp/bin
- ./mdce start
- ./startjobmanager -name MyJobManager -v
- You can check the status of the job manager with ./nodestatus
Starting Matlab workers on the compute nodes
- Log onto Stella as root
- "rsh compXX", where XX is the compute node's number: 00, 01, 02, 03, 04 or 05
- cd /usr/local/appl/matlab/toolbox/distcomp/bin
- ./mdce start
- return to stella (possibly as a normal user): "exit" from the node
- Start as many workers as desired (usually 8 workers - 1 per core): (replace "comp00" by the desired node)
- cd /usr/local/appl/matlab/toolbox/distcomp/bin
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_1
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_2
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_3
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_4
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_5
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_6
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_7
- ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_8
- You can check the status of the job manager with ./nodestatus or ./nodestatus -remotehost comp00.cs.man.ac.uk
Monitoring a node's console output while rebooting or switching it off
- Log onto Stella as a user
- "console XX", where XX is the compute node's number: 00, 01, 02, 03, 04 or 05
- It will show you the Serial over Ethernet output for the specified node... which is everything Linux and the BIOS print on the console during a reboot
- To disconnect the console, press '~' followed by '.' (if it doesn't work, press Enter and try ~. again).
Rebooting one compute node
Method 1:- Log onto Stella as root
- Optional: Monitor the node's console output during reboot by following the instructions above
- rsh compXX, where XX is the compute nodes' number (between 00 and 05)
- Check that you are logged on the desired compute node
- "reboot"
- If after 5 minutes, the node hasn't finished rebooting, it might mean that it is stuck on "Unmounting NFS"... Follow the instructions in method 2 to do an "IPMI Power CYCLE"
- Log onto Stella as a normal user with X available
- run "firefox" or another web browser
- Open http://cma
- Log in as admin/admin
- In Node View, choose the desired node
- Choose the desired Action ("Soft reboot" or "IPMI Power CYCLE")
- Optional: Monitor the node's console output during reboot by following the instructions above
- If after 5 minutes, the node hasn't finished rebooting, it might mean that it is stuck on "Unmounting NFS"... Go back to firefox and choose "IPMI Power CYCLE"
- Walk to the machine room
- unplug the big cable (or better... ask somebody else to unplug it and stay away)
Rebooting all the compute nodes
Method 1:- Log onto Stella as root
- "doall /sbin/reboot"
- Optional: Monitor the node's console output during reboot by following the instructions above for each of the nodes
- Follow any one of the methods for rebooting one compute node, and apply them one by one to comp00, comp01, comp02, comp3, comp04 and comp05.
Switching off the whole cluster
- Follow the instructions for "Rebooting all the compute nodes", but ask for a poweroff instead of a reboot.
- On stella as root: "poweroff"
- Go to the machine room and switch off the 2 big blue switches
- Switch off the UPS
Switching the whole cluster back on
- Go to the machine room
- Switch on the big blue switches one by one. Each one of them should add a new noise. If it doesn't, it means the power has tripped and you need to find Hilary, who will find the electricians... good luck
- If you reached this step, you are very lucky and the cluster will soon be back on
- All the lights blinking on every node only mean that the network is activated... They don't mean the nodes are already switched on
- Switch on the front-end node (the lowest 4U machine) by pressing its power button
- Once it is booted, do the same for every node one by one.
- After a few minutes, try to ssh/rsh from stella to each node. If a node doesn't work, monitor its console output by following the instructions above and reboot it (either manually with the power button, or using the IPMI CYCLE technique described above)