School of Computer Science Intranet

APT research areas

Stella Administration Guide

I quickly write this guide to leave some pointers to Kostas before leaving on holiday... I will complete it in 2007
Lilian

Use the same username as on the department's NFS/Unix system.
Find the user's UID by using "cd ~USERNAME; ls -ln"
Log onto Stella as root
Use the following command with the appropriate UID and USERNAME: "useradd -u UID -g 800 -d /users/USERNAME -m USERNAME".
Note: 800 is the Group id for the APT group
Update the NIS database for compute nodes: "cd /var/yp && make"

Log onto Stella as root
"rsh compXX", where XX is the compute node's number: 00, 01, 02, 03, 04 or 05
cd /usr/local/appl/matlab/toolbox/distcomp/bin
./mdce start
return to stella (possibly as a normal user): "exit" from the node
Start as many workers as desired (usually 8 workers - 1 per core): (replace "comp00" by the desired node)
cd /usr/local/appl/matlab/toolbox/distcomp/bin
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_1
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_2
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_3
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_4
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_5
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_6
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_7
./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_8
You can check the status of the job manager with ./nodestatus or ./nodestatus -remotehost comp00.cs.man.ac.uk

Log onto Stella as a user
"console XX", where XX is the compute node's number: 00, 01, 02, 03, 04 or 05
It will show you the Serial over Ethernet output for the specified node... which is everything Linux and the BIOS print on the console during a reboot
To disconnect the console, press '~' followed by '.' (if it doesn't work, press Enter and try ~. again).

Method 1:

Log onto Stella as root
Optional: Monitor the node's console output during reboot by following the instructions above
rsh compXX, where XX is the compute nodes' number (between 00 and 05)
Check that you are logged on the desired compute node
"reboot"
If after 5 minutes, the node hasn't finished rebooting, it might mean that it is stuck on "Unmounting NFS"... Follow the instructions in method 2 to do an "IPMI Power CYCLE"

Method 2:

Log onto Stella as a normal user with X available
run "firefox" or another web browser
Open http://cma
Log in as admin/admin
In Node View, choose the desired node
Choose the desired Action ("Soft reboot" or "IPMI Power CYCLE")
Optional: Monitor the node's console output during reboot by following the instructions above
If after 5 minutes, the node hasn't finished rebooting, it might mean that it is stuck on "Unmounting NFS"... Go back to firefox and choose "IPMI Power CYCLE"

Method 3:

Walk to the machine room
unplug the big cable (or better... ask somebody else to unplug it and stay away)

Method 1:

Log onto Stella as root
"doall /sbin/reboot"
Optional: Monitor the node's console output during reboot by following the instructions above for each of the nodes

Method 2:

Follow any one of the methods for rebooting one compute node, and apply them one by one to comp00, comp01, comp02, comp3, comp04 and comp05.

Follow the instructions for "Rebooting all the compute nodes", but ask for a poweroff instead of a reboot.
On stella as root: "poweroff"
Go to the machine room and switch off the 2 big blue switches
Switch off the UPS

Go to the machine room
Switch on the big blue switches one by one. Each one of them should add a new noise. If it doesn't, it means the power has tripped and you need to find Hilary, who will find the electricians... good luck
If you reached this step, you are very lucky and the cluster will soon be back on
All the lights blinking on every node only mean that the network is activated... They don't mean the nodes are already switched on
Switch on the front-end node (the lowest 4U machine) by pressing its power button
Once it is booted, do the same for every node one by one.
After a few minutes, try to ssh/rsh from stella to each node. If a node doesn't work, monitor its console output by following the instructions above and reboot it (either manually with the power button, or using the IPMI CYCLE technique described above)