Go to main content

School of Computer Science Intranet

APT research areas

Discover our main research areas

Stella Administration Guide

I quickly write this guide to leave some pointers to Kostas before leaving on holiday... I will complete it in 2007
Lilian

Creating a New User

  • Use the same username as on the department's NFS/Unix system.
  • Find the user's UID by using "cd ~USERNAME; ls -ln"
  • Log onto Stella as root
  • Use the following command with the appropriate UID and USERNAME: "useradd -u UID -g 800 -d /users/USERNAME -m USERNAME".
    Note: 800 is the Group id for the APT group
  • Update the NIS database for compute nodes: "cd /var/yp && make"

Starting Matlab's job manager on Stella

  • Log onto Stella as root
  • cd /usr/local/appl/matlab/toolbox/distcomp/bin
  • ./mdce start
  • ./startjobmanager -name MyJobManager -v
  • You can check the status of the job manager with ./nodestatus

Starting Matlab workers on the compute nodes

  • Log onto Stella as root
  • "rsh compXX", where XX is the compute node's number: 00, 01, 02, 03, 04 or 05
  • cd /usr/local/appl/matlab/toolbox/distcomp/bin
  • ./mdce start
  • return to stella (possibly as a normal user): "exit" from the node
  • Start as many workers as desired (usually 8 workers - 1 per core): (replace "comp00" by the desired node)
  • cd /usr/local/appl/matlab/toolbox/distcomp/bin
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_1
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_2
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_3
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_4
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_5
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_6
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_7
  • ./startworker -jobmanagerhost stella -jobmanager MyJobManager -remotehost comp00.cs.man.ac.uk -v -name worker_comp00_8
  • You can check the status of the job manager with ./nodestatus or ./nodestatus -remotehost comp00.cs.man.ac.uk

Monitoring a node's console output while rebooting or switching it off

  • Log onto Stella as a user
  • "console XX", where XX is the compute node's number: 00, 01, 02, 03, 04 or 05
  • It will show you the Serial over Ethernet output for the specified node... which is everything Linux and the BIOS print on the console during a reboot
  • To disconnect the console, press '~' followed by '.' (if it doesn't work, press Enter and try ~. again).

Rebooting one compute node

Method 1:
  • Log onto Stella as root
  • Optional: Monitor the node's console output during reboot by following the instructions above
  • rsh compXX, where XX is the compute nodes' number (between 00 and 05)
  • Check that you are logged on the desired compute node
  • "reboot"
  • If after 5 minutes, the node hasn't finished rebooting, it might mean that it is stuck on "Unmounting NFS"... Follow the instructions in method 2 to do an "IPMI Power CYCLE"
Method 2:
  • Log onto Stella as a normal user with X available
  • run "firefox" or another web browser
  • Open http://cma
  • Log in as admin/admin
  • In Node View, choose the desired node
  • Choose the desired Action ("Soft reboot" or "IPMI Power CYCLE")
  • Optional: Monitor the node's console output during reboot by following the instructions above
  • If after 5 minutes, the node hasn't finished rebooting, it might mean that it is stuck on "Unmounting NFS"... Go back to firefox and choose "IPMI Power CYCLE"
Method 3:
  • Walk to the machine room
  • unplug the big cable (or better... ask somebody else to unplug it and stay away)

Rebooting all the compute nodes

Method 1:
  • Log onto Stella as root
  • "doall /sbin/reboot"
  • Optional: Monitor the node's console output during reboot by following the instructions above for each of the nodes
Method 2:
  • Follow any one of the methods for rebooting one compute node, and apply them one by one to comp00, comp01, comp02, comp3, comp04 and comp05.

Switching off the whole cluster

  • Follow the instructions for "Rebooting all the compute nodes", but ask for a poweroff instead of a reboot.
  • On stella as root: "poweroff"
  • Go to the machine room and switch off the 2 big blue switches
  • Switch off the UPS

Switching the whole cluster back on

  • Go to the machine room
  • Switch on the big blue switches one by one. Each one of them should add a new noise. If it doesn't, it means the power has tripped and you need to find Hilary, who will find the electricians... good luck
  • If you reached this step, you are very lucky and the cluster will soon be back on
  • All the lights blinking on every node only mean that the network is activated... They don't mean the nodes are already switched on
  • Switch on the front-end node (the lowest 4U machine) by pressing its power button
  • Once it is booted, do the same for every node one by one.
  • After a few minutes, try to ssh/rsh from stella to each node. If a node doesn't work, monitor its console output by following the instructions above and reboot it (either manually with the power button, or using the IPMI CYCLE technique described above)