¿is tping enough to assume a cluster is working?

2 messages Options
Embed this post
Permalink
Ricardo Guerreiro

¿is tping enough to assume a cluster is working?

Reply Threaded More More options
Print post
Permalink
Hi:

I am quite confused ¿can you explain me where is the problem?. I think I need some orientantion to know where to look.

I am trying to implement a temporal cluster, to be used a few times a year. Up to now, I am using just two machines, athlonX2 64 , motherboard M2N-SX, to practice. I am using a switch, and  I am not connected to the university net.



I boot my first machine form the cd (but I have to use the "noapic" option.  I also use keyb=es, spanish is my lenguage.)
everything is ok, up to now. The "noapic" ¿could have any consecuences for the cluster?

then I boot X environment (automatically in v 1.8, using the user account and then startx in v 1.9), I open a terminal and run pelican_setup

I  press yes in "Start Pelican HPC netboot services"
I press YES to bring the computer nodes


then I boot the other machine from the net (but I have to go to the keyboard of the other machine and pres TAB and type "live noapic", if not it will not boot).

I get the debian login:

I return to the main machine ,
I press "NOT" in the main machine so pelican can recognice the second machine, and I get the message saying that At the moment 1 compute nodes (not counting fron node) are available

So I press YES and get the message that "your cluster of 2 nodes is (probably) lambooted."



But in the process, I can see:

        LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University

        ERROR: LAM/MPI unexpectedly received the following on stderr:
        Warning: Permanently added '10.11.12.2' (RSA) to the list of known hosts.
        -----------------------------------------------------------------------------
        LAM attempted to execute a process on the remote node "10.11.12.2",
        but received some output on the standard error.  This heuristic
        assumes that any output on the standard error indicates a fatal error,
        and therefore aborts.  You can disable this behavior (i.e., have LAM
        ignore output on standard error) in the rsh boot module by setting the
        SSI parameter boot_rsh_ignore_stderr to 1.

        LAM tried to use the remote agent command "/usr/bin/rsh"
        to invoke "echo $SHELL" on the remote node.

        *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS, AND
        *** CONSULT THE "BOOTING LAM" SECTION OF THE LAM/MPI FAQ
        *** (http://www.lam-mpi.org/faq/) BEFORE POSTING TO THE LAM/MPI USER'S
        *** MAILING LIST.

        This can indicate an authentication error with the remote agent, or
        can indicate an error in your $HOME/.cshrc, $HOME/.login, or
        $HOME/.profile files.  The following is a (non-inclusive) list of items
        that you should check on the remote node:

                - You have an account and can login to the remote machine
                - Incorrect permissions on your home directory (should
                  probably be 0755)
                - Incorrect permissions on your $HOME/.rhosts file (if you are
                  using rsh -- they should probably be 0644)
                - You have an entry in the remote $HOME/.rhosts file (if you
                  are using rsh) for the machine and username that you are
                  running from
                - Your .cshrc/.profile must not print anything out to the
                  standard error
                - Your .cshrc/.profile should set a correct TERM type
                - Your .cshrc/.profile should set the SHELL environment
                  variable to your default shell

        Try invoking the following command at the unix command line:

                /usr/bin/rsh 10.11.12.2 -n 'echo $SHELL'

        You will need to configure your local setup such that you will *not*
        be prompted for a password to invoke this command on the remote node.
        No output should be printed from the remote node before the output of
        the command is displayed.


¿do I have a problem?.  To figure, I try:




        user@pelican: ~$ tping n1
          1 byte from n1 (o): 0.000 secs
          1 byte from n1 (o): 0.000 secs
        ^C
        2 messages, 2 bytes (0.002K), 0.000 secs (19.249K/sec)
        roundtrip min/avg/max: 0.000/0.000/0.000
        user@pelican:~$ tping n0
          1 byte from n0: 0.000 secs
          1 byte from n0: 0.000 secs
          1 byte from n0: 0.000 secs
        ^C

then with lamnodes I get

           user@pelican:~$ lamnodes
          n0      10.11.12.2:1:
          n1      10.11.12.1:1:origin,this_node



 and then I try , and get:

        user@pelican:~$ /usr/bin/rsh 10.11.12.2 -n 'echo $SHELL'
        /bin/bash
        user@pelican:~$ /usr/bin/rsh 10.11.12.1 -n 'echo $SHELL'
        Warning: Permanently added '10.11.12.1' (RSA) to the list of known hosts.
        /bin/bash
        user@pelican:~$

        user@pelican:~/mpir$ tping n0-1
          1 byte from 1 remote node and 1 local node: 0.000 secs
          1 byte from 1 remote node and 1 local node: 0.000 secs
          1 byte from 1 remote node and 1 local node: 0.000 secs
          1 byte from 1 remote node and 1 local node: 0.000 secs
        ^C



if I type less /home/user/tmp/bhosts I get

        10.11.12.2
        10.11.12.1
        bhosts (END)



so I think the cluster is running.


but then, i create  a test directory inside /home/user
there I create the following file (r1.c):

        #include "mpi.h"
        #include <stdio.h>

        int main (argc,argv)
        int argc;
        char **argv;

        {
        int rank,size;
        MPI_Init(&argc,&argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        printf("Hello I am %d of %d\n",rank,size);
        MPI_Finalize();
  return 0;
        }


I go to that directory, I compile with "mpicc r1.c -o r1", I get the executable, I run it and get:


        user@pelican:~$ mpirun N r1
        Hello I am 1 of 2
        user@pelican:~$    
 

I think it is not correct.

if I try,
        user@pelican:~$ mpirun -c 8 r1

I get


        Hello I am 3 of 8
        Hello I am 5 of 8
        Hello I am 7 of 8
        Hello I am 1 of 8
        user@pelican:~$    



I think it is not correct.

but if I do

lamboot

(I discover only my local note is in the cluster now),

and try,
        mpirun -c 8 r1


I get
        Hello I am 0 of 8
        Hello I am 2 of 8
        Hello I am 6 of 8
        Hello I am 1 of 8
        Hello I am 3 of 8
        Hello I am 4 of 8
        Hello I am 5 of 8
        Hello I am 7 of 8
        user@pelican:~$

so it looks that my 2 machine cluster was NOT working.

I thought the problem was in the printf. So I tryed

        #include <mpi.h>
        #include <math.h>
        #include <stdio.h>
        /* Prototype */
        float integral(float ai, float h, int n);
        void main(int argc, char* argv[])
        {
        /*###############################################################################
        #                                                                              #
        # This is an MPI example on parallel integration to demonstrate the use of:    #
        #                                                                              #
        # * MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize                       #
        # * MPI_Recv, MPI_Send                                                         #
        #                                                                              #
        # Dr. Kadin Tseng                                                              #
        # Scientific Computing and Visualization                                       #
        # Boston University                                                            #
        # 1998                                                                         #
        #                                                                              #
        ###############################################################################*/
        int n, p, myid, tag, proc, ierr;
        float h, integral_sum, a, b, ai, pi, my_int;
        int master = 0;  /* processor performing total sum */
        MPI_Comm comm;
        MPI_Status status;
       
        comm = MPI_COMM_WORLD;      
        ierr = MPI_Init(&argc,&argv);           /* starts MPI */
        MPI_Comm_rank(comm, &myid);            /* get current process id */
        MPI_Comm_size(comm, &p);               /* get number of processes */
       
        pi = acos(-1.0);  /* = 3.14159... */
        a = 0.;           /* lower limit of integration */
        b = pi*1./2.;     /* upper limit of integration */
        n = 500;          /* number of increment within each process */
        tag = 123;        /* set the tag to identify this particular job */
        h = (b-a)/n/p;    /* length of increment */
       
        ai = a + myid*n*h;  /* lower limit of integration for partition myid */
        my_int = integral(ai, h, n);   /* 0<=myid<=p-1 */
       
        printf("Process %d has the partial integral of %f\n", myid,my_int);
       
        MPI_Send(
                &my_int, 1, MPI_FLOAT,
                master,        /* message destination */
                tag,           /* message tag */
                comm);
       
        if(myid == master) {  /* Receives serialized */
                integral_sum = 0.0;
                for (proc=0;proc<p;proc++) {
                MPI_Recv(
                        &my_int, 1, MPI_FLOAT,
                        proc,        /* message source */
                        tag,         /* message tag */
                        comm, &status);     /* status reports source, tag */
                integral_sum += my_int;
                }
                printf("The Integral =%f\n",integral_sum); /* sum of my_int */
        }
        MPI_Finalize();                        /* let MPI finish up ... */
        }
        float integral(float ai, float h, int n)
        {
        int j;
        float aij, integ;
       
        integ = 0.0;                 /* initialize */
        for (j=0;j<n;j++) {          /* sum integrals */
                aij = ai + (j+0.5)*h;      /* mid-point */
                integ += cos(aij)*h;
        }
        return integ;
        }

if i run just in node n1 "mpirun n1 tst2", I got the correct answer.

if I run "mpirun N tst2", I got a wrong answer


The problem, ¿is in the cluster, or in the c programs I am running?.


¿is it correct to create a directory inside /home/user, and compile and run the program from there?
¿what am I missing?

Regards, ricardo
Michael Creel

Re: ¿is tping enough to assume a cluster is working?

Reply Threaded More More options
Print post
Permalink
Hi Ricardo,
I believe that everything is working properly. Seeing LAM complain the first time pelican_restarthpc is run (it is called during pelican_setup, and it can be called independently, too, to resize or re-initialize the environment) is normal, because the ~/.ssh/known_hosts file needs to be created. Also, you only see output from the frontend node, which is why part of the output of your hello world program is not seen on screen. To verify that things run on the compute node, you can set up a monitor as is described in a thread on this forum (see homepage for a link) or just open a terminal, ssh 10.11.12.2, and run htop there. Then launch something on the frontend, and you will see cpu activity on the compute node.

Try running pelican_restarthpc another time on the frontend. It should not show any error messages the second time, because known_hosts already exists.

About the results of the longer program, I believe that any difference in the results is due to the program, but I haven't checked carefully.

Cheers. Michael