Skip to content

MPI_THREAD_MULTIPLE support broken #157

@ompiteam

Description

@ompiteam

Support for MPI_THREAD_MULTIPLE is currently broken on trunk. Even very simple programs that just do an MPI_Init_thread with MPI_THREAD_MULTIPLE and then call MPI_Barrier() will hang.

Example of a backtrace when --mpi-preconnect_mpi 1 is passed to mpirun

(gdb) bt
#0  0x0000008039720d6c in .pthread_cond_wait () from /lib64/power6/libpthread.so.0
#1  0x00000400001299d8 in opal_condition_wait (c=0x400004763f8, m=0x40000476460)
    at ../../ompi-trunk.chris2/opal/threads/condition.h:79
#2  0x000004000012a08c in ompi_request_default_wait_all (count=2, requests=0xfffffa9db20, 
    statuses=0x0) at ../../ompi-trunk.chris2/ompi/request/req_wait.c:281
#3  0x000004000012f56c in ompi_init_preconnect_mpi ()
    at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_preconnect.c:72
#4  0x000004000012c738 in ompi_mpi_init (argc=1, argv=0xfffffa9f278, requested=3, 
    provided=0xfffffa9edd8) at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_init.c:800
#5  0x000004000017a064 in PMPI_Init_thread (argc=0xfffffa9ee20, argv=0xfffffa9ee28, required=3, 
    provided=0xfffffa9edd8) at pinit_thread.c:84
#6  0x0000000010000ae4 in main (argc=1, argv=0xfffffa9f278) at test2.c:15

Running without MPI_THREAD_MULTIPLE but against the same build works fine.

I think the problem is due to some changes between 1.6 and trunk in opal/threads/condition.h


    if (opal_using_threads()) {
#if OPAL_HAVE_POSIX_THREADS && OPAL_ENABLE_PROGRESS_THREADS
    rc = pthread_cond_wait(&c->c_cond, &m->m_lock_pthread);
#elif OPAL_HAVE_SOLARIS_THREADS && OPAL_ENABLE_PROGRESS_THREADS
    rc = cond_wait(&c->c_cond, &m->m_lock_solaris);
#else
        if (c->c_signaled) {
            c->c_waiting--;
            opal_mutex_unlock(m);
            opal_progress();

and from trunk:

    if (opal_using_threads()) {
#if OPAL_HAVE_POSIX_THREADS && OPAL_ENABLE_MULTI_THREADS
        rc = pthread_cond_wait(&c->c_cond, &m->m_lock_pthread);
#elif OPAL_HAVE_SOLARIS_THREADS && OPAL_ENABLE_MULTI_THREADS
        rc = cond_wait(&c->c_cond, &m->m_lock_solaris);
#else
        if (c->c_signaled) {
            c->c_waiting--;
            opal_mutex_unlock(m);
            opal_progress();

Now in 1.6 OPAL_ENABLE_PROGRESS_THREADS is hardcoded by configure to be
off. So even with mpi threads enabled when we are in
ompi_request_default_wait_all and call opal_condition_wait we still
call opal_progress.

In trunk OPAL_ENABLE_MULTI_THREADS is set to 1 if mpi threads are
enabled. Note that in 1.6 OPAL_ENABLE_MULTI_THREADS also exists and is
set to 1 if mpi threads are enabled, but as can be seen above is not
used to control how opal_condition_wait behaves.

So in trunk when MPI_THREAD_MULTIPLE is requrest in init, the
pthread_cond_wait path is taken. MPI programs get stuck because the
main thread sits in pthread_cond_wait and there appears to be no one
around to call opal_progress. I've looked around in the OMPI code to see
where a thread should be spawned to service opal_progress, but I
haven't been able to find it.

Between 1.6 and trunk OPAL_ENABLE_PROGRESS_THREADS seems to have
disappeared and OMPI_ENABLE_PROGRESS_THREADS has appeared. The latter
is hardcoded to be off. I tried to compile with
OMPI_ENABLE_PROGRESS_THREADS set, but there are compile errors
(presumably why its turned off). But I'm wondering if in
opal_condition_wait and a few other areas if OPAL_ENABLE_MULTI_THREADS should in fact be OMPI_ENABLE_PROGRESS_THREADS?

If I change a few of those OPAL_ENABLE_MULTI_THREADS to
OMPI_ENABLE_PROGRESS_THREADS (I don't know if I changed all that need to be changed) then I can start running threaded MPI programs again.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions