MVAPICH2 Release Information

The following is reproduced essentially verbatim from files contained within the MVAPICH2 tarball downloaded from http://mvapich.cse.ohio-state.edu/

The MVAPICH2 User Guide is available at http://mvapich.cse.ohio-state.edu/support/.

MVAPICH2-2.1 introduces an algorithm to determine CPU topology on the node, and this new algorithm does not work properly for older Mellanox controllers and firmware, resulting in software threads not spreading out across a node's cores by default. This problem has been fixed in MVAPICH-2.2 and beyond.

Prior to updating to MVAPICH2-2.1, the cluster administrator should determine the potential vulnerability to this problem. For each node that contains an Infiniband controller, execute ibstat, and if the first output line is:

CA 'mthca0'

then that node may exhibit the problem. The cluster administrator has two choices: either avoid updating the mvapich2-scyld packages (keeping in mind that the mvapich2-psm-scyld packages can be updated, as those packages are only used by QLogic Infiniband controllers, which don't have the problem); or update mvapich2-scyld, execute tests to determine if the problem exists for those Mellanox mthca nodes, and if the problem does exist, then instruct users to employ explicit CPU Mapping. See http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-540006.5 fo details.

MVAPICH2 Changelog
------------------
This file briefly describes the changes to the MVAPICH2 software
package.  The logs are arranged in the "most recent first" order.

MVAPICH2 2.3.1 (03/01/2019)

* Features and Enhancements (since 2.3):
    - Add support for JSM and Flux resource managers
    - Architecture detection, enhanced point-to-point and collective tuning for
      AMD Epyc system
    - Enhanced point-to-point and collective tuning for IBM POWER9 and ARM
      systems
    - Add support of DDN Infinite Memory Engine (IME) to ROMIO
        - Thanks to Sylvain Didelot @DDN for the patch
    - Optimize performance of MPI_Wait operation
    - Update to hwloc 1.11.11

* Bug Fixes (since 2.3):
    - Fix autogen error with Flang compiler on ARM systems
        - Thanks to Nathan Sircombe @ARM for the patch
    - Fix issues with shmem collectives on ARM architecture
        - Thanks to Pavel Shamis @ARM for the patch
    - Fix issues with MPI-3 shared memory windows for PSM-CH3 and PSM2-CH3
      channel
        - Thanks to Adam Moody @LLNL for the report
    - Fix segfault in MPI_Reduce
        - Thanks to Samuel Khuvis @OSC for the report
    - Fix compilation issues with IBM XLC compiler
        - Thanks to Ken Raffenetti and Yanfei Guo @ANL for the patch
    - Fix issues with MPI_Mprobe/Improbe and MPI_Mrecv/Imrecv for PSM-CH3 and
      PSM2-CH3 channel
        - Thanks to Adam Moody @LLNL for the report
    - Fix compilation issues with PGI compilers for CUDA-enabled builds
    - Fix potential hangs in MPI_Finalize
    - Fix issues in handling very large messages with RGET protocol
    - Fix issues with handling GPU buffers
    - Fix issue with hardware multicast based Allreduce
    - Fix build issue with TCP/IP-CH3 channel
    - Fix memory leaks exposed by TotalView
        - Thanks to Adam Moody @LLNL for the report
    - Fix issues with cleaning up temporary files generated in CUDA builds
    - Fix compilation warnings

MVAPICH2 2.3 (07/23/2018)

* Features and Enhancements (since 2.3rc2):
    - Add point-to-point and collective tuning for IBM POWER9 CPUs
    - Enhanced collective tuning for IBM POWER8, Intel Skylake, Intel KNL, Intel
      Broadwell architectures

* Bug Fixes (since 2.3rc2):
    - Fix issues in CH3-TCP/IP channel
    - Fix build and runtime issues with CUDA support
    - Fix error when XRC and RoCE were enabled at the same time
    - Fix issue with XRC connection establishment
    - Fix for failure at finalize seen on iWARP enabled devices
    - Fix issue with MPI_IN_PLACE-based communcation in MPI_Reduce and
      MPI_Reduce_scatter
    - Fix issue with allocating large number of shared memory based MPI3-RMA
      windows
    - Fix failure in mpirun_rsh with large number of nodes
    - Fix singleton initialization issue with SLURM/PMI2 and PSM/Omni-Path
        - Thanks to Adam Moody @LLNL for the report
    - Fix build failure with when enabling GPFS support in ROMIO
        - Thanks to Doug Johnson @OHTech for the report
    - Fix issues with architecture detection in PSM-CH3 and PSM2-CH3 channels
    - Fix failures with CMA read at very large message sizes
    - Fix faiures with MV2_SHOW_HCA_BINDING on single-node jobs
    - Fix compilation warnings and memory leaks

MVAPICH2 2.3rc2 (04/30/2018)

* Features and Enhancements (since 2.3rc1):
    - Based on MPICH v3.2.1
    - Enhanced small message performance for MPI_Alltoallv
    - Improve performance for host-based transfers when CUDA is enabled
    - Add architecture detection for IBM POWER9 CPUs
    - Enhance architecture detection for Intel Skylake CPUs
    - Enhance MPI initialization to gracefully handle RDMA_CM failures
    - Improve algorithm selection of several collectives
    - Enhance detection of number and IP addresses of IB devices
    - Tested with CLANG v5.0.0

* Bug Fixes (since 2.3rc1):
    - Fix issue in autogen step with duplicate error messages
    - Fix issue with XRC connection establishment
    - Fix build issue with SLES 15 and Perl 5.26.1
        - Thanks to Matias A Cabral @Intel for the report and patch
    - Fix segfault when manually selecting collective algorithms
    - Fix cleanup of preallocated RDMA_FP regions at RDMA_CM finalize
    - Fix compilation warnings and memory leaks

MVAPICH2 2.3rc1 (02/19/2018)

* Features and Enhancements (since 2.3b):
    - Enhanced performance for Allreduce, Reduce_scatter_block, Allgather,
      Allgatherv through new algorithms
        - Thanks to Danielle Sikich and Adam Moody @ LLNL for the patch
    - Enhance support for MPI_T PVARs and CVARs
    - Improved job startup time for OFA-IB-CH3, PSM-CH3, and PSM2-CH3
    - Support to automatically detect IP address of IB/RoCE interfaces when
      RDMA_CM is enabled without relying on mv2.conf file
    - Enhance HCA detection to handle cases where node has both IB and RoCE HCAs
    - Automatically detect and use maximum supported MTU by the HCA
    - Added logic to detect heterogeneous CPU/HFI configurations in PSM-CH3 and
      PSM2-CH3 channels
        - Thanks to Matias Cabral@Intel for the report
    - Enhanced intra-node and inter-node tuning for PSM-CH3 and PSM2-CH3
      channels
    - Enhanced HFI selection logic for systems with multiple Omni-Path HFIs
    - Enhanced tuning and architecture detection for OpenPOWER, Intel Skylake
      and Cavium ARM (ThunderX) systems
    - Added 'SPREAD', 'BUNCH', and 'SCATTER' binding options for hybrid CPU
      binding policy
    - Rename MV2_THREADS_BINDING_POLICY to MV2_HYBRID_BINDING_POLICY
    - Added support for MV2_SHOW_CPU_BINDING to display number of OMP threads
    - Update to hwloc version 1.11.9

* Bug Fixes (since 2.3b):
    - Fix issue with RDMA_CM in multi-rail scenario
    - Fix issues in nullpscw RMA test.
    - Fix issue with reduce and allreduce algorithms for large message sizes
    - Fix hang issue in hydra when no SLURM environment is present
        - Thanks to Vaibhav Sundriyal for the report
    - Fix issue to test Fortran KIND with FFLAGS
        - Thanks to Rob Latham@mcs.anl.gov for the patch
    - Fix issue in parsing environment variables
    - Fix issue in displaying process to HCA binding
    - Enhance CPU binding logic to handle vendor specific core mappings
    - Fix compilation warnings and memory leaks

MVAPICH2 2.3b (08/10/2017)

* Features and Enhancements (since 2.3a):
    - Enhance performance of point-to-point operations for CH3-Gen2 (InfiniBand),
      CH3-PSM, and CH3-PSM2 (Omni-Path) channels
    - Improve performance for MPI-3 RMA operations
    - Introduce support for Cavium ARM (ThunderX) systems
    - Improve support for process to core mapping on many-core systems
        - New environment variable MV2_THREADS_BINDING_POLICY for
          multi-threaded MPI and MPI+OpenMP applications
        - Support `linear' and `compact' placement of threads
        - Warn user if oversubcription of core is detected
    - Improve launch time for large-scale jobs with mpirun_rsh
    - Add support for non-blocking Allreduce using Mellanox SHARP
    - Efficient support for different Intel Knight's Landing (KNL) models
    - Improve performance for Intra- and Inter-node communication for OpenPOWER
      architecture
    - Improve support for large processes per node and hugepages on SMP systems
    - Enhance collective tuning for Intel Knight's Landing and Intel Omni-Path
      based systems
    - Enhance collective tuning for Bebop@ANL, Bridges@PSC, and Stampede2@TACC
      systems
    - Enhance large message intra-node performance with CH3-IB-Gen2 channel on
      Intel Knight's Landing
    - Enhance support for MPI_T PVARs and CVARs
* Bug Fixes (since 2.3a):
    - Fix issue with bcast algorithm selection
    - Fix issue with large message transfers using CMA
    - Fix issue in Scatter and Gather with large messages
    - Fix tuning tables for various collectives
    - Fix issue with launching single-process MPI jobs
    - Fix compilation error in the CH3-TCP/IP channel
        - Thanks to Isaac Carroll@Lightfleet for the patch
    - Fix issue with memory barrier instructions on ARM
        - Thanks to Pavel (Pasha) Shamis@ARM for reporting the issue
    - Fix compilation warnings and memory leaks

MVAPICH2 2.3a (03/29/2017)

* Features and Enhancements (since 2.2):
    - Based on and ABI compatible with MPICH 3.2
    - Support collective offload using Mellanox's SHArP for Allreduce
        - Enhance tuning framework for Allreduce using SHArP
    - Introduce capability to run MPI jobs across multiple InfiniBand subnets
    - Introduce basic support for executing MPI jobs in Singularity
    - Enhance collective tuning for Intel Knight's Landing and Intel Omni-path
    - Enhance process mapping support for multi-threaded MPI applications
        - Introduce MV2_CPU_BINDING_POLICY=hybrid
        - Introduce MV2_THREADS_PER_PROCESS
    - On-demand connection management for PSM-CH3 and PSM2-CH3 channels
    - Enhance PSM-CH3 and PSM2-CH3 job startup to use non-blocking PMI calls
    - Enhance debugging support for PSM-CH3 and PSM2-CH3 channels
    - Improve performance of architecture detection
    - Introduce run time parameter MV2_SHOW_HCA_BINDING to show process to HCA
      bindings
    - Enhance MV2_SHOW_CPU_BINDING to enable display of CPU bindings on all
      nodes
    - Deprecate OFA-IB-Nemesis channel
    - Update to hwloc version 1.11.6
* Bug Fixes (since 2.2):
    - Fix issue with ring startup in multi-rail systems
    - Fix startup issue with SLURM and PMI-1
        - Thanks to Manuel Rodriguez for the report
    - Fix startup issue caused by fix for bash `shellshock' bug
    - Fix issue with very large messages in PSM
    - Fix issue with singleton jobs and PMI-2
        - Thanks to Adam T. Moody@LLNL for the report
    - Fix incorrect reporting of non-existing files with Luster ADIO
        - Thanks to Wei Kang@NWU for the report
    - Fix hang in MPI_Probe
        - Thanks to John Westlund@Intel for the report
    - Fix issue while setting affinity with Torque Cgroups
        - Thanks to Doug Johnson@OSC for the report
    - Fix runtime errors observed when running MVAPICH2 on aarch64 platforms
        - Thanks to Sreenidhi Bharathkar Ramesh@Broadcom for posting
          the original patch
        - Thanks to Michal Schmidt@RedHat for reposting it
    - Fix failure in mv2_show_cpu_affinity with affinity disabled
        - Thanks to Carlos Rosales-Fernandez@TACC for the report
    - Fix mpirun_rsh error when running short-lived non-MPI jobs
        - Thanks to Kevin Manalo@OSC for the report
    - Fix comment and spelling mistake
        - Thanks to Maksym Planeta for the report
    - Ignore cpusets and cgroups that may have been set by resource manager
        - Thanks to Adam T. Moody@LLNL for the report and the patch
    - Fix reduce tuning table entry for 2ppn 2node
    - Fix compilation issues due to inline keyword with GCC 5 and newer
    - Fix compilation warnings and memory leaks

MVAPICH2 2.2 (09/07/2016)

* Features and Enhancements (since 2.2rc2):
    - Single node collective tuning for Bridges@PSC, Stampede@TACC and other
      architectures
    - Enable PSM builds when both PSM and PSM2 libraries are present
        - Thanks to Adam T. Moody@LLNL for the report and patch
    - Add support for HCAs that return result of atomics in big endian notation
    - Establish loopback connections by default if HCA supports atomics
* Bug Fixes (since 2.2rc2):
    - Fix minor error in use of communicator object in collectives
    - Fix missing u_int64_t declaration with PGI compilers
        - Thanks to Adam T. Moody@LLNL for the report and patch
    - Fix memory leak in RMA rendezvous code path
        - Thanks to Min Si@ANL for the report and patch

MVAPICH2 2.2rc2 (08/08/2016)

* Features and Enhancements (since 2.2rc1):
    - Enhanced performance for MPI_Comm_split through new bitonic algorithm
        - Thanks to Adam T. Moody@LLNL for the patch
    - Enable graceful fallback to Shared Memory if LiMIC2 or CMA transfer fails
    - Enable support for multiple MPI initializations
    - Unify process affinity support in Gen2, PSM and PSM2 channels
    - Remove verbs dependency when building the PSM and PSM2 channels
    - Allow processes to request MPI_THREAD_MULTIPLE when socket or NUMA node
      level affinity is specified
    - Point-to-point and collective performance optimization for Intel Knights
      Landing
    - Automatic detection and tuning for InfiniBand EDR HCAs
    - Warn user to reconfigure library if rank type is not large enough to
      represent all ranks in job
    - Collective tuning for Opal@LLNL, Bridges@PSC, and Stampede-1.5@TACC
    - Tuning and architecture detection for Intel Broadwell processors
    - Add ability to avoid using --enable-new-dtags with ld
        - Thanks to Adam T. Moody@LLNL for the suggestion
    - Add LIBTVMPICH specific CFLAGS and LDFLAGS
        - Thanks to Adam T. Moody@LLNL for the suggestion

* Bug Fixes (since 2.2rc1):
    - Disable optimization that removes use of calloc in ptmalloc hook
      detection code
        - Thanks to Karl W. Schulz@Intel
    - Fix weak alias typos (allows successful compilation with CLANG compiler)
        - Thanks to Min Dong@Old Dominion University for the patch
    - Fix issues in PSM large message gather operations
        - Thanks to Adam T. Moody@LLNL for the report
    - Enhance error checking in collective tuning code
        - Thanks to Jan Bierbaum@Technical University of Dresden for the patch
    - Fix issues with UD based communication in RoCE mode
    - Fix issues with PMI2 support in singleton mode
    - Fix default binding bug in hydra launcher
    - Fix issues with Checkpoint Restart when launched with mpirun_rsh
    - Fix fortran binding issues with Intel 2016 compilers
    - Fix issues with socket/NUMA node level binding
    - Disable atomics when using Connect-IB with RDMA_CM
    - Fix hang in MPI_Finalize when using hybrid channel
    - Fix memory leaks

MVAPICH2 2.2rc1 (03/29/2016)

* Features and Enhancements (since 2.2b):
    - Support for OpenPower architecture
        - Optimized inter-node and intra-node communication
    - Support for Intel Omni-Path architecture
        - Thanks to Intel for contributing the patch
        - Introduction of a new PSM2 channel for Omni-Path
    - Support for RoCEv2
    - Architecture detection for PSC Bridges system with Omni-Path
    - Enhanced startup performance and reduced memory footprint for storing
      InfiniBand end-point information with SLURM
        - Support for shared memory based PMI operations
        - Availability of an updated patch from the MVAPICH project website
          with this support for SLURM installations
    - Optimized pt-to-pt and collective tuning for Chameleon InfiniBand
      systems at TACC/UoC
    - Enable affinity by default for TrueScale(PSM) and Omni-Path(PSM2)
      channels
    - Enhanced tuning for shared-memory based MPI_Bcast
    - Enhanced debugging support and error messages
    - Update to hwloc version 1.11.2

* Bug Fixes (since 2.2b):
    - Fix issue in some of the internal algorithms used for MPI_Bcast,
      MPI_Alltoall and MPI_Reduce
    - Fix hang in one of the internal algorithms used for MPI_Scatter
        - Thanks to Ivan Raikov@Stanford for reporting this issue
    - Fix issue with rdma_connect operation
    - Fix issue with Dynamic Process Management feature
    - Fix issue with de-allocating InfiniBand resources in blocking mode
    - Fix build errors caused due to improper compile time guards
        - Thanks to Adam Moody@LLNL for the report
    - Fix finalize hang when running in hybrid or UD-only mode
        - Thanks to Jerome Vienne@TACC for reporting this issue
    - Fix issue in MPI_Win_flush operation
        - Thanks to Nenad Vukicevic for reporting this issue
    - Fix out of memory issues with non-blocking collectives code
        - Thanks to Phanisri Pradeep Pratapa and Fang Liu@GaTech for
          reporting this issue
    - Fix fall-through bug in external32 pack
        - Thanks to Adam Moody@LLNL for the report and patch
    - Fix issue with on-demand connection establishment and blocking mode
        - Thanks to Maksym Planeta@TU Dresden for the report
    - Fix memory leaks in hardware multicast based broadcast code
    - Fix memory leaks in TrueScale(PSM) channel
    - Fix compilation warnings

MVAPICH2 2.2b (11/12/2015)

* Features and Enhancements (since 2.2a):
    - Enhanced performance for small messages
    - Enhanced startup performance with SLURM
        - Support for PMIX_Iallgather and PMIX_Ifence
    - Support to enable affinity with asynchronous progress thread
    - Enhanced support for MPIT based performance variables
    - Tuned VBUF size for performance
    - Improved startup performance for QLogic PSM-CH3 channel
        - Thanks to Maksym Planeta@TU Dresden for the patch

* Bug Fixes (since 2.2a):
    - Fix issue with MPI_Get_count in QLogic PSM-CH3 channel with very large
      messages (>2GB)
    - Fix issues with shared memory collectives and checkpoint-restart
    - Fix hang with checkpoint-restart
    - Fix issue with unlinking shared memory files
    - Fix memory leak with MPIT
    - Fix minor typos and usage of inline and static keywords
        - Thanks to Maksym Planeta@TU Dresden for the patch and suggestions
    - Fix missing MPIDI_FUNC_EXIT
        - Thanks to Maksym Planeta@TU Dresden for the patch
    - Remove unused code
        - Thanks to Maksym Planeta@TU Dresden for the patch
    - Continue with warning if user asks to enable XRC when the system does not
      support XRC

MVAPICH2 2.2a (08/17/2015)

* Features and Enhancements (since 2.1 GA):

  - Based on MPICH 3.1.4
  - Support for backing on-demand UD CM information with shared memory
    for minimizing memory footprint
  - Reorganized HCA-aware process mapping
  - Dynamic identification of maximum read/atomic operations supported by HCA
  - Enabling support for intra-node communications in RoCE mode without
    shared memory
  - Updated to hwloc 1.11.0
  - Updated to sm_20 kernel optimizations for MPI Datatypes
  - Automatic detection and tuning for 24-core Haswell architecture

* Bug Fixes (since 2.1 GA):

  - Fix for error with multi-vbuf design for GPU based communication
  - Fix bugs with hybrid UD/RC/XRC communications
  - Fix for MPICH putfence/getfence for large messages
  - Fix for error in collective tuning framework
  - Fix validation failure with Alltoall with IN_PLACE option
     - Thanks for Mahidhar Tatineni @SDSC for the report
  - Fix bug with MPI_Reduce with IN_PLACE option
     - Thanks to Markus Geimer for the report
  - Fix for compilation failures with multicast disabled
     - Thanks to Devesh Sharma @Emulex for the report
  - Fix bug with MPI_Bcast
  - Fix IPC selection for shared GPU mode systems
  - Fix for build time warnings and memory leaks
  - Fix issues with Dynamic Process Management
     - Thanks to Neil Spruit for the report
   - Fix bug in architecture detection code
     - Thanks to Adam Moody @LLNL for the report

MVAPICH2-2.1 (04/03/2015)

* Features and Enhancements (since 2.1rc2):
    - Tuning for EDR adapters
    - Optimization of collectives for SDSC Comet system

* Bug-Fixes (since 2.1rc2):
    - Relocate reading environment variables in PSM
        - Thanks to Adam Moody@LLNL for the suggestion
    - Fix issue with automatic process mapping
    - Fix issue with checkpoint restart when full path is not given
    - Fix issue with Dynamic Process Management
    - Fix issue in CUDA IPC code path
    - Fix corner case in CMA runtime detection

MVAPICH2-2.1rc2 (03/12/2015)

* Features and Enhancements (since 2.1rc1):
    - Based on MPICH-3.1.4
    - Enhanced startup performance with mpirun_rsh
    - Checkpoint-Restart Support with DMTCP (Distributed MultiThreaded
      CheckPointing)
        - Thanks to the DMTCP project team (http://dmtcp.sourceforge.net/)
    - Support for handling very large messages in RMA
    - Optimize size of buffer requested for control messages in large message
      transfer
    - Enhanced automatic detection of atomic support
    - Optimized collectives (bcast, reduce, and allreduce) for 4K processes
    - Introduce support to sleep for user specified period before aborting
        - Thanks to Adam Moody@LLNL for the suggestion
    - Disable PSM from setting CPU affinity
        - Thanks to Adam Moody@LLNL for providing the patch
    - Install PSM error handler to print more verbose error messages
        - Thanks to Adam Moody@LLNL for providing the patch
    - Introduce retry mechanism to perform psm_ep_open in PSM channel
        - Thanks to Adam Moody@LLNL for providing the patch

* Bug-Fixes (since 2.1rc1):
    - Fix failures with shared memory collectives with checkpoint-restart
    - Fix failures with checkpoint-restart when using internal communication
      buffers of different size
    - Fix undeclared variable error when --disable-cxx is specified with
      configure
        - Thanks to Chris Green from FANL for the patch
    - Fix segfault seen during connect/accept with dynamic processes
        - Thanks to Neil Spruit for the fix
    - Fix errors with large messages pack/unpack operations in PSM channel
    - Fix for bcast collective tuning
    - Fix assertion errors in one-sided put operations in PSM channel
    - Fix issue with code getting stuck in infinite loop inside ptmalloc
        - Thanks to Adam Moody@LLNL for the suggested changes
    - Fix assertion error in shared memory large message transfers
        - Thanks to Adam Moody@LLNL for reporting the issue
    - Fix compilation warnings

MVAPICH2-2.1rc1 (12/18/2014)

* Features and Enhancements (since 2.1a):
    - Based on MPICH-3.1.3
    - Flexibility to use internal communication buffers of different size for
      improved performance and memory footprint
    - Improve communication performance by removing locks from critical path
    - Enhanced communication performance for small/medium message sizes
    - Support for linking Intel Trace Analyzer and Collector
    - Increase the number of connect retry attempts with RDMA_CM
    - Automatic detection and tuning for Haswell architecture

* Bug-Fixes (since 2.1a):
    - Fix automatic detection of support for atomics
    - Fix issue with void pointer arithmetic with PGI
    - Fix deadlock in ctxidup MPICH test in PSM channel
    - Fix compile warnings

MVAPICH2-2.1a (09/21/2014)

* Features and Enhancements (since 2.0):
    - Based on MPICH-3.1.2
    - Support for PMI-2 based startup with SLURM
    - Enhanced startup performance for Gen2/UD-Hybrid channel
    - GPU support for MPI_Scan and MPI_Exscan collective operations
    - Optimize creation of 2-level communicator
    - Collective optimization for PSM-CH3 channel
    - Tuning for IvyBridge architecture
    - Add -export-all option to mpirun_rsh
    - Support for additional MPI-T performance variables (PVARs)
      in the CH3 channel
    - Link with libstdc++ when building with GPU support
        (required by CUDA 6.5)

* Bug-Fixes (since 2.0):
    - Fix error in large message (>2GB) transfers in CMA code path
    - Fix memory leaks in OFA-IB-CH3 and OFA-IB-Nemesis channels
    - Fix issues with optimizations for broadcast and reduce collectives
    - Fix hang at finalize with Gen2-Hybrid/UD channel
    - Fix issues for collectives with non power-of-two process counts
          - Thanks to Evren Yurtesen for identifying the issue
    - Make ring startup use HCA selected by user
    - Increase counter length for shared-memory collectives

MVAPICH2-2.0 (06/20/2014)

* Features and Enhancements (since 2.0rc2):
    - Consider CMA in collective tuning framework

* Bug-Fixes (since 2.0rc2):
    - Fix bug when disabling registration cache
    - Fix shared memory window bug when shared memory collectives are disabled
    - Fix mpirun_rsh bug when running mpmd programs with no arguments

MVAPICH2-2.0rc2 (05/25/2014)

* Features and Enhancements (since 2.0rc1):
    - CMA support is now enabled by default
    - Optimization of collectives with CMA support
    - RMA optimizations for shared memory and atomic operations
    - Tuning RGET and Atomics operations
    - Tuning RDMA FP-based communication
    - MPI-T support for additional performance and control variables
    - The --enable-mpit-pvars=yes configuration option will now
      enable only MVAPICH2 specific variables
    - Large message transfer support for PSM interface
    - Optimization of collectives for PSM interface
    - Updated to hwloc v1.9

* Bug-Fixes (since 2.0rc1):
    - Fix multicast hang when there is a single process on one node
      and more than one process on other nodes
    - Fix non-power-of-two usage of scatter-doubling-allgather algorithm
    - Fix for bcastzero type hang during finalize
    - Enhanced handling of failures in RDMA_CM based
      connection establishment
    - Fix for a hang in finalize when using RDMA_CM
    - Finish receive request when RDMA READ completes in RGET protocol
    - Always use direct RDMA when flush is used
    - Fix compilation error with --enable-g=all in PSM interface
    - Fix warnings and memory leaks

MVAPICH2-2.0rc1 (03/24/2014)

* Features and Enhancements (since 2.0b):
    - Based on MPICH-3.1
    - Enhanced direct RDMA based designs for MPI_Put and MPI_Get operations in
      OFA-IB-CH3 channel
    - Optimized communication when using MPI_Win_allocate for OFA-IB-CH3
      channel
    - MPI-3 RMA support for CH3-PSM channel
    - Multi-rail support for UD-Hybrid channel
    - Optimized and tuned blocking and non-blocking collectives for OFA-IB-CH3,
      OFA-IB-Nemesis, and CH3-PSM channels
    - Improved hierarchical job startup performance
    - Optimized sub-array data-type processing for GPU-to-GPU communication
    - Tuning for Mellanox Connect-IB adapters
    - Updated hwloc to version 1.8
    - Added options to specify CUDA library paths
    - Deprecation of uDAPL-CH3 channel

* Bug-Fixes (since 2.0b):
    - Fix issues related to MPI-3 RMA locks
    - Fix an issue related to MPI-3 dynamic window
    - Fix issues related to MPI_Win_allocate backed by shared memory
    - Fix issues related to large message transfers for OFA-IB-CH3 and
      OFA-IB-Nemesis channels
    - Fix warning in job launch, when using DPM
    - Fix an issue related to MPI atomic operations on HCAs without atomics
      support
    - Fixed an issue related to selection of compiler. (We prefer the GNU,
      Intel, PGI, and Ekopath compilers in that order).
        - Thanks to Uday R Bondhugula from IISc for the report
    - Fix an issue in message coalescing
    - Prevent printing out inter-node runtime parameters for pure intra-node
      runs
        - Thanks to Jerome Vienne from TACC for the report
    - Fix an issue related to ordering of messages for GPU-to-GPU transfers
    - Fix a few memory leaks and warnings

MVAPICH2-2.0b (11/08/2013)

* Features and Enhancements (since 2.0a):
    - Based on MPICH-3.1b1
    - Multi-rail support for GPU communication
    - Non-blocking streams in asynchronous CUDA transfers for better overlap
    - Initialize GPU resources only when used by MPI transfer
    - Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and
      OFA-RoCE-CH3
    - Additional MPIT counters and performance variables
    - Updated compiler wrappers to remove application dependency on network and
      other extra libraries
        - Thanks to Adam Moody from LLNL for the suggestion
    - Capability to checkpoint CH3 channel using the Hydra process manager
    - Optimized support for broadcast, reduce and other collectives
    - Tuning for IvyBridge architecture
    - Improved launch time for large-scale mpirun_rsh jobs
    - Introduced retry mechanism in mpirun_rsh for socket binding
    - Updated hwloc to version 1.7.2

* Bug-Fixes (since 2.0a):
    - Consider list provided by MV2_IBA_HCA when scanning device list
    - Fix issues in Nemesis interface with --with-ch3-rank-bits=32
    - Better cleanup of XRC files in corner cases
    - Initialize using better defaults for ibv_modify_qp (initial ring)
    - Add unconditional check and addition of pthread library
    - MPI_Get_library_version updated with proper MVAPICH2 branding
        - Thanks to Jerome Vienne from the TACC for the report

MVAPICH2-2.0a (08/24/2013)

* Features and Enhancements (since 1.9):
    - Based on MPICH-3.0.4
    - Dynamic CUDA initialization. Support GPU device selection after MPI_Init
    - Support for running on heterogeneous clusters with GPU and non-GPU nodes
    - Supporting MPI-3 RMA atomic operations and flush operations with CH3-Gen2
      interface
    - Exposing internal performance variables to MPI-3 Tools information
      interface (MPIT)
    - Enhanced MPI_Bcast performance
    - Enhanced performance for large message MPI_Scatter and MPI_Gather
    - Enhanced intra-node SMP performance
    - Tuned SMP eager threshold parameters
    - Reduced memory footprint
    - Improved job-startup performance
    - Warn and continue when ptmalloc fails to initialize
    - Enable hierarchical SSH-based startup with Checkpoint-Restart
    - Enable the use of Hydra launcher with Checkpoint-Restart

* Bug-Fixes (since 1.9):
    - Fix data validation issue with MPI_Bcast
        - Thanks to Claudio J. Margulis from University of Iowa for the report
    - Fix buffer alignment for large message shared memory transfers
    - Fix a bug in One-Sided shared memory backed windows
    - Fix a flow-control bug in UD transport
        - Thanks to Benjamin M. Auer from NASA for the report
    - Fix bugs with MPI-3 RMA in Nemesis IB interface
    - Fix issue with very large message (>2GB bytes) MPI_Bcast
        - Thanks to Lu Qiyue for the report
    - Handle case where $HOME is not set during search for MV2 user config file
        - Thanks to Adam Moody from LLNL for the patch
    - Fix a hang in connection setup with RDMA-CM

MVAPICH2-1.9 (05/06/2013)

* Features and Enhancements (since 1.9rc1):
    - Updated to hwloc v1.7
    - Tuned Reduce, AllReduce, Scatter, Reduce-Scatter and
        Allgatherv Collectives

* Bug-Fixes (since 1.9rc1):
    - Fix cuda context issue with async progress thread
       - Thanks to Osuna Escamilla Carlos from env.ethz.ch for the report
    - Overwrite pre-existing PSM environment variables
       - Thanks to Adam Moody from LLNL for the patch
    - Fix several warnings
       - Thanks to Adam Moody from LLNL for some of the patches

MVAPICH2-1.9RC1 (04/16/2013)

* Features and Enhancements (since 1.9b):
    - Based on MPICH-3.0.3
    - Updated SCR to version 1.1.8
    - Install utility scripts included with SCR
    - Support for automatic detection of path to utilities used by mpirun_rsh
      during configuration
        - Utilities supported: rsh, ssh, xterm, totalview
    - Support for launching jobs on heterogeneous networks with mpirun_rsh
    - Tuned Bcast, Reduce, Scatter Collectives
    - Tuned MPI performance on Kepler GPUs
    - Introduced MV2_RDMA_CM_CONF_FILE_PATH parameter which specifies path to
      mv2.conf

* Bug-Fixes (since 1.9b):
    - Fix autoconf issue with LiMIC2 source-code
        - Thanks to Doug Johnson from OH-TECH for the report
    - Fix build errors with --enable-thread-cs=per-object and
      --enable-refcount=lock-free
        - Thanks to Marcin Zalewski from Indiana University for the report
    - Fix MPI_Scatter failure with MPI_IN_PLACE
        - Thanks to Mellanox for the report
    - Fix MPI_Scatter failure with cyclic host files
    - Fix deadlocks in PSM interface for multi-threaded jobs
        - Thanks to Marcin Zalewski from Indiana University for the report
    - Fix MPI_Bcast failures in SCALAPACK
        - Thanks to Jerome Vienne from TACC for the report
    - Fix build errors with newer Ekopath compiler
    - Fix a bug with shmem collectives in PSM interface
    - Fix memory corruption when more entries specified in mv2.conf than the
      requested number of rails
        - Thanks to Akihiro Nomura from Tokyo Institute of Technology for the
          report
    - Fix memory corruption with CR configuration in Nemesis interface

MVAPICH2-1.9b (02/28/2013)

* Features and Enhancements (since 1.9a2):
    - Based on MPICH-3.0.2
        - Support for all MPI-3 features
    - Support for single copy intra-node communication using Linux supported
      CMA (Cross Memory Attach)
        - Provides flexibility for intra-node communication: shared memory,
          LiMIC2, and CMA
    - Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR)
        - Support for application-level checkpointing
        - Support for hierarchical system-level checkpointing
    - Improved job startup time
        - Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized
          startup on homogeneous clusters
    - New version of LiMIC2 (v0.5.6)
        - Provides support for unlocked ioctl calls
    - Tuned Reduce, Allgather, Reduce_Scatter, Allgatherv collectives
    - Introduced option to export environment variables automatically with
      mpirun_rsh
    - Updated to HWLOC v1.6.1
    - Provided option to use CUDA libary call instead of CUDA driver to check
      buffer pointer type
        - Thanks to Christian Robert from Sandia for the suggestion
    - Improved debug messages and error reporting

* Bug-Fixes (since 1.9a2):
    - Fix page fault with memory access violation with LiMIC2 exposed by newer
      Linux kernels
        - Thanks to Karl Schulz from TACC for the report
    - Fix a failure when lazy memory registration is disabled and CUDA is
      enabled
        - Thanks to Jens Glaser from University of Minnesota for the report
    - Fix an issue with variable initialization related to DPM support
    - Rename a few internal variables to avoid name conflicts with external
      applications
        - Thanks to Adam Moody from LLNL for the report
    - Check for libattr during configuration when Checkpoint/Restart and
      Process Migration are requested
        - Thanks to John Gilmore from Vastech for the report
    - Fix build issue with --disable-cxx
    - Set intra-node eager threshold correctly when configured with LiMIC2
    - Fix an issue with MV2_DEFAULT_PKEY in partitioned InfiniBand network
        - Thanks to Jesper Larsen from FCOO for the report
    - Improve makefile rules to use automake macros
        - Thanks to Carmelo Ponti from CSCS for the report
    - Fix configure error with automake conditionals
        - Thanks to Evren Yurtesen from Abo Akademi for the report
    - Fix a few memory leaks and warnings
    - Properly cleanup shared memory files (used by XRC) when applications fail

MVAPICH2-1.9a2 (11/08/2012)

* Features and Enhancements (since 1.9a):
    - Based on MPICH2-1.5
    - Initial support for MPI-3:
      (Available for all interfaces: OFA-IB-CH3, OFA-IWARP-CH3, OFA-RoCE-CH3,
       uDAPL-CH3, OFA-IB-Nemesis, PSM-CH3)
        - Nonblocking collective functions available as "MPIX_" functions
          (e.g., "MPIX_Ibcast")
        - Neighborhood collective routines available as "MPIX_" functions
          (e.g., "MPIX_Neighbor_allgather")
        - MPI_Comm_split_type function available as an "MPIX_" function
        - Support for MPIX_Type_create_hindexed_block
        - Nonblocking communicator duplication routine MPIX_Comm_idup (will
          only work for single-threaded programs)
        - MPIX_Comm_create_group support
        - Support for matched probe functionality (e.g., MPIX_Mprobe,
          MPIX_Improbe, MPIX_Mrecv, and MPIX_Imrecv),
          (Not Available for PSM)
        - Support for "Const" (disabled by default)
    - Efficient vector, hindexed datatype processing on GPU buffers
    - Tuned alltoall, Scatter and Allreduce collectives
    - Support for Mellanox Connect-IB HCA
    - Adaptive number of registration cache entries based on job size
    - Revamped Build system:
        - Uses automake instead of simplemake,
        - Allows for parallel builds ("make -j8" and similar)

* Bug-Fixes (since 1.9a):
    - CPU frequency mismatch warning shown under debug
    - Fix issue with MPI_IN_PLACE buffers with CUDA
    - Fix ptmalloc initialization issue due to compiler optimization
        - Thanks to Kyle Sheumaker from ACT for the report
    - Adjustable MAX_NUM_PORTS at build time to support more than two ports
    - Fix issue with MPI_Allreduce with MPI_IN_PLACE send buffer
    - Fix memleak in MPI_Cancel with PSM interface
        - Thanks to Andrew Friedley from LLNL for the report

MVAPICH2-1.9a (09/07/2012)

* Features and Enhancements (since 1.8):
    - Support for InfiniBand hardware UD-multicast
    - UD-multicast-based designs for collectives
      (Bcast, Allreduce and Scatter)
    - Enhanced Bcast and Reduce collectives with pt-to-pt communication
    - LiMIC-based design for Gather collective
    - Improved performance for shared-memory-aware collectives
    - Improved intra-node communication performance with GPU buffers
      using pipelined design
    - Improved inter-node communication performance with GPU buffers
      with non-blocking CUDA copies
    - Improved small message communication performance with
      GPU buffers using CUDA IPC design
    - Improved automatic GPU device selection and CUDA context management
    - Optimal communication channel selection for different
      GPU communication modes (DD, DH and HD) in different
      configurations (intra-IOH and inter-IOH)
    - Removed libibumad dependency for building the library
    - Option for selecting non-default gid-index in a loss-less
      fabric setup in RoCE mode
    - Option to disable signal handler setup
    - Tuned thresholds for various architectures
    - Set DAPL-2.0 as the default version for the uDAPL interface
    - Updated to hwloc v1.5
    - Option to use IP address as a fallback if hostname
      cannot be resolved
    - Improved error reporting

* Bug-Fixes (since 1.8):
    - Fix issue in intra-node knomial bcast
    - Handle gethostbyname return values gracefully
    - Fix corner case issue in two-level gather code path
    - Fix bug in CUDA events/streams pool management
    - Fix ptmalloc initialization issue when MALLOC_CHECK_ is
      defined in the environment
        - Thanks to Mehmet Belgin from Georgia Institute of
          Technology for the report
    - Fix memory corruption and handle heterogeneous architectures
      in gather collective
    - Fix issue in detecting the correct HCA type
    - Fix issue in ring start-up to select correct HCA when
      MV2_IBA_HCA is specified
    - Fix SEGFAULT in MPI_Finalize when IB loop-back is used
    - Fix memory corruption on nodes with 64-cores
        - Thanks to M Xie for the report
    - Fix hang in MPI_Finalize with Nemesis interface when
      ptmalloc initialization fails
        - Thanks to Carson Holt from OICR for the report
    - Fix memory corruption in shared memory communication
        - Thanks to Craig Tierney from NOAA for the report
          and testing the patch
    - Fix issue in IB ring start-up selection with mpiexec.hydra
    - Fix issue in selecting CUDA run-time variables when running
      on single node in SMP only mode
    - Fix few memory leaks and warnings

MVAPICH2-1.8 (04/30/2012)

* Features and Enhancements (since 1.8rc1):
    - Introduced a unified run time parameter MV2_USE_ONLY_UD to enable UD only
      mode
    - Enhanced designs for Alltoall and Allgather collective communication from
      GPU device buffers
    - Tuned collective communication from GPU device buffers
    - Tuned Gather collective
    - Introduced a run time parameter MV2_SHOW_CPU_BINDING to show current CPU
      bindings
    - Updated to hwloc v1.4.1
    - Remove dependency on LEX and YACC


* Bug-Fixes (since 1.8rc1):
    - Fix hang with multiple GPU configuration
        - Thanks to Jens Glaser from University of Minnesota for the report
    - Fix buffer alignment issues to improve intra-node performance
    - Fix a DPM multispawn behavior
    - Enhanced error reporting in DPM functionality
    - Quote environment variables in job startup to protect from shell
    - Fix hang when LIMIC is enabled
    - Fix hang in environments with heterogeneous HCAs
    - Fix issue when using multiple HCA ports in RDMA_CM mode
        - Thanks to Steve Wise from Open Grid Computing for the report
    - Fix hang during MPI_Finalize in Nemesis IB netmod
    - Fix for a start-up issue in Nemesis with heterogeneous architectures
    - Fix few memory leaks and warnings

MVAPICH2-1.8rc1 (03/22/2012)

* Features & Enhancements (since 1.8a2):
    - New design for intra-node communication from GPU Device buffers using
      CUDA IPC for better performance and correctness
        - Thanks to Joel Scherpelz from NVIDIA for his suggestions
    - Enabled shared memory communication for host transfers when CUDA is
      enabled
    - Optimized and tuned collectives for GPU device buffers
    - Enhanced pipelined inter-node device transfers
    - Enhanced shared memory design for GPU device transfers for large messages
    - Enhanced support for CPU binding with socket and numanode level
      granularity
    - Support suspend/resume functionality with mpirun_rsh
    - Exporting local rank, local size, global rank and global size through
      environment variables (both mpirun_rsh and hydra)
    - Update to hwloc v1.4
    - Checkpoint-Restart support in OFA-IB-Nemesis interface
    - Enabling run-through stabilization support to handle process failures in
      OFA-IB-Nemesis interface
    - Enhancing OFA-IB-Nemesis interface to handle IB errors gracefully
    - Performance tuning on various architecture clusters
    - Support for Mellanox IB FDR adapter

* Bug-Fixes (since 1.8a2):
    - Fix a hang issue on InfiniHost SDR/DDR cards
        - Thanks to Nirmal Seenu from Fermilab for the report
    - Fix an issue with runtime parameter MV2_USE_COALESCE usage
    - Fix an issue with LiMIC2 when CUDA is enabled
    - Fix an issue with intra-node communication using datatypes and GPU device
      buffers
    - Fix an issue with Dynamic Process Management when launching processes on
      multiple nodes
    -  Thanks to Rutger Hofman from VU Amsterdam for the report
    - Fix build issue in hwloc source with mcmodel=medium flags
        - Thanks to Nirmal Seenu from Fermilab for the report
    - Fix a build issue in hwloc with --disable-shared or --disabled-static
      options
    - Use portable stdout and stderr redirection
        - Thanks to Dr. Axel Philipp from *MTU* Aero Engines for the patch
    - Fix a build issue with PGI 12.2
        - Thanks to Thomas Rothrock from U.S. Army SMDC for the patch
    - Fix an issue with send message queue in OFA-IB-Nemesis interface
    - Fix a process cleanup issue in Hydra when MPI_ABORT is called (upstream
      MPICH2 patch)
    - Fix an issue with non-contiguous datatypes in MPI_Gather
    - Fix a few memory leaks and warnings

MVAPICH2-1.8a2 (02/02/2012)

* Features and Enhancements (since 1.8a1p1):
    - Support for collective communication from GPU buffers
    - Non-contiguous datatype support in point-to-point and collective
      communication from GPU buffers
    - Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)
    - Alternate synchronization mechanism using CUDA Events for pipelined device
      data transfers
    - Exporting processes local rank in a node through environment variable
    - Adjust shared-memory communication block size at runtime
    - Enable XRC by default at configure time
    - New shared memory design for enhanced intra-node small message performance
    - Tuned inter-node and intra-node performance on different cluster
      architectures
    - Update to hwloc v1.3.1
    - Support for fallback to R3 rendezvous protocol if RGET fails
    - SLURM integration with mpiexec.mpirun_rsh to use SLURM allocated hosts
      without specifying a hostfile
    - Support added to automatically use PBS_NODEFILE in Torque and PBS
      environments
    - Enable signal-triggered (SIGUSR2) migration

* Bug Fixes (since 1.8a1p1):
    - Set process affinity independently of SMP enable/disable to control the
      affinity in loopback mode
    - Report error and exit if user requests MV2_USE_CUDA=1 in non-cuda
      configuration
    - Fix for data validation error with GPU buffers
    - Updated WRAPPER_CPPFLAGS when using --with-cuda. Users should not have to
      explicitly specify CPPFLAGS or LDFLAGS to build applications
    - Fix for several compilation warnings
    - Report an error message if user requests MV2_USE_XRC=1 in non-XRC
      configuration
    - Remove debug prints in regular code path with MV2_USE_BLOCKING=1
        - Thanks to Vaibhav Dutt for the report
    - Handling shared memory collective buffers in a dynamic manner to eliminate
      static setting of maximum CPU core count
    - Fix for validation issue in MPICH2 strided_get_indexed.c
    - Fix a bug in packetized transfers on heterogeneous clusters
    - Fix for deadlock between psm_ep_connect and PMGR_COLLECTIVE calls on
      QLogic systems
        - Thanks to Adam T. Moody for the patch
    - Fix a bug in MPI_Allocate_mem when it is called with size 0
        - Thanks to Michele De Stefano for reporting this issue
    - Create vendor for Open64 compilers and add rpath for unknown compilers
        - Thanks to Martin Hilgemen from Dell Inc. for the initial patch
    - Fix issue due to overlapping buffers with sprintf
        - Thanks to Mark Debbage from QLogic for reporting this issue
    - Fallback to using GNU options for unknown f90 compilers
    - Fix hang in PMI_Barrier due to incorrect handling of the socket return
      values in mpirun_rsh
    - Unify the redundant FTB events used to initiate a migration
    - Fix memory leaks when mpirun_rsh reads hostfiles
    - Fix a bug where library attempts to use in-active rail in multi-rail
      scenario

MVAPICH2-1.8a1p1 (11/14/2011)

* Bug Fixes (since 1.8a1)
    - Fix for a data validation issue in GPU transfers
        - Thanks to Massimiliano Fatica, NVIDIA, for reporting this issue
    - Tuned CUDA block size to 256K for better performance
    - Enhanced error checking for CUDA library calls
    - Fix for mpirun_rsh issue while launching applications on Linux Kernels
      (3.x)

MVAPICH2-1.8a1 (11/09/2011)

* Features and Enhancements (since 1.7):
    - Support for MPI communication from NVIDIA GPU device memory
        - High performance RDMA-based inter-node point-to-point communication
          (GPU-GPU, GPU-Host and Host-GPU)
        - High performance intra-node point-to-point communication for
          multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)
        - Communication with contiguous datatype
    - Reduced memory footprint of the library
    - Enhanced one-sided communication design with reduced memory requirement
    - Enhancements and tuned collectives (Bcast and Alltoallv)
    - Update to hwloc v1.3.0
    - Flexible HCA selection with Nemesis interface
        - Thanks to Grigori Inozemtsev, Queens University
    - Support iWARP interoperability between Intel NE020 and Chelsio T4 Adapters
    - RoCE enable environment variable name is changed from MV2_USE_RDMAOE to
      MV2_USE_RoCE

* Bug Fixes (since 1.7):
    - Fix for a bug in mpirun_rsh while doing process clean-up in abort and
      other error scenarios
    - Fixes for code compilation warnings
    - Fix for memory leaks in RDMA CM code path

MVAPICH2-1.7 (10/14/2011)

* Features and Enhancements (since 1.7rc2):
    - Support SHMEM collectives upto 64 cores/node
    - Update to hwloc v1.2.2
    - Enhancement and tuned collective (GatherV)

* Bug Fixes:
    - Fixes for code compilation warnings
    - Fix job clean-up issues with mpirun_rsh
    - Fix a hang with RDMA CM

MVAPICH2-1.7rc2 (09/19/2011)

* Features and Enhancements (since 1.7rc1):
    - Based on MPICH2-1.4.1p1
    - Integrated Hybrid (UD-RC/XRC) design to get best performance
      on large-scale systems with reduced/constant memory footprint
    - Shared memory backed Windows for One-Sided Communication
    - Support for truly passive locking for intra-node RMA in shared
      memory and LIMIC based windows
    - Integrated with Portable Hardware Locality (hwloc v1.2.1)
    - Integrated with latest OSU Micro-Benchmarks (3.4)
    - Enhancements and tuned collectives (Allreduce and Allgatherv)
    - MPI_THREAD_SINGLE provided by default and MPI_THREAD_MULTIPLE as an
      option
    - Enabling Checkpoint/Restart support in pure SMP mode
    - Optimization for QDR cards
    - On-demand connection management support with IB CM (RoCE interface)
    - Optimization to limit number of RDMA Fast Path connections for very large
      clusters (Nemesis interface)
    - Multi-core-aware collective support (QLogic PSM interface)

* Bug Fixes:
    - Fixes for code compilation warnings
    - Compiler preference lists reordered to avoid mixing GCC and Intel
      compilers if both are found by configure
    - Fix a bug in transferring very large messages (>2GB)
        - Thanks to Tibor Pausz from Univ. of Frankfurt for reporting it
    - Fix a hang with One-Sided Put operation
    - Fix a bug in ptmalloc integration
    - Avoid double-free crash with mpispawn
    - Avoid crash and print an error message in mpirun_rsh when the hostfile is
      empty
    - Checking for error codes in PMI design
    - Verify programs can link with LiMIC2 at runtime
    - Fix for compilation issue when BLCR or FTB installed in non-system paths
    - Fix an issue with RDMA-Migration
    - Fix for memory leaks
    - Fix an issue in supporting RoCE with second port on available on HCA
        - Thanks to Jeffrey Konz from HP for reporting it
    - Fix for a hang with passive RMA tests (QLogic PSM interface)

MVAPICH2-1.7rc1 (07/20/2011)

* Features and Enhancements (since 1.7a2)
    - Based on MPICH2-1.4
    - CH3 shared memory channel for standalone hosts (including laptops)
      without any InfiniBand adapters
    - HugePage support
    - Improved on-demand InfiniBand connection setup
    - Optimized Fence synchronization (with and without LIMIC2 support)
    - Enhanced mpirun_rsh design to avoid race conditions and support for
      improved debug messages
    - Optimized design for collectives (Bcast and Reduce)
    - Improved performance for medium size messages for QLogic PSM
    - Support for Ekopath Compiler

* Bug Fixes
    - Fixes in Dynamic Process Management (DPM) support
    - Fixes in Checkpoint/Restart and Migration support
    - Fix Restart when using automatic checkpoint
        - Thanks to Alexandr for reporting this
    - Compilation warnings fixes
    - Handling very large one-sided transfers using RDMA
    - Fixes for memory leaks
    - Graceful handling of unknown HCAs
    - Better handling of shmem file creation errors
    - Fix for a hang in intra-node transfer
    - Fix for a build error with --disable-weak-symbols
        - Thanks to Peter Willis for reporting this issue
    - Fixes for one-sided communication with passive target synchronization
    - Proper error reporting when a program is linked with both static and
      shared MVAPICH2 libraries

MVAPICH2-1.7a2 (06/03/2011)

* Features and Enhancements (Since 1.7a)
    - Improved intra-node shared memory communication performance
    - Tuned RDMA Fast Path Buffer size to get better performance
      with less memory footprint (CH3 and Nemesis)
    - Fast process migration using RDMA
    - Automatic inter-node communication parameter tuning
      based on platform and adapter detection (Nemesis)
    - Automatic intra-node communication parameter tuning
      based on platform
    - Efficient connection set-up for multi-core systems
    - Enhancements for collectives (barrier, gather and allgather)
    - Compact and shorthand way to specify blocks of processes on the same
      host with mpirun_rsh
    - Support for latest stable version of HWLOC v1.2
    - Improved debug message output in process management and fault tolerance
      functionality
    - Better handling of process signals and error management in mpispawn
    - Performance tuning for pt-to-pt and several collective operations

* Bug fixes
    - Fixes for memory leaks
    - Fixes in CR/migration
    - Better handling of memory allocation and registration failures
    - Fixes for compilation warnings
    - Fix a bug that disallows '=' from mpirun_rsh arguments
    - Handling of non-contiguous transfer in Nemesis interface
    - Bug fix in gather collective when ranks are in cyclic order
    - Fix for the ignore_locks bug in MPI-IO with Lustre

MVAPICH2-1.7a (04/19/2011)

* Features and Enhancements

    - Based on MPICH2-1.3.2p1
    - Integrated with Portable Hardware Locality (hwloc v1.1.1)
    - Supporting Large Data transfers (>2GB)
    - Integrated with Enhanced LiMIC2 (v0.5.5) to support Intra-node
      large message (>2GB) transfers
    - Optimized and tuned algorithm for AlltoAll
    - Enhanced debugging config options to generate
      core files and back-traces
    - Support for Chelsio's T4 Adapter

MVAPICH2-1.6 (03/09/2011)

* Features and Enhancements (since 1.6-RC3)
    - Improved configure help for MVAPICH2 features
    - Updated Hydra launcher with MPICH2-1.3.3 Hydra process manager
    - Building and installation of OSU micro benchmarks during default
      MVAPICH2 installation
    - Hydra is the default mpiexec process manager

* Bug fixes (since 1.6-RC3)
    - Fix hang issues in RMA
    - Fix memory leaks
    - Fix in RDMA_FP

MVAPICH2-1.6-RC3 (02/15/2011)

* Features and Enhancements
    - Support for 3D torus topology with appropriate SL settings
        - For both CH3 and Nemesis interfaces
    - Thanks to Jim Schutt, Marcus Epperson and John Nagle from
      Sandia for the initial patch
    - Quality of Service (QoS) support with multiple InfiniBand SL
        - For both CH3 and Nemesis interfaces
    - Configuration file support (similar to the one available in MVAPICH).
      Provides a convenient method for handling all runtime variables
      through a configuration file.
    - Improved job-startup performance on large-scale systems
    - Optimization in MPI_Finalize
    - Improved pt-to-pt communication performance for small and
      medium messages
    - Optimized and tuned algorithms for Gather and Scatter collective
      operations
    - Optimized thresholds for one-sided RMA operations
    - User-friendly configuration options to enable/disable various
      checkpoint/restart and migration features
    - Enabled ROMIO's auto detection scheme for filetypes
      on Lustre file system
    - Improved error checking for system and BLCR calls in
      checkpoint-restart and migration codepath
    - Enhanced OSU Micro-benchmarks suite (version 3.3)

Bug Fixes
    - Fix in aggregate ADIO alignment
    - Fix for an issue with LiMIC2 header
    - XRC connection management
    - Fixes in registration cache
    - IB card detection with MV2_IBA_HCA runtime option in
      multi rail design
    - Fix for a bug in multi-rail design while opening multiple HCAs
    - Fixes for multiple memory leaks
    - Fix for a bug in mpirun_rsh
    - Checks before enabling aggregation and migration
    - Fixing the build errors with --disable-cxx
    - Thanks to Bright Yang for reporting this issue
    - Fixing the build errors related to "pthread_spinlock_t"
      seen on RHEL systems

MVAPICH2-1.6-RC2 (12/22/2010)

* Features and Enhancements
    - Optimization and enhanced performance for clusters with nVIDIA
      GPU adapters (with and without GPUDirect technology)
    - Enhanced R3 rendezvous protocol
        - For both CH3 and Nemesis interfaces
    - Robust RDMA Fast Path setup to avoid memory allocation
      failures
        - For both CH3 and Nemesis interfaces
    - Multiple design enhancements for better performance of
      medium sized messages
    - Enhancements and optimizations for one sided Put and Get operations
    - Enhancements and tuning of Allgather for small and medium
      sized messages
    - Optimization of AllReduce
    - Enhancements to Multi-rail Design and features including striping
      of one-sided messages
    - Enhancements to mpirun_rsh job start-up scheme
    - Enhanced designs for automatic detection of various
      architectures and adapters

* Bug fixes
    - Fix a bug in Post-Wait/Start-Complete path for one-sided
      operations
    - Resolving a hang in mpirun_rsh termination when CR is enabled
    - Fixing issue in MPI_Allreduce and Reduce when called with MPI_IN_PLACE
        - Thanks to the initial patch by Alexander Alekhin
    - Fix for an issue in rail selection for small RMA messages
    - Fix for threading related errors with comm_dup
    - Fix for alignment issues in RDMA Fast Path
    - Fix for extra memcpy in header caching
    - Fix for an issue to use correct HCA when process to rail binding
      scheme used in combination with XRC.
    - Fix for an RMA issue when configured with enable-g=meminit
        - Thanks to James Dinan of Argonne for reporting this issue
    - Only set FC and F77 if gfortran is executable


MVAPICH2-1.6RC1 (11/12/2010)

* Features and Enhancements
    - Using LiMIC2 for efficient intra-node RMA transfer to avoid extra
      memory copies
    - Upgraded to LiMIC2 version 0.5.4
    - Removing the limitation on number of concurrent windows in RMA
      operations
    - Support for InfiniBand Quality of Service (QoS) with multiple lanes
    - Enhanced support for multi-threaded applications
    - Fast Checkpoint-Restart support with aggregation scheme
    - Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
    - Support for new standardized Fault Tolerant Backplane (FTB) Events
      for Checkpoint-Restart and Job Pause-Migration-Restart Framework
    - Dynamic detection of multiple InfiniBand adapters and using these
      by default in multi-rail configurations (OLA-IB-CH3, OFA-iWARP-CH3 and
      OFA-RoCE-CH3 interfaces)
    - Support for process-to-rail binding policy (bunch, scatter and
      user-defined) in multi-rail configurations (OFA-IB-CH3, OFA-iWARP-CH3 and
      OFA-RoCE-CH3 interfaces)
    - Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce
      operations for small and medium message sizes.
    - XRC support with Hydra Process Manager
    - Improved usability of process to CPU mapping with support of
      delimiters (',' , '-') in CPU listing
    - Thanks to Gilles Civario for the initial patch
    - Use of gfortran as the default F77 compiler
    - Support of Shared-Memory-Nemesis interface on multi-core platforms
      requiring intra-node communication only (SMP-only systems, laptops, etc. )

* Bug fixes
    - Fix for memory leak in one-sided code with --enable-g=all
       --enable-error-messages=all
    - Fix for memory leak in getting the context of intra-communicator
    - Fix for shmat() return code check
    - Fix for issues with inter-communicator collectives in Nemesis
    - KNEM patch for osu_bibw issue with KNEM version 0.9.2
    - Fix for osu_bibw error with Shared-memory-Nemesis interface
    - Fix for Win_test error for one-sided RDMA
    - Fix for a hang in collective when thread level is set to multiple
    - Fix for intel test errors with rsend, bsend and ssend operations in Nemesis
    - Fix for memory free issue when it allocated by scandir
    - Fix for a hang in Finalize
    - Fix for issue with MPIU_Find_local_and_external when it is called
      from MPIDI_CH3I_comm_create
    - Fix for handling CPPFLGS values with spaces
    - Dynamic Process Management to work with XRC support
    - Fix related to disabling CPU affinity when shared memory is turned off at run time

- MVAPICH2-1.5.1 (09/14/10)

* Features and Enhancements
    - Significantly reduce memory footprint on some systems by changing the
      stack size setting for multi-rail configurations
    - Optimization to the number of RDMA Fast Path connections
    - Performance improvements in Scatterv and Gatherv collectives for CH3
      interface (Thanks to Dan Kokran and Max Suarez of NASA for identifying
      the issue)
    - Tuning of Broadcast Collective
    - Support for tuning of eager thresholds based on both adapter and platform
      type
    - Environment variables for message sizes can now be expressed in short
      form K=Kilobytes and M=Megabytes (e.g.  MV2_IBA_EAGER_THRESHOLD=12K)
    - Ability to selectively use some or all HCAs using colon separated lists.
      e.g. MV2_IBA_HCA=mlx4_0:mlx4_1
    - Improved Bunch/Scatter mapping for process binding with HWLOC and SMT
      support (Thanks to Dr. Bernd Kallies of ZIB for ideas and suggestions)
    - Update to Hydra code from MPICH2-1.3b1
    - Auto-detection of various iWARP adapters
    - Specifying MV2_USE_IWARP=1 is no longer needed when using iWARP
    - Changing automatic eager threshold selection and tuning for iWARP
      adapters based on number of nodes in the system instead of the number of
      processes
    - PSM progress loop optimization for QLogic Adapters (Thanks to Dr.
      Avneesh Pant of QLogic for the patch)

* Bug fixes
    - Fix memory leak in registration cache with --enable-g=all
    - Fix memory leak in operations using datatype modules
    - Fix for rdma_cross_connect issue for RDMA CM. The server is prevented
      from initiating a connection.
    - Don't fail during build if RDMA CM is unavailable
    - Various mpirun_rsh bug fixes for CH3, Nemesis and uDAPL interfaces
    - ROMIO panfs build fix
    - Update panfs for not-so-new ADIO file function pointers
    - Shared libraries can be generated with unknown compilers
    - Explicitly link against DL library to prevent build error due to DSO link
      change in Fedora 13 (introduced with gcc-4.4.3-5.fc13)
    - Fix regression that prevents the proper use of our internal HWLOC
      component
    - Remove spurious debug flags when certain options are selected at build
      time
    - Error code added for situation when received eager SMP message is larger
      than receive buffer
    - Fix for Gather and GatherV back-to-back hang problem with LiMIC2
    - Fix for packetized send in Nemesis
    - Fix related to eager threshold in nemesis ib-netmod
    - Fix initialization parameter for Nemesis based on adapter type
    - Fix for uDAPL one sided operations (Thanks to Jakub Fedoruk from Intel
      for reporting this)
    - Fix an issue with out-of-order message handling for iWARP
    - Fixes for memory leak and Shared context Handling in PSM for QLogic
      Adapters (Thanks to Dr. Avneesh Pant of QLogic for the patch)


MVAPICH2-1.5 (07/09/10)

* Features and Enhancements (since 1.5-RC2)
    - SRQ turned on by default for Nemesis interface
    - Performance tuning - adjusted eager thresholds for
      variety of architectures, vbuf size based on adapter
      types and vbuf pool sizes
    - Tuning for Intel iWARP NE020 adapter, thanks to Harry
      Cropper of Intel
    - Introduction of a retry mechanism for RDMA_CM connection
      establishment

* Bug fixes (since 1.5-RC2)
    - Fix in build process with hwloc (for some Distros)
    - Fix for memory leak (Nemesis interface)


MVAPICH2-1.5-RC2 (06/21/10)

* Features and Enhancements (since 1.5-RC1)
    - Support for hwloc library (1.0.1) for defining CPU affinity
    - Deprecating the PLPA support for defining CPU affinity
    - Efficient CPU affinity policies (bunch and scatter) to
      specify CPU affinity per job for modern multi-core platforms
    - New flag in mpirun_rsh to execute tasks with different group IDs
    - Enhancement to the design of Win_complete for RMA operations
    - Flexibility to support variable number of RMA windows
    - Support for Intel iWARP NE020 adapter

* Bug fixes (since 1.5-RC1)
    - Compilation issue with the ROMIO adio-lustre driver, thanks
      to Adam Moody of LLNL for reporting the issue
    - Allowing checkpoint-restart for large-scale systems
    - Correcting a bug in clear_kvc function. Thanks to T J (Chris) Ward,
      IBM Research, for reporting and providing the resolving patch
    - Shared lock operations with RMA with scatter process distribution.
      Thanks to Pavan Balaji of Argonne for reporting this issue
    - Fix a bug during window creation in uDAPL
    - Compilation issue with --enable-alloca, Thanks to E. Borisch,
      for reporting and providing the patch
    - Improved error message for ibv_poll_cq failures
    - Fix an issue that prevents mpirun_rsh to execute programs without
      specifying the path from directories in PATH
    - Fix an issue of mpirun_rsh with Dynamic Process Migration (DPM)
    - Fix for memory leaks (both CH3 and Nemesis interfaces)
    - Updatefiles correctly update LiMIC2
    - Several fixes to the registration cache
      (CH3, Nemesis and uDAPL interfaces)
    - Fix to multi-rail communication
    - Fix to Shared Memory communication Progress Engine
    - Fix to all-to-all collective for large number of processes


MVAPICH2-1.5-RC1 (05/04/10)

* Features and Enhancements
    - MPI 2.2 compliant
    - Based on MPICH2-1.2.1p1
    - OFA-IB-Nemesis interface design
        - OpenFabrics InfiniBand network module support for
          MPICH2 Nemesis modular design
        - Support for high-performance intra-node shared memory
          communication provided by the Nemesis design
        - Adaptive RDMA Fastpath with Polling Set for high-performance
          inter-node communication
        - Shared Receive Queue (SRQ) support with flow control,
          uses significantly less memory for MPI library
        - Header caching
        - Advanced AVL tree-based Resource-aware registration cache
        - Memory Hook Support provided by integration with ptmalloc2
          library. This provides safe release of memory to the
          Operating System and is expected to benefit the memory
          usage of applications that heavily use malloc and free
              operations.
        - Support for TotalView debugger
        - Shared Library Support for existing binary MPI application
          programs to run ROMIO Support for MPI-IO
        - Support for additional features (such as hwloc,
          hierarchical collectives, one-sided, multithreading, etc.),
          as included in the MPICH2 1.2.1p1 Nemesis channel
    - Flexible process manager support
        - mpirun_rsh to work with any of the eight interfaces
          (CH3 and Nemesis channel-based) including OFA-IB-Nemesis,
          TCP/IP-CH3 and TCP/IP-Nemesis
        - Hydra process manager to work with any of the eight interfaces
          (CH3 and Nemesis channel-based) including OFA-IB-CH3,
          OFA-iWARP-CH3, OFA-RoCE-CH3 and TCP/IP-CH3
    - MPIEXEC_TIMEOUT is honored by mpirun_rsh

* Bug fixes since 1.4.1
    - Fix compilation error when configured with
      `--enable-thread-funneled'
    - Fix MPE functionality, thanks to Anthony Chan  for
      reporting and providing the resolving patch
    - Cleanup after a failure in the init phase is handled better by
      mpirun_rsh
    - Path determination is correctly handled by mpirun_rsh when DPM is
      used
    - Shared libraries are correctly built (again)


MVAPICH2-1.4.1

* Enhancements since mvapich2-1.4
   - MPMD launch capability to mpirun_rsh
   - Portable Hardware Locality (hwloc) support, patch suggested by
     Dr. Bernd Kallies <kallies@zib.de>
   - Multi-port support for iWARP
   - Enhanced iWARP design for scalability to higher process count
   - Ring based startup support for RDMAoE

* Bug fixes since mvapich2-1.4
   - Fixes for MPE and other profiling tools
     as suggested by Anthony Chan (chan@mcs.anl.gov)
   - Fixes for finalization issue with dynamic process management
   - Removed overrides to PSM_SHAREDCONTEXT, PSM_SHAREDCONTEXTS_MAX variables.
     Suggested by Ben Truscott <b.s.truscott@bristol.ac.uk>.
   - Fixing the error check for buffer aliasing in MPI_Reduce as
     suggested by Dr. Rajeev Thakur <thakur@mcs.anl.gov>
   - Fix Totalview integration for RHEL5
   - Update simplemake to handle build timestamp issues
   - Fixes for --enable-g={mem, meminit}
   - Improved logic to control the receive and send requests to handle the
     limitation of CQ Depth on iWARP
   - Fixing assertion failures with IMB-EXT tests
   - VBUF size for very small iWARP clusters bumped up to 33K
   - Replace internal mallocs with MPIU_Malloc uniformly for correct
     tracing with --enable-g=mem
   - Fixing multi-port for iWARP
   - Fix memory leaks
   - Shared-memory reduce fixes for MPI_Reduce invoked with MPI_IN_PLACE
   - Handling RDMA_CM_EVENT_TIMEWAIT_EXIT event
   - Fix for threaded-ctxdup mpich2 test
   - Detecting spawn errors, patch contributed by
     Dr. Bernd Kallies <kallies@zib.de>
   - IMB-EXT fixes reported by Yutaka from Cray Japan
   - Fix alltoall assertion error when limic is used

MVAPICH2-1.4

* Enhancements since mvapich2-1.4rc2
    - Efficient runtime CPU binding
    - Add an environment variable for controlling the use of multiple cq's for
      iWARP interface.
    - Add environmental variables to disable registration cache for All-to-All
      on large systems.
    - Performance tune for pt-to-pt Intra-node communication with LiMIC2
    - Performance tune for MPI_Broadcast

* Bug fixes since mvapich2-1.4rc2
    - Fix the reading error in lock_get_response by adding
      initialization to req->mrail.protocol
    - Fix mpirun_rsh scalability issue with hierarchical ssh scheme
      when launching greater than 8K processes.
    - Add mvapich_ prefix to yacc functions. This can avoid some namespace
      issues when linking with other libraries.  Thanks to Manhui Wang
      <wangm9@cardiff.ac.uk> for contributing the patch.

MVAPICH2-1.4-rc2

* Enhancements since mvapich2-1.4rc1
    - Added Feature: Check-point Restart with Fault-Tolerant Backplane Support
      (FTB_CR)
    - Added Feature: Multiple CQ-based design for Chelsio iWARP
    - Distribute LiMIC2-0.5.2 with MVAPICH2. Added flexibility for selecting
      and using a pre-existing installation of LiMIC2
    - Increase the amount of command line that mpirun_rsh can handle (Thanks
      for the suggestion by Bill Barth @ TACC)

* Bug fixes since mvapich2-1.4rc1
    - Fix for hang with packetized send using RDMA Fast path
    - Fix for allowing to use user specified P_Key's (Thanks to Mike Heinz @
      QLogic)
    - Fix for allowing mpirun_rsh to accept parameters through the
      parmeters file (Thanks to Mike Heinz @ QLogic)
    - Modify the default value of shmem_bcast_leaders to 4K
    - Fix for one-sided with XRC support
    - Fix hang with XRC
    - Fix to always enabling MVAPICH2_Sync_Checkpoint functionality
    - Fix build error on RHEL 4 systems (Reported by Nathan Baca and Jonathan
      Atencio)
    - Fix issue with PGI compilation for PSM interface
    - Fix for one-sided accumulate function with user-defined continguous
      datatypes
    - Fix linear/hierarchical switching logic and reduce threshold for the
      enhanced mpirun_rsh framework.
    - Clean up intra-node connection management code for iWARP
    - Fix --enable-g=all issue with uDAPL interface
    - Fix one sided operation with on demand CM.
    - Fix VPATH build

MVAPICH2-1.4-rc1

* Bugs fixed since MVAPICH2-1.2p1

  - Changed parameters for iWARP for increased scalability

  - Fix error with derived datatypes and Put and Accumulate operations
      Request was being marked complete before data transfer
      had actually taken place when MV_RNDV_PROTOCOL=R3 was used

  - Unregister stale memory registrations earlier to prevent
    malloc failures

  - Fix for compilation issues with --enable-g=mem and --enable-g=all

  - Change dapl_prepost_noop_extra value from 5 to 8 to prevent
    credit flow issues.

  - Re-enable RGET (RDMA Read) functionality

  - Fix SRQ Finalize error
    Make sure that finalize does not hang when the srq_post_cond is
    being waited on.

  - Fix a multi-rail one-sided error when multiple QPs are used

  - PMI Lookup name failure with SLURM

  - Port auto-detection failure when the 1st HCA did
    not have an active failure

  - Change default small message scheduling for multirail
    for higher performance

  - MPE support for shared memory collectives now available

MVAPICH2-1.2p1 (11/11/2008)

* Changes since MVAPICH2-1.2

  - Fix shared-memory communication issue for AMD Barcelona systems.

MVAPICH2-1.2 (11/06/2008)

* Bugs fixed since MVAPICH2-1.2-rc2

  - Ignore the last bit of the pkey and remove the pkey_ix option since the
    index can be different on different machines. Thanks for Pasha@Mellanox for
    the patch.

  - Fix data types for memory allocations. Thanks for Dr. Bill Barth from TACC
    for the patches.

  - Fix a bug when MV2_NUM_HCAS is larger than the number of active HCAs.

  - Allow builds on architectures for which tuning parameters do not exist.

* Changes related to the mpirun_rsh framework

  - Always build and install mpirun_rsh in addition to the process manager(s)
    selected through the --with-pm mechanism.

  - Cleaner job abort handling

  - Ability to detect the path to mpispawn if the Linux proc filesystem is
    available.

  - Added Totalview debugger support

  - Stdin is only available to rank 0.  Other ranks get /dev/null.

* Other miscellaneous changes

  - Add sequence numbers for RPUT and RGET finish packets.

  - Increase the number of allowed nodes for shared memory broadcast to 4K.

  - Use /dev/shm on Linux as the default temporary file path for shared memory
    communication. Thanks for Doug Johnson@OSC for the patch.

  - MV2_DEFAULT_MAX_WQE has been replaced with MV2_DEFAULT_MAX_SEND_WQE and
    MV2_DEFAULT_MAX_RECV_WQE for send and recv wqes, respectively.

  - Fix compilation warnings.

MVAPICH2-1.2-RC2 (08/20/2008)

* Following bugs are fixed in RC2

  - Properly handle the scenario in shared memory broadcast code when the
    datatypes of different processes taking part in broadcast are different.

  - Fix a bug in Checkpoint-Restart code to determine whether a connection is a
    shared memory connection or a network connection.

  - Support non-standard path for BLCR header files.

  - Increase the maximum heap size to avoid race condition in realloc().

  - Use int32_t for rank for larger jobs with 32k processes or more.

  - Improve mvapich2-1.2 bandwidth to the same level of mvapich2-1.0.3.

  - An error handling patch for uDAPL interface. Thanks for Nilesh Awate for
    the patch.

  - Explicitly set some of the EP attributes when on demand connection is used
    in uDAPL interface.

MVAPICH2-1.2-RC1 (07/02/08)

* Following features are added for this new mvapich2-1.2 release:

  - Based on MPICH2 1.0.7

  - Scalable and robust daemon-less job startup

       -- Enhanced and robust mpirun_rsh framework (non-MPD-based) to
          provide scalable job launching on multi-thousand core clusters

       -- Available for OpenFabrics (IB and iWARP) and uDAPL interfaces
          (including Solaris)

  - Adding support for intra-node shared memory communication with Checkpoint-restart

       --  Allows best performance and scalability with fault-tolerance
           support

  - Enhancement to software installation

       -- Change to full autoconf-based configuration
       -- Adding an application (mpiname) for querying the MVAPICH2 library
          version and configuration information

  - Enhanced processor affinity using PLPA for multi-core architectures

  - Allows user-defined flexible processor affinity

  - Enhanced scalability for RDMA-based direct one-sided communication
    with less communication resource

  - Shared memory optimized MPI_Bcast operations

  - Optimized and tuned MPI_Alltoall

MVAPICH2-1.0.2 (02/20/08)

* Change the default MV2_DAPL_PROVIDER to OpenIB-cma

* Remove extraneous parameter is_blocking from the gen2 interface for
  MPIDI_CH3I_MRAILI_Get_next_vbuf

* Explicitly name unions in struct ibv_wr_descriptor and reference the
  members in the code properly.

* Change "inline" functions to "static inline" properly.

* Increase the maximum number of buffer allocations for communication
  intensive applications

* Corrections for warnings from the Sun Studio 12 compiler.

* If malloc hook initialization fails, then turn off registration
  cache

* Add MV_R3_THESHOLD and MV_R3_NOCACHE_THRESHOLD which allows
  R3 to be used for smaller messages instead of registering the
  buffer and using a zero-copy protocol.

* Fixed an error in message coalescing.

* Setting application initiated checkpoint as default if CR is turned on.


MVAPICH2-1.0.1 (10/29/07)

* Enhance udapl initializaton, set all ep_attr fields properly.
  Thanks for Kanoj Sarcar from NetXen for the patch.

* Fixing a bug that miscalculates the receive size in case of complex
  datatype is used.
  Thanks for Patrice Martinez from Bull for reporting this problem.

 * Minor patches for fixing (i) NBO for rdma-cm ports and (ii) rank
   variable usage in DEBUG_PRINT in rdma-cm.c
   Thanks to Steve Wise for reporting these.


MVAPICH2-1.0 (09/14/07)

* Following features and bug fixes are added in this new MVAPICH2-1.0 release:

- Message coalescing support to enable reduction of per Queue-pair
  send queues for reduction in memory requirement on large scale
  clusters. This design also increases the small message messaging
  rate significantly. Available for Open Fabrics Gen2-IB.

- Hot-Spot Avoidance Mechanism (HSAM) for alleviating
  network congestion in large scale clusters. Available for
  Open Fabrics Gen2-IB.

- RDMA CM based on-demand connection management for large scale
  clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP.

- uDAPL on-demand connection management for large scale clusters.
  Available for uDAPL interface (including Solaris IB implementation).

- RDMA Read support for increased overlap of computation and
  communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP.

- Application-initiated system-level (synchronous) checkpointing in
  addition to the user-transparent checkpointing. User application can
  now request a whole program checkpoint synchronously with BLCR by
  calling special functions within the application. Available for
  OpenFabrics Gen2-IB.

- Network-Level fault tolerance with Automatic Path Migration (APM)
  for tolerating intermittent network failures over InfiniBand.
  Available for OpenFabrics Gen2-IB.

- Integrated multi-rail communication support for OpenFabrics
  Gen2-iWARP.

- Blocking mode of communication progress. Available for OpenFabrics
  Gen2-IB.

- Based on MPICH2 1.0.5p4.


* Fix for hang while using IMB with -multi option.
  Thanks to Pasha (Mellanox) for reporting this.

* Fix for hang in memory allocations > 2^31 - 1.
  Thanks to Bryan Putnam (Purdue) for reporting this.

* Fix for RDMA_CM finalize rdma_destroy_id failure.
  Added Timeout env variable for RDMA_CM ARP.
  Thanks to Steve Wise for suggesting these.

* Fix for RDMA_CM invalid event in finalize. Thanks to Steve Wise and Sean Hefty.

* Fix for shmem memory collectives related memory leaks

* Updated src/mpi/romio/adio/ad_panfs/Makefile.in include path to find mpi.h.
  Contributed by David Gunter, Los Alamos National Laboratory.

* Fixed header caching error on handling datatype messages with small vector
  sizes.

* Change the finalization protocol for UD connection manager.

* Fix for the "command line too long" problem. Contributed by Xavier Bru
  <xavier.bru@bull.net> from Bull (http://www.bull.net/)

* Change the CKPT handling to invalidate all unused registration cache.

* Added ofed 1.2 interface change patch for iwarp/rdma_cm from Steve Wise.

* Fix for rdma_cm_get_event err in finalize. Reported by Steve Wise.

* Fix for when MV2_IBA_HCA is used. Contributed by Michael Schwind
  of Technical Univ. of Chemnitz (Germany).


MVAPICH2-0.9.8 (11/10/06)

* Following features are added in this new MVAPICH2-0.9.8 release:

- BLCR based Checkpoint/Restart support

- iWARP support: tested with Chelsio and Ammasso adapters and OpenFabrics/Gen2 stack

- RDMA CM connection management support

- Shared memory optimizations for collective communication operations

- uDAPL support for NetEffect 10GigE adapter.


MVAPICH2-0.9.6 (10/22/06)

* Following features and bug fixes are added in this new MVAPICH2-0.9.6 release:

- Added on demand connection management.

- Enhance shared memory communication support.

- Added ptmalloc memory hook support.

- Runtime selection for most configuration options.


MVAPICH2-0.9.5 (08/30/06)

* Following features and bug fixes are added in this new MVAPICH2-0.9.5 release:

- Added multi-rail support for both point to point and direct one side
  operations.

- Added adaptive RDMA fast path.

- Added shared receive queue support.

- Added TotalView debugger support

* Optimization of SMP startup information exchange for USE_MPD_RING to
  enhance performance for SLURM. Thanks to Don and team members from Bull
  and folks from LLNL for their feedbacks and comments.

* Added uDAPL build script functionality to set DAPL_DEFAULT_PROVIDER
  explicitly with default suggestions.

* Thanks to Harvey Richardson from Sun for suggesting this feature.


MVAPICH2-0.9.3 (05/20/06)

* Following features are added in this new MVAPICH2-0.9.3 release:

- Multi-threading support

- Integrated with MPICH2 1.0.3 stack

- Advanced AVL tree-based Resource-aware registration cache

- Tuning and Optimization of various collective algorithms

- Processor affinity for intra-node shared memory communication

- Auto-detection of InfiniBand adapters for Gen2


MVAPICH2-0.9.2 (01/15/06)

* Following features are added in this new MVAPICH2-0.9.2 release:

- InfiniBand support for OpenIB/Gen2

- High-performance and optimized support for many MPI-2
  functionalities (one-sided, collectives, datatype)

- Support for other MPI-2 functionalities (as provided by MPICH2 1.0.2p1)

- High-performance and optimized support for all MPI-1 functionalities


MVAPICH2-0.9.0 (11/01/05)

* Following features are added in this new MVAPICH2-0.9.0 release:

- Optimized two-sided operations with RDMA support

- Efficient memory registration/de-registration schemes for RDMA operations

- Optimized intra-node shared memory support (bus-based and NUMA)

- Shared library support

- ROMIO support

- Support for multiple compilers (gcc, icc, and pgi)



MVAPICH2-0.6.5 (07/02/05)

* Following features are added in this new MVAPICH2-0.6.5 release:

- uDAPL support (tested for InfiniBand, Myrinet, and Ammasso GigE)


MVAPICH2-0.6.0 (11/04/04)

* Following features are added in this new MVAPICH2-0.6.0 release:

- MPI-2 functionalities (one-sided, collectives, datatype)

- All MPI-1 functionalities

- Optimized one-sided operations (Get, Put, and Accumulate)

- Support for active and passive synchronization

- Optimized two-sided operations

- Scalable job start-up

- Optimized and tuned for the above platforms and different
  network interfaces (PCI-X and PCI-Express)

- Memory efficient scaling modes for medium and large clusters