forked from lattice/quda
-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
168 lines (124 loc) · 6.83 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
Release Notes for QUDA v0.4.0 XX February 2011
-----------------------------
Overview:
QUDA is a library for performing calculations in lattice QCD on
graphics processing units (GPUs) using NVIDIA's "C for CUDA" API.
This release includes optimized kernels for applying several different
Dirac operators (Wilson, clover-improved Wilson, twisted mass,
improved staggered, and domain wall), kernels for performing various
BLAS-like operations, and full inverters built on these kernels.
Mixed-precision implementations of both CG and BiCGstab are provided,
with support for double, single, and half (16-bit fixed-point)
precision. The staggered implementation additionally includes support
for asqtad link fattening, force terms for the asqtad fermion action
and one-loop improved Symanzik gauge action, and a multi-shift CG
solver. Initial support for multiple GPUs is enabled in this release;
however, this is currently limited to Wilson, clover, and twisted mass
Dirac operators.
Software Compatibility:
The library has been tested under Linux (CentOS 5.5 and Ubuntu 10.04)
using release 3.2 of the CUDA toolkit. CUDA 2.x and earlier are no
longer supported. The library also seems to work under Mac OS X
10.6.5 ("Snow Leopard") on recent 64-bit Intel-based Macs.
See also "Known Issues" below.
Hardware Compatibility:
For a list of supported devices, see
http://www.nvidia.com/object/cuda_learn_products.html
Before building the library, you should determine the "compute
capability" of your card, either from NVIDIA's documentation or by
running the deviceQuery example in the CUDA SDK, and set GPU_ARCH in
make.inc appropriately. Setting GPU_ARCH to 'sm_13' or 'sm_20' will
enable double precision support.
Installation:
In the source directory, copy 'make.inc.example' to 'make.inc', and
edit the first few lines to specify the CUDA install path, the
platform (x86 or x86_64), and the GPU architecture (see "Hardware
Compatibility" above). Then type 'make' to build the library.
Optionally you can now generate a 'make.inc' using the Autoconf
./configure script. See ./configure --help for the options available.
The ./configure script allows you to sed most User configurable features
(hardware capability, multi-gpu build, QMP and MPI location) with the
usual barrage of --enable-feature and --with-package options.
Once ./configure has created your make.inc, you can run 'make' to build
the library
As an optional step, 'make tune' will invoke tests/blas_test to
perform autotuning of the various BLAS-like functions needed by the
inverters. This involves testing many combinations of launch
parameters (corresponding to different numbers of CUDA threads per
block and blocks per grid for each kernel) and writing the optimal
values to lib/blas_param.h. The new values will take effect the next
time the library is built. Ideally, the autotuning should be
performed on the machine where the library is to be used, since the
optimal parameters will depend on the CUDA device and host hardware.
They will also depend slightly on the lattice volume; if desired, the
volume used in the autotuning can be changed by editing
tests/blas_test.cu.
In summary, for an optimized install, run
make && make tune && make
(after optionally editing blas_test.cu). By default, the autotuning
is performed using CUDA device 0. To select a different device
number, set DEVICE in make.inc appropriately.
Building the multiple GPU library requires the enabling of the
appropriate flags in the make.inc file. Specifically, the
BUILD_MULTI_GPU and BUILD_QMP flags must be set to 'yes'.
Additionally, optimum performance will generally be obtained if the
OVERLAP_COMMS option is enabled to overlap kernel computation and
inter-GPU communication. Multiple GPUs are controlled using QMP, and
so QMP and MPI must be installed on the host systems, and the QMP_HOME
and MPI_HOME paths must be set.
Using the Library:
Include the header file include/quda.h in your application, link
against lib/libquda.a, and study tests/wilson_invert_test.c (or the
corresponding tests for staggered, domain wall, and twisted mass) for
an example of the interface. The various inverter options are
enumerated in include/enum_quda.h.
Known Issues:
* When the library is compiled with version 3.0 of the CUDA toolkit
and run on Fermi (GPU architecture sm_20), the RECONSTRUCT_8 option
(which enables reconstructing gauge matrices from 8 real numbers)
gives wrong results in double precision. This appears to be a bug
in CUDA 3.0, since CUDA 3.1 has no such issue. Note that this
problem isn't likely to matter in practice, since RECONSTRUCT_8
generally performs worse in double precision than RECONSTRUCT_12.
* Compiling in emulation mode with CUDA 3.0 does not with work when
the GPU architecture is set to sm_10, sm_11, or sm_12. One should
compile for sm_13 instead. Note that NVIDIA has eliminated
emulation mode completely from CUDA 3.1 and later.
* For compatibility with CUDA, on 32-bit platforms the library is
compiled with the GCC option -malign-double. This differs from the
GCC default and may affect the alignment of various structures,
notably those of type QudaGaugeParam and QudaInvertParam, defined in
quda.h. Therefore, any code to be linked against QUDA should also
be compiled with this option.
* At present, combining RECONSTRUCT_NO with half precision will give
wrong results unless all elements of the gauge field matrices have
magnitude bounded by 1. In particular, half precision should
not be used with asqtad or HISQ fermions, since the "fat links" do
not satisfy this condition in general.
Getting Help:
Please visit http://lattice.bu.edu/quda for contact information.
Bug reports are especially welcome.
Acknowledging QUDA:
If you find this code useful in your work, please cite:
M. A. Clark, R. Babich, K. Barros, R. Brower, and C. Rebbi, "Solving
Lattice QCD systems of equations using mixed precision solvers on GPUs,"
Comput. Phys. Commun. 181, 1517 (2010) [arXiv:0911.3191 [hep-lat]].
Additionally, if you use the multiple GPU version, please additionally cite:
R. Babich, M. A. Clark and B. Joo, "Parallelizing the QUDA Library for
Multi-GPU Calculations in Lattice Quantum Chromodynamics,"
International Conference for High Performance Computing, Networking,
Storage and Analysis (SC), 2010 [arXiv:1011.0024[ hep-lat]].
Authors:
Ronald Babich (Boston University)
Kipton Barros (Los Alamos National Laboratory)
Richard Brower (Boston University)
Michael Clark (Harvard University)
Joel Giedt (Rensselaer Polytechnic Institute)
Steven Gottlieb (Indiana University)
Balint Joo (Jefferson Laboratory)
Claudio Rebbi (Boston University)
Guochun Shi (NCSA)
Alexei Strelchenko (Cyprus Institute)
Portions of this software were developed at the Innovative Systems Lab,
National Center for Supercomputing Applications
http://www.ncsa.uiuc.edu/AboutUs/Directorates/ISL.html