深度学习服务器--LINKZOL基于8个GTX1080Ti的GPU计算服务器性能测试
LINKZOL品牌(联众集群公司)通过和全球视觉计算领域的领导者NVIDIA公司的紧密合作,并与NVIDIA建立了NPN网络合作伙伴关系,通过近十年与高校和科研院的合作经验,将产品类型深度开发。相继推出LZ-743GR-2G/Q,LZ-748GT,LZ-428GR-8G等深度学习应用的专用GPU计算服务器和工作站,例如在2U整机实现容纳6个NVIDIA Tesla K80,P40,P100/TITANX/GTX 1080Ti,以及在4U整理容纳4个和8个NVIDIA Tesla K80,P40,P100/Titanx(pascal)/GTX 1080Ti的GPU计算卡,利用NVIDIA的CUDA生态系统,CUDNN等GPU加速库,实现“CPU+GPU”协同计算加速,合理分配计算资源,充分释放计算能力,以高效,可靠,稳定的特性,满足不同行业的深度学习和人工智能等的计算应用。同时利用GNU编译器包括C/C++/Fortran,MKL库以及利用OPENMPI和MPICH的并行消息环境,采用Caffe,Tensorflow,Theano,BIDMach,Torch等深度学习框架,通过编译Caffe的Python和Matla等接口,基于B/S架构实现且可视化的进行DNN的训练,测试等。
测试平台:
LINKZOL品牌8个GPU深度学习服务器,型号:LZ-428GR-8G
系统环境:Ubuntu 16.04 LTS
编译器:GNU编译器,包括C/C++/Fortran编译器;Intel编译器,C/C++/Fortran编译器、MKL、MPI等;
并行环境:配置OpenMP并行环境;GPU开发环境:配置最新CUDA驱动、编译器、调试器、SDK及例子文件等;
支持cuDNN加速,CUDA FFT、CUDA BLAS等;深度学习框架:预装Caffe, Torch, Theano, BIDMach、TensorFlow
测试配置:
2颗十核E5-2630V4(2.2GHZ,8.0GT/S),64G(16G*4)DDR4 2133MHZ内存,1片512G 企业级SSD,1片2T企业级硬盘,8个GTX 1080Ti(CUDA核心数3584,11G DDR5显存)。
备注:安装和调试步骤在此不描述;测试机型见图(2)
(图2)
//通过以下命令可见系统有8块GPU卡
lzhpc@ubuntu:~$ nvidia-smi
Mon Apr 24 22:21:32 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 378.13 Driver Version: 378.13 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 0000:04:00.0 Off | N/A |
| 23% 34C P0 59W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Graphics Device Off | 0000:05:00.0 Off | N/A |
| 23% 35C P0 60W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Graphics Device Off | 0000:08:00.0 Off | N/A |
| 23% 33C P0 60W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Graphics Device Off | 0000:09:00.0 Off | N/A |
| 23% 31C P0 60W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Graphics Device Off | 0000:84:00.0 Off | N/A |
| 23% 33C P0 59W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Graphics Device Off | 0000:85:00.0 Off | N/A |
| 23% 36C P0 59W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Graphics Device Off | 0000:88:00.0 Off | N/A |
| 23% 31C P0 59W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Graphics Device Off | 0000:89:00.0 Off | N/A |
| 23% 37C P0 61W / 250W | 0MiB / 11172MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
测试方法:
使用矩阵乘算例分别测试8块GPU卡,在同一计算量下的计算时间对比!
测试命令如下:./matrixMul gpu_num loop_num
其中,gpu_num指定使用几块GPU卡计算,loop_num指定计算量,因为这里最大8块GPU卡,所以loop_num测试取值可为8!gpu_num的取值分别为8,进行测试如下:
//使用8块GPU卡进行测试,用时0m20.487s
lzhpc@ubuntu:~$ time ./matrixMul 8 8
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[0]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[2]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[7]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[4]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[5]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[6]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[3]:"Graphics Device" with compute capability 6.1
MatrixA(3200,3200)*MatrixB(3200,3200) Using GPU[1]:"Graphics Device" with compute capability 6.1
Computing CUDA Kernel...
Computing CUDA Kernel...
Computing CUDA Kernel...
Computing CUDA Kernel...
Computing CUDA Kernel...
Computing CUDA Kernel...
Computing CUDA Kernel...
Computing CUDA Kernel...
Time= 7243.66 msec for one loop (two hundreds of matrinx*matrix)
Time= 7246.23 msec for one loop (two hundreds of matrinx*matrix)
Time= 7255.54 msec for one loop (two hundreds of matrinx*matrix)
Time= 7293.25 msec for one loop (two hundreds of matrinx*matrix)
Time= 7298.95 msec for one loop (two hundreds of matrinx*matrix)
Time= 7312.67 msec for one loop (two hundreds of matrinx*matrix)
Time= 7314.78 msec for one loop (two hundreds of matrinx*matrix)
Time= 7318.51 msec for one loop (two hundreds of matrinx*matrix)
main time:17897.691 seconds
real 0m20.487s
user 1m4.904s
sys 0m16.712s
如需了解更多产品,可通过LINKZOL官方网站进行了解
或者拨打400-630-7530进行咨询了解