C++/C++ CLR/C#/Java/Python/IronPython/JavaScript/Lua/Ruby/Squirrel性能测试
2009 10 23 10:08 PM 5839次查看
这是2007年英特尔多核平台编程优化大赛题,该文的作者拿到了最佳优化奖,但此处的代码并非最优化的,只是改进了乘方、自己实现随机数而已。(最优版本可参见蒋黎的代码和侯思松的代码,感觉很变态…)
其中,C++和C++ CLR的代码相同,只是采用的编译指令不同而已。
此外,C#代码从二维数组改为一维数组,我稍微测试了一下,C#二维数组和嵌套数组的速度确实很慢。原作者的代码似乎还把下标弄错了,所以也改了下。
而Python、Cython、JavaScript和Lua的代码是我自己照英特尔的原程序改编的,没有对随机数进行优化,附在文末。
Squirrel的代码由dwing提供,我只是将table改成了数组,RAND_MAX + 1.0改成了32768.0的常数,时间快了约5秒左右。
测试平台:
CPU:Intel Core2 Duo T9400 @ 2.53GHz
内存:3 GB
操作系统:Windows XP Pro SP2英文版
C++编译器:gcc version 4.4.1 (TDM-2 mingw32)(以下简称G++)、Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86(以下简称VC++ 2008)、VC++ 2010 Beta2
C++ CLR编译器:VC++ 2008、VC++ 2010 Beta2
C#编译器:Microsoft (R) Visual C# 2008 Compiler version 3.5.21022.8、Microsoft (R) Visual C# 2010 Compiler version 4.0.21006.1
Java:Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Python:Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32、Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32、Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on win32
Cython:0.11.3
Psyco:1.6
IronPython:IronPython 2.6 (2.6.10920.0) on .NET 2.0.50727.1433
JavaScript:ChromePlus 1.2.6.0 (Chromium 4.0.222.3 (Official Build 28644),V8 1.3.15)
Lua:5.1.4
LuaJIT:1.1.5、2.0.0-beta1
Ruby:ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Squirrel:Squirrel 3.0 alpha 2 Copyright (C) 2003-2009 Alberto Demichelis (32 bits)
顺带一提,G++是采用TDM版(目前尚无GCC 4.4.1和4.4.2的Windows版),VC++ 2008编译器是Microsoft Visual C++ Toolkit 2008里的,C#编译器是Microsoft .NET Framework 3.5和4.0 Beta2里的,Cython和LuaJIT用的C编译器是GCC 4.4.1,LuaJIT 2.0.0还测试了VC++ 2003编译的版本,Ruby没有采用更慢的稳定版。
测试时关掉了大部分占用资源的程序,每种搭配测试5次,取时间最少的一次作为最终结果。
从控制台运行时似乎会影响速度(此时是C++ CLR以超过第2名3%的优势胜出),所以对于exe文件,我改成了双击运行和控制台各测5次,取最小值。
测试结果:
语言/编译器 | 编译参数 | 运行时间(秒) | 相对速度 |
C++ (G++ 4.4.1) | -O3 -s | 1.765 | 85.84% |
C++ (G++ 4.4.1) | -O3 -s -mtune=native -march=native | 1.859 | 81.50% |
C++ (G++ 4.4.1) | -O3 -s -mfpmath=sse -mtune=native -march=native | 1.578 | 96.01% |
C++ (G++ 4.4.1) | -O3 -s -mfpmath=sse -mtune=native -march=native -ffast-math | 1.546 | 97.99% |
C++ (VC++ 2008) | /O2 /MD | 1.625 | 93.23% |
C++ (VC++ 2008) | /O2 /MD /arch:SSE2 | 1.593 | 95.10% |
C++ (VC++ 2008) | /O2 /MD /arch:SSE2 /fp:fast | 1.562 | 96.99% |
C++ (VC++ 2010 Beta2) | /O2 /MD | 1.64 | 92.38% |
C++ (VC++ 2010 Beta2) | /O2 /MD /arch:SSE2 | 1.562 | 96.99% |
C++ (VC++ 2010 Beta2) | /O2 /MD /arch:SSE2 /fp:fast | 1.531 | 98.95% |
C++ CLR (VC++ 2008) | /O2 /MD /clr | 1.578 | 96.01% |
C++ CLR (VC++ 2010 Beta2) | /O2 /MD /clr | 1.515 | 100.00% |
C# 3.5 | 无参数 | 3.281 | 46.17% |
C# 3.5 | /o | 3.219 | 47.06% |
C# 4.0 Beta2 | 无参数 | 2.203 | 68.77% |
C# 4.0 Beta2 | /o | 2.187 | 69.27% |
Java 1.6 | 无参数 | 2.093 | 72.38% |
Python 2.5.4 | 无参数 | 117.163 | 1.29% |
Python 2.6.4 | 无参数 | 91.137 | 1.66% |
Python 3.1.1 | 无参数 | 109.531 | 1.38% |
Cython 0.11.3 | -O3 | 11.795 | 12.84% |
Psyco 1.6 (bind) | 无参数 | 26.206 | 5.78% |
Psyco 1.6 (full) | 无参数 | 25.650 | 5.91% |
IronPython 2.6 | 无参数 | 59.630 | 2.54% |
JavaScript (V8 1.3.15) | 无参数 | 16.653 | 9.10% |
Lua 5.1.4 | 无参数 | 48.969 | 3.09% |
Lua 5.1.4 (Luac编译) | -s | 49.031 | 3.09% |
LuaJIT 1.1.5 | 无参数 | 24.141 | 6.28% |
LuaJIT 1.1.5 | 运行:-O2 | 9.641 | 15.71% |
LuaJIT 1.1.5 (Luac编译) | 编译:-s | 24.093 | 6.29% |
LuaJIT 1.1.5 (Luac编译) | 编译:-s 运行:-O2 | 9.641 | 15.71% |
LuaJIT 2.0.0-beta1 (GCC 4.4.1编译) | 无参数 | 1.546 | 97.99% |
LuaJIT 2.0.0-beta1 (GCC 4.4.1编译) | 运行:-O2 | 4.578 | 33.09% |
LuaJIT 2.0.0-beta1 (GCC 4.4.1编译) | 无参数 | 1.625 | 93.23% |
LuaJIT 2.0.0-beta1 (VC++ 2003编译) | 运行:-O2 | 4.578 | 33.09% |
Ruby 1.9.1 | 无参数 | 327.797 | 0.46% |
Squirrel 3.0 alpha 2 | 无参数 | 73.873 | 2.05% |
Squirrel 3.0 alpha 2 (编译成字节码) | 无参数 | 77.828 | 1.95% |
G++在最佳的参数下和VC++基本持平,不过单一的-O3参数并没有什么优势,要组合出那么多种参数也蛮难为人的,而且-ffast-math也是不符合IEEE规范的浮点实现。顺带一提,用该参数编译未优化版本的结果是1.578秒,和优化后几乎没有区别。
VC++ 2008各种参数的性能都差不多,所以不用过多考虑搭配。
VC++ 2010 Beta2略微超过了VC++ 2008和G++,不过需要附带一个新的动态链接库(msvcr100.dll,749 KB)。
让我无语的是C# 3.5居然远远落后于Java(慢了约53.8%),而且是在Windows操作系统上,也未测试可能更快的Java 7及IBM、DEA的JVM和GCJ。C# 4.0的性能提高了不少,只略低于Java 6了。(顺带一提,C# 2.0和3.5成绩几乎相同,看来微软在4.0上花了大气力来提升性能。)此外,C#生成的代码只有5KB,且无manifest文件。(但别忘了运行库的体积。)
而Python则仅快于Ruby,比静态编译的实现慢了约2个数量级,这也在我意料之中。而在性能方面,2.6.4 > 3.1.1 > 2.5.4,而3.x的表现还不错。顺带一提动态语言的多维数组都比较慢,所以改成了一维。
Cython目前仍无针对数组的优化(详见《CEP 517: Cython array type》),而本程序大量涉及对数组的操作,所以性能提升相当有限。此外,将列表设为了1维,速度比2维快了约50%。同时,由于Cython不能对乘方进行优化,因此我改成了乘法。
Python+Psyco的组合还算不错,只增加了2行,性能就提升了约5倍,凌驾于其他动态语言之上。同上,我将乘方改成了乘法,开方则调用系统函数。
IronPython的性能让我很惊讶,因为Jython的性能据说很糟糕,没想到在.net平台上会发挥得这么好。
JavaScript只测试了性能最好的V8引擎。由于输出是调用DOM API,较为影响速度,所以我直接删除了。优化同上,数组采用1维。
在接受dwing的建议,将全局变量和系统函数声明为local后,Lua的速度提升了1倍,表现很不错。不过Luac编译似乎没有帮助,以后不去用了。此外,同样也将table优化成了1维数组。
LuaJIT 1.1.5表现非常不错,改成local声明后,速度提升了3倍,在-O2开启时居然比Cython更快。速度大约是C++的1/6,Java的1/5,相差已经在一个数量级之内了。
而最近刚出的LuaJIT 2.0.0-beta1更是达到了C++的水平,这对动态语言来说简直是不可思议的。不过大部分测试是稳定在1.625秒,只有一次达到1.546。而VC++ 2003编译出来的是在1.625~1.640之间波动,但体积略小。至于兼容性,采用-O0~-O2来运行都会有大幅的性能下降(非常诡异),而且现在不支持luac编译的代码。此外,文档里也提到了,LuaJIT 2在数值计算方面接近于静态编译语言,而其他方面(例如字符串)仍略逊于后者,整体性能和JVM 1.6 Client持平。
Ruby是很久没碰了,于是测试了半天,选了种最快的实现,但仍比Python 2.6.4慢了59%,速度仅有C++的1/200。代码里我用while取代了for,因为后者(无论0...n还是0.upto(n))比前者慢了一半。(Python里正好相反,于是人们会倾向于用最优雅的方式得到最好的性能。)此外数组采用了1维,乘方改成了乘法,开方用系统函数。
Squirrel表现平平,和Lua相比完全出于劣势,而且我也不喜欢它的语法。顺带一提,第一次运行时为73秒,之后就一直是78秒了,编译成字节码也只能到77秒,这也是测出非编译版本更快的原因。
Python和IronPython:
from random import random
from time import clock
NPARTS = 1000
NPARTS2 = NPARTS * 2
NITER = 201
DIMS = 3
pot = .0
SIZE = NPARTS * DIMS
r = [.0] * SIZE
def computePot():
global NPARTS, DIMS, r, pot
for i in xrange(NPARTS):
for j in xrange(i - 1):
distx = r[j] - r[i]
disty = r[NPARTS + j] - r[NPARTS + i]
distz = r[NPARTS2 + j] - r[NPARTS2 + i]
dist = (distx * distx + disty * disty + distz * distz) ** .5
pot += 1.0 / dist
def initPositions():
global NPARTS, DIMS, r
for i in xrange(SIZE):
r[i] = 0.5 + random()
def updatePositions():
global NPARTS, DIMS, r
for i in xrange(SIZE):
r[i] -= 0.5 + random()
def main():
global pot, NITER
initPositions()
updatePositions()
start = clock()
for i in xrange(NITER):
pot = .0
computePot()
if not i % 10:
print "%5d: Potential: %10.7f" % (i, pot)
updatePositions()
stop = clock()
print "Seconds = %10.9f" % (stop - start)
if __name__ == '__main__':
main()
Python 3000:将Python的代码中,print改成函数调用形式(加括号),xrange改成range即可。Cython:
from random import random
from time import clock
cdef int NPARTS = 1000
cdef int NPARTS2 = NPARTS * 2
cdef int NITER = 201
cdef int DIMS = 3
cdef int SIZE = NPARTS * DIMS
cdef double pot = .0
cdef list r = [.0] * SIZE
cpdef computePot():
global NPARTS, DIMS, r, pot
cdef int i, j
cdef double distx, disty, distz, dist
for i in xrange(NPARTS):
for j in xrange(i - 1):
distx = r[j] - r[i]
disty = r[NPARTS + j] - r[NPARTS + i]
distz = r[NPARTS2 + j] - r[NPARTS2 + i]
dist = (distx * distx + disty * disty + distz * distz) ** .5
pot += 1.0 / dist
cpdef initPositions():
global NPARTS, DIMS, r
cdef int i, j
for i in xrange(SIZE):
r[i] = 0.5 + random()
cpdef updatePositions():
global NPARTS, DIMS, r
cdef int i, j
for i in xrange(SIZE):
r[i] -= 0.5 + random()
def main():
global pot, NITER
cdef int i
cdef double start, stop
initPositions()
updatePositions()
start = clock()
for i in xrange(NITER):
pot = .0
computePot()
if not i % 10:
print "%5d: Potential: %10.7f" % (i, pot)
updatePositions()
stop = clock()
print "Seconds = %10.9f" % (stop - start)
Psyco(使用bind()):from random import random
from time import clock
from math import sqrt
from psyco import bind
NPARTS = 1000
NITER = 201
DIMS = 3
pot = .0
r = [[.0] * NPARTS] * DIMS
def computePot():
global NPARTS, DIMS, r, pot
for i in xrange(NPARTS):
for j in xrange(i - 1):
distx = (r[0][j] - r[0][i])
distx *= distx
disty = (r[1][j] - r[1][i])
disty *= disty
distz = (r[2][j] - r[2][i])
distz *= distz
dist = sqrt(distx + disty + distz)
pot += 1.0 / dist
def initPositions():
global NPARTS, DIMS, r
for i in xrange(DIMS):
for j in xrange(NPARTS):
r[i][j] = 0.5 + random()
def updatePositions():
global NPARTS, DIMS, r
for i in xrange(DIMS):
for j in xrange(NPARTS):
r[i][j] -= 0.5 + random()
def main():
global pot, NITER
initPositions()
updatePositions()
start = clock()
for i in xrange(NITER):
pot = .0
computePot()
if not i % 10:
print "%5d: Potential: %10.7f" % (i, pot)
updatePositions()
stop = clock()
print "Seconds = %10.9f" % (stop - start)
bind(computePot)
bind(initPositions)
bind(updatePositions)
bind(main)
if __name__ == '__main__':
main()
Psyco(使用full()):from random import random
from time import clock
from math import sqrt
from psyco import full
full()
NPARTS = 1000
NITER = 201
DIMS = 3
pot = .0
r = [[.0] * NPARTS] * DIMS
def computePot():
global NPARTS, DIMS, r, pot
for i in xrange(NPARTS):
for j in xrange(i - 1):
distx = (r[0][j] - r[0][i])
distx *= distx
disty = (r[1][j] - r[1][i])
disty *= disty
distz = (r[2][j] - r[2][i])
distz *= distz
dist = sqrt(distx + disty + distz)
pot += 1.0 / dist
def initPositions():
global NPARTS, DIMS, r
for i in xrange(DIMS):
for j in xrange(NPARTS):
r[i][j] = 0.5 + random()
def updatePositions():
global NPARTS, DIMS, r
for i in xrange(DIMS):
for j in xrange(NPARTS):
r[i][j] -= 0.5 + random()
def main():
global pot, NITER
initPositions()
updatePositions()
start = clock()
for i in xrange(NITER):
pot = .0
computePot()
if not i % 10:
print "%5d: Potential: %10.7f" % (i, pot)
updatePositions()
stop = clock()
print "Seconds = %10.9f" % (stop - start)
if __name__ == '__main__':
main()
我顺便还写了个numpy的实现:from numpy import add, square
from numpy.random import rand
from time import clock
NPARTS = 1000
NITER = 201
DIMS = 3
r = rand(NPARTS, DIMS)
pot = .0
def computePot():
global NPARTS, DIMS, r, pot
for i in xrange(NPARTS):
for j in xrange(i - 1):
pot += 1.0 / (add.reduce(square(r[j] - r[i])) ** .5)
def initPositions():
global r
r += 0.5
def updatePositions():
c = clock()
print c
global DIMS, NPARTS, r
r -= (rand(NPARTS, DIMS) + .5)
print clock() -c
def main():
global pot, NITER
initPositions()
updatePositions()
start = clock()
for i in xrange(NITER):
pot = .0
computePot()
if not i % 10:
print "%5d: Potential: %10.7f" % (i, pot)
updatePositions()
stop = clock()
print "Seconds = %10.9f" % (stop - start)
if __name__ == '__main__':
main()
然而让我无语的是,numpy版比纯Python版还慢1个数量级。稍微测试了一下,发现了原因:numpy的函数调用开销很大,因此不适合对小数组进行操作。
简单来说,计算二维数组里2列间差的平方和时,numpy调用了3次函数,而纯Python则完全是内置操作符,所以调用开销较小。而这个二维数组每列只有3个数,所以即便numpy的函数计算很快,但总体来说仍慢于纯Python版。
不过对付大数组则是numpy占据优势了:在数组长度为3时,numpy比Python慢10倍;长度为33时,2者持平;长度为330时,numpy快8倍;长度为3300时,numpy快30倍。(纯Python版的时间复杂度基本可视为O(N)。)
JavaScript:
var NPARTS = 1000;
var NPARTS2 = 2000;
var NITER = 201;
var DIMS = 3;
var SIZE = DIMS * NPARTS;
var r = new Array(SIZE);
var pot;
function initPositions() {
for(var i = 0; i < SIZE; i++)
r[i] = 0.5 + Math.random();
}
function updatePositions() {
for (var i = 0; i < SIZE; i++)
r[i] -= 0.5 + Math.random();
}
function computePot() {
for (var i = 0; i < NPARTS; i++ ) {
for (var j = 0, k = i - 1; j < k; j++ ) {
var distx = r[j] - r[i];
var disty = r[NPARTS + j] - r[NPARTS + i];
var distz = r[NPARTS2 + j] - r[NPARTS2 + i];
var dist = Math.sqrt(distx * distx + disty * disty + distz * distz);
pot += 1.0 / dist;
}
}
}
initPositions();
updatePositions();
var start = new Date();
for (var i = 0; i < NITER; i++) {
pot = 0.0;
computePot();
updatePositions();
}
var stop = new Date();
alert(stop - start);
Lua:local NPARTS = 1000
local NITER = 201
local DIMS = 3
local NPARTS2 = 2000
local SIZE = DIMS * NPARTS
local pot = .0
local r = {}
local sqrt = math.sqrt
local random = math.random
function computePot()
for i = 1, NPARTS do
for j = 1, i - 1 do
local distx = r[j] - r[i]
local disty = r[NPARTS + j] - r[NPARTS + i]
local distz = r[NPARTS2 + j] - r[NPARTS2 + i]
pot = pot + 1.0 / sqrt(distx * distx + disty * disty + distz * distz)
end
end
end
function initPositions()
for i = 1, SIZE do
r[i] = 0.5 + random()
end
end
function updatePositions()
for i = 1, SIZE do
r[i] = r[i] - (0.5 + random())
end
end
initPositions()
updatePositions()
local start = os.clock()
for i = 1, NITER do
pot = .0
computePot()
if i % 10 == 0 then
print(string.format("%5d: Potential: %10.7f", i, pot))
end
updatePositions()
end
local stop = os.clock()
print(string.format("Seconds = %10.9f", stop - start))
Ruby:$NPARTS = 1000
$NPARTS2 = 2000
$NITER = 201
$DIMS = 3
$SIZE = $DIMS * $NPARTS
$pot = 0.0
$r = [0.0] * ($NPARTS * $DIMS)
def computePot()
i = 0
while i < $NPARTS
j = 0
max = i - 1
while j < max
distx = $r[j] - $r[i]
disty = $r[$NPARTS + j] - $r[$NPARTS + i]
distz = $r[$NPARTS2 + j] - $r[$NPARTS2 + i]
$pot = $pot + 1.0 / Math.sqrt(distx * distx + disty * disty + distz * distz)
j += 1
end
i += 1
end
end
def initPositions()
i = 0
while i < $SIZE
$r[i] = 0.5 + rand()
i += 1
end
end
def updatePositions()
i = 0
while i < $SIZE
$r[i] -= 0.5 + rand()
i += 1
end
end
initPositions()
updatePositions()
start = Time.now()
i = 0
while i < $NITER
pot = 0.0
computePot()
if i % 10 == 0
print "%5d Potential %10.7f\n" % [i, $pot]
end
updatePositions()
i += 1
end
stop = Time.now()
print "Seconds = %10.9f\n" % (stop - start)
Squirrel:const NPARTS = 1000
const NITER = 201
const DIMS = 3
const NPARTS2 = 2000
const RANDMAX = 32168.0
local SIZE = DIMS * NPARTS
local pot = 0.0
local r = []
local sqrt = sqrt
local rand = rand
local function computePot() {
for(local i = 0; i < NPARTS; ++i) {
for(local j = 0, k = i-1; j < k; ++j) {
local distx = r[j] - r[i]
local disty = r[NPARTS + j] - r[NPARTS + i]
local distz = r[NPARTS2 + j] - r[NPARTS2 + i]
pot += 1.0 / sqrt(distx * distx + disty * disty + distz * distz)
}
}
}
local function initPositions() {
for(local i = 0; i < SIZE; ++i) {
r.append(0.5 + rand() / RANDMAX)
}
}
local function updatePositions() {
for(local i = 0; i < SIZE; ++i) {
r[i] -= 0.5 + rand() / RANDMAX
}
}
initPositions()
updatePositions()
local start = clock()
for(local i = 0; i < NITER; ++i) {
pot = 0.0
computePot()
if(i % 10 == 0) {
print(format("%5d: Potential: %10.7f\n", i, pot))
}
updatePositions()
}
local stop = clock()
print(format("Seconds = %10.9f\n", stop - start))
向下滚动可载入更多评论,或者点这里禁止自动加载。