C++/C++ CLR/C#/Java/Python/IronPython/JavaScript/Lua/Ruby/Squirrel性能测试

标签:性能

今天蛋疼地看到一篇《原生代码与托管代码的一个简单性能对比》,考虑到已经是2年前的文章了,现在的编译器可能会进一步优化,所以自行测试了一遍。


这是2007年英特尔多核平台编程优化大赛题,该文的作者拿到了最佳优化奖,但此处的代码并非最优化的,只是改进了乘方、自己实现随机数而已。(最优版本可参见蒋黎的代码侯思松的代码,感觉很变态…)
其中,C++和C++ CLR的代码相同,只是采用的编译指令不同而已。
此外,C#代码从二维数组改为一维数组,我稍微测试了一下,C#二维数组和嵌套数组的速度确实很慢。原作者的代码似乎还把下标弄错了,所以也改了下。
而Python、Cython、JavaScript和Lua的代码是我自己照英特尔的原程序改编的,没有对随机数进行优化,附在文末。
Squirrel的代码由dwing提供,我只是将table改成了数组,RAND_MAX + 1.0改成了32768.0的常数,时间快了约5秒左右。

测试平台:
CPU:Intel Core2 Duo T9400 @ 2.53GHz
内存:3 GB
操作系统:Windows XP Pro SP2英文版
C++编译器:gcc version 4.4.1 (TDM-2 mingw32)(以下简称G++)、Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86(以下简称VC++ 2008)、VC++ 2010 Beta2
C++ CLR编译器:VC++ 2008、VC++ 2010 Beta2
C#编译器:Microsoft (R) Visual C# 2008 Compiler version 3.5.21022.8、Microsoft (R) Visual C# 2010 Compiler version 4.0.21006.1
Java:Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Python:Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32、Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32、Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on win32
Cython:0.11.3
Psyco:1.6
IronPython:IronPython 2.6 (2.6.10920.0) on .NET 2.0.50727.1433
JavaScript:ChromePlus 1.2.6.0 (Chromium    4.0.222.3 (Official Build 28644),V8 1.3.15)
Lua:5.1.4
LuaJIT:1.1.5、2.0.0-beta1
Ruby:ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-mswin32]
Squirrel:Squirrel 3.0 alpha 2 Copyright (C) 2003-2009 Alberto Demichelis (32 bits)
顺带一提,G++是采用TDM版(目前尚无GCC 4.4.1和4.4.2的Windows版),VC++ 2008编译器是Microsoft Visual C++ Toolkit 2008里的,C#编译器是Microsoft .NET Framework 3.5和4.0 Beta2里的,Cython和LuaJIT用的C编译器是GCC 4.4.1,LuaJIT 2.0.0还测试了VC++ 2003编译的版本,Ruby没有采用更慢的稳定版。

测试时关掉了大部分占用资源的程序,每种搭配测试5次,取时间最少的一次作为最终结果。
从控制台运行时似乎会影响速度(此时是C++ CLR以超过第2名3%的优势胜出),所以对于exe文件,我改成了双击运行和控制台各测5次,取最小值。

测试结果:
语言/编译器编译参数运行时间(秒)相对速度
C++ (G++ 4.4.1)-O3 -s1.76585.84%
C++ (G++ 4.4.1)-O3 -s -mtune=native -march=native1.85981.50%
C++ (G++ 4.4.1)-O3 -s -mfpmath=sse -mtune=native -march=native1.57896.01%
C++ (G++ 4.4.1)-O3 -s -mfpmath=sse -mtune=native -march=native -ffast-math1.54697.99%
C++ (VC++ 2008)/O2 /MD1.62593.23%
C++ (VC++ 2008)/O2 /MD /arch:SSE21.59395.10%
C++ (VC++ 2008)/O2 /MD /arch:SSE2 /fp:fast1.56296.99%
C++ (VC++ 2010 Beta2)/O2 /MD1.6492.38%
C++ (VC++ 2010 Beta2)/O2 /MD /arch:SSE21.56296.99%
C++ (VC++ 2010 Beta2)/O2 /MD /arch:SSE2 /fp:fast1.53198.95%
C++ CLR (VC++ 2008)/O2 /MD /clr1.57896.01%
C++ CLR (VC++ 2010 Beta2)/O2 /MD /clr1.515100.00%
C# 3.5无参数3.28146.17%
C# 3.5/o3.21947.06%
C# 4.0 Beta2无参数2.20368.77%
C# 4.0 Beta2/o2.18769.27%
Java 1.6无参数2.09372.38%
Python 2.5.4无参数117.1631.29%
Python 2.6.4无参数91.1371.66%
Python 3.1.1无参数109.5311.38%
Cython 0.11.3-O311.79512.84%
Psyco 1.6 (bind)无参数26.2065.78%
Psyco 1.6 (full)无参数25.6505.91%
IronPython 2.6无参数59.6302.54%
JavaScript (V8 1.3.15)无参数16.6539.10%
Lua 5.1.4无参数48.9693.09%
Lua 5.1.4 (Luac编译)-s49.0313.09%
LuaJIT 1.1.5无参数24.1416.28%
LuaJIT 1.1.5运行:-O29.64115.71%
LuaJIT 1.1.5 (Luac编译)编译:-s24.0936.29%
LuaJIT 1.1.5 (Luac编译)编译:-s 运行:-O29.64115.71%
LuaJIT 2.0.0-beta1 (GCC 4.4.1编译)无参数1.54697.99%
LuaJIT 2.0.0-beta1 (GCC 4.4.1编译)运行:-O24.57833.09%
LuaJIT 2.0.0-beta1 (GCC 4.4.1编译)无参数1.62593.23%
LuaJIT 2.0.0-beta1 (VC++ 2003编译)运行:-O24.57833.09%
Ruby 1.9.1无参数327.7970.46%
Squirrel 3.0 alpha 2无参数73.8732.05%
Squirrel 3.0 alpha 2 (编译成字节码)无参数77.8281.95%
冠军让C++ CLR夺走了,这点出乎我的意料。测试了多次,均比其他C++实现的时间要少,可见不属于测试误差。看来相对于原生代码,CLR会自动对当前平台进行优化。缺点就是体积稍大了点(29.5KB(VC++ 2010)/24KB(VC++ 2008),而其他的C++实现为7KB左右,VC++ 2008编译的还会生成1KB manifest文件),且需要安装.net framework才能运行。没有测试Intel编译器是一大遗憾,估计它才是性能王者。(由于运行库不兼容,我的电脑上无法测试;在别人的电脑上,Intel C++ 11编译器速度约为VC++2010 CLR的2倍,非常震惊。)
G++在最佳的参数下和VC++基本持平,不过单一的-O3参数并没有什么优势,要组合出那么多种参数也蛮难为人的,而且-ffast-math也是不符合IEEE规范的浮点实现。顺带一提,用该参数编译未优化版本的结果是1.578秒,和优化后几乎没有区别。
VC++ 2008各种参数的性能都差不多,所以不用过多考虑搭配。
VC++ 2010 Beta2略微超过了VC++ 2008和G++,不过需要附带一个新的动态链接库(msvcr100.dll,749 KB)。
让我无语的是C# 3.5居然远远落后于Java(慢了约53.8%),而且是在Windows操作系统上,也未测试可能更快的Java 7及IBM、DEA的JVM和GCJ。C# 4.0的性能提高了不少,只略低于Java 6了。(顺带一提,C# 2.0和3.5成绩几乎相同,看来微软在4.0上花了大气力来提升性能。)此外,C#生成的代码只有5KB,且无manifest文件。(但别忘了运行库的体积。)
而Python则仅快于Ruby,比静态编译的实现慢了约2个数量级,这也在我意料之中。而在性能方面,2.6.4 > 3.1.1 > 2.5.4,而3.x的表现还不错。顺带一提动态语言的多维数组都比较慢,所以改成了一维。
Cython目前仍无针对数组的优化(详见《CEP 517: Cython array type》),而本程序大量涉及对数组的操作,所以性能提升相当有限。此外,将列表设为了1维,速度比2维快了约50%。同时,由于Cython不能对乘方进行优化,因此我改成了乘法。
Python+Psyco的组合还算不错,只增加了2行,性能就提升了约5倍,凌驾于其他动态语言之上。同上,我将乘方改成了乘法,开方则调用系统函数。
IronPython的性能让我很惊讶,因为Jython的性能据说很糟糕,没想到在.net平台上会发挥得这么好。
JavaScript只测试了性能最好的V8引擎。由于输出是调用DOM API,较为影响速度,所以我直接删除了。优化同上,数组采用1维。
在接受dwing的建议,将全局变量和系统函数声明为local后,Lua的速度提升了1倍,表现很不错。不过Luac编译似乎没有帮助,以后不去用了。此外,同样也将table优化成了1维数组。
LuaJIT 1.1.5表现非常不错,改成local声明后,速度提升了3倍,在-O2开启时居然比Cython更快。速度大约是C++的1/6,Java的1/5,相差已经在一个数量级之内了。
而最近刚出的LuaJIT 2.0.0-beta1更是达到了C++的水平,这对动态语言来说简直是不可思议的。不过大部分测试是稳定在1.625秒,只有一次达到1.546。而VC++ 2003编译出来的是在1.625~1.640之间波动,但体积略小。至于兼容性,采用-O0~-O2来运行都会有大幅的性能下降(非常诡异),而且现在不支持luac编译的代码。此外,文档里也提到了,LuaJIT 2在数值计算方面接近于静态编译语言,而其他方面(例如字符串)仍略逊于后者,整体性能和JVM 1.6 Client持平。
Ruby是很久没碰了,于是测试了半天,选了种最快的实现,但仍比Python 2.6.4慢了59%,速度仅有C++的1/200。代码里我用while取代了for,因为后者(无论0...n还是0.upto(n))比前者慢了一半。(Python里正好相反,于是人们会倾向于用最优雅的方式得到最好的性能。)此外数组采用了1维,乘方改成了乘法,开方用系统函数。
Squirrel表现平平,和Lua相比完全出于劣势,而且我也不喜欢它的语法。顺带一提,第一次运行时为73秒,之后就一直是78秒了,编译成字节码也只能到77秒,这也是测出非编译版本更快的原因。

Python和IronPython:
from random import random
from time import clock

NPARTS = 1000
NPARTS2 = NPARTS * 2
NITER = 201
DIMS = 3
pot = .0
SIZE = NPARTS * DIMS
r = [.0]  * SIZE

def computePot():
  global NPARTS, DIMS, r, pot
  for i in xrange(NPARTS):
    for j in xrange(i - 1):
      distx = r[j] - r[i]
      disty = r[NPARTS + j] - r[NPARTS + i]
      distz = r[NPARTS2 + j] - r[NPARTS2 + i]
      dist = (distx * distx + disty * disty + distz * distz) ** .5
      pot += 1.0 / dist

def initPositions():
  global NPARTS, DIMS, r
  for i in xrange(SIZE):
    r[i] = 0.5 + random()

def updatePositions():
  global NPARTS, DIMS, r
  for i in xrange(SIZE):
    r[i] -= 0.5 + random()

def main():
  global pot, NITER
  initPositions()
  updatePositions()

  start = clock()
  for i in xrange(NITER):
    pot = .0
    computePot()
    if not i % 10:
      print "%5d: Potential: %10.7f" % (i, pot)
    updatePositions()
  stop = clock()
  print "Seconds = %10.9f" % (stop - start)

if __name__ == '__main__':
  main()
Python 3000:将Python的代码中,print改成函数调用形式(加括号),xrange改成range即可。

Cython:
from random import random
from time import clock

cdef int NPARTS = 1000
cdef int NPARTS2 = NPARTS * 2
cdef int NITER = 201
cdef int DIMS = 3
cdef int SIZE = NPARTS * DIMS
cdef double pot = .0
cdef list r = [.0]  * SIZE

cpdef computePot():
  global NPARTS, DIMS, r, pot
  cdef int i, j
  cdef double distx, disty, distz, dist
  for i in xrange(NPARTS):
    for j in xrange(i - 1):
        distx = r[j] - r[i]
        disty = r[NPARTS + j] - r[NPARTS + i]
        distz = r[NPARTS2 + j] - r[NPARTS2 + i]
        dist = (distx * distx + disty * disty + distz * distz) ** .5
        pot += 1.0 / dist

cpdef initPositions():
  global NPARTS, DIMS, r
  cdef int i, j
  for i in xrange(SIZE):
    r[i] = 0.5 + random()

cpdef updatePositions():
  global NPARTS, DIMS, r
  cdef int i, j
  for i in xrange(SIZE):
    r[i] -= 0.5 + random()

def main():
  global pot, NITER
  cdef int i
  cdef double start, stop
  initPositions()
  updatePositions()

  start = clock()
  for i in xrange(NITER):
    pot = .0
    computePot()
    if not i % 10:
      print "%5d: Potential: %10.7f" % (i, pot)
    updatePositions()
  stop = clock()

  print "Seconds = %10.9f" % (stop - start)
Psyco(使用bind()):
from random import random
from time import clock
from math import sqrt
from psyco import bind


NPARTS = 1000
NITER = 201
DIMS = 3
pot = .0
r = [[.0]  * NPARTS] * DIMS

def computePot():
  global NPARTS, DIMS, r, pot
  for i in xrange(NPARTS):
    for j in xrange(i - 1):
      distx = (r[0][j] - r[0][i])
      distx *= distx
      disty = (r[1][j] - r[1][i])
      disty *= disty
      distz = (r[2][j] - r[2][i])
      distz *= distz
      dist = sqrt(distx + disty + distz)
      pot += 1.0 / dist

def initPositions():
  global NPARTS, DIMS, r
  for i in xrange(DIMS):
    for j in xrange(NPARTS):
      r[i][j] = 0.5 + random()

def updatePositions():
  global NPARTS, DIMS, r
  for i in xrange(DIMS):
    for j in xrange(NPARTS):
      r[i][j] -= 0.5 + random()

def main():
  global pot, NITER
  initPositions()
  updatePositions()

  start = clock()
  for i in xrange(NITER):
    pot = .0
    computePot()
    if not i % 10:
      print "%5d: Potential: %10.7f" % (i, pot)
    updatePositions()
  stop = clock()
  print "Seconds = %10.9f" % (stop - start)

bind(computePot)
bind(initPositions)
bind(updatePositions)
bind(main)

if __name__ == '__main__':
  main()
Psyco(使用full()):
from random import random
from time import clock
from math import sqrt
from psyco import full


full()

NPARTS = 1000
NITER = 201
DIMS = 3
pot = .0
r = [[.0]  * NPARTS] * DIMS

def computePot():
  global NPARTS, DIMS, r, pot
  for i in xrange(NPARTS):
    for j in xrange(i - 1):
      distx = (r[0][j] - r[0][i])
      distx *= distx
      disty = (r[1][j] - r[1][i])
      disty *= disty
      distz = (r[2][j] - r[2][i])
      distz *= distz
      dist = sqrt(distx + disty + distz)
      pot += 1.0 / dist

def initPositions():
  global NPARTS, DIMS, r
  for i in xrange(DIMS):
    for j in xrange(NPARTS):
      r[i][j] = 0.5 + random()

def updatePositions():
  global NPARTS, DIMS, r
  for i in xrange(DIMS):
    for j in xrange(NPARTS):
      r[i][j] -= 0.5 + random()

def main():
  global pot, NITER
  initPositions()
  updatePositions()

  start = clock()
  for i in xrange(NITER):
    pot = .0
    computePot()
    if not i % 10:
      print "%5d: Potential: %10.7f" % (i, pot)
    updatePositions()
  stop = clock()
  print "Seconds = %10.9f" % (stop - start)

if __name__ == '__main__':
  main()
我顺便还写了个numpy的实现:
from numpy import add, square
from numpy.random import rand
from time import clock

NPARTS = 1000
NITER = 201
DIMS = 3
r = rand(NPARTS, DIMS)
pot = .0

def computePot():
  global NPARTS, DIMS, r, pot
  for i in xrange(NPARTS):
    for j in xrange(i - 1):
        pot += 1.0 / (add.reduce(square(r[j] - r[i])) ** .5)

def initPositions():
  global r
  r += 0.5

def updatePositions():
  c = clock()
  print c
  global DIMS, NPARTS, r
  r -= (rand(NPARTS, DIMS) + .5)
  print clock() -c

def main():
  global pot, NITER
  initPositions()
  updatePositions()

  start = clock()
  for i in xrange(NITER):
    pot = .0
    computePot()
    if not i % 10:
      print "%5d: Potential: %10.7f" % (i, pot)
    updatePositions()
  stop = clock()
  print "Seconds = %10.9f" % (stop - start)

if __name__ == '__main__':
  main()
然而让我无语的是,numpy版比纯Python版还慢1个数量级。
稍微测试了一下,发现了原因:numpy的函数调用开销很大,因此不适合对小数组进行操作。
简单来说,计算二维数组里2列间差的平方和时,numpy调用了3次函数,而纯Python则完全是内置操作符,所以调用开销较小。而这个二维数组每列只有3个数,所以即便numpy的函数计算很快,但总体来说仍慢于纯Python版。
不过对付大数组则是numpy占据优势了:在数组长度为3时,numpy比Python慢10倍;长度为33时,2者持平;长度为330时,numpy快8倍;长度为3300时,numpy快30倍。(纯Python版的时间复杂度基本可视为O(N)。)

JavaScript:
var NPARTS = 1000;
var NPARTS2 = 2000;
var NITER = 201;
var DIMS = 3;
var SIZE = DIMS * NPARTS;

var r = new Array(SIZE);
var pot;

function initPositions() {
   for(var i = 0; i < SIZE; i++)
     r[i] = 0.5 + Math.random();
}

function updatePositions() {
   for (var i = 0; i < SIZE; i++)
      r[i] -= 0.5 + Math.random();
}

function computePot() {
   for (var i = 0; i < NPARTS; i++ ) {
      for (var j = 0, k = i - 1; j < k; j++ ) {
        var distx = r[j] - r[i];
        var disty = r[NPARTS + j] - r[NPARTS + i];
        var distz = r[NPARTS2 + j] - r[NPARTS2 + i];
        var dist = Math.sqrt(distx * distx + disty * disty + distz * distz);
        pot += 1.0 / dist;
      }
   }
}

initPositions();
updatePositions();

var start = new Date();
for (var i = 0; i < NITER; i++) {
  pot = 0.0;
  computePot();
  updatePositions();
}
var stop = new Date();

alert(stop - start);
Lua:
local NPARTS = 1000
local NITER = 201
local DIMS = 3
local NPARTS2 = 2000
local SIZE = DIMS * NPARTS
local pot = .0
local r = {}
local sqrt = math.sqrt
local random = math.random


function computePot()
  for i = 1, NPARTS do
    for j = 1, i - 1 do
      local distx = r[j] - r[i]
      local disty = r[NPARTS + j] - r[NPARTS + i]
      local distz = r[NPARTS2 + j] - r[NPARTS2 + i]
      pot = pot + 1.0 / sqrt(distx * distx + disty * disty + distz * distz)
    end
  end
end

function initPositions()
  for i = 1, SIZE do
    r[i] = 0.5 + random()
  end
end

function updatePositions()
  for i = 1, SIZE do
    r[i] = r[i] - (0.5 + random())
  end
end

initPositions()
updatePositions()

local start = os.clock()
for i = 1, NITER do
  pot = .0
  computePot()
  if i % 10 == 0 then
    print(string.format("%5d: Potential: %10.7f", i, pot))
  end
  updatePositions()
end
local stop = os.clock()
print(string.format("Seconds = %10.9f", stop - start))
Ruby:
$NPARTS = 1000
$NPARTS2 = 2000
$NITER = 201
$DIMS = 3
$SIZE = $DIMS * $NPARTS
$pot = 0.0
$r = [0.0]  * ($NPARTS * $DIMS)

def computePot()
  i = 0
  while i < $NPARTS
    j = 0
    max = i - 1
    while j < max
      distx = $r[j] - $r[i]
      disty = $r[$NPARTS + j] - $r[$NPARTS + i]
      distz = $r[$NPARTS2 + j] - $r[$NPARTS2 + i]
      $pot = $pot + 1.0 / Math.sqrt(distx * distx + disty * disty + distz * distz)
      j += 1
    end
    i += 1
  end
end


def initPositions()
  i = 0
  while i < $SIZE
    $r[i] = 0.5 + rand()
    i += 1
  end
end

def updatePositions()
  i = 0
  while i < $SIZE
    $r[i] -= 0.5 + rand()
    i += 1
  end
end

initPositions()
updatePositions()

start = Time.now()
i = 0
while i < $NITER
	pot = 0.0
	computePot()
	if i % 10 == 0
		print "%5d Potential %10.7f\n" % [i, $pot]
  end
	updatePositions()
  i += 1
end
stop = Time.now()
print "Seconds = %10.9f\n" % (stop - start)
Squirrel:
const NPARTS = 1000
const NITER = 201
const DIMS = 3
const NPARTS2 = 2000
const RANDMAX  = 32168.0
local SIZE = DIMS * NPARTS
local pot = 0.0
local r = []
local sqrt = sqrt
local rand = rand

local function computePot() {
  for(local i = 0; i < NPARTS; ++i) {
    for(local j = 0, k = i-1; j < k; ++j) {
      local distx = r[j] - r[i]
      local disty = r[NPARTS  + j] - r[NPARTS  + i]
      local distz = r[NPARTS2 + j] - r[NPARTS2 + i]
      pot += 1.0 / sqrt(distx * distx + disty * disty + distz * distz)
    }
  }
}

local function initPositions() {
  for(local i = 0; i < SIZE; ++i) {
    r.append(0.5 + rand() / RANDMAX)
  }
}

local function updatePositions() {
  for(local i = 0; i < SIZE; ++i) {
    r[i] -= 0.5 + rand() / RANDMAX
  }
}

initPositions()
updatePositions()

local start = clock()
for(local i = 0; i < NITER; ++i) {
  pot = 0.0
  computePot()
  if(i % 10 == 0) {
    print(format("%5d: Potential: %10.7f\n", i, pot))
  }
  updatePositions()
}
local stop = clock()
print(format("Seconds = %10.9f\n", stop - start))

9条评论 你不来一发么↓ 顺序排列 倒序排列

    向下滚动可载入更多评论,或者点这里禁止自动加载

    想说点什么呢?