2D Delaunay Triangulation Performance – Example1

Fade is a high performance multithreaded Delaunay triangulation library, see the benchmark diagram below. Use the present benchmark code example1.cpp (contained in the download) to test the performance on your own hardware.

  1. The example source code starts with a call to Fade_2D::setNumCPU(int numCPU):
    • numCPU=0 means autodetect and use all available CPU cores
    • numCPU=1 means single-threaded execution (default behavior)
    • numCPU>1 activates multithreading

    The command returns the number of used CPU cores.

  1. Then it sets up the size of the point sets to be triangulated:
    • You can activate the 100 and 250 million points lines in the source if you have enough memory.
  2. The final loop creates random point sets and measures their triangulation time
// * 1 *   Create a Fade object and call setNumCPU(..)
Fade_2D dt;
int numUsedCPU=dt.setNumCPU(0); // 0 means: autodetect
cout<<"Number of CPU cores: "<<numUsedCPU<<endl;

// * 2 *   Set up how many points to test
std::vector<std::pair<std::string,int> > vNumPoints;
vNumPoints.push_back(make_pair("numPoints: 1k",1000));
vNumPoints.push_back(make_pair("numPoints: 50k",50000));
vNumPoints.push_back(make_pair("numPoints: 100k",100000));
vNumPoints.push_back(make_pair("numPoints: 500k",500000));
vNumPoints.push_back(make_pair("numPoints: 1 mio",1000000));
vNumPoints.push_back(make_pair("numPoints: 10 mio",10000000));
vNumPoints.push_back(make_pair("numPoints: 50 mio (10.7 GB)",50000000));
//vNumPoints.push_back(make_pair("numPoints: 100 mio (21.3 GB)",100000000));
//vNumPoints.push_back(make_pair("numPoints: 250 mio (53 GB)",250000000));


// * 3 *   Test
for(size_t i=0;i<vNumPoints.size();++i)
{
	std::string label(vNumPoints[i].first);
	int numPoints(vNumPoints[i].second);

	// Prepare a vector of random points
	vector<Point2> vInPoints;
	generateRandomPoints(numPoints,0,100,vInPoints,1);

	// Insert the points and erase afterwards. The total time for
	// triangulation (without destruction time) is measured.
	cout<<"\n"<<label<<": start"<<endl;
	double elapsed=dt.measureTriangulationTime(vInPoints);
	cout<<"Elapsed time: "<<elapsed<<endl<<endl;
}


Get true and comparable Benchmark Results

  • Compile an x64 application in Release mode. For better performance and because 32-bit applications suffer from the (MSVC 32-bit) 2GB memory limit.
  • By default Fade runs single-threaded to avoid nested multithreading in applications that are already multithreaded. Thus you need to explicitly enable multithreading using int Fade_2D::setNumCPU(..).
  • Do you have enough RAM for the benchmark? 100 mio. points take ~21 GB RAM on 64-bit systems. Avoid usage of slow swap space.
  • Prevent cpu throttling: Disable energy saving options during the test. On Linux use the Performance governor. If you use a laptop plug in the power chord.
  • Fade still supports very old compilers (VS2010, CentOS6.4) but multithreading is disabled there.

1 Million Points performance test on a Core i7 6800K

The above Delaunay triangulation benchmark has been run on a desktop CPU (Core i7 6800K, Linux x64): One million points (2 million triangles) take 0.17 seconds multithreaded (or 0.42 seconds single threaded).

Delaunay Triangulation Performance, fast single- and multithreaded execution

Delaunay Triangulation Performance, single- and multithreaded for 1 million points

Comparison: Multithreaded Delaunay vs. Singlethreaded Delaunay

The same machine has been used to test Fade2D with datasets up to 250 million points (500 million triangles). The runtime is roughly linear in the number of input points. The speedup from parallel execution on this 6-core machine is up to 4.39.

Delaunay, single- and multithreaded performance for huge point sets up to 250 million points

Delaunay Performance, single- and multithreaded for huge point sets, 250 Million Points


Raspberry PI2 – 1 Million Points on an ARM Cortex A7

Whether it is a fun benchmark or even an important one: Fade2D has been run on a comparably weak Raspberry PI2. It uses a Broadcom BCM2836 SoC containing a 32-bit ARM Cortex-A7 cpu.

This poor little 3-watt device has fought incredibly brave, it was only 16 times slower than an i7 desktop cpu with a 140-watt TDP. This result encourages us to provide support for further embedded devices. If you have any suggestions feel free to add your comments below.
Raspberry PI Delaunay Performance
Raspberry PI Delaunay Performance: 1 Million points on a weak embedded device

Delaunay Triangulation Performance on a Raspberry PI2

The next article Example2 – Traversing teaches you the important basics you need to access specific elements in a triangulation.

Spread the word. Share this post!

Leave A Reply

Your email address will not be published. Required fields are marked *

*

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close