Houdini Pro is intended for power users with high-end hardware.

The main differences with the Standard version are:

Houdini Pro supports up to 128 threads.
Houdini Pro supports up to 128 GB of hash memory (131072 MB).
Houdini Pro supports Large Memory Pages.
Houdini Pro is NUMA-aware.
Houdini Pro can use Nalimov end game table bases.

Large Memory Pages

Houdini Pro will use so-called large memory pages if they are provided by the operating system. Depending on the hash table size the speed gain may be between 5% and 15%.

 

To enable this feature in Windows, you need to modify the Group Policy for your account:

1.Run: gpedit.msc (or search for "Group Policy").
2.Under "Computer Configuration", "Windows Settings", "Security Settings", "Local Policies" click on "User Rights Assignment".
3.In the right pane double-click the option "Lock Pages in Memory".
4.Click on "Add User or Group" and add your account or "Everyone".
5.You may have to logoff or reboot for the change to take effect.

 

IMPORTANT: You'll also need to run your chess GUI with administrative rights ("Run as Administrator") or disable UAC in Windows.

Very often large memory pages will only be available shortly after booting Windows. After a while the Windows memory becomes too fragmented for large page allocation, and Houdini will fall back to standard memory page usage.

 

You can test the availability of Large Pages with the lp command. Run Houdini in a command window (simply by double-clicking on the executable) and type lp followed by Enter. Houdini will try to create large page memory blocks of increasing size and show a summary of the results.

NUMA-awareness

Most CPU mother boards with multiple sockets employ the so-called "NUMA" architecture.

Houdini Pro detects the NUMA configuration at start-up and will adapt its memory management and thread interaction based on the different NUMA nodes that are available.

The speed gain depends on the number of cores, the motherboard and CPU brand.

Running Multiple Houdini Pro instances

If you're simultaneously running multiple Houdini Pro instances they will by default compete for the resources on the same NUMA nodes. To avoid this, you should set the NUMA Offset parameter to different values in the different Houdini instances.

For example, if you want to run two Houdini instances with 6 threads each on 12-core hardware, you should use NUMA Offset 1 for the second instance so that it will allocate its 6 threads on the second NUMA node. See also the NUMA Offset configuration.

Some Real Performance Data

40-core dual Intel Xeon v4 at 2.3 GHz

This test system is a 40-core dual Intel Xeon v4 box running at 2.3 GHz speed having 80 virtual processors (40 cores with hyper-threading) under Windows 10.

The system has two CPUs and two NUMA nodes; every CPU with 20 cores resides on its own NUMA node.

 

A two minute benchmark was run on a number of positions using 1, 6, 20 and 40 threads with hash memory set at 2048 MB.

The impact of the Large Pages and the NUMA-awareness on the measured average node speed was as follows:

 

Configuration

1 thread

6 threads

20 threads

40 threads

Standard or Pro without NUMA

2,400 kN/s

13,600 kN/s

47,600 kN/s

58,700 kN/s

Pro with NUMA

2,400 kN/s

13,500 kN/s

47,800 kN/s

91,200 kN/s

Pro with Large Pages

2,550 kN/s

14,500 kN/s

49,400 kN/s

67,300 kN/s

Pro with NUMA and Large Pages

2,550 kN/s

14,600 kN/s

49,500 kN/s

96,400 kN/s

 

When the engine can run fully on a single CPU, i.e. up to 20 threads, Windows and the Intel Xeons do a good job of providing excellent performance without any NUMA awareness.

Only with 40 threads running on the two CPUs of the system the NUMA-awareness becomes important.

For 2048 MB hash the speed improvement from using Large Pages is about 6%. The impact grows when the size of the Hash Memory becomes larger; repeating the same benchmark with 8192 MB of hash memory yields a speed increase from Large Pages of nearly 10%.

The numbers also show that Houdini Pro scales nearly perfectly with the number of threads: the 20-thread benchmark is nearly 20 times faster than the single-thread result, and the 40-thread run is nearly 40 times faster.

 

24-core dual AMD Opteron 6174 at 2.3 GHz

This is a 24-core dual Opteron box comprised of 4 NUMA nodes of 6 cores each; each 12-core Opteron 6174 processor has 2 NUMA nodes.

A two minute benchmark was run on a number of positions using 1, 6, 12 and 24 threads with hash memory set at 2048 MB.

The impact of the Large Pages and the NUMA-awareness on the measured average node speed was as follows:

 

Configuration

1 thread

6 threads

12 threads

24 threads

Standard or Pro without NUMA

1,430 kN/s

8,350 kN/s

16,400 kN/s

31,500 kN/s

Pro with NUMA

1,440 kN/s

8,900 kN/s

17,700 kN/s

35,200 kN/s

Pro with Large Pages

1,600 kN/s

9,100 kN/s

18,100 kN/s

 

Pro with NUMA and Large Pages

1,630 kN/s

9,400 kN/s

19,100 kN/s

39,300 kN/s

 

The AMD CPU benefits more clearly from Large Pages and NUMA support.

With 24 threads the performance benefit provided by NUMA-awareness and Large Pages is close to 25%.

As above, the scaling of the performance remains nearly perfect up to maximum number of threads.