Effect of Line Solution Caching on Pbnsolve Run-times

Jan Wolter

Pbnsolve is a program for solving paint-by-number puzzles (also called nonograms). This page describes the impact of caching line solutions on its performance.

I first saw this idea used in the BGU solver. Whenever we solve a line, we store the solution in a hash table, keyed by the row clues, and the initial state. We check that table before running the line solver each time, and if there is a solution stored there, we use that instead of running the line solver. I implemented this idea in version 1.07 of pbnsolve.

A few details of that implementation:

Normally pbnsolve uses two hash tables, one for rows, one for columns. However, if the puzzle is square, then just one table is used.
Before starting, it assigns "clue numbers" to each clue, such that rows or columns have the same clue number if and only if they have the same length and the same clues. These numbers are stored with the solutions in the hash table, so that solutions for one line can be reused for other lines with identical clues.
For each line solution, it coverts the initial and final state of the line into a compressed binary representation to minimize the storage used. The initial state and the clue number are concatinated to form the hash key. The final state is the data.
The hash tables are fixed size. Hashing is done with a standard double hashing algorithm.
When the hash tables reach 90% fill, they are simply emptied. This limits their memory consumption, but obviously negatively impacts the hit rates.
Each time the hash table is emptied, it's size is doubled. For rectangular puzzles we start with two tables of size 25,000. For square puzzles we start with one of size 50,000. They are doubled until they reach 800,000 and then the doubling stops. This keeps us from using too much memory for puzzles that don't need that much searching, while making cache flushes relatively infrequent for more difficult puzzles.
Hashing is not turned on until the first guess is made. If the puzzle can be solved entirely by the line solver, without doing any search, then it never gets used. This is again a memory saver, since hit rates are likely to be lower during that phase, and pbnsolve doesn't really need to be sped up on such puzzles. Note that though hit rates are low during that phase, they tend to be higher than one would expect, due to the large number of duplicate clues in many puzzles.

In later versions of pbnsolve I added a refinement to the algorithm. Sometimes you'll have clues that are identical except for being reversed. For example two rows of the same length might have clues "3 1 1" and "1 1 3". These rows are actually the same. If you recognized them during the "clue numbering" phase, then whenever you are saving or looking up a solution for one of those lines you can reverse the inital and final state strings of the reversed clues. This costs very little extra computation, and I was hoping it would reduce the size of the hash tables and improve the hit rate. It does, but only very slightly. Even on 9-Dom, which has lots of reversed clues, the hit rate was improved only 0.3%.

Here are some runtimes and hit rates on some sample puzzles. These statistics were collected on computer B using pbnsolve 1.07. (There are newer, faster versions of pbnsolve, but I have not rerun these statistics.)

Pbnsolve normally uses probing to make guesses, but there is some older code in it that uses a heuristic function (not an amazingly good one in version 1.07). Probing naturally involves more re-solving of similar problems, so I thought it would be interesting to see how the caching worked with a heuristic search algorithm as well. A "+" sign indicates a runtime longer than 30 minutes.

Pbnsolve 1.07 Run Times with and without Caching

puzzle	size	Probing				Heuristic
puzzle	size	No Caching	Caching	Hit Rate	Improvement	No Caching	Caching	Hit Rate	Improvement
pbnsolve command line flags:		-aLEP	-aLEHP			-aLEG	-aLEHG
#436: Petro*	40 x 35	0.12s	0.06s	88%	50%	2.2m	1.6m	86%	27%
#4645: M&M	50 x 70	0.10s	0.07s	90%	30%	0.14s	0.12s	70%	14%
#3541: Signed	60 x 50	0.04s	0.04s	72%	0%	2.0s	1.6s	74%	20%
#803: Light	50 x 45	0.95s	0.68s	98%	28%	+	+
#6574: Forever*	25 x 25	16.8m	8.0m	98%	52%	2.1s	1.4s	93%	33%
#2040: Hot	55 x 60	1.24s	0.56s	94%	54%	+	+
#6739: Karate	40 x 40	7.7s	3.2s	93%	58%	1.5m	47.2s	93%	48%
#8098: 9-Dom*	19 x 19	28.1s	20.3s	96%	28%	7.3m	5.7m	93%	22%
#2556: Flag	65 x 45	+	+			7.8s	5.2s	92%	33%
#2712: Lion	47 x 47	1.9m	44.9s	97%	61%	+	+

Note that the hit rates are very high. I would never have guessed that they would be as high as this. In solving the "Forever" puzzle with probing pbnsolve did 986 million line solves, of which 974 million were satisfied from the cache, even though 4 cache flushes were done.

As expected, the hit rates are a bit lower when we do heuristic search rather than probing, but still pretty high.

We get substantial run time improvements even when the hit rates are modest, but the hit rate alone doesn't really predict how much improvement we get in run time. Pbnsolve's line solver is very highly optimized, and on some lines probably isn't that much slower than a hash lookup (especially with the overhead of compressing and uncompressing lines). If we are mostly hitting on lines that would have solved fast anyway, then the speedup isn't going to be so great.

Note that the speedups tend to be higher on larger puzzles, which have longer lines, and thus more expensive line solves. The 9-Dom puzzle might seem like a natural candidate for caching, because it has many duplicate clues, but the lines are short and the clues are simple, so caching doesn't save us as much. (This may be why it is one of the puzzles that gives the BGU solver the most trouble.)

So this is a very effective enhancement that is easy to implement within almost any solver.