Tuesday, December 27, 2011

Crossword Puzzle Solution Generator



     In my artificial intelligence graduate course this semester, we had a term project to create a LISP crossword puzzle game. The game has two modes, an automatic and a manual mode. In automatic mode, the computer is programmed to automatically fill in the grid, whereas the user may choose to play manually through manual mode. In automatic mode, the program generates as many solutions to the crossword puzzle as it can in the limited time span of 30 CPU seconds or until it reaches a maximum of 100 solutions. The program was given a wordlist of over 100,000 English words contained in the text file wordsEn.txt and an arbitrary grid data file as input.

Technical Approach and Intelligence
     My program uses a depth-first search backtracking algorithm to fill in the grid with words from the wordlist until the puzzle is solved while backtracking if no solution for an entry is found. An entry is defined as the horizontal or vertical lines of empty cells that are to be written. The depth-first search approach starts with an empty entry (root node), writes to it, and moves to its first neighbor (first child node). The same process trickles down for all neighbors’ neighbors. If a neighbor is already written to, it will then move to its second neighbor and so on until all neighbors are filled and we move back up to the entry that calls it. If there is no solution for an entry, we retrace back to the calling entry, find a different solution for that entry, and repeat the process until all entries are filled and the puzzle is solved.
     In order to speed up search time and reduce the chances of the program searching through dead end solutions, I applied a couple of heuristic to minimize the number of backtracks and improve performance.
     The problem was approached with the “First Fail principle: try first where you are more likely to fail”. One heuristic that was used was the least constraining variable heuristic where the crossword puzzle is solved by first filling in entries with the greatest number of constraints. The constraints that is of the most interest, is the number of intersections of letters a word entry has with other entries of different directions. In other words, an entry with 3 neighbors will have a constraint of 3. However, because we are using a depth-first search approach, it is important to also sort each entry’s neighbors by greatest number of constraints. This allows us to visit neighbors that have more intersecting positions first. Another way of looking at this heuristic is that it looks at one additional level of constraints, and permits us to starting writing at a position in the grid that is of the highest degree of complexity in terms of constraints and limits the number of backtracks. 
     To break ties between entries with the same constraints, the entry with the smaller domain size will be chosen to maximize choices for the remaining entries.

letter frequency of text

    Another heuristic that was employed was to check if the letters at each intersection begins with an e, t, a, o, i, n, s, h, or r. These letters have the highest frequency of appearance in the English dictionary as portrayed by the figure below. This approach was taken so that intersecting letters that have low frequency won’t be chosen and hence lead to a dead end which would cause more backtracking. Admittedly, as with other greedy heuristics, this heuristic might have an adverse effect where the only possible intersecting letters for two entries are one of the letters that were not allowed. In this sense, we are omitting search sensitivity for speed.
     Finally, words in the same domain were not shuffled and were instead sorted in alphabetical order. This is another strategy to limit the searches for words to fill an entry because words at the beginning of the alphabet have a higher frequency of appearing.

Results
     The algorithm was tested using 14 test grids with varying degree of complexity. Mostly, the program was able to generate 100 solutions for each test grid with an average run time of 0.3 second per grid. For extremely cluttered grid with high solving difficulty, the program was unable to generate a solution within the 30 second time frame permitted. Out of the 77 class projects, my algorithm ranked 8th in speed and completeness which I thought was pretty impressive. 
     Overall, this was a challenging but fun project. I especially enjoy finding new and creative ways to improve my algorithmic efficiency. It was also my first time programming in LISP, and boy was it a major battle. Debugging and finding documentations on the language prove to be quite difficult. However, there are many benefits that I've discovered upon using the language such as the "nth" function and the straightforwardness of the code.

No comments:

Post a Comment