Coder Diary #11 -- Automated, Empirical A/I Testing

John Tiller's Campaign Series exemplifies tactical war-gaming at its finest by bringing you the entire collection of TalonSoft's award-winning campaign series. Containing TalonSoft's West Front, East Front, and Rising Sun platoon-level combat series, as well as all of the official add-ons and expansion packs, the Matrix Edition allows players to dictate the events of World War II from the tumultuous beginning to its climatic conclusion. We are working together with original programmer John Tiller to bring you this updated edition.

Moderators: Jason Petho, Peter Fisla, asiaticus, dogovich

Post Reply
User avatar
berto
Posts: 21461
Joined: Wed Mar 13, 2002 1:15 am
Location: metro Chicago, Illinois, USA
Contact:

Coder Diary #11 -- Automated, Empirical A/I Testing

Post by berto »


Coder Diary #11 -- Automated, Empirical A/I Testing


Again, if you are a die-hard PBEMer, read no further. The following will be of no relevance to you.

If you play the A/I, however, you might find the following discussion interesting.
ORIGINAL: berto

Back in the mid 1990s, when John Tiller developed and refined the Campaign Series and that game's A/I, how did he "optimize" the A/I? Through reason alone? Intuition? And/or empirically, by running test game after test game? If the latter, did he automate the process? This is something I would like to achieve: totally automating the testing process, having the ability to run the game in batch mode testing hundreds and even thousands of trial games. Not just by reason alone, but also by way of discovery, we might chance on just the right mix of A/I parameters (and eventually also refined and new internal algorithms) that will optimize the game's A/I. (Optimize in the sense of maximizing of Victory Points. Another metric might be plausibility of A/I behavior. Does it act more or less like a real commander might?)

Can I/we do better than John Tiller? Maybe, maybe not. We may never find the Secret Sauce or improve on Tiller's parameters. (And I may never be able to improve on his coded algorithms.) But it will be interesting to find out.
If you recall, in the new ai.ini A/I initialization file, we have these:

# AI parameters
#
# Maximum hot value.
# Hex hot trigger.
# Move trigger for AttackHigh order type.
# Move trigger for AttackLow order type.
# Move trigger for NoOrder order type.
# Move trigger for DefendHigh order type.
# Move trigger for DefendLow order type.
# Bad health value.
# Good health value.
# Unload distance.
# Stand-off distance.

(In future, we will be adding to this list.)

In the earlier Campaign Series 1.04 update, the NoOrder move trigger had been changed from 0 to 3.

Why was that? Was it a change for the better? How might we find out?

One way is via empirical testing.

Consider this:

Robert@roberto /cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/West Front
$ ./wf.exe -W -T -A 12 6 0 2 0 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 Bootcamp1.scn
Bootcamp1.scn: 12 6 0 2 0 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 1 3 0 0 0 0 -3

What's all that?!
  • Using Cygwin, I can run Windows programs from the Cygwin command line (much like running programs from the (C:\) Windows Command Prompt).
  • (The -W option says to run the West Front EXE windowed.)
  • The -T option says to run the game in TestTrialPlay mode. In that mode, one can run test trial games entirely hands-off, one after the other, continuously, for as many trials as you specify.
  • The -A option passes to the wf.exe various command-line arguments, among them:
    • 12 6 0 2 0 4 6 70 90 2 4 -- the Side A/Allies A/I parameters
    • 50 -- the Side A/Allies A/I Aggressiveness (as specified in the Bootcamp1.scn file)
    • 12 6 0 2 0 4 6 70 90 2 4 -- the Side B/Axis A/I parameters (for this one test, same as for Side A/Allies)
    • 30 -- the Side B/Axis A/I Aggressiveness (as specified in the Bootcamp1.scn file)
    • Bootcamp1.scn -- the scenario to be tested
For the subsequent output line, the 1 3 0 0 0 0 -3 are the

[*]Side A/Allies losses
[*]Side A/Allies points
[*]Side B/Axis losses
[*]Side B/Axis points
[*]the first side (0 for Side A/Allies; 1 for Side B/Axis)
[*]objective points (for the first side)
[*]victory points (for the first side; reflects objective points, also loss points, both sides)

As you can see, in that one A/I test trial of the Bootcamp1 scenario, side 0, the Allies, ended with a victory "score" of -3.

What if we retry the test, but this time changing the Side A/Allies (only) NoOrder move trigger (and nothing else) to 3 (as was done in the CS 1.04 update)?

Robert@roberto /cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/West Front
$ ./wf.exe -W -T -A 12 6 0 2 3 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 Bootcamp1.scn
Bootcamp1.scn: 12 6 0 2 3 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 5 15 0 0 0 0 -15

Oh my. In this second test, the Allies performed worse, from -3 down to a -15 score.

So we conclude that changing the NoOrder move trigger from 0 to 3 makes the A/I a poorer performer, right?

Not so fast!

[*]These are just two test games. From game to game, the CS A/I gives a wide disparity of outcomes. We need to run many test games, then compare average outcomes.
[*]Bootcamp1.scn is just one simple scenario, an Allied assault against a fixed Axis position. We need to test a wide variety of scenarios -- assaults against fixed positions, but also: meeting engagement; delaying action; static line; pocket breakout; river crossing; mopping up; recon; and others besides.
[*]We need to "turn the tables" -- change the NoOrder move trigger for the second side from 0 to 3 (against an unchanging first side NoOrder move trigger).
[*]We need to keep in mind that these tests are A/I vs. A/I, not A/I vs. human.

For all of those reasons and more, we must not jump to conclusions. We need to run many tests, analyze many inputs and outputs, and draw our conclusions judiciously.

I have a script, airun, to run my A/I test trials:

#!/usr/bin/bash

set +x # no trace by default

DEBUG=0

EFDIR="/cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/East Front"
WFDIR="/cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/West Front"
RSDIR="/cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/Rising Sun"
MEDIR="/cygdrive/c/Games/Matrix Games/Modern Wars/Middle East"
VNDIR="/cygdrive/c/Games/Matrix Games/Modern Wars/Vietnam"

AIP1="12 6 0 2 0 4 6 70 90 2 4"
AIP2="12 6 0 2 3 4 6 70 90 2 4"

TRIALS=10 # default

while [ $# ]; do
if [ "$1" = "+G" ]; then
DEBUG=1
set -x
shift
elif [ "$1" = "-g" ]; then
shift
GAME=$1
shift
elif [ "$1" = "-t" ]; then
shift
TRIALS=$1
shift
elif [ "$1" = "-d" ]; then
shift
DATE=$1
shift
else
break
fi
done

if [ "x$GAME" = "x" ]; then
echo "Usage: ttrun [+G] -g <GAME> [-t <TRIALS>] [-d <DATE>]"
exit 1
fi

if [ "x$DATE" = "x" ]; then
DATE=`date '+%Y%m%d'`
fi

if [ $GAME = "ef" ]; then
EXE=ef.exe
SCN=Farm79.scn
AGRA=80
AGRB=20
#SCN=Butyrki.scn
#AGRA=80
#AGRB=90
#SCN=Tutorial.scn
#AGRA=100
#AGRB=100
cd "$EFDIR"
elif [ $GAME = "wf" ]; then
EXE=wf.exe
SCN=Omaha_East.scn
AGRA=100
AGRB=50
#SCN=Gabr_es_Siaghi.scn
#AGRA=60
#AGRB=100
cd "$WFDIR"
...
fi

declare -i T

T=$TRIALS
while [ $T -gt 0 ]; do
./$EXE -W -T -A $AIP1 $AGRA $AIP1 $AGRB "$SCN"
T=$T-1
done

T=$TRIALS
while [ $T -gt 0 ]; do
./$EXE -W -T -A $AIP2 $AGRA $AIP1 $AGRB "$SCN"
T=$T-1
done

T=$TRIALS
while [ $T -gt 0 ]; do
./$EXE -W -T -A $AIP1 $AGRA $AIP2 $AGRB "$SCN"
T=$T-1
done

T=$TRIALS
while [ $T -gt 0 ]; do
./$EXE -W -T -A $AIP2 $AGRA $AIP2 $AGRB "$SCN"
T=$T-1
done

exit 0

This is definitely a work-in-progress. Indeed, this whole testing methodology is a work-in-progress. (And let's not forget: In future, I intend to add more and more user-configurable A/I parameters into the mix.)

Using the above airun script, and just testing the 0 vs. 3 NoOrder move trigger change, I have run over 320 A/I test trial games across eight different scenarios (of various kinds) from four different games (EF, WF, RS & VN). Here are the results from the Rising Sun Asun.scn tests:

Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 48 149 57 177 1 0 -28
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 32 100 53 152 1 0 -52
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 31 95 49 146 1 0 -51
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 40 120 48 145 1 0 -25
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 37 103 37 110 1 0 -7
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 41 131 59 169 1 0 -38
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 38 98 60 180 1 0 -82
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 31 101 63 181 1 0 -80
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 33 93 49 135 1 0 -42
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 37 118 45 137 1 0 -19
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 36 93 65 208 1 0 -115
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 25 78 56 167 1 0 -89
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 40 102 55 161 1 0 -59
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 33 105 54 167 1 0 -62
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 31 89 43 127 1 0 -38
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 38 117 58 163 1 0 -46
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 54 164 41 117 1 0 47
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 27 87 58 169 1 0 -82
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 40 127 45 139 1 0 -12
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 41 130 53 165 1 0 -35
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 41 126 57 170 1 0 -44
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 44 126 41 117 1 0 9
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 43 125 57 164 1 0 -39
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 46 145 46 141 1 50 54
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 45 142 57 183 1 0 -41
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 41 117 39 110 1 0 7
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 50 150 40 113 1 0 37
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 40 105 53 161 1 0 -56
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 48 131 57 173 1 0 -42
Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 40 120 55 172 1 0 -52
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 26 78 55 169 1 0 -91
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 31 96 43 121 1 0 -25
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 36 84 40 116 1 0 -32
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 39 114 56 172 1 0 -58
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 34 94 52 149 1 0 -55
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 40 125 63 197 1 0 -72
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 42 128 40 124 1 0 4
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 44 137 50 148 1 0 -11
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 48 131 58 177 1 0 -46
Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 41 138 63 190 1 0 -52

Whoa! That's a lot of data, even just for one test scenario.

I have another (work-in-progress) script, airpt.pl (not shown here), to process the airun data output. For now, the script just reports average scores, but in future I might extend it to report other averages (losses, points), other game metrics, and perhaps also standard deviations, etc. There's no end to the sophisticated statistical analysis I might apply. (I might prepare some pretty graphs, too.)

Here are the airpt.pl results for the Asun.scn tests:

RS, Asun.scn: side 1, Side B/Axis, first side

Side A 12 6 0 2 0 4 6 70 90 2 4 75 vs
Side B 12 6 0 2 0 4 6 70 90 2 4 85 AI: -42

Side A 12 6 0 2 0 4 6 70 90 2 4 75 vs
Side B 12 6 0 2 3 4 6 70 90 2 4 85 AI: -16 -- 3 is very much better

----

Side A 12 6 0 2 3 4 6 70 90 2 4 75 vs
Side B 12 6 0 2 0 4 6 70 90 2 4 85 AI: -49

Side A 12 6 0 2 3 4 6 70 90 2 4 75 vs
Side B 12 6 0 2 3 4 6 70 90 2 4 85 AI: -43 -- 3 is slightly better

For the Rising Sun Asun.scn (river crossing) scenario, 3 is a "better" NoOrder move trigger than 0. What about the other 280 test results across the other seven tested scenarios (across four games)? In general, they seem to vindicate the CS 1.04 NoOrder move trigger change from 0 to 3. Not in all cases, not in all scenario types, not always for the attacker or the defender, ... -- I cannot say that 3 is always "better" than 0. But on balance, based on my results so far, 3 is probably better, certainly no worse, than 0. So, the CS 1.04 change of NoOrder move trigger from 0 to 3? For now, we keep it.

Question: Is 3 the "optimal" NoOrder move trigger? What about 2? 4? And so on.

More questions: What about the other A/I parameters? Are they "optimal"?

And still many, many more questions, and angles to look at this problem.

The $64,000 Question: What is just the right mix of data parameters, the "Secret Sauce", giving the all-around "best", "optimal" A/I?

Previously, I had said of this effort to improve the game's A/I:
It's a journey, not a destination.
Fortunately, I have three CS test systems to carry me forward on my journey:

[*]Windows XP (which I will devote to round-the-clock, non-stop test trials)
[*]Windows 7 (my main developer system)
[*]Windows 8 (test trials, and other tests)

with another system on the way (my daughter's older system that she has abandoned with her leaving recently for a teaching position in South Africa):

[*]Windows XP (which I will also devote to round-the-clock, non-stop test trials)

Four Windows test/development systems in total!

When I get these test trials going, it's kind of amusing to be surrounded by three (and soon four) computers each playing CS scenarios non-stop, one after another. [8D]

On my "faster" Windows 7 systems, I can typically run through ~40 test trial games of a Complexity Level 3 or 4 scenario in a day or so. On my slower Windows XP system(s), it takes a while longer.

The whole testing process is slowed down by graphics overhead: For now, even though I launch the test games via the airun script (else directly from the Cygwin command line), the games still run as Windows games normally would -- i.e., in a window, with full graphical display, units moving about, explosions, etc. A longer term goal: to add still another no-graphics game play mode and command-line switch. If I can run these test games more quickly sans graphics, I'll really be able to crank them out.

Again (it goes without saying, the sound of the endlessly playing broken record): This is all very much a work-in-progress, a developing effort. I anticipate many modifications in the months ahead. (And foresee running many thousands of test trials.)

But this whole automated testing methodology: It opens up worlds of possibilities. Not just for improving the A/I, but for QA (Quality Assurance) and other cool stuff.

There's the game, then there's also the meta game -- coding, testing, data analyzing, ...

Geeks Just Gotta Have Fuh-un.



(Am I weird or what? [;)])

Until the next time ...
Campaign Series Legion https://cslegion.com/
Campaign Series Lead Coder https://www.matrixgames.com/forums/view ... hp?f=10167
Panzer Campaigns, Panzer Battles Lead Coder https://wargameds.com
User avatar
junk2drive
Posts: 12856
Joined: Thu Jun 27, 2002 7:27 am
Location: Arizona West Coast

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by junk2drive »

Thanks
Conflict of Heroes "Most games are like checkers or chess and some have dice and cards involved too. This game plays like checkers but you think like chess and the dice and cards can change everything in real time."
User avatar
XLVIIIPzKorp
Posts: 224
Joined: Tue Oct 24, 2006 12:34 am
Contact:

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by XLVIIIPzKorp »

Wow, I may even play the A/I again some day. Sounds good.
User avatar
Crossroads
Posts: 17498
Joined: Sun Jul 05, 2009 8:57 am

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by Crossroads »

ORIGINAL: berto

On my "faster" Windows 7 systems, I can typically run through ~40 test trial games of a Complexity Level 3 or 4 scenario in a day or so.

Oh no! This starts to sound all War Games to me... [X(]
Visit us at: Campaign Series Legion
---
CS: Vietnam 1948-1967 < Available now
CS: Middle East 1948-1985 2.0 < 3.0 In the works
User avatar
wings7
Posts: 4586
Joined: Mon Aug 11, 2003 4:59 am
Location: Phoenix, Arizona

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by wings7 »

An important issue that is being addressed!! Thanks! [:D]

Patrick
Please come and join and befriend me at the great Steam portal! There are quite a few Matrix/Slitherine players on Steam! My member page: http://steamcommunity.com/profiles/76561197988402427
pzgndr
Posts: 3486
Joined: Thu Mar 18, 2004 12:51 am
Location: Maryland

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by pzgndr »

The $64,000 Question: What is just the right mix of data parameters, the "Secret Sauce", giving the all-around "best", "optimal" A/I?

Good stuff. For "optimal" AI, are these parameters something we would adjust in the overall game defaults someplace or something to adjust for each individual scenario?
Bill Macon
Empires in Arms Developer
Strategic Command Developer
User avatar
berto
Posts: 21461
Joined: Wed Mar 13, 2002 1:15 am
Location: metro Chicago, Illinois, USA
Contact:

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by berto »

ORIGINAL: pzgndr
The $64,000 Question: What is just the right mix of data parameters, the "Secret Sauce", giving the all-around "best", "optimal" A/I?
Good stuff. For "optimal" AI, are these parameters something we would adjust in the overall game defaults someplace or something to adjust for each individual scenario?
As described here:

[*]With fine granularity, pre-game, via the ai.ini file.
[*]More coarsely, in-game, via the Aggressiveness, Audacious A/I, and Cautious A/I menu options.

Not specified on a per-scenario basis, directly in the .scn files, if that's what you mean. But a future possibility?
Campaign Series Legion https://cslegion.com/
Campaign Series Lead Coder https://www.matrixgames.com/forums/view ... hp?f=10167
Panzer Campaigns, Panzer Battles Lead Coder https://wargameds.com
User avatar
junk2drive
Posts: 12856
Joined: Thu Jun 27, 2002 7:27 am
Location: Arizona West Coast

RE: Coder Diary #11 -- Automated, Empirical A/I Testing

Post by junk2drive »

bump 11
Conflict of Heroes "Most games are like checkers or chess and some have dice and cards involved too. This game plays like checkers but you think like chess and the dice and cards can change everything in real time."
Post Reply

Return to “John Tiller's Campaign Series”