Matrix Games Forums

Forums  Register  Login  Photo Gallery  Member List  Search  Calendars  FAQ 

My Profile  Inbox  Address Book  My Subscription  My Forums  Log Out

Coder Diary #11 -- Automated, Empirical A/I Testing

 
View related threads: (in this forum | in all forums)

Logged in as: Guest
Users viewing this topic: none
  Printable Version
All Forums >> [Current Games From Matrix.] >> [World War II] >> John Tiller's Campaign Series >> Coder Diary #11 -- Automated, Empirical A/I Testing Page: [1]
Login
Message << Older Topic   Newer Topic >>
Coder Diary #11 -- Automated, Empirical A/I Testing - 10/14/2013 10:46:07 PM   
berto


Posts: 18174
Joined: 3/13/2002
From: metro Chicago, Illinois, USA
Status: online

Coder Diary #11 -- Automated, Empirical A/I Testing


Again, if you are a die-hard PBEMer, read no further. The following will be of no relevance to you.

If you play the A/I, however, you might find the following discussion interesting.

quote:

ORIGINAL: berto

Back in the mid 1990s, when John Tiller developed and refined the Campaign Series and that game's A/I, how did he "optimize" the A/I? Through reason alone? Intuition? And/or empirically, by running test game after test game? If the latter, did he automate the process? This is something I would like to achieve: totally automating the testing process, having the ability to run the game in batch mode testing hundreds and even thousands of trial games. Not just by reason alone, but also by way of discovery, we might chance on just the right mix of A/I parameters (and eventually also refined and new internal algorithms) that will optimize the game's A/I. (Optimize in the sense of maximizing of Victory Points. Another metric might be plausibility of A/I behavior. Does it act more or less like a real commander might?)

Can I/we do better than John Tiller? Maybe, maybe not. We may never find the Secret Sauce or improve on Tiller's parameters. (And I may never be able to improve on his coded algorithms.) But it will be interesting to find out.

If you recall, in the new ai.ini A/I initialization file, we have these:



# AI parameters
#
# Maximum hot value.
# Hex hot trigger.
# Move trigger for AttackHigh order type.
# Move trigger for AttackLow order type.
# Move trigger for NoOrder order type.
# Move trigger for DefendHigh order type.
# Move trigger for DefendLow order type.
# Bad health value.
# Good health value.
# Unload distance.
# Stand-off distance.



(In future, we will be adding to this list.)

In the earlier Campaign Series 1.04 update, the NoOrder move trigger had been changed from 0 to 3.

Why was that? Was it a change for the better? How might we find out?

One way is via empirical testing.

Consider this:



Robert@roberto /cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/West Front
$ ./wf.exe -W -T -A 12 6 0 2 0 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 Bootcamp1.scn
Bootcamp1.scn: 12 6 0 2 0 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 1 3 0 0 0 0 -3



What's all that?!

  • Using Cygwin, I can run Windows programs from the Cygwin command line (much like running programs from the (C:\) Windows Command Prompt).
  • (The -W option says to run the West Front EXE windowed.)
  • The -T option says to run the game in TestTrialPlay mode. In that mode, one can run test trial games entirely hands-off, one after the other, continuously, for as many trials as you specify.
  • The -A option passes to the wf.exe various command-line arguments, among them:

    • 12 6 0 2 0 4 6 70 90 2 4 -- the Side A/Allies A/I parameters
    • 50 -- the Side A/Allies A/I Aggressiveness (as specified in the Bootcamp1.scn file)
    • 12 6 0 2 0 4 6 70 90 2 4 -- the Side B/Axis A/I parameters (for this one test, same as for Side A/Allies)
    • 30 -- the Side B/Axis A/I Aggressiveness (as specified in the Bootcamp1.scn file)
    • Bootcamp1.scn -- the scenario to be tested


For the subsequent output line, the 1 3 0 0 0 0 -3 are the

  • Side A/Allies losses
  • Side A/Allies points
  • Side B/Axis losses
  • Side B/Axis points
  • the first side (0 for Side A/Allies; 1 for Side B/Axis)
  • objective points (for the first side)
  • victory points (for the first side; reflects objective points, also loss points, both sides)

    As you can see, in that one A/I test trial of the Bootcamp1 scenario, side 0, the Allies, ended with a victory "score" of -3.

    What if we retry the test, but this time changing the Side A/Allies (only) NoOrder move trigger (and nothing else) to 3 (as was done in the CS 1.04 update)?



    Robert@roberto /cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/West Front
    $ ./wf.exe -W -T -A 12 6 0 2 3 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 Bootcamp1.scn
    Bootcamp1.scn: 12 6 0 2 3 4 6 70 90 2 4 50 12 6 0 2 0 4 6 70 90 2 4 30 5 15 0 0 0 0 -15



    Oh my. In this second test, the Allies performed worse, from -3 down to a -15 score.

    So we conclude that changing the NoOrder move trigger from 0 to 3 makes the A/I a poorer performer, right?

    Not so fast!

  • These are just two test games. From game to game, the CS A/I gives a wide disparity of outcomes. We need to run many test games, then compare average outcomes.
  • Bootcamp1.scn is just one simple scenario, an Allied assault against a fixed Axis position. We need to test a wide variety of scenarios -- assaults against fixed positions, but also: meeting engagement; delaying action; static line; pocket breakout; river crossing; mopping up; recon; and others besides.
  • We need to "turn the tables" -- change the NoOrder move trigger for the second side from 0 to 3 (against an unchanging first side NoOrder move trigger).
  • We need to keep in mind that these tests are A/I vs. A/I, not A/I vs. human.

    For all of those reasons and more, we must not jump to conclusions. We need to run many tests, analyze many inputs and outputs, and draw our conclusions judiciously.

    I have a script, airun, to run my A/I test trials:



    #!/usr/bin/bash

    set +x # no trace by default

    DEBUG=0

    EFDIR="/cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/East Front"
    WFDIR="/cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/West Front"
    RSDIR="/cygdrive/c/Games/Matrix Games/John Tiller's Campaign Series/Rising Sun"
    MEDIR="/cygdrive/c/Games/Matrix Games/Modern Wars/Middle East"
    VNDIR="/cygdrive/c/Games/Matrix Games/Modern Wars/Vietnam"

    AIP1="12 6 0 2 0 4 6 70 90 2 4"
    AIP2="12 6 0 2 3 4 6 70 90 2 4"

    TRIALS=10 # default

    while [ $# ]; do
    if [ "$1" = "+G" ]; then
    DEBUG=1
    set -x
    shift
    elif [ "$1" = "-g" ]; then
    shift
    GAME=$1
    shift
    elif [ "$1" = "-t" ]; then
    shift
    TRIALS=$1
    shift
    elif [ "$1" = "-d" ]; then
    shift
    DATE=$1
    shift
    else
    break
    fi
    done

    if [ "x$GAME" = "x" ]; then
    echo "Usage: ttrun [+G] -g <GAME> [-t <TRIALS>] [-d <DATE>]"
    exit 1
    fi

    if [ "x$DATE" = "x" ]; then
    DATE=`date '+%Y%m%d'`
    fi

    if [ $GAME = "ef" ]; then
    EXE=ef.exe
    SCN=Farm79.scn
    AGRA=80
    AGRB=20
    #SCN=Butyrki.scn
    #AGRA=80
    #AGRB=90
    #SCN=Tutorial.scn
    #AGRA=100
    #AGRB=100
    cd "$EFDIR"
    elif [ $GAME = "wf" ]; then
    EXE=wf.exe
    SCN=Omaha_East.scn
    AGRA=100
    AGRB=50
    #SCN=Gabr_es_Siaghi.scn
    #AGRA=60
    #AGRB=100
    cd "$WFDIR"
    ...
    fi

    declare -i T

    T=$TRIALS
    while [ $T -gt 0 ]; do
    ./$EXE -W -T -A $AIP1 $AGRA $AIP1 $AGRB "$SCN"
    T=$T-1
    done

    T=$TRIALS
    while [ $T -gt 0 ]; do
    ./$EXE -W -T -A $AIP2 $AGRA $AIP1 $AGRB "$SCN"
    T=$T-1
    done

    T=$TRIALS
    while [ $T -gt 0 ]; do
    ./$EXE -W -T -A $AIP1 $AGRA $AIP2 $AGRB "$SCN"
    T=$T-1
    done

    T=$TRIALS
    while [ $T -gt 0 ]; do
    ./$EXE -W -T -A $AIP2 $AGRA $AIP2 $AGRB "$SCN"
    T=$T-1
    done

    exit 0



    This is definitely a work-in-progress. Indeed, this whole testing methodology is a work-in-progress. (And let's not forget: In future, I intend to add more and more user-configurable A/I parameters into the mix.)

    Using the above airun script, and just testing the 0 vs. 3 NoOrder move trigger change, I have run over 320 A/I test trial games across eight different scenarios (of various kinds) from four different games (EF, WF, RS & VN). Here are the results from the Rising Sun Asun.scn tests:



    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 48 149 57 177 1 0 -28
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 32 100 53 152 1 0 -52
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 31 95 49 146 1 0 -51
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 40 120 48 145 1 0 -25
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 37 103 37 110 1 0 -7
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 41 131 59 169 1 0 -38
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 38 98 60 180 1 0 -82
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 31 101 63 181 1 0 -80
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 33 93 49 135 1 0 -42
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 37 118 45 137 1 0 -19
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 36 93 65 208 1 0 -115
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 25 78 56 167 1 0 -89
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 40 102 55 161 1 0 -59
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 33 105 54 167 1 0 -62
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 31 89 43 127 1 0 -38
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 38 117 58 163 1 0 -46
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 54 164 41 117 1 0 47
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 27 87 58 169 1 0 -82
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 40 127 45 139 1 0 -12
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 0 4 6 70 90 2 4 85 41 130 53 165 1 0 -35
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 41 126 57 170 1 0 -44
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 44 126 41 117 1 0 9
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 43 125 57 164 1 0 -39
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 46 145 46 141 1 50 54
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 45 142 57 183 1 0 -41
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 41 117 39 110 1 0 7
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 50 150 40 113 1 0 37
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 40 105 53 161 1 0 -56
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 48 131 57 173 1 0 -42
    Asun.scn: 12 6 0 2 0 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 40 120 55 172 1 0 -52
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 26 78 55 169 1 0 -91
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 31 96 43 121 1 0 -25
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 36 84 40 116 1 0 -32
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 39 114 56 172 1 0 -58
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 34 94 52 149 1 0 -55
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 40 125 63 197 1 0 -72
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 42 128 40 124 1 0 4
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 44 137 50 148 1 0 -11
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 48 131 58 177 1 0 -46
    Asun.scn: 12 6 0 2 3 4 6 70 90 2 4 75 12 6 0 2 3 4 6 70 90 2 4 85 41 138 63 190 1 0 -52



    Whoa! That's a lot of data, even just for one test scenario.

    I have another (work-in-progress) script, airpt.pl (not shown here), to process the airun data output. For now, the script just reports average scores, but in future I might extend it to report other averages (losses, points), other game metrics, and perhaps also standard deviations, etc. There's no end to the sophisticated statistical analysis I might apply. (I might prepare some pretty graphs, too.)

    Here are the airpt.pl results for the Asun.scn tests:



    RS, Asun.scn: side 1, Side B/Axis, first side

    Side A 12 6 0 2 0 4 6 70 90 2 4 75 vs
    Side B 12 6 0 2 0 4 6 70 90 2 4 85 AI: -42

    Side A 12 6 0 2 0 4 6 70 90 2 4 75 vs
    Side B 12 6 0 2 3 4 6 70 90 2 4 85 AI: -16 -- 3 is very much better

    ----

    Side A 12 6 0 2 3 4 6 70 90 2 4 75 vs
    Side B 12 6 0 2 0 4 6 70 90 2 4 85 AI: -49

    Side A 12 6 0 2 3 4 6 70 90 2 4 75 vs
    Side B 12 6 0 2 3 4 6 70 90 2 4 85 AI: -43 -- 3 is slightly better



    For the Rising Sun Asun.scn (river crossing) scenario, 3 is a "better" NoOrder move trigger than 0. What about the other 280 test results across the other seven tested scenarios (across four games)? In general, they seem to vindicate the CS 1.04 NoOrder move trigger change from 0 to 3. Not in all cases, not in all scenario types, not always for the attacker or the defender, ... -- I cannot say that 3 is always "better" than 0. But on balance, based on my results so far, 3 is probably better, certainly no worse, than 0. So, the CS 1.04 change of NoOrder move trigger from 0 to 3? For now, we keep it.

    Question: Is 3 the "optimal" NoOrder move trigger? What about 2? 4? And so on.

    More questions: What about the other A/I parameters? Are they "optimal"?

    And still many, many more questions, and angles to look at this problem.

    The $64,000 Question: What is just the right mix of data parameters, the "Secret Sauce", giving the all-around "best", "optimal" A/I?

    Previously, I had said of this effort to improve the game's A/I:

    quote:

    It's a journey, not a destination.

    Fortunately, I have three CS test systems to carry me forward on my journey:

  • Windows XP (which I will devote to round-the-clock, non-stop test trials)
  • Windows 7 (my main developer system)
  • Windows 8 (test trials, and other tests)

    with another system on the way (my daughter's older system that she has abandoned with her leaving recently for a teaching position in South Africa):

  • Windows XP (which I will also devote to round-the-clock, non-stop test trials)

    Four Windows test/development systems in total!

    When I get these test trials going, it's kind of amusing to be surrounded by three (and soon four) computers each playing CS scenarios non-stop, one after another.

    On my "faster" Windows 7 systems, I can typically run through ~40 test trial games of a Complexity Level 3 or 4 scenario in a day or so. On my slower Windows XP system(s), it takes a while longer.

    The whole testing process is slowed down by graphics overhead: For now, even though I launch the test games via the airun script (else directly from the Cygwin command line), the games still run as Windows games normally would -- i.e., in a window, with full graphical display, units moving about, explosions, etc. A longer term goal: to add still another no-graphics game play mode and command-line switch. If I can run these test games more quickly sans graphics, I'll really be able to crank them out.

    Again (it goes without saying, the sound of the endlessly playing broken record): This is all very much a work-in-progress, a developing effort. I anticipate many modifications in the months ahead. (And foresee running many thousands of test trials.)

    But this whole automated testing methodology: It opens up worlds of possibilities. Not just for improving the A/I, but for QA (Quality Assurance) and other cool stuff.

    There's the game, then there's also the meta game -- coding, testing, data analyzing, ...

    Geeks Just Gotta Have Fuh-un.



    (Am I weird or what? )

    Until the next time ...

    < Message edited by berto -- 10/18/2013 6:55:44 PM >


    _____________________________

  • Post #: 1
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 10/15/2013 2:37:24 AM   
    junk2drive


    Posts: 12907
    Joined: 6/27/2002
    From: Arizona West Coast
    Status: offline
    Thanks

    (in reply to berto)
    Post #: 2
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 10/15/2013 6:56:37 AM   
    XLVIIIPzKorp


    Posts: 216
    Joined: 10/24/2006
    Status: offline
    Wow, I may even play the A/I again some day. Sounds good.

    (in reply to berto)
    Post #: 3
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 10/15/2013 2:18:47 PM   
    Crossroads


    Posts: 15076
    Joined: 7/5/2009
    Status: offline

    quote:

    ORIGINAL: berto

    On my "faster" Windows 7 systems, I can typically run through ~40 test trial games of a Complexity Level 3 or 4 scenario in a day or so.


    Oh no! This starts to sound all War Games to me...

    _____________________________

    Visit us at: Campaign Series Legion
    ---
    CS: Vietnam | CS: East Front 1939-1941 IN-THE-WORKS
    CS: Middle East 1948-1985 Fully reimaged v2.0 available now!

    (in reply to berto)
    Post #: 4
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 10/15/2013 8:42:32 PM   
    wings7


    Posts: 4608
    Joined: 8/11/2003
    From: Phoenix, Arizona
    Status: offline
    An important issue that is being addressed!! Thanks!

    Patrick

    (in reply to Crossroads)
    Post #: 5
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 10/29/2013 2:46:36 PM   
    pzgndr

     

    Posts: 2587
    Joined: 3/18/2004
    From: Maryland
    Status: offline
    quote:

    The $64,000 Question: What is just the right mix of data parameters, the "Secret Sauce", giving the all-around "best", "optimal" A/I?


    Good stuff. For "optimal" AI, are these parameters something we would adjust in the overall game defaults someplace or something to adjust for each individual scenario?

    (in reply to wings7)
    Post #: 6
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 10/29/2013 6:35:29 PM   
    berto


    Posts: 18174
    Joined: 3/13/2002
    From: metro Chicago, Illinois, USA
    Status: online

    quote:

    ORIGINAL: pzgndr

    quote:

    The $64,000 Question: What is just the right mix of data parameters, the "Secret Sauce", giving the all-around "best", "optimal" A/I?

    Good stuff. For "optimal" AI, are these parameters something we would adjust in the overall game defaults someplace or something to adjust for each individual scenario?

    As described here:

  • With fine granularity, pre-game, via the ai.ini file.
  • More coarsely, in-game, via the Aggressiveness, Audacious A/I, and Cautious A/I menu options.

    Not specified on a per-scenario basis, directly in the .scn files, if that's what you mean. But a future possibility?

    _____________________________


    (in reply to pzgndr)
  • Post #: 7
    RE: Coder Diary #11 -- Automated, Empirical A/I Testing - 11/3/2013 12:19:35 AM   
    junk2drive


    Posts: 12907
    Joined: 6/27/2002
    From: Arizona West Coast
    Status: offline
    bump 11

    (in reply to berto)
    Post #: 8
    Page:   [1]
    All Forums >> [Current Games From Matrix.] >> [World War II] >> John Tiller's Campaign Series >> Coder Diary #11 -- Automated, Empirical A/I Testing Page: [1]
    Jump to:





    New Messages No New Messages
    Hot Topic w/ New Messages Hot Topic w/o New Messages
    Locked w/ New Messages Locked w/o New Messages
     Post New Thread
     Reply to Message
     Post New Poll
     Submit Vote
     Delete My Own Post
     Delete My Own Thread
     Rate Posts


    Forum Software © ASPPlayground.NET Advanced Edition 2.4.5 ANSI

    0.125