Battleship Game as Integrity Illustration

Battleship Game as Integrity Illustration

For a number of years now, every time I play Battleship with my son, I've thought about how that game offers so many examples of Integrity relating to data quality. As is the objective of this blog, we'll use fun everyday topics to learn about data quality. We'll use a very common table-game and explain how it illustrates the importance of connectedness and for that matter how you can ensure you have an argument-free game between your kids.

In business we often say that ‘we are as only good as the decisions that we make’ and therefore poor information leads to poor decision making. It's pretty common that when playing Battleship inevitably you misplace a peg or one falls out when you accidentally jiggle the game board. This can lead to incomplete information and in the worst case scenario lead to someone playing the same move again, wasting a turn. 

To begin with, I decided to record each of the moves my son and I make when we played Battleship. It's a simple narrow table with the x and y position, followed by the status: hit or miss. See illustration below with only the first 4 rounds.

Player 1 (Dan)Player 2 (Son)
PositionHit or MissPositionHit or Miss
B4HitE6Miss
B5HitB9Miss
B6HitC3Hit
B3HitC4Hit

The natural outcome of charting these moves with respect to your own versus your opponent’s moves on the same line is that you ensure a level of integrity, called Cardinality. If we use the grid to record our gameplay, it's easy to see who has played and who hasn't for a given turn (row). This relationship of two plays per turn defines the Cardinality between turns and location and hit status. This prevents a shift in the data which happens if one person plays twice. This is just one of the aspects of Integrity. Let's look at the others, too.

The Integrity Dimension includes the following Underlying Concepts:

Referential Integrity - Referential integrity measures whether if when a value (foreign key) is used it must reference an existing key (primary key) in the parent table.
Uniqueness - Uniqueness measures whether each fact is uniquely represented.
Cardinality - Cardinality describes the relationship between one data set and another, such as one-to-one, one-to-many, or many-to-many.

Another unique component of Battleship is the fact that you and your opponent store information about previous hits or misses for each other's battlefield. If you've never played Battleship, it might help to know that you have a grid of positions that you are allowed to place your ships on that lays flat on the table. An additional grid the same size stands vertically in front of you between you and your opponent- sort of like a radar screen in a real battleship. On the vertical screen you place pegs for the locations that you hit or missed relative to where you think your opponent is located. 

On the horizontal grid that lays flat on the table you place pegs for your opponents attempts to hit your ships. If you think about it, both your horizontal grid and your opponents vertical grade should be mirror images with respect to hits and misses. Note the Orange and Green arrows showing the mirror image on the opponent's board (below). The only exception is that your grid (horizontal) has the location of your ships. See below pictures of the game after we finished.

Battleship Gameboards

From a data perspective this is duplicate data, and in data architecture we avoid storing data twice, because we stand the risk of confusion or disagreement when two sources of data become out of sync. From a parent’s perspective, we really don't like this data to become out of sync because it causes arguments between siblings. From a business perspective we similarly avoid this situation because departments often don't act that much differently from siblings, when the data isn't in agreement.

I have to admit this doesn't happen as often anymore because my son is 12 now. When he accidentally bumped his game board, and the pegs fell out, he just asked me which pegs were recorded for such and such locations and then corrected his board. In theory, I had that same information on my board, but because I had it written on paper, as well, he trusted me more. This was additionally handy because in a sense I had a table of all the moves so I could scan through them looking to see which columns had been populated or not. For instance, if he asked which pegs were in the “D” column, I could tell him which moves were completed relating to “D”.

The definition of the CDDQ Integrity dimension is, "Integrity measures the structural or relational quality of data sets." Here we can see that the focus is on the concept of connection between things. Often this is because of the way that we store data in relational tables. You have to join multiple tables to understand the full context of a transaction or concept that is stored within the database. These concepts apply in other data structures as well, such as object orientation with persistence based in documents, which require relationships between pieces of information within a single document.

So far, we discussed the way that the structure of the data storage for our Battleship game ensures Cardinality. In addition, to row level integrity, the first stored column holds the X and Y values of game-play which are actually a composite key. You need both vertical and horizontal parts to identify the location. This tightly coupled information is the sign that Integrity is at play. This is a welcome behavior because it enables a natural method of checking that everything is in order. Both of these values must be Not Null. Additionally, they is a Uniqueness requirement in this game. You should never take a turn shooting at the same location that you've previously fired upon. So for the list of shots made by one player, there should only be unique combinations in the game-play table for a given player (e.g. never see B3 in row 4 for Dan and then again in row 25). Look for these in your data and develop data quality rules around them!

Another interesting point is that the “X” and “Y” values fall within a domain, a specific allowable range from “A” to “J” and 1 to 10. And the values of results are either “hit” or “miss”. This can be measured with the Referential Integrity concept within Integrity, assuming that your data is modeled in a relational database where the valid values for these domains are stored in separate lookup tables. We often find that Validity comes hand-in-hand with integrity. In this case, you can identify the DQ issue with either dimension. You can therefore apply different data quality rules accordingly: a.) Referential Integrity- checking that every XY value that occurs in the game-play table is also in the associated lookup table. Or b.) Validity- comparing the values against the enterprise master (maybe the same reference table or some global reference data master).

Until now, we've typically talked about situations that have a right or wrong answer. Often, data quality is more about situations that are unlikely to occur and therefore need to be flagged and reviewed by a person in order to identify where they are correct. Relating back to the Battleship game, I can tell you that my strategy is to typically spread out my ships on the game board; so it would be uncommon to have too many hits in a row. For example, the largest ship in our battleship set is five pegs long (aircraft carrier), so in other words it would be highly unlikely to find a line of hits greater than or equal to 7 pegs.

In other words, the data quality rule translation for this would evaluate whether there were any seven pegs in a row that were classified as hits. This would trigger an alert for a data quality analyst to review.

Now, just because I spread out my ships doesn't necessarily mean that all players follow this strategy. When my son was younger, he would often cram all the ships into one corner of the game board. Inevitably, when you do this a miss on one ship is actually a hit on another ship. This can complicate rule execution that we previously discussed. In other words, if you have a carrier(5) next to a battleship(3) which is next to an airplane(3) it might be quite valid to have seven hits or more in a row.

This starts to add complexity because you have to account for entities, aka battleship types. In order to validate the length of a ship you have to identify what type of ship it is. I have to say at this point that because we play, “Gentlemen's Battleship” and when my son asks for confirmation that my ship was sunk I answered with the truth. So he knows based on the number of hits and declaration of sinking, which ship it was. Others may not play this way.

You can see examples of this at J6 and G5 on my board. There are no pegs there because he took for granted that I was telling the truth about the ship being sunk and didn't waste another shot to validate that. Also, I think he knows that I generally don't cram my ships together, otherwise it might behoove him to take one more turn and shoot at those coordinates.

What is the translation for the data quality world? Well, assumptions are typically our worst enemy. Luckily with modern-day data profiling tools it doesn't hurt to extend our analysis and validate our assumption. If you think X is not likely, do a query and see, or go ahead and write the rule to make sure you’ve covered that scenario.

Now back to entity analysis. When we talk about integrity it typically refers to structural comparison where there's a hard and fast answer, like each turn must include an X and Y followed by a hit or miss. The data quality rules that are much work complex typically require generation of entities as we called them, and then searching for copies of these that happen more often than expected. These are logical as opposed to physical in nature, such as a situation or event in the data.

This is the area where advanced AI techniques will become crucial to scaling up the number of data quality rules that organizations are able to create. In other words, we won't just have basic business rules codified but intricate event based validations that identify unique circumstances, search for them and alert staff when they occur. The challenge is that we need human-in-the- middle validations of computer generated circumstances or events to ensure they are realistic. More on this in a future blog post.

I think we'll wrap things up here, but in a separate blog in the near future I'll discuss why most companies don't get to this advanced level of data quality analysis and resolution. Hint, it's because of people not technology. It's because we either hire a mathematician to build simple data quality rules (too costly and demoralizing) or we hire a data governance focused individual to understand machine learning and statistical programming or software development. If you have a preferred topic for the blog send it my way.