Simpson’s Paradox

Statistics can be weird. Just when you’ve done the game show paradox, and the birthday paradox, there’s this. I think people in general need to realize that we as humans are just not that good at intuitively knowing probability.

From John Rice’s Statistics Textbook:

A black urn contains 5 red and 6 green balls, and a white urn contains 3 red and 4 green balls. You are allowed to choose an urn and then choose a ball at random from the urn. If you choose a red ball, you get a prize. Which urn should you choose to draw from? If you draw from the black urn, the probability of choosing a red ball is 5/11 (the number of ways you can draw a red ball divided by the total number of outcomes). If you choose to draw from the white urn, the probability of choosing a red ball is 3/7, so you should choose to dra from the black urn.

Now consider another game in which a second black urn has 6 red and 3 green balls, and a second white urn has 9 red and 5 green balls. If you draw from the black urn, the probability of a red ball is 6/9, whereas if you choose to draw from the white urn, the probability is 9/14. Again, you should choose to draw from the black urn.

In the final game, the contents of the second black urn are added to the first black urn, and the contents of the second white urn are added to the first white urn. Again, you can choose which urn to draw from. Which should you choose? Intuition says choose the black urn, but let’s calculate the probabilities. The black urn now contains 11 red and 9 green, so the probability of drawing a red ball from it is 11/20 = .55. The white urn now contains 12 red and 9 green balls, so the probability of drawing a red ball from it is 12/21 = .571.  So, you should choose the white urn.

When you think about it, it actually makes sense. Because the number is greater in the second one it sort of evens out. Still, a little weird though.

Another common example is in batting averages. Here is an example wikipedia gives. In this example, David Justice has a better batting average than Mike Jeter for two years in a row, but his cumulative batting average is worse.

1995 1996 Combined
Derek Jeter 12/48 .250 183/582 .314 195/630 .310
David Justice 104/411 .253 45/140 .321 149/551 .270

Actually, the batting averages make it more intuitive for me.

modular exponentiation python program

This is a simple – not efficient – but doable way to do modular exponentiation

#!/usr/bin/python

def modexp(base, pow, mod):
  exponent = 1
  i = 0
  while i < pow:
    exponent = (exponent * base) % mod
    i += 1
  return exponent

if __name__ == "__main__":
  import sys
  print modexp(int(sys.argv[1]), int(sys.argv[2]), int(sys.argv[3]))

Example usage

$ modexponent.py 2 1234 789
481

For large numbers this method isn’t practical. The below program implements the improved python algorithms for modular exponents.

#!/usr/bin/python
#Let x be an integer written in binary, y and n be integers
#I'm sure this isn't the best, but this is only called once per program
def binconvert(n):
  barray = []
  if n < 0: 
    raise ValueError, "must be positive"
  if n == 0:
    return 0
  while n > 0:
    #barray = n%2 + barray[:]
    barray.append(n%2)
    n = n >> 1
  barray.reverse()
  return barray
#this is the intuitive but slow way
def modexp(base, pow, mod):
  exponent = 1
  i = 0
  while i < pow:
    exponent = (exponent * base) % mod  
    i += 1
  return exponent
#y**x mod n
def modexp1(y, x, n):
  #convert x to a binary list
  x = binconvert(x)
  
  s = [1]
  r = x[:]
  for k in range (0, len(x)):
    if x[k] == 1:
      r[k] = (s[k] * y) % n
    else:
      r[k] = s[k]
    s.append ((r[k]**2)%n)
  #print s
  #print r
  return r[-1]
#similar to modexp1 except uses the bits in reverse order
def modexp2(y, x, n):
  a = x
  b = 1
  c = y
  while a != 0:
    if a % 2 == 0:
      a = a/2
      c = (c**2) % n
    else:
      a = a -1
      b = (b * c) % n
  return b
if __name__ == "__main__":
import sys
  print modexp(int(sys.argv[1]), int(sys.argv[2]), int(sys.argv[3]))
  print modexp1(int(sys.argv[1]), int(sys.argv[2]), int(sys.argv[3]))
  print modexp2(int(sys.argv[1]), int(sys.argv[2]), int(sys.argv[3]))

Using the above program, some interesting things are easy to demonstrate. For example, here is a test that shows for 100,000 tries, the 2 algorithms get the correct answer.

import prob23
import random
import sys

BASEMAX = 60000
EXPMAX = 60000
MODMAX = 100000
ITERATIONS = 100000

print "beginning program"
print "This will take awhile... you need", ITERATIONS, "of these dots to complete."
p
for i in range (0, ITERATIONS):
  if i % 10 == 0:
    sys.stdout.write(".")
    sys.stdout.flush()
  r_base = random.randint(1, BASEMAX)
  r_exp = random.randint(1, EXPMAX)
  r_mod = random.randint(max(r_base, r_exp), MODMAX)
  trueval = prob23.modexp(r_base, r_exp, r_mod)
  testval1 = prob23.modexp1(r_base, r_exp, r_mod)
  testval2 = prob23.modexp2(r_base, r_exp, r_mod)
  if trueval != testval1 or trueval != testval2:
    print "One of these exponent things doesn't work"
    #sys.exit(1)
print "Done, all of these answers were the same."

Running this program several times with various parameters verifies the correctness of the algorithm (like how an Engineer verifies things).

Also, it’s interesting to time these algorithms when compared to the brute force method (multiplying every number). All three algorithms were timed to find a solution. Below is some example code that accomplishes this (this particular code is for the brute force version, but it is almost identical to the other algorithms – the only change is the modexponent call).

import modexponent

for i in range(1, 65538):
  thistry = modexponent.modexp(3, i, 65537)
  if thistry == 2:
    print i
    break
  if i % 15000 == 0:
    print "trying", i

Running all three algorithms, the results are shown below.

me@Opteron:~/math/chp3$ time python bruteforce.py
55296
real 11m28.131s
user 11m18.710s
sys 0m1.170s
me@Opteron:~/math/chp3$ time python algorithm1.py
55296
real 0m3.285s
user 0m3.290s
sys 0m0.000s
me@Opteron:~/math/chp3$ time python algorithm2.py
55296
real 0m2.621s
user 0m2.610s
sys 0m0.010s

This is a huge speedup – from 11 minutes in the case of brute forcing to under 4 seconds for the two algorithms.

Golay G24

This is a dirty implemenation of Golay correcting code using python.

This is a solution to 18.13 problem 1 from Trappe and Washington’s Crytography book. To run this, you need bash, python, and the numpy libraries. To run, run golay.sh.  The algorithm is located in golay.py

golay.py

#!/usr/bin/env python
 
from numpy import *
import sys
 
geng24 = array([[1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,1,1,0,0,0,1,0],
                [0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,1,1,1,0,0,0,1],
                [0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,1,1,1,0,0,0],
                [0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,1,0,1,1,1,0,0],
                [0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,1,1,1,0],
                [0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,1,1,0,1,1,1],
                [0,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,1,1,0,1,1],
                [0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,0,1,0,1,1,0,1],
                [0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,0,0,0,1,0,1,1,0],
                [0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,1,1,0,0,0,1,0,1,1],
                [0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,1,1,1,0,0,0,1,0,1],
                [0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1]])
 
B = array([[1,1,1,0,1,1,1,0,0,0,1,0],
           [1,0,1,1,0,1,1,1,0,0,0,1],
           [1,1,0,1,1,0,1,1,1,0,0,0],
           [1,0,1,0,1,1,0,1,1,1,0,0],
           [1,0,0,1,0,1,1,0,1,1,1,0],
           [1,0,0,0,1,0,1,1,0,1,1,1],
           [1,1,0,0,0,1,0,1,1,0,1,1],
           [1,1,1,0,0,0,1,0,1,1,0,1],
           [1,1,1,1,0,0,0,1,0,1,1,0],
           [1,0,1,1,1,0,0,0,1,0,1,1],
           [1,1,0,1,1,1,0,0,0,1,0,1],
           [0,1,1,1,1,1,1,1,1,1,1,1]])
 
errorvector = array([0,0,0,0,0,0,0,0,0,0,0,0])
 
def weight(obj):
  wt = 0
  for i in obj:
    wt += i
  return wt
 
geng24_transpose = geng24.transpose()
 
#r = array([1,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0])
#r = array([0,1,0,0,0,0,1,1,0,1,0,1,1,0,1,1,0,1,0,0,0,0,1,1])
#r = array([0,0,1,0,1,0,1,0,1,0,1,0,0,1,0,0,1,1,1,0,1,1,0,0])
#r = array([1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1])
 
r = []
for i in sys.argv[1].split(','):
  r.append(int(i))
r = array(r)
 
s = (dot(r,geng24_transpose) % 2 )
 
sdotb = (dot(s,B) % 2 )
 
#e = array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
e = zeros( (1,24), dtype=int8)
e = e.flat
 
if weight(s) <= 3:
  print "ERROR CORRECTION 3"
  e = array([s, array([0,0,0,0,0,0,0,0,0,0,0,0])])
  e.shape = (1,24)
  e = e.flat
 
if weight(sdotb) <= 3:
  print "ERROR CORRECTION 4"
  e_new = array([array([0,0,0,0,0,0,0,0,0,0,0,0]), sdotb])
  e_new.shape = (1,24)
  e = (e + e_new) % 2
  e = e.flat
 
for j in range(12,24):
  if weight((geng24[:,j] + s)%2) < 2:
    print "ERROR CORRECTION 5"
    e[j] = 1
    e_new = array([(geng24[:,j]+s)%2, array([0,0,0,0,0,0,0,0,0,0,0,0])])
    e_new.shape = (1,24)
    e = (e + e_new) % 2
    e = e.flat
 
for j in range(0,12):
  if weight((sdotb + B[j,:])%2) <= 2:
    print "ERROR CORRECTION 6"
    e[j] = 1
    e_new = array([array([0,0,0,0,0,0,0,0,0,0,0,0]),((sdotb + B[j,:])%2)])
    e_new.shape = (1, 24)
    e = (e + e_new) % 2
    e = e.flat
 
print "\ne =  [",
for i in e:
  print i,
print "]\n"
 
c = ((e + r)%2)
m = array(c[:12])
 
print "r = ", r
print "c = ", c
print "m = ", m

Here is golay.sh

#!/bin/bash
 
echo "Running with Example Problem"
echo "==========================="
./golay.py 1,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0
 
echo "
Running with Problem1"
echo "==========================="
./golay.py 0,1,0,0,0,0,1,1,0,1,0,1,1,0,1,1,0,1,0,0,0,0,1,1
 
echo "
Running with Problem2"
echo "==========================="
./golay.py 0,0,1,0,1,0,1,0,1,0,1,0,0,1,0,0,1,1,1,0,1,1,0,0
 
echo "
Running with Problem3"
echo "==========================="
./golay.py 1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1