Friday, January 26, 2018

An exercise of in-database-AI: Is woman underpaid?

Tensorflow is tightly integrated in Deepgreen data warehouse and our user can do true in-database-AI sitting in front of a SQL console.   As an exercise, we studied the official tensorflow tutorial wide, and tutorial wide and deep .     

The tutorial uses census data such as age, education, hours per week to train a model and predict if the person's annual income is more than $50K.   Playing with the data with simple SQL aggregates we find that number of years of education is extremely important -- not a surprise.   How about gender?

Surprisingly, in the original tutorial, gender was not even used by the model.   There is a good reason.  The original learnt model has an accuracy about 86%, and if we add gender as a feature column, the model has an accuracy about 85%.   Adding gender as feature does not improve accuracy.

The following simple SQL tells us that only about 10% of female made more than $50K/year, while 30% males' income is above $50K/year.  
  
select gender, income, count(*) from widedeep_train 
group by gender, income

OK, unfair.   But if you look at hours per week, on average, males work 42hrs/week but females work 36 hrs/week.    Is it because female work less hours?   

The real question people ask is, otherwise same condition, is woman underpaid?   This question is difficult because same is rather hard to define, or test.    

Let's first add gender as feature column and train the model.  As said before, accuracy is about 85%.  If use the model to predict only females,
 
+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|        314 |      4734 |         97 |       276 |   0.924184 |
+------------+-----------+------------+-----------+------------+

And only males, 
+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|       1256 |      6842 |        762 |      2000 |    0.81418 |
+------------+-----------+------------+-----------+------------+


So the model is much better at predicting Females' income.   What's more interesting, is if we disguise females as male (this is the otherwise same) , what will the model say?  

+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|        280 |      4699 |        132 |       310 |   0.923999 |
+------------+-----------+------------+-----------+------------+

And males as female, 
+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|       1442 |      7036 |        568 |      1814 |   0.814917 |
+------------+-----------+------------+-----------+------------+

The accuracy does not change much.    But look carefully, if we disguise females as male, the model predicts (true positive + false positive) 442 will make more than $50K, compared to 373.   And if we disguise male as female, the number reduced from 2762 to 2382.   
Therefore, under otherwise same condition, the model believe a male has about 16-19% better chance to reach $50K/year than a female.  
AI is fun.   The code to run the model is at wide and deep notebook.    Enjoy. 


No comments:

Post a Comment