Friday, January 26, 2018

An exercise of in-database-AI: Is woman underpaid?

Tensorflow is tightly integrated in Deepgreen data warehouse and our user can do true in-database-AI sitting in front of a SQL console.   As an exercise, we studied the official tensorflow tutorial wide, and tutorial wide and deep .     

The tutorial uses census data such as age, education, hours per week to train a model and predict if the person's annual income is more than $50K.   Playing with the data with simple SQL aggregates we find that number of years of education is extremely important -- not a surprise.   How about gender?

Surprisingly, in the original tutorial, gender was not even used by the model.   There is a good reason.  The original learnt model has an accuracy about 86%, and if we add gender as a feature column, the model has an accuracy about 85%.   Adding gender as feature does not improve accuracy.

The following simple SQL tells us that only about 10% of female made more than $50K/year, while 30% males' income is above $50K/year.  
  
select gender, income, count(*) from widedeep_train 
group by gender, income

OK, unfair.   But if you look at hours per week, on average, males work 42hrs/week but females work 36 hrs/week.    Is it because female work less hours?   

The real question people ask is, otherwise same condition, is woman underpaid?   This question is difficult because same is rather hard to define, or test.    

Let's first add gender as feature column and train the model.  As said before, accuracy is about 85%.  If use the model to predict only females,
 
+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|        314 |      4734 |         97 |       276 |   0.924184 |
+------------+-----------+------------+-----------+------------+

And only males, 
+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|       1256 |      6842 |        762 |      2000 |    0.81418 |
+------------+-----------+------------+-----------+------------+


So the model is much better at predicting Females' income.   What's more interesting, is if we disguise females as male (this is the otherwise same) , what will the model say?  

+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|        280 |      4699 |        132 |       310 |   0.923999 |
+------------+-----------+------------+-----------+------------+

And males as female, 
+------------+-----------+------------+-----------+------------+
|   falseneg |   trueneg |   falsepos |   truepos |   accuracy |
|------------+-----------+------------+-----------+------------|
|       1442 |      7036 |        568 |      1814 |   0.814917 |
+------------+-----------+------------+-----------+------------+

The accuracy does not change much.    But look carefully, if we disguise females as male, the model predicts (true positive + false positive) 442 will make more than $50K, compared to 373.   And if we disguise male as female, the number reduced from 2762 to 2382.   
Therefore, under otherwise same condition, the model believe a male has about 16-19% better chance to reach $50K/year than a female.  
AI is fun.   The code to run the model is at wide and deep notebook.    Enjoy. 


Thursday, May 21, 2015

Comparing two tables

Say you have carefully tuned your database and wow! it is ten times faster -- but, before you push it to the production cluster, how do you know the answer of the query is correct?   Or at least, same as before?   :-)    You may have same question if you want to upgrade or migrate your database, or, for us, we want to make sure the new great join algorithm actually produce correct answer.   Well, let's just put the result before, and after, in two tables, and compare the two tables.

This is great, but turns out to be quite tricky.   You need to take care of the cases that,
  •  Either table may not have a primary key
  •  Rows in the two tables may be physically stored in different orders.
  • Columns could contain nulls. 
  • The two tables could be huge, so performance must be good.
Here are something you could do.   

Method 1: You can dump the tables to text or csv files, the run diff -- but you need to take care of ordering, so you have to copy the table out with order by clause.    And good luck diff-ing 2TB of text file.

Method 2: Use except.  How about this query?  
select * from (
    select * from ta except select * from tb 
    union all 
    select * from tb except select * from ta
) foo; 

It is totally wrong!   See attached sql script for some surprise.   And this?

select * from (
    select * from (select * from ta except select * from tb) xxx
    union all
    select * from (select * from tb except select * from ta) yyy
) foo;
Better, but it did not take care of duplicate rows -- and, checking select count(*) is not good enough.

Method 3: Join by ourselves!  Here is a script from my friend CK.

with 
A as ( 
    select hashtext(textin(record_out(ta))) as h, count(*) as c 
    from ta group by h
),
B as (
    select hashtext(textin(record_out(tb))) as h, count(*) as c 
    from tb group by h
)
select * from A full outer join B on (A.h + A.c= B.h + B.c)
where A.h is null or B.h is null limit 5;

The hashtext part is optional -- for very wide rows, using hashtext will save some cycles and memory.   This is great except it has a full outer join.    Can we do better?   

Method 4: Group it ourselves,
select r, sum(cnt) from (
    select textin(record_out(ta)) as r, 1 as cnt from ta 
    union all
    select textin(record_out(tb) as r, -1 as cnt from tb) foo
group by r having sum(cnt) <> 0;

No joins!  





Tuesday, February 3, 2015

Julia, Postgres on Mac

I took Julia, PostgreSQL to a test drive on my macbook.    It worked like charm.

First, the Juno IDE is quite nice.   Download, drag, drop, open, runs!    Connection to Postgres via libpq is not very usable at this moment, so I went the odbc route.    I used port, to install psqlODBC and unixODBC.   Configure DNS at ~/.odbc.ini

Fengs-MBP:pg ftian$ cat ~/.odbc.ini 
[ftian]
Driver          = /opt/local/lib/psqlodbcw.so 
ServerName      = localhost
Port            = 5432
Username        = ftian
Database        = ftian

Tested with isql, DNS works.   Nice.   Next, try ODBC in Julia, cannot find any DNS.  Ooops.   Turns out julia ODBC need some help to locate libodbc.   The file to edit, is hidden at 

/Users/ftian/.julia/v0.3/ODBC/src/ODBC_Types.jl

After that, all works -- time to play with some data.   




So I loaded a csv file (TPCH 1G, lineitem) in Julia, took about 2 minutes.  I am quite impressed -- compared this to R!   TPCH data is really | separated, not comma separated, but Julia got the lineitem count right.   My favorite query language is still SQL, so, let's pipe the csv file through PostgreSQL using the wonderful file fdw.


set search_path=csvfdw;
CREATE FOREIGN TABLE LINEITEM ( L_ORDERKEY    INTEGER NOT NULL,
                             L_PARTKEY     INTEGER NOT NULL,
                             L_SUPPKEY     INTEGER NOT NULL,
                             L_LINENUMBER  INTEGER NOT NULL,
                             L_QUANTITY    INTEGER /*DECIMAL(15,2)*/ NOT NULL,
                             L_EXTENDEDPRICE  MONEY/*DECIMAL(15,2)*/ NOT NULL,
                             L_DISCOUNT    DOUBLE PRECISION /*DECIMAL(15,2)*/ NOT NULL,
                             L_TAX         DOUBLE PRECISION /*DECIMAL(15,2)*/ NOT NULL,
                             L_RETURNFLAG  VARCHAR(1) /*CHAR(1)*/ NOT NULL,
                             L_LINESTATUS  VARCHAR(1) /*CHAR(1)*/ NOT NULL,
                             L_SHIPDATE    DATE NOT NULL,
                             L_COMMITDATE  DATE NOT NULL,
                             L_RECEIPTDATE DATE NOT NULL,
                             L_SHIPINSTRUCT VARCHAR(25) /*CHAR(25)*/ NOT NULL,
                             L_SHIPMODE     VARCHAR(10) /*CHAR(10)*/ NOT NULL,
                             L_COMMENT      VARCHAR(44) NOT NULL)
        server filefdw
        options ( filename '_PWD_/data/lineitem.tbl', 
                  format 'csv',
                  delimiter '|');


Now, let's ask on average, how much items in an order?    About 4.   PostgresSQL answered this query in about 10 sec.    If we load the data, we can answer it in about 5 seconds.

ftian=# select avg(lno) from ( select max(l_linenumber) as lno from csvfdw.lineitem group by l_orderkey) tmpt;
        avg         
--------------------
 4.0008100000000000
(1 row)

Time: 9912.140 ms
ftian=# select avg(lno) from ( select max(l_linenumber) as lno from tpch1.lineitem group by l_orderkey) tmpt;
        avg         
--------------------
 4.0008100000000000
(1 row)

Time: 5598.877 ms

Of course, cannot help show off our own code :-)   We run the query from CSV file faster than original PG after loading data!

ftian=# set vdb_jit = 1;
SET
Time: 0.140 ms
ftian=# select avg(lno) from ( select max(l_linenumber) as lno from csvfdw.lineitem group by l_orderkey) tmpt;
        avg         
--------------------
 4.0008100000000000
(1 row)

Time: 3755.514 ms

ftian=# select avg(lno) from ( select max(l_linenumber) as lno from tpch1.lineitem group by l_orderkey) tmpt;
        avg         
--------------------
 4.0008100000000000
(1 row)

Time: 1863.698 ms


Monday, November 10, 2014

TPCH on PostgreSQL (part 3)

Today is the biggest internet shopping day.  If you read Chinese, you may find that Alibaba posts job openings for PostgreSQL dba from time to time.   It would be interesting to find out how many transactions and/or analytic workloads inside Alibaba is handled by PostgreSQL.

Back to TPCH.  Q16 is particularly tough for us to optimize.  Again, it is better to look at a simpler example,

ftian=# create table t as select x, x % 10 as i, x % 100 as j, 'random string' || (x % 100) as s from generate_series(1, 1000000) x;
SELECT 1000000
Time: 784.034 ms


ftian=# select count(distinct i) from t group by s, j;
Time: 7991.994 ms

Grouping one million rows in 8 sec, kind of slow.  So let's try, 

ftian=# select count(i) from t group by s, j;
Time: 99.029 ms

So it must be count(distinct).  Well, we wasted quite some time chasing the distinct.  Profiling, before optimizing, we should've known better.  Real reason is the select count(distinct) will trigger a sort agg instead of hash agg.   Distinct, will need some resource in aggregate function and it is very hard to cap the consumption in hash agg.   So a sort agg is used, which actually is a fine plan. 

ftian=# explain select count(distinct i) from t group by s, j;
                               QUERY PLAN                               
------------------------------------------------------------------------
 GroupAggregate  (cost=117010.84..127110.84 rows=10000 width=23)
   ->  Sort  (cost=117010.84..119510.84 rows=1000000 width=23)
         Sort Key: s, j
         ->  Seq Scan on t  (cost=0.00..17353.00 rows=1000000 width=23)
(4 rows)

Is sort that slow?

ftian=# select count(distinct i) from t group by j, s;
Time: 1418.938 ms

5x faster.  What is going on?   Note that we switched grouping order from group by s, j to group by j, s.    These two queries are equivalent, except the order of result -- well, if you really care about ordering, you should have an order by clause.  

The true cost lies in string comparison with collation.   The database uses utf-8 encoding and sort on text will use string comparison with collation.   When group by (s, j), we will always compare two strings.  When group by (j, s), string comparison is only executed when j are equal.

Finally,  to see how expensive compare with collation is, 

ftian=# select count(distinct i) from t group by decode(s, 'escape'), j;
Time: 2831.271 ms

Even we go through the not so cheap decode hoop, still, almost 3x faster.

So what can we do in this situation?
  • If your group by clause is group by t, i; consider rewrite your query as group by i, t;   Basically, put faster (int, float, etc) data types with many distinct values first. 
  • Wait for a patch from Peter Geoghegan.  If the string has a random prefix, the patch will speed up string comparison considerably.  But, it won't help this example, because we have an common prefix.  Same common prefix problem for TPCH data.  On the other hand, these are synthetic data, and Peter's patch should help a lot in many situations in real life.
A final remark.  Can we get rid of the sort completely?  Yes, if there is only one distinct,

ftian=# explain select count(i) from (select i, j, s from t group by i, j, s) tmpt group by s, j;
                               QUERY PLAN                               
------------------------------------------------------------------------
 HashAggregate  (cost=27603.00..27703.00 rows=10000 width=23)
   ->  HashAggregate  (cost=24853.00..25853.00 rows=100000 width=23)
         ->  Seq Scan on t  (cost=0.00..17353.00 rows=1000000 width=23)
(3 rows)

ftian=# select count(i) from (select i, j, s from t group by i, j, s) tmpt group by s, j;
Time: 106.915 ms

This is how fast the query can be, even though we have two HashAggregate nodes.   I wish one day, PostgreSQL planner can do this trick for me.

P.S We did not rewrite Q16 in our benchmark after all.  It is an interesting case that we want to keep an eye on.


Tuesday, November 4, 2014

TPCH on PostgreSQL (Part 2)

CTE is a great feature.

We also need to rewrite Q20 extensively.   We haven't really tried out best, and we believe we can unroll the three subqueries into outer joins, but it is mind boggling.   The WITH clause (Common Table Expression) comes to rescue.   CTE allows one to write simple, clear, easy to understand steps in SQL.  This is very much like writing query in Pig [1] (or Quel, for old timers), except with a better syntax (and it is standard).

What one need to beware, is that WITH clause is a optimization barrier in PostgreSQL, that is, the planner will optimize each CTE separately and will not consider global optimizations.  This is both good and bad.   One can use this feature to reduce planner search space and "force" a plan (for example, join ordering).   But on the other hand, more than often, the optimizer (or planner) is smarter than the programmer.  

Let's look at the following simple example,

create table t as select i, i + 1 as j from generate_series(1, 10000000) i;
create index ti on t(i);

ftian=# explain select * from t where i = 10 and j = 100;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=927.50..48029.26 rows=250 width=8)
   Recheck Cond: (i = 10)
   Filter: (j = 100)
   ->  Bitmap Index Scan on ti  (cost=0.00..927.43 rows=50000 width=0)
         Index Cond: (i = 10)

(5 rows)

ftian=# explain with x as (select * from t where j = 100) select * from x where i = 10;
                          QUERY PLAN                          
--------------------------------------------------------------
 CTE Scan on x  (cost=169247.81..169247.83 rows=1 width=8)
   Filter: (i = 10)
   CTE x
     ->  Seq Scan on t  (cost=0.00..169247.81 rows=1 width=8)
           Filter: (j = 100)
(5 rows)

Note that in the second query, planner is not able to utilize the index.   

Closely related to WITH clause is view.   One can create views one by one to achieve what WITH clauses do, but this is much more intrusive in the sense that it modifies catalog.   On the other had, view is not an optimization barrier.  For example,

create view vt as select * from t where j = 100;
ftian=# explain select * from vt where i = 10;
                         QUERY PLAN                         
------------------------------------------------------------
 Index Scan using ti on t  (cost=0.43..8.46 rows=1 width=8)
   Index Cond: (i = 10)mi
   Filter: (j = 100)
(3 rows)

The index ti is used, no problem.   But can someone explain why this produces a different plan?  It uses Index Scan instead of Bitmap Index Scan.   Well, optimizer is really fun.



[1] http://pig.apache.org

Friday, October 17, 2014

Running TPCH on PostgreSQL (Part 1)

We have just release our 9.3.5.S for public beta test.  Together with the product, we released a benchmark based on TPCH.   The modification to data types is easy to understand -- money and double types are faster than Numeric (and no one on this planet has a bank account that overflows the money type, not any time soon).   The modifications to queries are more interesting.

We modified the queries not because we want to show off how fast Vitesse DB is.  Without these modifications, some query will never finish.    We have seen similar queries, and similar modifications required in the field.   Overall, PostgreSQL is well capable of running the workload of TPCH as long as developers pay attention to some "tricks".

Now let's look at them,

Q2: The skeleton of Q2 look like
select xxx from t1, t2, ... where foo = (select min(bar) from tx, ty, .. where tx.field = t1.field ...);

This is correlated subquery (tx.field = t1.field) with aggregate.   You can pull the subquery out with a join,

select xxx from t1, t2, ...
          (select tx.field, min(bar) as min_bar from tx, ty ... group by tx.field) tmpt
where t1.field = tmpt.field and foo = min_bar ...

Performance of join (in this case, hash join, very fast) is two orders of magnitude faster than the subplan query.

Same trick is applied to Q17.

In the next post, I will examine Q20 which uses CTE (with clause).