Thursday, May 21, 2015

Comparing two tables

Say you have carefully tuned your database and wow! it is ten times faster -- but, before you push it to the production cluster, how do you know the answer of the query is correct?   Or at least, same as before?   :-)    You may have same question if you want to upgrade or migrate your database, or, for us, we want to make sure the new great join algorithm actually produce correct answer.   Well, let's just put the result before, and after, in two tables, and compare the two tables.

This is great, but turns out to be quite tricky.   You need to take care of the cases that,
  •  Either table may not have a primary key
  •  Rows in the two tables may be physically stored in different orders.
  • Columns could contain nulls. 
  • The two tables could be huge, so performance must be good.
Here are something you could do.   

Method 1: You can dump the tables to text or csv files, the run diff -- but you need to take care of ordering, so you have to copy the table out with order by clause.    And good luck diff-ing 2TB of text file.

Method 2: Use except.  How about this query?  
select * from (
    select * from ta except select * from tb 
    union all 
    select * from tb except select * from ta
) foo; 

It is totally wrong!   See attached sql script for some surprise.   And this?

select * from (
    select * from (select * from ta except select * from tb) xxx
    union all
    select * from (select * from tb except select * from ta) yyy
) foo;
Better, but it did not take care of duplicate rows -- and, checking select count(*) is not good enough.

Method 3: Join by ourselves!  Here is a script from my friend CK.

with 
A as ( 
    select hashtext(textin(record_out(ta))) as h, count(*) as c 
    from ta group by h
),
B as (
    select hashtext(textin(record_out(tb))) as h, count(*) as c 
    from tb group by h
)
select * from A full outer join B on (A.h + A.c= B.h + B.c)
where A.h is null or B.h is null limit 5;

The hashtext part is optional -- for very wide rows, using hashtext will save some cycles and memory.   This is great except it has a full outer join.    Can we do better?   

Method 4: Group it ourselves,
select r, sum(cnt) from (
    select textin(record_out(ta)) as r, 1 as cnt from ta 
    union all
    select textin(record_out(tb) as r, -1 as cnt from tb) foo
group by r having sum(cnt) <> 0;

No joins!  





No comments:

Post a Comment