This is great, but turns out to be quite tricky. You need to take care of the cases that,
- Either table may not have a primary key
- Rows in the two tables may be physically stored in different orders.
- Columns could contain nulls.
- The two tables could be huge, so performance must be good.
Here are something you could do.
Method 1: You can dump the tables to text or csv files, the run diff -- but you need to take care of ordering, so you have to copy the table out with order by clause. And good luck diff-ing 2TB of text file.
Method 2: Use except. How about this query?
select * from (
select * from ta except select * from tb
select * from tb except select * from ta
It is totally wrong! See attached sql script for some surprise. And this?
select * from (
select * from (select * from ta except select * from tb) xxx
select * from (select * from tb except select * from ta) yyy
Better, but it did not take care of duplicate rows -- and, checking select count(*) is not good enough.
Method 3: Join by ourselves! Here is a script from my friend CK.
A as (
select hashtext(textin(record_out(ta))) as h, count(*) as c
from ta group by h
B as (
select hashtext(textin(record_out(tb))) as h, count(*) as c
from tb group by h
select * from A full outer join B on (A.h + A.c= B.h + B.c)
where A.h is null or B.h is null limit 5;
The hashtext part is optional -- for very wide rows, using hashtext will save some cycles and memory. This is great except it has a full outer join. Can we do better?
Method 4: Group it ourselves,
select r, sum(cnt) from (
select textin(record_out(ta)) as r, 1 as cnt from ta
select textin(record_out(tb) as r, -1 as cnt from tb) foo
group by r having sum(cnt) <> 0;