18 July 2010

Bill & Ted's Excellent Adventure, Part 5

In today's episode, our fearless heroes set out to build some larger tables. The Sample database tables, and The Mother Lode, both demonstrate the thesis. But let's create some more complex, and larger, scenarios just to see for ourselves.

We'll start with Birchall's figures 989 and 990, wherein we'll use the magic of CTE to populate our test data, running to 100,000 rather than 10,000 as he does. Aside: I've always wished that CTE could be used to insert both independent and dependent at one go, but alas, not that I can divine. One can buy software (Red Gate, the folks who run simple-talk, are one vendor) which generates test data; if that's not niche I don't know what is.

We start with the base single table:

CREATE TABLE personnel
(emp# INTEGER NOT NULL
,socsec# CHAR(11) NOT NULL
,job_ftn CHAR(4) NOT NULL
,dept SMALLINT NOT NULL
,salary DECIMAL(7,2) NOT NULL
,date_bn DATE NOT NULL WITH DEFAULT
,fst_name VARCHAR(20)
,lst_name VARCHAR(20)
,CONSTRAINT pex1 PRIMARY KEY (emp#)
,CONSTRAINT pe01 CHECK (emp# > 0)
,CONSTRAINT pe02 CHECK (LOCATE(' ',socsec#) = 0)
,CONSTRAINT pe03 CHECK (LOCATE('-',socsec#,1) = 4)
,CONSTRAINT pe04 CHECK (LOCATE('-',socsec#,5) = 7)
,CONSTRAINT pe05 CHECK (job_ftn <> '')
,CONSTRAINT pe06 CHECK (dept BETWEEN 1 AND 99)
,CONSTRAINT pe07 CHECK (salary BETWEEN 0 AND 99999)
,CONSTRAINT pe08 CHECK (fst_name <> '')
,CONSTRAINT pe09 CHECK (lst_name <> '')
,CONSTRAINT pe10 CHECK (date_bn >= '1900-01-01' ));

-- CREATE UNIQUE INDEX PEX2 ON PERSONNEL (SOCSEC#); we'll skip this one from Birchall, in that it generates collisions and we really don't need it for these purposes
CREATE UNIQUE INDEX PEX3 ON PERSONNEL (DEPT, EMP#);


Next, we need both the flatfile version and the joined version of some dependent table, and Dependents is the obvious candidate. We'll use 12 for the number of Kids (we'll ignore Spouses for this exercise; we could just add an Attribute...), they're cheaper at that amount.


Here's, truncated, the PersonnelFlat table:

CREATE TABLE PersonnelFlat
(emp# INTEGER NOT NULL
,socsec# CHAR(11) NOT NULL
,job_ftn CHAR(4) NOT NULL
,dept SMALLINT NOT NULL
,salary DECIMAL(7,2) NOT NULL
,date_bn DATE NOT NULL WITH DEFAULT
,fst_name VARCHAR(20)
,lst_name VARCHAR(20)
,dep1ID VARCHAR(30) NOT NULL
,dep1_fst_name VARCHAR(30)
,dep1_lst_name VARCHAR(30)
...
,dep12ID VARCHAR(30) NOT NULL
,dep12_fst_name VARCHAR(30)
,dep12_lst_name VARCHAR(30)
,CONSTRAINT pexf1 PRIMARY KEY (emp#)
,CONSTRAINT pe01 CHECK (emp# > 0)
,CONSTRAINT pe02 CHECK (LOCATE(' ',socsec#) = 0)
,CONSTRAINT pe03 CHECK (LOCATE('-',socsec#,1) = 4)
,CONSTRAINT pe04 CHECK (LOCATE('-',socsec#,5) = 7)
,CONSTRAINT pe05 CHECK (job_ftn <> '')
,CONSTRAINT pe06 CHECK (dept BETWEEN 1 AND 99)
,CONSTRAINT pe07 CHECK (salary BETWEEN 0 AND 99999)
,CONSTRAINT pe08 CHECK (fst_name <> '')
,CONSTRAINT pe09 CHECK (lst_name <> '')
,CONSTRAINT pe10 CHECK (date_bn >= '1900-01-01' ));

-- CREATE UNIQUE INDEX PEXF2 ON PERSONNELFlat (SOCSEC#); as above
CREATE UNIQUE INDEX PEXF3 ON PERSONNELFlat (DEPT, EMP#);


(For the loads, I had to put the bufferpools back to 1,000, otherwise DB2 locked up for both versions.)
Here's the, truncated, data generator:

INSERT INTO personnelFlat
WITH temp1 (s1,r1,r2,r3,r4) AS
(VALUES (0
,RAND(2)
,RAND()+(RAND()/1E6)
,RAND()* RAND()
,RAND()* RAND()* RAND())
UNION ALL
SELECT s1 + 1
,RAND()
,RAND()+(RAND()/1E6)
,RAND()* RAND()
,RAND()* RAND()* RAND()
FROM temp1
WHERE s1 < 100000)
SELECT 1000000 + s1
,SUBSTR(DIGITS(INT(r2*988+10)),8) || '-' || SUBSTR(DIGITS(INT(r1*88+10)),9) || '-' || TRANSLATE(SUBSTR(DIGITS(s1),7),'9873450126','0123456789')
,CASE
WHEN INT(r4*9) > 7 THEN 'MGR'
WHEN INT(r4*9) > 5 THEN 'SUPR'
WHEN INT(r4*9) > 3 THEN 'PGMR'
WHEN INT(R4*9) > 1 THEN 'SEC'
ELSE 'WKR'
END
,INT(r3*98+1)
,DECIMAL(r4*99999,7,2)
,DATE('1930-01-01') + INT(50-(r4*50)) YEARS
+ INT(r4*11) MONTHS
+ INT(r4*27) DAYS
,CHR(INT(r1*26+65))|| CHR(INT(r2*26+97))|| CHR(INT(r3*26+97))|| CHR(INT(r4*26+97))|| CHR(INT(r3*10+97))|| CHR(INT(r3*11+97))
,CHR(INT(r2*26+65))|| TRANSLATE(CHAR(INT(r2*1E7)),'aaeeiibmty','0123456789')

,'A'
,CHR(INT(r1*26+65))|| CHR(INT(r2*26+97))|| CHR(INT(r3*26+97))|| CHR(INT(r4*26+97))|| CHR(INT(r3*10+97))|| CHR(INT(r3*11+97)) || 'A'
,CHR(INT(r2*26+65))|| TRANSLATE(CHAR(INT(r2*1E7)),'aaeeiibmty','0123456789') || 'A'
...
,'L'
,CHR(INT(r1*26+65))|| CHR(INT(r2*26+97))|| CHR(INT(r3*26+97))|| CHR(INT(r4*26+97))|| CHR(INT(r3*10+97))|| CHR(INT(r3*11+97)) || 'L'
,CHR(INT(r2*26+65))|| TRANSLATE(CHAR(INT(r2*1E7)),'aaeeiibmty','0123456789') || 'L'

FROM temp1;


Now, we need the table for the dependents, named, Dependents:

CREATE TABLE Dependents
(
emp# INTEGER NOT NULL
,dependent# CHAR(1) NOT NULL
,dep_fst_name VARCHAR(20)
,dep_lst_name VARCHAR(20)
,CONSTRAINT dep1 PRIMARY KEY (emp#, dependent#)
)


To keep the joined and flat file tables in sync, we'll load both the Personnel and Dependents data from PersonnelFlat.

INSERT INTO personnel
SELECT emp#, socsec#, job_ftn ,dept ,salary ,date_bn ,fst_name , lst_name FROM personnelflat;


And, the, truncated, data loader for Dependents (the NOT NULL is there from habit, not really needed here since they all got a dozen kids, but would be when converting some arbitrary flat file):

INSERT INTO dependents (emp#, dependent#, dep_fst_name, dep_lst_name)
SELECT emp#,
dep1ID AS dep,
dep1_fst_name,
dep1_lst_name
FROM personnelFlat
WHERE dep1ID IS NOT NULL
...
UNION ALL
SELECT
emp#,
dep12ID AS dep,
dep12_fst_name,
dep12_lst_name
FROM personnelFlat
WHERE dep12ID IS NOT NULL
ORDER BY emp#, dep;


For grins, here's the timing (same buffering, self-tuning on, and all logging is on a HDD directory) for the Dependents data:
SSD: 21s
HDD: 2m 17s

Now, return to 5 bufferpools, 1 prefetch, self-tuning off. Let's do a simple range select on Personnel.

select * from personnel where 1090000 < emp# and emp# < 1100000 -- for the join-ed table on SSD
19ms

select * from personnelflat where 1090000 < emp# and emp# < 1100000 -- for the flat table on HDD
140ms

Lastly, the joined select:

SELECT p.*
,d.dependent#
,d.dep_fst_name
,d.dep_lst_name
FROM personnel AS p
JOIN dependents AS d ON d.emp# = p.emp#
WHERE 1090000 < p.emp# AND p.emp# < 1100000
25ms -- SSD
257ms -- HDD


While not TPC-C exercising, more evidence that SSD/BCNF datastores are more better. With transactional data that would otherwise span hundreds of gigs in flatfiles, one can have a few gigs in relational. And faster, too.

In this example, I didn't fake up a large independent table to couple to a small dependent table. In the canonical customer/order/orderline example, the size difference between them is significant; at least in the systems I've worked on over the years. This is where much of the byte bloat is saved. Note, too, that since I load a dozen kids across the board, the Dependents table is full; in the Real World, the flatfile would have, depending on how the engine works, allocated space for the dozen as either empty space or NULLs, but the Dependents table would have only allocated for living Kids.

No comments: