-
Notifications
You must be signed in to change notification settings - Fork 5
/
2017-1 Data Modelling And Databases.fex
1496 lines (1375 loc) · 64.3 KB
/
2017-1 Data Modelling And Databases.fex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
INTRODUCTION
============
DataBase Management System (DBMS):
tool that helps run & develop data intensive applications
push the complexity of dealing with data (storage, processing, persistence) to the database
share the database
database is a tool
many form of engines: relational
usages:
avoid redundancy & inconsistencies
rich access to data
synchronize concurrent access to data
recovery after system failures
security & privacy
facilitate reuse of data
reduce cost & pain of doing someting useful
database abstraction layers
data independence
logical data independence
DATA MODELING
=============
data modeling:
conceptual model:
captures the world to be represented (domain), collection of entities & their relations (Entity-Relationship)
manual modelling of ER schema, semi-automatic transformation to XML, relational, hierachical, object-oriented, ...
entity relationship, UML
logical model (schema):
mapping of the concepts to concrete logical representations
flat file (SQLITE), relational model (SQL), network model (COBOL), hierarchical model (IBM IMS/FastPath), object-orientied model (ODMG 2.0), semi-structured model (XML), deductive model (datalog, prolog)
physical model: implementation in concrete physical hardware
ER MODELING
===========
database design:
factors:
a) information requirements
b) processing requirements
c) DBMS
d) Hardware/OS
steps:
requirements engineering:
create a book of duty:
describe information requirements: objects used, identifiers, relationships, attributes)
describe processes: examination, degree, ...
describe processing requirements: cardinalities (how many entries), distribution (how many relationships), workload (how often a process is carried out), priorities & service level aggreements
conceptual modeling:
create a conceptual design (ER)
logical modeling:
create a logical design (schema)
physical modeling:
hardware / OS does this
building blocks:
ellipsis (yellow): properties
rectangles (orange): objects
scewed rectangles (light green): relations
underline: key
undeline with dashes: secondary key
text besides relationships: roles (for example "attends")
double border relations & objects: weak entities, must have secondary key, relation AND object are weak
is-a (6 edges polygon, blue): "ineritance", generalization, arrow points to more general object (less specific one) employee(salary) -> is-a -> person (id, name). likes traits!
part-of (looks like relation, but green): frame - part of - bicyle (no arrows drawn)
creating models:
avoid redundancy
KISS
two binary vs one ternary: different use cases
attribute vs entity: attribute if 1:1 relationship, else entity
partitioning of ER models by domains good practice
do not: redundancy, performance improvements
do: less is better, consise, correct, complete, comprehensible
good to know about:
1:1: rel(_a_,b), addtional key _b_
n:1: A(_a_, b), B(a)
rule of thumb: put finger on all definined relations (all but one). if missing relation is a 1 -> fully definied
min/max: (2,*), (1,6), ...
weak entities: key is definied by primary key of relationed object & own secondary key. happens for example for building -> room. double borders for relationship, entity & connecting line
weak relations: do not exist! Only weak entity matters; and then mark the defining relation as weak too. (order_entry is not weak; even if order is)
generalization: using the is-a structure, more specific points to more general. one entity can have multiple specializations
aggregation: using the part-of structure
Professor 1 --- <gives> --- N Lecture
min max: (1:N) => (0,*) (0,1)
to be part of a relationship is optional; but if the relationsip is establised; it needs to connect all entitites part
tertiary: MyTert (_a_, _b_, c); additional key: _a_, _c_
same underline for same key, even if it spans multiple attributes! R(_a , b_, c)
ER to Schema:
1) entities to tables
2) relationships to entites
3) combine entites which have generalization (copy all fields of more general into more specific)
4) combine relationships into entites where possible (same primary key)
difficulties:
create global schema from different views:
want: no redundancy, no conflicts, avoid synonyms & homonyms
difficulties: detect generalizations, synonyms, different concepts of attribute domain
Unified Modeling Language
===
basics:
short: UML
ER vs UML
attribute, generalization same
entity vs class
relationship vs association
weak entity vs compositor
key differences:
methods are part of classes
keys are not part of UML
UML models explicitly aggregation
UML supports instance modeling
additional material
use cases
sequence diagram
object diagram
building blocks:
create small tables for classes
first row: name of object (professor)
second row: property list & their type (PersNr: String, Age: Int)
last row: method names (promote())
associations:
arrows, with min/max notation at the end (1, 1..*, 0..*, 0..1)
example: Professor 1 --- gives ---> 0..* Lecture
other way around as compared with ER
aggregation:
arrow (with scewed rectagle as arrow head, not filled out)
example: Professor -----<WHITE> Group
"belongs to"
composition:
arrow (with scewed rectagle as arrow head, filled out)
example: Room -----<BLACK> Building
"is part of"
generalization:
arrows, with traingle as head
example: Professor ----|> Person
"is a"
RELATIONAL DATA MODEL
=====================
definitions:
relation:
R untermenge D1 x D2 x D3
example: Addressbook untermenge string x string x int
tuple:
t element_of R
example: ("hi","mom")
schema:
associates labels to domains
AddrBook:{[_Id_: int, Name: string, Number: int]} -> undeline keys, done here with _KEY_
instance: set of db
key: monomialf set of attributes that identify each tuple uniquely
primary key: select one key, use it to refer to tuples
transform ER to RM:
entities to relations -> Student:{[_Number_: string, Name: string]}
relationships to relations -> attends:{[_Number_: string, _Lecture_: string]}
naming attributes: use names of roles if provided, else use key attribute name, else invent new names
merge relations with same key -> {[_A_: string, B: int]} + {[_A_: string, C: string]} -> {[_A_: string, B: int, C: string]}
generalization: copy all attributes of generalization into more specific relation, or not. but keep all tables (do not remove general tables like Person)!
weak entities: must contain all keys (also from strong one)
relational algebra:
atoms:
basic expressions
relation in database
constant relation
operators:
composite expression
selection, projection, carthesian product, rename, union, minus
no recursion!
selection s (sigma):
condition
s(semester > 10)(Student)
projection p (pi):
selection of columns
p(semester)(Student)
cartesian product cp (x):
all for all (n*m set)
(natural) join j (|><|):
S nj R = p(An, R.Bn, Cn)(s(R.Bn = S.Bn)(S j R)
theta join tj (|><| theta):
two relations with no common attributes + binary operator theta
S tj R = s(theta)(S x R)
rename r (p):
rename relation names to join multiple same tables
s(L1.age = L2.age)(r(L1)(Student) x r(L2)(Student))
rename attribute names
r(newName <- oldname)(Student)
set minus m (-):
gives difference of attributes (not data!)
R-S: all attributes in R which are not in S
relational division rd (./.):
R rd S
S contains at least one column of R
S contains multiple rows
outputs all tuples from R which have all the values supplied by S -> so if S is (v1,v2) then R outputs m1 if R contains ([m1,v1], [m1,v2]) (so both v1 & v2 connected to m)
result set contains all columns from R which are not in S
can be rewritten as: p(R-S)(R) - p(R-S)((p(R-S)(R) x S) - R)
"student who attends all lectures" R:attends, S:lecture
union u (U):
combines two sets with same attributes
intersection i (turned U):
only works if both relations have same schema (same attribute names & attribute domains)
i = R-(R-S)
semi-join (left) sjl (|><):
tuples from left matching tuples from right (only left columns)
semi-join (right) sjr (><|):
tuples from right matching tuples from left (only right columns)
left outer join loj (_|><|):
natural join + unmatched tuples from left (left+right columns)
right outer join roj (|><|_):
natural join + unmatched tuples from right (left+right columns)
full outer join foj:
natural join + unmatched tuples from left/right (left+right columns)
tuple relational calculus:
{t | P(t)}
more advanced: {t | t element_of Student ^ there_is a element_of attends(a.is_valid_entry => a.student_id = t.student_id ^ a.is_too_late)}
tuple relational calculus: standard calculus; atoms, formulas, constants, ...
safety:
restrict to finite answers (semantic not syntactic property! { n | not (n element_of Table) } would not be valid)
result must be a subset of the domain of the formula (domain: all constants, all relations used in formula)
domain relational calculus:
similar to relational calculus, but tuples are "written out" and specific properties are selected
example:
{[l, n] there_is s ([l, n, s] element_of Student ^ there_is v, p, g ([l, v, p, g] element_of tests ^ there_is a,r, b([p, a, r, b] element_of Professor ^ a = 'Curie')))}
safety:
same as relational calculus
Codd's theorem:
relations, tuple relational (safe only), domain relational (safe only) algebra are all equivalent
SQL is based on relational calculus
SQL implementation is based on relational algebra
SQL
====
defintions:
SQL: Structured Query Language
DDL: Data Definition Language
DML: Data Manipilation Language
Query: Query language
DDL:
base:
CREATE TABLE person (nr INTEGER NOT NULL, street VARCHAR(50) DEFAULT "hi mom", CONSTRAINT my_const CHECK nr > 0, CONSTRAINT my_const_2 UNIQUE(nr), CONSTRAINT my_frkey FOREIGN KEY (other_person_id) REFERENCES person(id));
can add: new columns, CONSTRAINT, FOREIGN KEY, PRIMARY KEY
CREATE TABLE person (id INTEGER, email VARCHAR(50), CHECK (email IS NOT NULL));
DROP TABLE person
ALTER TABLE person ADD COLUMN (age INTEGER)
indexes:
CREATE INDEX my_index ON person (age, nr)
DROP INDEX my_index
columns:
ALTER TABLE person ADD COLUMN age INTEGER NOT NULL DEFAULT 0;
ALTER TABLE person ALTER COLUMN age TYPE varchar(50);
ALTER TABLE person ALTER COLUMN age SET NOT NULL;
ALTER TABLE person DROP COLUMN;
constaints:
ALTER TABLE person ALTER COLUMN age TYPE varchar(50);
ALTER TABLE person ADD CONSTRAINT age_bigger CHECK (age > 12);
DML:
inserts:
INSERT INTO person (age, street) VALUES (12, "streeT");
INSERT INTO copy_table (age, street) SELECT * FROM table;
INSERT INTO copy_table SELECT * FROM table;
sequence:
CREATE SEQUENCE my_sql INCREMENT BY 1 START WITH 1;
my_sql.nextval
DROP SEQUENCE my_sql;
updates:
snapshot semantics: mark tuples which are affected by update; then implement update on affected tuples
UPDATE person SET age = 12 WHERE age = 11;
UPDATE persone SET age = age + 1;
deletes:
DELETE FROM person WHERE age = 11;
ETL
extract, transform, load
populating a database needs:
extract: data from file
transform: into a form the database can use (type transformation, formatting)
load: insert into db as a bulk operation
Queries:
select / from:
SELECT * FROM person;
SELECT age FROM person;
where
SELECT * FROM person WHERE age = 11;
order
SELECT * FROM person ORDER BY age ASC, name DESC, age
distinct
SELECT DISTINCT age FROM person;
rename
SELECT * FROM person p;
SELECT age as person_age FROM person;
join:
SELECT * FROM person p, hero h WHERE h.id = p.id;
set operations:
(SELECT age FROM person) UNION (SELECT age FROM hero) //removes duplicates
(SELECT age FROM person) UNION ALL (SELECT age FROM hero)
(SELECT age FROM person) INTERSECT (SELECT age FROM hero)
(SELECT age FROM person) MINUS (SELECT age FROM hero) //same as except
(SELECT age FROM person) EXCEPT (SELECT age FROM hero) //same as minus
functions:
SELECT * FROM person WHERE SUBSTRING(name, 1, 3) = 'flo';
SELECT * FROM person WHERE date < NOW() AND date_part('year',date) = 2017 OR date = date('12.07.08');
EXTRACT(YEAR FROM orderdate);
aggregate:
SELECT AVG(age) FROM person;
SELECT MAX(age) FROM person;
SELECT MIN(age) FROM person;
SELECT COUNT(age) FROM person;
SELECT SUM(age) FROM person;
with:
create views only avalable in the subquery -> be careful with the ,
WITH my_table AS (SELECT * FROM person), my_table_2 AS (SELECT * FROM person) SELECT * FROM my_table, my_table_2
EXCEPT:
distinct rows from left which are not part of right
INTERSECT:
distinct rows which are both in right & left set
subqueries:
SELECT age FROM person AS p1 WHERE (SELECT MIN(p2.age) FROM person p2 WHERE p2.name = p1.name) > 20
WITH my_range AS (SELECT age FROM (SELECT * FROM person pe ORDER BY pe.age LIMIT 0.55 * (SELECT COUNT(*) FROM person)) AS lowest55 ORDER BY age DESC LIMIT 0.1 * (SELECT COUNT(*) FROM person))
SELECT name, (SELECT AVG(age) FROM person) FROM person, (SELECT * FROM hero WHERE name = "hi mom")
grouping:
SELECT AVG(age), name FROM person GROUP BY name;
having:
SELECT AVG(age), name FROM person GROUP BY name HAVING AVG(age) > 11;
exists:
SELECT name FROM person p WHERE NOT EXISTS (SELECT * FROM hero h WHERE p.name = h.name)
in:
SELECT name FROM person p WHERE name NOT IN (SELECT name FROM hero h)
all:
SELECT name FROM person p WHERE age >= ALL(SELECT age FROM hero h)
for all tricks:
SELECT a.Legi FROM attends a GROUP BY a.Legi HAVING COUNT(*) = (SELECT COUNT(*) FROM Lecture);
NULL:
arithmetic results in NULL: NULL * 2 -> NULL
comparisation results are UNKNOWN: NULL > 2 -> UNKNOWN
nulls are not equal null: NULL <> NULL = true
nulls group to one group: GROUP BY mabye_column has NULL
where: only true evaluations
group by: if at least one NULL exists there will be group for NULL
UNKNOWN:
not: results in UNKNOWN
and: false and UNKNOWN -> false; other and UNKNOWN -> UNKNOWN
or: true or UNKNOWN -> true; other or UNKNOWN -> UNKNOWN
between
SELECT age FROM person WHERE age BETWEEN 2 AND 3;
in
SELECT age FROM person WHERE age IN (1,2,3);
case
SELECT age, (case when (age > 12) then "old" when (age < 12) then "young" else "12" end) FROM person;
like:
%: any amount of caracters
_: exactly one letter
SELECT name FROM person WHERE name LIKE "Marga%" OR name LIKE "M_rgareth"
you can choose a escape caracter by using ESCAPE: so LIKE "80!%" ESCAPE "!" would look for "80%" text vals
joins:
SELECT * FROM person, hero WHERE person.id = hero.id; crossproduct then filter
SELECT * FROM person JOIN hero ON person.id = hero.id; same as above
SELECT * FROM person NATURAL JOIN hero; removed duplicated columns
SELECT * FROM person LEFT JOIN hero ON person.id = hero.id; left table part fully filled out (n entries)
SELECT * FROM person RIGHT JOIN hero ON person.id = hero.id; right table part fully filled out (m entries)
SELECT * FROM person FULL OUTER JOIN hero ON person.id = hero.id; both tables in there, missing column values are null (m+n entries)
recursion:
WITH RECURSIVE temp (n, fact) AS (SELECT 0, 1 UNION ALL SELECT n+1, (n+1)*fact FROM temp WHERE n < 9) SELECT * FROM temp;
WITH lectures (first, next) AS (SELECT prerequisite, follow-up FROM requires UNION ALL SELECT t.first, r.follow-up FROM lecture t, requires r WHERE t.Next= r.prerequisite)
WITH RECURSIVE R(r) AS (SELECT 1 UNION SELECT r+1 FROM R) SELECT r FROM R LIMIT 10;
views:
CREATE VIEW person_view AS (SELECT age, name FROM person)
updateable iff view involves: only one base relation; the key of that relationship; does not contain aggregates, group-by or duplicate-elimination
-> view columns do not change (not even if SELECT *) if the columns of base table change
INTEGRITY CONSTRAINTS
=====================
ways to constraint
schema (defines domain of the data & the captured concepts)
types (defines format & space reserved for values)
constraits (add additional contraints to attributes & relations)
constraits:
pre- & postconditions
way to make sure changes are consistent and do not cause trouble later on
control content of data and its consistency as part of the schema
avoiding problems:
inserting data without a key
adding references to non-existing tuples
nonsensival values for attributes
conflicting tuples
example constraints
keys
multiciplicy of relationships
attribute domains
subset relationship of generalization
referential integrity (foreign keys do indeed reference an existing key)
static constraints:
constraits any instance of DB must meet
dynamic constraints:
constraits on a state transition of the DB
why in database
good way to annotate schema
db is a central points; so has to be done once & for all cases
sofety net: in case feature missing in app
useful for DB optimization
constraint examples:
UNIQUE:
key, null each count as unique
CREATE TABLE person (id INTEGER PRIMARY KEY, email TEXT UNIQUE)
PRIMARY KEY:
entity integrity
CREATE TABLE person (id INTEGER PRIMARY KEY)
CREATE TABLE person (id INTEGER, PRIMARY KEY (id))
CREATE TABLE person (id INTEGER, age INTEGER, PRIMARY KEY (id, age))
FOREIGN KEY:
referential integrity: for every foreign key the value must be either NULL or the one of an existing referenced tuple
CREATE TABLE person (id INTEGER PRIMARY KEY, other_person_id INTEGER REFERENCES person)
CREATE TABLE person (id INTEGER PRIMARY KEY, other_person_id INTEGER, FOREIGN KEY (other_person_id) REFERENCES person ON DELETE SET NULL)
CREATE TABLE person (id INTEGER PRIMARY KEY, other_person_id INTEGER, FOREIGN KEY (other_person_id) REFERENCES person(id) ON DELETE SET NULL)
maintain integrity actions:
CREATE TABLE person (id INTEGER PRIMARY KEY, other_person_id INTEGER, FOREIGN KEY (other_person_id) REFERENCES person(id) ON DELETE SET NULL ON UPDATE SET NULL);
-> set foreign key to null if changed/removed
CREATE TABLE person (id INTEGER PRIMARY KEY, other_person_id INTEGER, FOREIGN KEY (other_person_id) REFERENCES person(id) ON DELETE CASCADE ON UPDATE SET NULL);
-> remove if foreign element is removed, set null if foreign element is updated
CREATE TABLE person (id INTEGER PRIMARY KEY, other_person_id INTEGER, FOREIGN KEY (other_person_id) REFERENCES person(id) ON DELETE NO ACTION UPDATE CASCADE);
cascade: propagate update & delete
set default, set null: set references to null or default value
restrict: prevents deletion of primary key if it still is referenced somewhere (checked immediately)
no-action: same as restrict, but checked at the end
implementation with triggers
ECA rule
Event -> check Condition -> execute Action
checks
CREATE TABLE person (age INTEGER, CHECK (age BETWEEN 1 and 5));
CREATE TABLE person (gender varchar(1), CHECK (gender IN ("m","f")));
CREATE TABLE person (born INTEGER, died INTEGER, CHECK (born < died));
triggers:
CREATE TRIGGER enforce_feminism BEFORE UPDATE ON Professor FOR EACH ROW WHEN (old.age != 1) BEGIN .... new & old, use if .. and .. then .. end if; END
NORMAL FORMS
============
redundancy:
problems:
waste of storage space
need to keep duplicates up to date
advantages:
improve locality
space is not so much of a problem anymore, time is
fault tolerance, availablity
multi-version database
storage is cheap -> never throw anything away
consequence 1: no delete (simply mark as deleted)
consequence 2: no update in place (create new version of tuple)
NoSQL Movement: denormalized data
functional dependency:
if prop A equal then prop B equal of two tuples
{A} -> {B}, A -> B as convention
armstrong axioms:
reflexivity: b subset_of a => a -> b
augmentation: a -> b => ay -> by (dont forget: a -> y = a -> ay)
transitivity: a -> b ^ b -> y => a -> y
complete, all other rules can be inferred from those three
union: a -> b ^ a -> y => a -> by
decomposition: a -> by => a -> b ^ a -> y
pseudo-transitivity: a -> b ^ yb -> g => ay -> g
keys:
superkey:
(a determines whole relation)
a -> R
either superset of minimal key or candidate key
example:
trivial: all atributes!
non-trivial: take candaidate key and add unnecessary attribute
minimal key:
for_all A element_of a: not ((a - A) -> b) (no entry from key can be removed)
a ->. b
example:
the keys which identify the row; as the id
(candidate) key:
(a is minimal and determines whole relation)
a ->. R
example:
the attributes which closure allows to determine all attributes
cardinalities:
cardinalities define functional dependencies
functional dependencies determine keys
but not all functional dependencies are derived from cardinality information
closure of attributes:
input: F (set of functional dependencies (A->B)), a_set (set of attributes)
formula:
result = a_set;
while (resultChanges) {
foreach (a_set as a -> b) {
if (a element_of result) {
result += b;
}
}
}
//closure is determininistic & terminates
//correctly done with mengen; so use submenge for element_of and union for +=
minimal basis
Fc is a minimal basis for F iff:
Fc == F (closure of attribute sets in Fc is same as F)
all functional dependencies in Fc are minimal:
for_all A element_of a: (Fc - (a-b)) union ((a-A)->b) !== Fc
for_all B element_of b: (Fc - (a-b)) union ((a)->(b-B)) !== Fc
in Fc, there are no two functional dependencies with the same left side
formula:
given: a->b, A element_of A, B element_of B, set of functional dependencies F
reduction of left sides: if (B element_of Closure(F, a-A)) then replace a->b with (a-A)->b
reduction of right sides: if (B element_of Closure(F-(a->b) union a->(b-B), a)) then replace a->b with a->(b-B)
example: B element_of Closure(F-(a->bc) union a->b, a)
remove Fd where a == empty
apply union rule to Fd where a1 == a2
decomposition of relations:
bad relation capture multiple concepts, so decompose them
lossless:
S = S1 union S2
S = S1 natural_join S2
proove that it is lossless (direct proof):
show that it holds for column1 = R1 union R2 that column1 -> R1 or column1 -> R2
prove that is lossy (prove by counterexample):
let R have two entries
construct R1 and R2 from it
cosntruct R1 natural_join R2
show that the constructed table is not identical to R
prove that R = R1 natural_join R2:
assume R1 is of the form (a, b) and R2 is of the form (a, c)
prove R untermenge R1 natural_join R2 (if (a,b,c) element_of R, then ....)
prove R1 natural_join R2 untermenge R (if (a,b,c) element_of R1 natural_join R2, then show that its in R1 and R2)
preservation of depedencies:
FD(R)+ == (FD(R1) union FD(R2))+
-> possible for lossless decomposition & but not preservation of dependencies if dependencies span the two tables
remember stuff:
the key (1NF), the whole key (2NF) and nothing but the key (3NF), so help me codd
normal forms
to investige the normal form of a relation first find the candidate key (the minimal key) of the functional dependencies
for all FD example assume R(ABCD) with key AB
minimal basis:
iff functional dependencies cannot further be reduced
algorythm:
left reduction: remove unnecessary part of keys from left
right reduction: remove unnecessary results from the right
remove empty FD: remove all FD which point to nothing
first normal form:
why:
only atomic domains (no JSON as value)
short:
no json
formal:
only atomic domains as values (no JSON, array or similar)
"no repeating groups"
disprove 1NF:
find JSON
"find repeating groups"
prove 1NF:
only atomic domains
"columns are all different (no columns which should be rows)"
example in 1NF:
[12, "hallo", "hi"]
example not in 1NF
[12, '['de': 'hallo', 'en': 'hi']']
(name, bestellung1, bestellung2 bestellung3)
second normal form:
why:
eliminate update anomalies
short:
must depend on whole key
formal:
no functional dependency exists which depends only on part of the candidate key
"each column must depend on the entire primary key"
prove 2NF:
check that all FD either depend on the full key or no key at all
example:
valid: AB -> C, C -> D, C -> A
invalid: A -> C
third normal form:
why:
Eliminating functional dependencies on non-key fields
short:
a is candidate key or B is part of candidate key
formal:
B is part of the key
B element_of a (a->B, trivial: AB -> B)
a is superkey of R
prove 3NF:
for all a check that its the key
the rest; check that the right side is part of the key
example:
valid: AB -> C; C -> A
invalid: C -> D;
synthesis algorythm:
compute minimal basis Fc (A -> B,C), (D,A are keys)
create relations for FD: for all a -> b create Ra := a union b ([A,B,C])
create relations for keys (if needed): if exists k untermenge R, and k is key of R create Rk := k ([A,B,C], [A,D])
eliminate redundant (submenge relations): eliminate Ra if exists Ra untermenge Ra' (eliminate unnecessary tables)
boyce codd normal form:
why:
eliminate FD which have no key
short:
a is candidate key
formal:
B element_of a (a->B, trivial)
a is superkey of R
prove BCNF:
for all a check that its the key
example:
valid: AB -> C
invalid: C -> A
decomposition algorythm:
any schemas can be converted to BCNM, but dependencies are not guaranteed to persist
decompose cyclic: create new table for cyclic FD, and remove new key from the old relation
result = a_set
foreach (a_set as a) {
if (a is not in BCNF because y -> z is evil) {
a1 = (y, z);
a2 = (a - z);
result = result - a + a1 + a2;
}
}
fourth normal form:
why:
elimintate MVD anomalies
short:
only one concept per table
formal:
for every MVD the left side must be superkey
prove 4NF:
for all a->->B check that a is superkey
example:
valid: AB -> C
invalid: C -> A
decomposition algorythm:
result = a_set
foreach (a_set as a) {
if (a is not in 4NF because y -> z is evil) {
a1 = (y, z);
a2 = (a - z);
result = result - a + a1 + a2;
}
}
not dependecy preserving!
MVD (Multi Value Dependency):
written as a ->-> b, person_id ->-> phone
if table can be restructured; so for example (person_id, language, skill)
if 1:n beziehung in same table; so (person_id, email, children_name)
formal:
for_all t1,t2 element_of R and t1.a = t2.a => t3, t4 element_of R:
create table with four rows (t1,t2,t3), three columns (A,B,C).
left column: all a
middle column: b1,b2,b1,b3
right column: c1,c2,c2,c1
a ->-> b && a->->c
now write down all the conclusions for t3,t4 -> proove stuff with that
laws:
generalization, promotion: a -> b => a->->b
reflexivity, augementation, transitivity
multi value augmentation, transitivity (in result is g sub_element h, so ag->ah)
complement: a ->-> b => a ->->R-b-a
coalesce: a ->-> b ^ (y sub_group b) ^ (z intersect b = empty) ^ z -> y = a -> y
multi-value union: a ->-> b ^ b -> -> c => a ->-> bc
intersection: a ->-> b ^ a ->-> c => a ->-> (b intersect c)
minus: a ->-> b ^ a ->-> c => a ->-> (b - c) ^ a ->-> (c - b)
WRONG: a ->-> by => a ->-> b ^ a ->-> y (example: a: person_id, b: language, c: skill)
trivial MVD:
b sub_group a OR b = R-a
lossless decomposition:
(R1 union R2) ->-> R1
database schema:
captures:
concepts represented
attributes
constraints & dependencies over all attributes
good schemas:
ER as the basis
functional dependencies for constraining
normal forms for removing redundancies & anomalies
algorythms for decomposing & deriving normalized tables
OLTP
on-line transaction processing
workload with updates, online (time-critical), high volume of small transactions
integrity important (3NF+)
banking, shops
TPC-C:
wholesame supplier, stocks of products in stores, districts, warehouses
OLAP
on-line analytical processing
workload with complex queries, data updates in batches
de-normalized schemas, with approaches as star/snowflake
marketing analysis
TPC-H:
pricing, analysis of sales
star schemas (not normalized tables): fact + domension tables (sales table with customer_id, part_id as fact table, customer + part tables as dimension tables)
why denormalized: not updated by hand, during data loading process constaints are controlled (application now responsible!)
snowflake schema (some dimension tables are normalized): so dimensiontables can itself have dimensiontables
TPC-DS:
sell analytics over multiple channels
seven fact tables, 17 dimension, 38 columns. large!
modern trends
working set in main memory
column stores at physical level
OLTP & OLAP in same system
specialitation for use case (denormalization etc)
data cubes:
analysis & reporting
preaggregated data over several dimensions
example cube dimension: year, product category, store
support:
slicing: fixing one dimension (select specific year) (removed on dimension of the cube)
dicing: select some values over all dimension of cube (sales of product by product_nr, time, region) (generalization of slicing; select multiple times & places)
roll up & drill down: group-by at different granuarities (sales by year or month, products by region or shop) (less or more cubes)
DMBS
====
basis:
input: SQL
output: tuples
process:
application -> DML-Compiler -> query optimizer, schema -> runtime -> query processor, data manager (Indexes, Records), Storage Manager (pages)
performance:
now assuming all data in main memory, before assumed data on disk
random vs sequential access: expensive (pollutes caches, generates faults, not predictable) vs fast (hardware prefeching works well)
storage system:
storage systems basic:
organized in a hierarchy: combine different types to get desired result
can be distributed: organized in arrays, use cheap hardware, paralell processing, replication
non-uniform: varying distance to memory, sequential vs random access
storage manager:
controls access to disks: implements storage hierarchy (ssd -> tapes -> disks), caching, optimizes storage
management of files & blocks: keeps them in pages (granularity of access), keeps track of pages from DBMS (catalog)
buffer management: segementation of buffer pool, clever replacement policies, pins pages (no replacement)
bottleneck:
cache hiarchies requires careful placement of data
NUMA has big effect: memory access time not uniform (distance), memory on other cores may not be the same
store data:
records as tuples
collection of tuples same table as pages
parts/collection of pages as blocks
tablesspaces is the place on disk for the db to use
data manager:
maps tuples to pages
buffer manager:
maps pages in memory to ones on disk
catalog:
knows where which files are
oracle logical data organization:
extend: blocks of data (~pages) for the same purpose
segement: collection of extends stored in the same tablespace
persist stuff:
record structure:
fixed length fields:
direct access
stored in system catalogs
no need to scan to find i-th entry
variable length fields:
access in two steps (retrieve info about variable (length, pointer), retreive value)
field count + delimiter or array of field offsets at begining of space
NULL:
bit set to indicate value is null
page structure:
fixed size:
header: slot directory (1 if valid, 0 if invalid), number of records in page, other stuff
entry: consists of rid (record identifier), payload
position of record in page: slotno * #record
->records can move inside page (on CRUD), do not need to regenerate indexes!
packed vs unpacked: no space between records (only at the end of page) vs does not care (may be better for access)
variable size:
header: slot directory (length of entries), number of slots in page, pointer to start of free space
record identifier as <page_id, record_nr>
full page but record grows:
row chaining: placehold PID (which points to another page); flexible, no need to update other references, but expensive
fixed-sized sequence of blocks
header page: references data pages (full and empty) & next header page
file:
variable-sized sequence of blocks
tablespace:
one or more datafiles
each datafile is only associated to one tablespace
an object in the tablespace may spans multiple datafiles
buffer management:
target: keep pages in memory as long as possible
crictical: replacement policy (LRU, 2Q (active, inactive page, replace inactive, accessed pages moved to active queue)), when does writeback occurr of updated pages?
hides the fact that not all pages are in main memory which are operated on (which is assumed by query processor)
Buffer management of DBMS vs OS:
DBMS has more insights as OS (access patterns)
problems: double page fault (as DBMS is on top of OS, replacement of page in Buffer may results in another replacement in OS)
access patterns:
sequential: 1 - 1000
hierachical: index navigation
random: index lookup
cyclic: nested loops (repeated access pattern)
replace policies:
LRU: Least Recently Used, replace oldest used page in cache -> problem sequential flooding (#buffer pages < #files in page)
MRU: Most Recently Used, replace newest used page in cache (example: buffer size 4, access: 1234512345)
DBMin:
observations: many concurrent queries, queries are made from operators
buffer segmentation; each operation has its own buffer size, replacement policy
exaples: scan 4pages MRU, indexScan 100pages LRU
meta-data management
stored in tables, accessed internally with SQL
contains:
schema (to compile queries)
tablespaces (files)
histograms (statistics for query optimizations)
parameters (cost of IO, CPU speed, for optimizations)
compiled queries (used for prepared statements)
configuration (isolation level, AppHeap size)
Users (login, passwd)
Workload statistics (index advisors)
QUERY PROCESSING
================
architecture:
compiler:
sql input
parser produces query graph model
rewrite produces QGM
optimizer produces QGM++
CodeGen produces the Execution Plan
runtime system:
Interpreter fetches the tuples
runtime system
compile query into maschine code: better performance, today's approach
compile query into relational algebra & interpret: easier debugging, portability
hybrid; e.g. compile predicates
relational algebra algorythms
query selectivity: #tuples in result vs #tuples in table
attribute cardinality: #distinctive values for attribute
skew: probability distribution of the possible values of an attribute
selectivity of index (over an attribute): #diff values / #total rows -> high for unique keys; low for boolean
indexes:
B-Tree of values;
good on primary/foreign keys, attributes with high cardinality, attributes used in joins, conditions
needs space, maintenance, not useful if low index selectivity
clustered index: leafs of B-Tree hold data (and not just pointer). Only once per table
partitioning:
divide a table into smaller chunks (by hash, range) for paralell access, cache fit, increase concurrency, distribute hot-spots
needs good heuristics to get right
table access:
iterate over every tuple in table; match tuple against predicate
not so expensive because: selectivity, IO, block/page sizes
predictable runtime
scan:
no indexes relevant; load each page
min / max pages: 1 / # of pages which can be served in one IO request
index scan:
relevant index found; need to scan whole index to find corresponding results
min / max pages: 2 / entire B-Tree + all pages -> use LRU
index seek:
index relevant found; but don't need to scan whole index only parts of it (BETWEEN query)
sorting:
sort vs hash: both O(nlogn); hash done before for partitioning, lower constants for CPU vs sorting more robust
needed for results; intermediate step for other operations
expensive in CPU, space
two-phase external locking: data does not fit in memory / memory already used by other queries
N size input pages, M size of buffer
One-Pass Merge:
phase I (create runs): load buffer space with tuples, sort tuples in buffer, write to disk, redo until all are done
phase II (merge runs): use priority heap to merge tuples from runs
special: M >= N: no merge needed, M < sqrt(N): multiple merges needed
merge runs:
keep pointer in each run (start at 0);
load items from pointer into memory
choose lowest item in memory
advance the pointer of the corresponding run by one and take the new item into memory -> back to choosing lowest
Multi-Way Merge:
generalization of one-pass merge
input file -> 1-page run (pass 0) -> produce 1-page sized runs
each consecutive run (called "pass") produces double sized runs (with merges)
analysis IO:
O(n) if m >= sqrt(n)
2n if m > n
4n if n > m >= sqrt(n)
O(n logm n) if m < sqrt(n)
analysis CPU:
if (m > sqrt(N)): create N/M runs of size M (O(N*log2 m)), merge n tuples (O(N*log2 N/M))
complexity: N*log(N) is correct, but db cares more about CPU/IO cost, constants, buffer allocation
two-way sort:
parallize sorting, concurrent sorts, despite main memory large
joins:
very common, expensive
performance factors: join algorythm, relative table sizes, number of tables to join, order of joins, selectivity, predicates of query, memory hierarchy, indexes
R table 1, S table 2
nested-loops NLJ:
two for loops: while more get tuple from R, scan S, output if match found
optmimization with blocks: get blocks from R, hash & compare to S (needs less S scans)
comparisations: R*S
IO cost: p(R) + p(S)*p(R)
min / max pages: 2 pages / all pages of inner, one of outer -> use MRU
good for: two small tables
canonical hash join:
build phase: build a hash table from all tuples in S
probe phase: scan S with the hash keys
easy parallizable
comparisations: S
IO cost: 3*p(R) + S (build hash table from R, compare with S)
grace hash join GHJ:
build partitions of the same hash value;
scan R and S with the hash function and put tuple in correct partition
foreach partition do the canonical hash join
comparisations: max(R,S)
IO cost: 3(p(S) + p(R)) (read/write while hashing; afterwards read to compare)
min / max pages: sqrt(p(R)) + 1 page of bigger relation / p(R) + p(S)
good for: big tables
partitioned hash join PHJ:
create p partitions; then build & probe each partition
if p too big: TLB-misses or cache trashing
sort merge join SMJ:
sort both tables
merge them
comparisations: r*log(r) + s*log(s) + r + s
IO: 5(p(R)+p(S)) (read/write for sort, read/write for merge, read for match phase)
min / max pages: 2 pages for external sort & multiple merge steps or sqrt(p(R)) for one merge step / p(R) + p(S)
good for: small + big table
multi-pass radix partitioning:
recursively apply paritioned hash join;
partition key is determined by log2 p bits of hash value
parallel radix join:
1st scan: local histograms
2nd scan: global histogram & prefix sum -> each thread knows when to output its tuples
group-by:
hash on group-by attribute; aggregate on hash collision
sort on group-by attribute; aggregate sorted ranges
-> choose whats best for the situation
OPTIMIZER
=========
optimizer
plan is a tree of operaters; implementation can be chosen
step 1 (execution model):
pipeline execution of the tree (dataflow graph with records as unit of exchange):
define each operator independently
generic interface for each operator: open (cascades down the tree), next(produce next answer), close(cascades down the tree)
each operator implemented by iterator
iterator model:
open goes down the tree, as soon as everything is open:
next goes down the tree, data is passed from the very bottom to all nodes (which each call next)
from left bottom to right bottom, up & down in between
data flow bottom up; control top down
advantages; generic interface, supports buffer management, no overhead in main memory, pipelining & parallelism
disadvantages: poor cache locality, high overhead of method calls
alternatives: vectorized; use blocks instead of tupels, fast in column stores
improvements: adaptive execution, non-blocking operators, pull/push, query compilation techniques, streaming/partial results