CPSC 424 Lecture Notes

#CPSC 424

Sept 9th

Need to know upt ot chapter 11.
Bottlenecks for traditional dbs is disk io
B+ trees - wide fanout, keep nodes half full
Primary index vs Secondary index.
Selection (sigma, relational algebra) = SQL WHERE
Look at assigned reading on back of road map.

Part 1 - Query Processing

SSD is changeing the traditional disk IO bottleneck. Good for read heavy, not write heavy.

Selections Using Indices:

Why low selectivity defaults to linear scan?
Review covering index. Pros and Cons.

Sept 16th

Summary of last 2 lectures

Overview query processing
Selection operation
Sorting
- Foundation building block
- If it fits in main memory use quick sort
- If not optimize sort alg for IO.
- External sort is writing to disk
- Book assumes a row is a block to keep simple
- Nested loop join: problem - inefficient, should do block wise instead by tuple.
- Block NLJ you want your smaller relation to be the outer relation
- Indexed NLJ want high selectivity on the outer realtion?

(Sort-) Merge-Join

Assume both relations are sorted on the join attribute
Options for sort, see previous sections
Also a hybrid merge-join, one relation has a sorted primary index, assume the other one has a secondary index. get RIDs from secondary index and use that for merging.
Cluster index on a pk is always sorted
If possible, use secondary indexes that are covering.

Look up secodnary indexes

Hash-Join

Divide and conquer
split into buckets which could fit into memory
bucket does not correspond to a block
Hash Join ALg *
- Partition s, and r
- for each i...
- build input and probe input
- why do we need a seperate hash func on step 3a
- Recursive partiioning
  - todays memorysize, don't really need it.
HAndling of skew (bad hash or lots of repeat values)
Overflow resolution is reactive
Overflow avoidance is proactive, start with small buckets and then merge them
Too many in one bucket, go to block nested-loop join
Cost of Hash-join:
- partitioning br reads bs transfers each? 2(br+bs)
- 3a read all of s into memory bs reads
- 3b br reads??? >> 3(br+bs)
- worst case nh partitions are underfull >> 4*nh
- Total transfers = 3(br+bs) +4*n_h
- == Accesses
- for step 1 (ceil(br/bb) + ceil(bs/bb)
- for step 2, double above
- for 3, 2*n_h
- 2(ceil(br/bb) + ceil(bs/bb) + 2*n_h

Hybrid Hash-Join

If you don't have to push it out of MM don't

Complex Joins

Conjunctive condition (ands)
Disjunctive condition (ors)
- if no index for every condition, is there any value in the approach.
- no, need index on each.

Other Operations

Duplicate Elimination

Projection

in SQL, does it do dup elim? no, need extinc. performance driven

Aggregation

Set Operations

Does the sql retain dups for set operations? no, removes dups. Need to use keyword all. matches set theory definitions, would not make sense to be different

INTERSECT and EXCEPT

Outer Join

Modifying Merge Join

Can we use this for any join, no only natural and equi joins.

Modify hash join

Evaluation of Expressions

Materialization - always applicable
Pipelining
- Can we pipe everything? blocking operations are impossible to pipeline. Such as sum aggreagtion, can't send until done. hash join...merge join
- Reality is you get a mixture of blocking and non-blocking >> Double buffering
- demand (lazy) vs producer (eager)
- iterator >open>close>....

Sep 21st

Height = tall vs short, (# of tuples)

Shorter is better for memory and CPU

Width = fat vs skinny, (# of columns). note, can eliminate indexes.

Equivalence Rules

Outer joins are not commutative
Cartesion join is a theta join without ...
Cartesian joins are associative
6b) - ...
Rule 6, heuristics, Ordering your joins to make temporary relations small
Rule 1, 2, & 4, heuristics, push down your selections
Rule 7a, combine with 1, push both conjunctive selections in.
Rule 8b, preserve the attr you need to make the joins. Allows skinny tempeorary relations
Always do selections before projections, as projections can destroy indexes and make the future joins slow.
Set difference, Use venn to decribe
Come up with table of concerns for commutative, associative, distributive
selection is unary

Examples

Optimization Plans

dynamic coding, DRY.
Going down a branch, is cost is greater then other, abort.
leverage existing indexes if they are already there. (left deep)
If a plan is 1ms, don't look any further.

Sep 23rd

Statistics

What are other statistics that may be useful, height of B+ tree, pages of a leaf.
Prblem with V(A, r), counts distinct values, not how many of each value. Thus histograms are useful
c calculatoin is essentially and average. A >= v, would be v-max(A,r) over same denominator
Seletivity - Conjunction: (Prob. Satisfied = 1 - Prob not satisfied)
slide 32, formulas produce 2 different numbers. V(a,r) != V(A,s). Therefore take the minimum.
Note a natural join projects out all the attributes that are in boht relations. So n_{r \bowtie s} implies a selection and projection...
40, Interesting sort order, type of join now can improve an outer aggregation.
40, heuristics, what are some...
44, index is better for pipelining. left-deep join tree is only better for a single core.
48, theory of a budget. If the query is costly, increase the budgte and take more time to optimize it.

Sep 28th

Transactions from slides

Sep 30th

ACID is the main take away from chapter 14. Next few chapters will be around that.
Cascadeless implies recoverable, recoverable does not imply cascadeless.
This reduces concurrency, however how often do transactions happen at the same time? Not very often.
Locking techniques, 2 phase locking, phase 1 (growing phase, don't release anything), problem can introduce deadlock.
Shared locks, (semaphore)

Oct 5th

Lock Based Protocols

Lock compatibility matrix. Comp(S,S) = true,
Cascading roll back is possible with 2PL.
No notion of fairness with 2PL
S2PL - strict 2PL, prevents a non-recoverable schedule.
SS2PL - strong strict 2PL (also know as rigorous)
SS2PL ristricts concurency even more.
Lock manager, what should be in the table. id, mode of lock, timestamp of transaction, granted or not granted.

Oct 7th

Deadlock Prevention has a cost, reduce concurrency. ie - Predeclaration, graph-based
Overhead in a low contention situation would be minimla with deadlock preemptive.
wait-die is overly pessimitic, just cause older T has lock, doesn't mean there is a deadlock. (no cycle)
wound-wait, can create a cycle because the younger can wait on older.
Timeout-Based Schemes: How to set timeout, too short leads to unecessary rollbacks, too long leads uncessary wait times. Needs to change dynamically. Not a good solution.
Prevention is only good if you expect a high level of contention and concurrency. Otherwise detection will have less overhead.
Timestamp-based, prevent dealocks, but can have starvation
Rollbacks get a newer timestamp
Reading, we only care about competing writes.
Blind writes.
Look in example at the aborts. Why does T2 abort? Becasue T3 is newer and previous wrote to Z. Why does T3 abort? T4 has a more recent write on W.
Why does T1 succeed? It only does reads on X and Y with out previous writes.
Why does T5 succeed? Reads from Z after the writes, writes to YZ after other transactions.

Oct 14th

Talked about the table on the slide 40 again, said you should know this. MAYBE ON EXAM.
Cascade freedom implies recoverability.
Thomas' write rule, Ti: Read(Q) -> Tj: write(Q) -> Ti: write (Q). (the second op Tj: write(Q), is a blind write. Doens't read first).
Talked about view serializablilty again. READ CHAPTER
Conflict Serializable => View Serializable, However, View Serializable !=> Conflict Serializable (blind writes)
Why is timestamp protocol not used in practice?

Validation

optimistic concurrency control
Improves concurrency if conflicts are low
Used in practice? No, Short sighted when selecting item for reads. This is what Multiversion is

Multiversion

read optimized, writes are the victim

Oct 19th

Missed Class

Oct 21st

Recovery

Atomicity must be preserved, along with consistency.
Durability must be preserved, if something commits, it must perist.
Transaction Failure: Logical vs System error
System crash (Failure)
- must have enough fail safe hardware to ensure non-volatile data is not lost
Disk Failure
- checksum are used to check if a sector is bad. (detection)

Storage

Protect from independant failure
- RAID and controlled updates
- For dependent failures, we need remote (offsite) back-ups
Buffer manager,

####Stable Storage

1st block	2nd block	Action
NC (no change)	NC	rewrite (log)
Corrupt	NC	2nd -> 1st, rewrite (log)
updated	NC	1st -> 2nd
updated	corrupt	1st->2nd
updated	updated	Do Nothing

Data Access

work areas are the thread stacks, slide 10

Logging (recovery)

undo - something in flight needs to be rolled back
log commit needs to have enough detail for ACD
Commiting the log entry does not mean outputing the log item. It does mean the transaction is committed.

<Ti start>
  |
# Update example
<Ti, X, V1, V2>
  |
<Ti commit> OR <Ti abort>
#redo-only log record
<Tj, X, V> *assuming V=oldvalue(Vi), also known as a compensation log record.

checkpoint is a point in the log where you know everything was ouput. Therefore you only have to repreat history from that record forward.
- There could be transactions in flight at the same time the as the checkpoint is created. Therefore there is a list of transaction in flight.
- Performance wise, it will be much faster.
- Need to shrink your log file. To last checkpoint, plus the start of the oldest transactoin in flight.
How to do a redo. Just write the old value back. Or something similar.

Oct 26th

Recovery and Atomicity

Redo-only log record <Ti, X, V1>, happens when there is a crash or an abort when some business logic is violated.
slide 21a, Repeat history, rollback incomplete transactions,
slide 21b, Repeat history, T0 removed from rollback list. T1 get's rolled back. add T1 redo-only log record, then T1 abort.
slide 21c, Repeat histroy, rollback list is empty.
Note that we don't worry about the data getting to stable storage because we are re-doing history and it will be in the data buffer, which will eventually get outputted.
Why checkpoints, performance for recovery time.

Recovery Algorithm

Why output block of log records when it is full?
When is the output of a log block be blocked? When the previous log blocks have not been output. LATCHES are used.
LATCHES are used for big blocks, not smaller.

Buffer Management

What happens if too much memory is statically reserved? Transactions suffer.
Waht happens when if not enough, performance hit with lot's of waiting for room.

Fuzzy Checkpointing

Outputting of the log isquick, outputting of the data is slow. This allows a quick release of the hold on updates.

Failure and loss of Non-volatile storage.

Early Lock Release and Logical Undo Operations

Nov 9th

Database Design

Client server systems
Interfaces : JDBC, ODBC,
Transactions servers, data servers,
Lock requests and approvals use predefined procedures through the processes (on the lock table, with a latch). Lock manager process performs deadlock detection and resolution
Prefetching - page-shipping versus item-shipping
Delay will be in processing and queueing not in propogation.
IaaS (hardware, not likely scalable like PaaS), PaaS, SaaS. Know the differences.
Parallel sustems: coarse grain - small # of CPU, massivelly or fine grain thousands of CPUs == Throughput vs speedup
Scaleup - problem size - batch vs transaciton

CPSC 424 Lecture Notes

Sept 9th

Part 1 - Query Processing

Sept 16th

Summary of last 2 lectures

(Sort-) Merge-Join

Hash-Join

Hybrid Hash-Join

Complex Joins

Other Operations

Evaluation of Expressions

Sep 21st

Equivalence Rules

Examples

Optimization Plans

Sep 23rd

Statistics

Sep 28th

Sep 30th

Oct 5th

Oct 7th

Oct 14th

Validation

Multiversion

Oct 19th

Oct 21st

Recovery

Storage

Data Access

Logging (recovery)

Oct 26th

Recovery and Atomicity

Recovery Algorithm

Buffer Management

Fuzzy Checkpointing

Failure and loss of Non-volatile storage.

Early Lock Release and Logical Undo Operations

Nov 9th

Database Design

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally