-
Notifications
You must be signed in to change notification settings - Fork 0
CPSC 424 Lecture Notes
#CPSC 424
-
Need to know upt ot chapter 11.
-
Bottlenecks for traditional dbs is disk io
-
B+ trees - wide fanout, keep nodes half full
-
Primary index vs Secondary index.
-
Selection (sigma, relational algebra) = SQL WHERE
-
Look at assigned reading on back of road map.
SSD is changeing the traditional disk IO bottleneck. Good for read heavy, not write heavy.
Selections Using Indices:
-
Why low selectivity defaults to linear scan?
-
Review covering index. Pros and Cons.
- Overview query processing
- Selection operation
- Sorting
- Foundation building block
- If it fits in main memory use quick sort
- If not optimize sort alg for IO.
- External sort is writing to disk
- Book assumes a row is a block to keep simple
- Nested loop join: problem - inefficient, should do block wise instead by tuple.
- Block NLJ you want your smaller relation to be the outer relation
- Indexed NLJ want high selectivity on the outer realtion?
- Assume both relations are sorted on the join attribute
- Options for sort, see previous sections
- Also a hybrid merge-join, one relation has a sorted primary index, assume the other one has a secondary index. get RIDs from secondary index and use that for merging.
- Cluster index on a pk is always sorted
- If possible, use secondary indexes that are covering.
Look up secodnary indexes
- Divide and conquer
- split into buckets which could fit into memory
- bucket does not correspond to a block
- Hash Join ALg *
- Partition s, and r
- for each i...
- build input and probe input
- why do we need a seperate hash func on step 3a
- Recursive partiioning
- todays memorysize, don't really need it.
- HAndling of skew (bad hash or lots of repeat values)
- Overflow resolution is reactive
- Overflow avoidance is proactive, start with small buckets and then merge them
- Too many in one bucket, go to block nested-loop join
- Cost of Hash-join:
- partitioning br reads bs transfers each? 2(br+bs)
- 3a read all of s into memory bs reads
- 3b br reads??? >> 3(br+bs)
- worst case nh partitions are underfull >> 4*nh
- Total transfers = 3(br+bs) +4*nh
- == Accesses
- for step 1 (ceil(br/bb) + ceil(bs/bb)
- for step 2, double above
- for 3, 2*nh
- 2(ceil(br/bb) + ceil(bs/bb) + 2*nh
- If you don't have to push it out of MM don't
- Conjunctive condition (ands)
- Disjunctive condition (ors)
- if no index for every condition, is there any value in the approach.
- no, need index on each.
Duplicate Elimination
Projection
- in SQL, does it do dup elim? no, need extinc. performance driven
Aggregation
Set Operations
- Does the sql retain dups for set operations? no, removes dups. Need to use keyword all. matches set theory definitions, would not make sense to be different
INTERSECT and EXCEPT
Outer Join
Modifying Merge Join
- Can we use this for any join, no only natural and equi joins.
Modify hash join
- Materialization - always applicable
- Pipelining
- Can we pipe everything? blocking operations are impossible to pipeline. Such as sum aggreagtion, can't send until done. hash join...merge join
- Reality is you get a mixture of blocking and non-blocking >> Double buffering
- demand (lazy) vs producer (eager)
- iterator >open>close>....
Height = tall vs short, (# of tuples)
- Shorter is better for memory and CPU
Width = fat vs skinny, (# of columns). note, can eliminate indexes.
- Outer joins are not commutative
- Cartesion join is a theta join without ...
- Cartesian joins are associative
- 6b) - ...
- Rule 6, heuristics, Ordering your joins to make temporary relations small
- Rule 1, 2, & 4, heuristics, push down your selections
- Rule 7a, combine with 1, push both conjunctive selections in.
- Rule 8b, preserve the attr you need to make the joins. Allows skinny tempeorary relations
- Always do selections before projections, as projections can destroy indexes and make the future joins slow.
- Set difference, Use venn to decribe
- Come up with table of concerns for commutative, associative, distributive
- selection is unary
- dynamic coding, DRY.
- Going down a branch, is cost is greater then other, abort.
- leverage existing indexes if they are already there. (left deep)
- If a plan is 1ms, don't look any further.
- What are other statistics that may be useful, height of B+ tree, pages of a leaf.
- Prblem with V(A, r), counts distinct values, not how many of each value. Thus histograms are useful
- c calculatoin is essentially and average. A >= v, would be v-max(A,r) over same denominator
- Seletivity - Conjunction: (Prob. Satisfied = 1 - Prob not satisfied)
- slide 32, formulas produce 2 different numbers. V(a,r) != V(A,s). Therefore take the minimum.
- Note a natural join projects out all the attributes that are in boht relations. So n_{r \bowtie s} implies a selection and projection...
- 40, Interesting sort order, type of join now can improve an outer aggregation.
- 40, heuristics, what are some...
- 44, index is better for pipelining. left-deep join tree is only better for a single core.
- 48, theory of a budget. If the query is costly, increase the budgte and take more time to optimize it.
- Transactions from slides
- ACID is the main take away from chapter 14. Next few chapters will be around that.
- Cascadeless implies recoverable, recoverable does not imply cascadeless.
- This reduces concurrency, however how often do transactions happen at the same time? Not very often.
- Locking techniques, 2 phase locking, phase 1 (growing phase, don't release anything), problem can introduce deadlock.
- Shared locks, (semaphore)
Lock Based Protocols
- Lock compatibility matrix. Comp(S,S) = true,
- Cascading roll back is possible with 2PL.
- No notion of fairness with 2PL
- S2PL - strict 2PL, prevents a non-recoverable schedule.
- SS2PL - strong strict 2PL (also know as rigorous)
- SS2PL ristricts concurency even more.
- Lock manager, what should be in the table. id, mode of lock, timestamp of transaction, granted or not granted.
- Deadlock Prevention has a cost, reduce concurrency. ie - Predeclaration, graph-based
- Overhead in a low contention situation would be minimla with deadlock preemptive.
- wait-die is overly pessimitic, just cause older T has lock, doesn't mean there is a deadlock. (no cycle)
- wound-wait, can create a cycle because the younger can wait on older.
- Timeout-Based Schemes: How to set timeout, too short leads to unecessary rollbacks, too long leads uncessary wait times. Needs to change dynamically. Not a good solution.
- Prevention is only good if you expect a high level of contention and concurrency. Otherwise detection will have less overhead.
- Timestamp-based, prevent dealocks, but can have starvation
- Rollbacks get a newer timestamp
- Reading, we only care about competing writes.
- Blind writes.
- Look in example at the aborts. Why does T2 abort? Becasue T3 is newer and previous wrote to Z. Why does T3 abort? T4 has a more recent write on W.
- Why does T1 succeed? It only does reads on X and Y with out previous writes.
- Why does T5 succeed? Reads from Z after the writes, writes to YZ after other transactions.
- Talked about the table on the slide 40 again, said you should know this. MAYBE ON EXAM.
- Cascade freedom implies recoverability.
- Thomas' write rule, Ti: Read(Q) -> Tj: write(Q) -> Ti: write (Q). (the second op Tj: write(Q), is a blind write. Doens't read first).
- Talked about view serializablilty again. READ CHAPTER
- Conflict Serializable => View Serializable, However, View Serializable !=> Conflict Serializable (blind writes)
- Why is timestamp protocol not used in practice?
- optimistic concurrency control
- Improves concurrency if conflicts are low
- Used in practice? No, Short sighted when selecting item for reads. This is what Multiversion is
- read optimized, writes are the victim
Missed Class
- Atomicity must be preserved, along with consistency.
- Durability must be preserved, if something commits, it must perist.
- Transaction Failure: Logical vs System error
- System crash (Failure)
- must have enough fail safe hardware to ensure non-volatile data is not lost
- Disk Failure
- checksum are used to check if a sector is bad. (detection)
- Protect from independant failure
- RAID and controlled updates
- For dependent failures, we need remote (offsite) back-ups
- Buffer manager,
####Stable Storage
| 1st block | 2nd block | Action |
|---|---|---|
| NC (no change) | NC | rewrite (log) |
| Corrupt | NC | 2nd -> 1st, rewrite (log) |
| updated | NC | 1st -> 2nd |
| updated | corrupt | 1st->2nd |
| updated | updated | Do Nothing |
- work areas are the thread stacks, slide 10
- undo - something in flight needs to be rolled back
- log commit needs to have enough detail for ACD
- Commiting the log entry does not mean outputing the log item. It does mean the transaction is committed.
<Ti start>
|
# Update example
<Ti, X, V1, V2>
|
<Ti commit> OR <Ti abort>
#redo-only log record
<Tj, X, V> *assuming V=oldvalue(Vi), also known as a compensation log record.
- checkpoint is a point in the log where you know everything was ouput. Therefore you only have to repreat history from that record forward.
- There could be transactions in flight at the same time the as the checkpoint is created. Therefore there is a list of transaction in flight.
- Performance wise, it will be much faster.
- Need to shrink your log file. To last checkpoint, plus the start of the oldest transactoin in flight.
- How to do a redo. Just write the old value back. Or something similar.
- Redo-only log record <Ti, X, V1>, happens when there is a crash or an abort when some business logic is violated.
- slide 21a, Repeat history, rollback incomplete transactions,
- slide 21b, Repeat history, T0 removed from rollback list. T1 get's rolled back. add T1 redo-only log record, then T1 abort.
- slide 21c, Repeat histroy, rollback list is empty.
- Note that we don't worry about the data getting to stable storage because we are re-doing history and it will be in the data buffer, which will eventually get outputted.
- Why checkpoints, performance for recovery time.
- Why output block of log records when it is full?
- When is the output of a log block be blocked? When the previous log blocks have not been output. LATCHES are used.
- LATCHES are used for big blocks, not smaller.
- What happens if too much memory is statically reserved? Transactions suffer.
- Waht happens when if not enough, performance hit with lot's of waiting for room.
- Outputting of the log isquick, outputting of the data is slow. This allows a quick release of the hold on updates.
- Client server systems
- Interfaces : JDBC, ODBC,
- Transactions servers, data servers,
- Lock requests and approvals use predefined procedures through the processes (on the lock table, with a latch). Lock manager process performs deadlock detection and resolution
- Prefetching - page-shipping versus item-shipping
- Delay will be in processing and queueing not in propogation.
- IaaS (hardware, not likely scalable like PaaS), PaaS, SaaS. Know the differences.
- Parallel sustems: coarse grain - small # of CPU, massivelly or fine grain thousands of CPUs == Throughput vs speedup
- Scaleup - problem size - batch vs transaciton