Programmer: Lifelong Learning: Databases Notes

Dirty Reads - Designing Data-Intensive Web Applications
- see changes made by other database transaction which is not yet commited and mybe rollbacked later
Nonrepeatable Read
- a transaction reads the same query multiple times and results are not the same each time.
Phantom read
- one transcation executes same query multiple times, additional rows may appear or rows may disappear

Read uncommitted
Read committed - most basic
When reading from the database, you will only see data that has been committed (no dirty reads).
When writing to the database, you will only overwrite data that has been committed (no dirty writes).

to prevent dirty writes - using row-level locks
to prevent dirty reads
- db remembers the old committed value, return the old value while the write transaction has not committed
- also the overwritten-but-not-yet-committed version.

Repeatable Read
- only sees data committed before the transaction began
Snapshot isolation - multiversion concurrency control (MVCC),
- create a new version every time a value is changed instead of update in place
- good for long-running, read-only queries such as backups and analytics
- db keeps several different committed versions of an object
- data tagged with transaction ID of the writer
- a created by field, containing the ID of the transaction that inserted this row into the table.
- a deleted by field, the ID of the transaction that requested the deletion.
- a garbage collection process removes any rows marked for deletion if it's sure that on transaction need access the delelted data
- update is translated into a delete and a create
- transaction IDs are used to decide which objects a transaction can see, and which are invisible

Serializable - highest isolation level
- emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently

Non-correlated SubQuery
- In non correlated subquery, inner query doesn't depend on outer query and can run as stand alone query
- inner query executes before outer query
- mostly used IN or NOT IN
Correlated SubQuery
- Correlated subqueries are the one in which inner query or subquery reference outer query.
- Outer query needs to be executed before inner query
- slow, avoid it in favor of sql joins.
- mostly used exits and not exists

prepared statements
- better performance: db will parse, compile, and perform query optimization then store the optimized query plan
- prevent SQL injection attacks

1NF - all underlying domains contain atomic values only
2NF - every non key attribute is fully dependent on the primary key
3NF - every non key attribute is non-transitively dependent on the primary key

truncate vs delete - link
- truncate doesn't support where clause, delete does.
- rollback is possible with delete not with truncate
- truncate is fast delete is slow.
- truncate doesn't do logging delete logs on per row basis.
- delete acquires lock on table, truncate doesn't need lock
- truncate doesn't fire trigger, delete does.
- Don't delete, truncate it when it comes to purge tables.
- truncate reset identity column in table if any, delete doesn't.

Why Normalize?

1. Reduce Redundancy

I. Redundancy would waste more space.

One obvious drawback of data repetition is that it consumes more space and resources than is necessary.

II. Redundancy may cause inconsistency.

Moreover, redundancy introduces the possibility for error, and exposes us to inconsistencies whenever the data is not maintained in the same way. for example, these redundant data are not updated at same time.

2. Prevent insert, delete, update anomaly.

Insert Anomaly

The insert anomaly refers to a situation wherein one cannot insert a new row into a relation because of an artificial dependency on another relation.

Delete Anomaly

It refers to a situation wherein a deletion of data about one particular entity causes unintended loss of data that characterizes another entity.

Update Anomaly

The update anomaly refers to a situation where an update of a single data value requires multiple rows of data to be updated.

Example: Product, Customer, Invoice.

First Normal Form: Eliminating Repeating Data

A relation is said to be in first normal form when it contains no multivalued attributes.

To transform unnormalized relations into first normal form, we must move multivalued attributes and repeating groups to new relations

Second Normal Form: Eliminating Partial Dependencies

Arelationissaidtobein second normal formif itmeets both the following criteria:

• The relation is in first normal form.

• All non-key attributes are functionally dependent on the entire primary key.

Second normal form only applies to relations where we have concatenated primary keys.

Once we find a second normal form violation, the solution is to move the attribute(s) that is (are) partially dependent to a new relation where it depends on the entire key instead of part of the key.

Attribute B is functionally dependent on attribute A if at any moment in time, there is no more than one value of attribute B associated with a given value of attribute A.

In another word, it means A determines attribute B, or that A is a determinant (unique identifier) of attribute B.

In the INVOICE relation,we can easily see thatCustomerNumber is functionally dependent on Invoice Number because at any point in time, there can be only one value of Customer Number associated with a given value of Invoice Number.

Anti-example

INVOICE LINE ITEM: # Invoice Number, # Product Number, Product Description(only dependent on product number), Quantity, Unit Price, Extended Amount

Third Normal Form: Eliminating Transitive Dependencie

Arelationissaidtobein third normal formif itmeets both the following criteria:

• The relation is in second normal form.

• There is no transitive dependence (that is, all the non-key attributes depend only on the primary key).

To transforma second normal formrelation into third normal form, simplymove any transitively dependent attributes to relations where they depend only on the primary key. Be careful to leave the attribute on which they depend in the original relation as a foreign key.Youwill need it to reconstruct the original user viewvia a join.

An attribute that depends on another attribute that is not the primary key of the relation is said to be transitively dependent.

Anti-example

INVOICE: # Invoice Number, Customer Number, Customer Name(dependent on primary key attribute: Customer Number), Customer Address..., Terms, Ship Via, Order Date, Total Order Amount

The third normal form violation—a non-key attribute determining another non-key attribute

Boyce-Codd Normal Form - BCNF Form

Boyce-Codd normal form (BCNF) is a stronger version of third normal form. It addresses anomalies that occur when a non-key attribute is a determinant of an attribute that is part of the primary key (that is, when an attribute that is part of the primary key is functionally dependent on a non-key attribute).

The Boyce-Codd normal form has two requirements:

• The relation must be in third normal form.

• No determinants exist that are not either the primary key or a candidate key for the table. That is, a non-key attribute may not uniquely identify (determine) any other attribute, including one that participates in the primary key.

The solution is to split the unwanted determinant to a different table.

Anti-example

#Customer Number, #Product Line, Support Specialist Number

Fourth Normal Form

An additional anomaly surfaces when two or more multivalued attributes are included in the same relation.

Denormalization

Normalization leads tomore relations,which translates tomore tables and more joins.When database users suffer performance problems that cannot be resolved by other means, then denormalization may be required. Most database experts consider denormalization a last resort, if not an act of desperation.

Possible denormalization steps include the following:

• Recombining relations that were split to satisfy normalization rules

• Storing redundant data in tables

• Storing summarized data in tables

What Is a Transaction?

A transaction is a unit of work that is composed of a discrete series of actions that must be either completely processed or not processed at all.

Transactions have properties ACID (Atomicity, Consistency, Isolation, Durability).

• Atomicity

A transaction must remain whole. That is, it must completely succeed or completely fail. When it succeeds, all changes that were made by the transaction must be preserved by the system. Should a transaction fail, all changes that were made by it must be completely undone.

• Consistency

A transaction should transform the database from one consistent state to another consistent state.

• Isolation

Each transaction should carry out its work independent of any other transaction that might occur at the same time.

• Durability

Changes made by completed transactions should remain permanent

Read Committed Isolation Level

Read Committed is the default isolation level in PostgreSQL. When a transaction runs on this isolation level, a SELECT query sees only data committed before the query began; it never sees either uncommitted data or changes committed during query execution by concurrent transactions.

In effect, a SELECT query sees a snapshot of the database as of the instant that that query begins to run. Notice that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes during execution of the first SELECT.

Serializable Isolation Level

The level Serializable provides the strictest transaction isolation. This level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently.

In REPEATABLE READ, one transaction will not read committed change of another transaction.

set session transaction isolation level serializable;

Client 1:

mysql> start transaction;

mysql> select * from tt;

CLIENT 2:

mysql> update tt set d=2 where d=4;

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

This time, client 2 cannot update the row. After a few seconds of waiting, the attempt to update a row ends with the error message “Lock wait timeout exceeded”.

Both REPEATABLE READ and SERIALIZABLE don’t allow dirty reads, nonrepeatable reads and phantom reads to happen.

While REPEATABLE READ still allows another client to modify data, while he performs a transaction. The change made by concurrent transaction would be invisible to current transaction.

SERIALIZABLE strictly disallows other transaction change data current transaction operates on.

Serializable vs. Snapshot

They are SERIALIZABLE and SNAPSHOT. They are both made available in order to avoid dirty, non-repeatable or phantom reads, but they do so using different methods.

SERIALIZABLE

The SERIALIZABLE level prevents phantom reads by using range locks.

SERIALIZABLE transactions take range locks in order to prevent Phantom Reads.

SNAPSHOT

While SERIALIZABLE uses locks, instead SNAPSHOT uses a copy of committed data. Since no locks are taken, when subsequent changes are made by concurrent transactions, those changes are allowed and not blocked.

If you try to make a change to some data and that data has already been changed by concurrent transaction, you will get update conflict error message.

Types of Indexes

http://it.toolbox.com/blogs/programming-life/sql-performance-get-the-right-index-7996

http://blog.csdn.net/laodao1/archive/2009/10/21/4709516.aspx

http://zhidao.baidu.com/question/87078855.html

Non-unique indexes

A non-unique index is one in which any key value may occur multiple times. This type of index is defined with the keyword INDEX or KEY.

ALTER TABLE department ADD INDEX dept_name_idx (name);

Unique indexes

Unique indexes are indexes that help maintain data integrity by ensuring that no two rows of data in a table have identical key values, the exception is that NULL values may occur multiple times.

ALTER TABLE department ADD UNIQUE dept_name_idx (name);

A PRIMARY KEY also is a unique-valued index. It is similar to a UNIQUE index, but has additional restrictions:

A table may have multiple UNIQUE indexes, but at most one PRIMARY KEY.

A UNIQUE index can contain NULL values, whereas a PRIMARY KEY cannot.

A FULLTEXT index is specially designed for text searching.

SHOW INDEX FROM department \G

Index implementation

B-tree indexes

Branch nodes are used for navigating the tree, while leaf nodes hold the actual values and location information.

Bitmap indexes

Although B-tree indexes are great at handling columns that contain many different values, they can become unwieldy when built on a column that allows only a small number of values.

For columns that contain only a small number of values across a large number of rows (known as low-cardinality data), Oracle Database includes bitmap indexes, which generate a bitmap for each value stored in the column.

CREATE BITMAP INDEX acc_prod_idx ON account (product_cd);

Clustered Indexes vs Non-clustered Indexes

http://www.devtoolshed.com/content/clustered-index-vs-non-clustered-index-sql-server

Clustered Index Structures

Nonclustered Index Structures

Clustered Indexes

A clustered index determines the physical storage order of the data in a table.

A table can only have a single clustered index.

The leaf nodes of a clustered index contain the data pages of the underlying table.

By default, primary key creates a clustered index.

Query on clustered index is fast, as it only requires one lookup.

Insert is slower, as the insert must be added in the exact right place in the clustered index.

Non-clustered Index

Non-clustered indexes have the same B-tree structure as clustered indexes, except:

The data rows of the underlying table are not sorted and stored in order based on their non0clustered keys.

A table can contain multiple (up to 255) non-clustered index.

The leaf nodes of non-clustered Index do not consist of the data pages, only contain index pages:

If clustered index present, non-clustered index points back to the data pages in the clustered index.

If no clustered index, non-clustered indexes point to the actual data in the table.

Logical order of the index does not match the physical stored order of the rows on disk.

A unique key by default creates a non-clustered index.

A non-clustered Index has separate storage space different from the table storage.

This means that non-clustered indexes require a lot more hard disk space compared to clustered indexes.

Query on non-clustered indexes requires two lookups. First a lookup for the non-clustered index itself, then a lookup for the primary key.

Advantages and disadvantages of indexes

http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005052.htm

http://www.mysqlfaqs.net/mysql-faqs/Indexes/What-are-advantages-and-disadvantages-of-indexes-in-MySQL

The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan.

Advantages of Index

Query optimization: Indexes make search queries much faster, because normally Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.

Uniqueness: Indexes like primary key index and unique index help to avoid duplicate row data.

Disadvantages of indexes

Every time a row is added, removed or updated, all indexes on that table must be modified. Therefore, the more indexes you have the more work the server needs to do to keep all schema objects up-to-date, which tends to slow things down.

Each index requires storage or disk space. The exact amount depends on the size of the table and the size and number of columns in the index.

Each index potentially adds an alternative access path for a query for the optimizer to consider, which increases the compilation time.

The best strategy is to add an index when a clear need arises. If you need an index for only special purposes, such as a monthly maintenance routine, you can always add the index, run the routine, and then drop the index until you need it again.

Types of Constraints

Constraint is a rule which can not be violated by end users.

There are five types of constraints:

Not null constraints:-which does not allows NULL values.

Unique constraints:-which does not allow duplication but allows NULL values.

Primary key constraints:-the key which does not allow duplication and null values. One table can only have one primary key.

Foreign key constraints:-the key used to refer primary key defined field in another table and it allows duplication.

Check constraint

Restrict the allowable values for a column

Practice

CONSTRAINT fk_product_type_cd FOREIGN KEY (product_type_cd)

REFERENCES product_type (product_type_cd),

CONSTRAINT pk_product PRIMARY KEY (product_cd)

ALTER TABLE product

ADD CONSTRAINT pk_product PRIMARY KEY (product_cd);

ALTER TABLE product

ADD CONSTRAINT fk_product_type_cd FOREIGN KEY (product_type_cd)

REFERENCES product_type (product_type_cd);

Constraints and Indexes

Constraint creation sometimes involves the automatic generation of an index. However, database servers behave differently regarding the relationship between constraints and indexes.

Cascading Constraints

ALTER TABLE product

ADD CONSTRAINT fk_product_type_cd FOREIGN KEY (product_type_cd)

REFERENCES product_type (product_type_cd)

ON UPDATE CASCADE

ON DELETE CASCADE;

View

http://en.wikipedia.org/wiki/View_%28database%29

http://manuals.sybase.com/onlinebooks/group-as/asg1250e/sqlug/@Generic__BookTextView/28457;pt=28420/*

http://wiki.answers.com/Q/What_are_the_advantages_and_disadvantages_of_views_in_a_database

A view consists of a stored query accessible as a virtual table composed of the result set of a query. Unlike ordinary tables (base tables), a view is a virtual table, does not form part of the physical schema: it is a dynamic, virtual table computed or collated from data in the database.

Advantages of View

Views can provide advantages over tables:

* Views can be used to make complex queries easy. A user can use a simple query on a view to display data from multiple tables without having the knowledge of how to join tables in queries.

* Views can simplify Statements for User, we can define frequently used joins, projections, and selections as views so that users do not have to specify all the conditions and qualifications each time an operation is performed on that data.

* Focus on the data that interests them and on the tasks for which they are responsible. Data that is not of interest to a user can be left out of the view.

* Different views can be created from the same data per the requirements, in this way we can display different data for different users.

* Provide additional level of security, through a view, users can query and modify only the data they can see. The rest of the database is neither visible nor accessible, so we can hide sensitive data from certain groups of users.

* Views take very little space to store; the database contains only the definition of a view, not a copy of all the data it presents.

Disadvantages of Views

* View affects performance, querying from view takes more time than directly querying from the table.

* View depends on the table objects, when table is dropped, the view becomes inactive.

* Rows available through a view are not sorted and are not ordered either, we can not use order by when define view.

Read-only vs. updatable views

We can define views as read-only or updatable.

Materialized views

Oracle, DB2 supports materialized views - pre-executed, non-virtual views commonly used in data warehousing.

In materialized view, the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. The accuracy of a materialized view depends on the frequency or trigger mechanisms behind its updates.

It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive.

Stored Procedure

http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.dc.doc/dc/c_spbenefits.htm

http://searchsqlserver.techtarget.com/news/1052737/Why-use-stored-procedures

http://en.wikipedia.org/wiki/Stored_procedure

A stored procedure is a precompiled subroutine on in database. Stored procedures are used to consolidate and centralize logic that was originally implemented in applications. Large or complex processing that might require the execution of several SQL statements can be moved into stored procedures, and all applications call the procedures only.

Benefits of using stored procedures

Reduced network usage between clients and servers

The stored procedure performs processing on the database server, without transmitting unnecessary data across the network. Only the records that are actually required by the client application are transmitted. Using a stored procedure can result in reduced network usage and better overall performance.

The more SQL statements that you group together in a stored procedure, the more you reduce network usage and the time that database locks are held.

Stored procedures are tunable. By having procedures that handle the database work for your interface, you eliminate the need to modify application source code to improve a query's performance. Changes can be made to the stored procedures that are transparent to the front-end interface.

Stored procedures are usually written by database developers/administrators. Persons holding these roles are usually more experienced in writing efficient queries and SQL statements. If you have your people performing the tasks to which they are best suited, then you will ultimately produce a better overall application.

Improved security, by including database privileges with stored procedures that use static SQL, the database administrator (DBA) can improve security, because you can enable controlled access to sensitive data by appropriate selection of the privileges a program has when it executes.

Centralized security, administration, and maintenance for common routines, by managing shared logic in one place at the server, you can simplify security, administration, and maintenance.

Disadvantages:

We encapsulate business logic in stored procedure, but stored procedures are incredibly tightly coupled to the specific database, it would be hard to switch from one database to another database vendor.

Trigger

http://en.wikipedia.org/wiki/Database_trigger

http://www.allinterview.com/showanswers/20053.html

A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. Triggers are commonly used to prevent changes, log changes, audit changes, enhance changes, enforce or execute business rules.

Advantages of Triggers

The Main advantage of the trigger is automatic, whenever the table is affected by inserts update, delete, or query that time, the triggers will implicitly call.

Disadvantages of Triggers

Triggers run on every update made to the table therefore it adds more load to the database and cause the system to run slower.

It is not possible to track or debug triggers.

Viewing a trigger is difficult compared to tables, views stored procedures.

Triggers execution is invisible; it is easy to forget about triggers.

Difference between stored-procedure and trigger

They serve two different purposes. A procedure executes only when called. A trigger is 'triggered' by a system event and allows to intercede on insert, update, delete.

We can write a trigger that calls a procedure. We can't really write a procedure that 'calls' a trigger.

MySQL
Innodb
- Row level locking
- default port 3306
- /etc/init.d/mysql start

How does InnoDB behave without a Primary Key?
if a table is declared with no PRIMARY KEY and no non-nullable UNIQUE KEY, InnoDB will automatically add a 6-byte (48-bit) integer column called ROW_ID to the table, and cluster the data based on that column.

all tables using such ROW_ID columns share the same global sequence counter which is part of the data dictionary. The maximum used value for all row IDs (well, technically the next ID to be used) is stored in the system tablespace (e.g. ibdata1) in page 7 (type SYS), within the data dictionary header (field DICT_HDR_ROW_ID).

This global sequence counter is protected by dict_sys->mutex, even for incrementing (as opposed to using atomic increment).

AUTO_INCREMENT Handling in InnoDB
innodb_autoinc_lock_mode = 0 (“traditional” lock mode)
- Table Lock until the statement finishes
innodb_autoinc_lock_mode = 1 (“consecutive” lock mode) - default
- “Simple inserts” (for which the number of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required number of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes.
innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

SQL
IS NULL
SELECT e1.Name, e2.Name FROM Employee e1 INNER JOIN Employee e2 ON e1.ManagerID = e2.EmployeeID
DELETE e1 FROM EMPLOYEE e1, EMPLOYEE e2 WHERE e1.name = e2.name AND e1.id > e2.id;

finding the 2nd highest salary - link
SELECT MAX(Salary) FROM Employee WHERE Salary NOT IN (SELECT MAX(Salary) FROM Employee )
SELECT salary FROM (SELECT salary FROM Employee ORDER BY salary DESC LIMIT 2) AS emp ORDER BY salary LIMIT 1;

Find nth highest salary
SELECT Salary FROM Employee
ORDER BY Salary DESC LIMIT n-1,1

inefficeint
SELECT FROM Employee Emp1
WHERE (N-1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > Emp1.Salary)

Databases Notes

Labels