Interview Questions – Databases
Why Normalize?
1. Reduce Redundancy
I. Redundancy would waste more space.
One obvious drawback of data repetition is that it
consumes more space and resources than is necessary.
II. Redundancy may cause inconsistency.
Moreover, redundancy introduces the possibility for error,
and exposes us to inconsistencies whenever the data is not maintained in the
same way. for example, these redundant data are not updated at same time.
2. Prevent insert, delete, update anomaly.
Insert Anomaly
The insert anomaly refers to a situation wherein one
cannot insert a new row into a relation because of an artificial dependency on
another relation.
Delete Anomaly
It refers to a situation wherein a deletion of data about
one particular entity causes unintended loss of data that characterizes another
entity.
Update Anomaly
The update anomaly refers to a situation where an update
of a single data value requires multiple rows of data to be updated.
Example: Product, Customer, Invoice.
First Normal Form: Eliminating Repeating Data
A relation is said to be in first normal form when it
contains no multivalued attributes.
To transform unnormalized relations into first normal
form, we must move multivalued attributes and repeating groups to new relations
Second Normal Form: Eliminating Partial Dependencies
Arelationissaidtobein second normal formif itmeets both
the following criteria:
• The relation
is in first normal form.
• All non-key
attributes are functionally dependent on the entire primary key.
Second normal form only applies to relations where we have
concatenated primary keys.
Once we find a second normal form violation, the solution
is to move the attribute(s) that is (are) partially dependent to a new relation
where it depends on the entire key instead of part of the key.
Attribute B is functionally dependent on attribute A if at
any moment in time, there is no more than one value of attribute B associated
with a given value of attribute A.
In another word, it means A determines attribute B, or that
A is a determinant (unique identifier) of attribute B.
In the INVOICE relation,we can easily see thatCustomerNumber
is functionally dependent on Invoice Number because at any point in time, there
can be only one value of Customer Number associated with a given value of
Invoice Number.
Anti-example
INVOICE LINE ITEM: # Invoice Number, # Product Number,
Product Description(only dependent on product number), Quantity, Unit Price,
Extended Amount
Third Normal Form: Eliminating Transitive Dependencies
Arelationissaidtobein third normal formif itmeets both the
following criteria:
• The relation
is in second normal form.
• There is no
transitive dependence (that is, all the non-key attributes depend only on the
primary key).
To transforma second normal formrelation into third normal
form, simplymove any transitively dependent attributes to relations where they
depend only on the primary key. Be careful to leave the attribute on which they
depend in the original relation as a foreign key.Youwill need it to reconstruct
the original user viewvia a join.
An attribute that depends on another attribute that is not
the primary key of the relation is said to be transitively dependent.
Anti-example
INVOICE: # Invoice Number, Customer Number, Customer
Name(dependent on primary key attribute: Customer Number), Customer Address...,
Terms, Ship Via, Order Date, Total Order Amount
The third normal form violation—a non-key attribute
determining another non-key attribute
Boyce-Codd Normal Form - BCNF Form
Boyce-Codd normal form (BCNF) is a stronger version of
third normal form. It addresses anomalies that occur when a non-key attribute
is a determinant of an attribute that is part of the primary key (that is, when
an attribute that is part of the primary key is functionally dependent on a
non-key attribute).
The Boyce-Codd normal form has two requirements:
• The relation
must be in third normal form.
• No
determinants exist that are not either the primary key or a candidate key for the
table. That is, a non-key attribute may not uniquely identify (determine) any other
attribute, including one that participates in the primary key.
The solution is to split the unwanted determinant to a
different table.
Anti-example
#Customer Number, #Product Line, Support Specialist Number
Fourth Normal Form
An additional anomaly surfaces when two or more
multivalued attributes are included in the same relation.
Denormalization
Normalization leads tomore relations,which translates
tomore tables and more joins.When database users suffer performance problems
that cannot be resolved by other means, then denormalization may be required.
Most database experts consider denormalization a last resort, if not an act of
desperation.
Possible denormalization steps include the following:
• Recombining
relations that were split to satisfy normalization rules
• Storing
redundant data in tables
• Storing
summarized data in tables
What Is a
Transaction?
A transaction is a unit of work that is composed of a
discrete series of actions that must be either completely processed or not
processed at all.
Transactions have properties ACID (Atomicity, Consistency,
Isolation, Durability).
• Atomicity
A transaction must remain whole. That is, it must
completely succeed or completely fail. When it succeeds, all changes that were
made by the transaction must be preserved by the system. Should a transaction fail,
all changes that were made by it must be completely undone.
• Consistency
A transaction should transform the database from one consistent
state to another consistent state.
• Isolation
Each transaction should carry out its work independent of
any other transaction that might occur at the same time.
• Durability
Changes made by completed transactions should remain permanent
Read Committed
Isolation Level
Read Committed is the default isolation level in
PostgreSQL. When a transaction runs on this isolation level, a SELECT query
sees only data committed before the query began; it never sees either
uncommitted data or changes committed during query execution by concurrent
transactions.
In effect, a SELECT query sees a snapshot of the database
as of the instant that that query begins to run. Notice that two successive
SELECT commands can see different data, even though they are within a single
transaction, if other transactions commit changes during execution of the first
SELECT.
The level Serializable provides the strictest transaction
isolation. This level emulates serial transaction execution, as if transactions
had been executed one after another, serially, rather than concurrently.
In REPEATABLE READ, one transaction will not read
committed change of another transaction.
set session transaction isolation level serializable;
Client 1:
mysql> start transaction;
mysql> select * from tt;
CLIENT 2:
mysql> update tt set d=2 where d=4;
ERROR 1205 (HY000): Lock wait timeout exceeded; try
restarting transaction
This time, client 2 cannot update the row. After a few
seconds of waiting, the attempt to update a row ends with the error message
“Lock wait timeout exceeded”.
Both REPEATABLE READ and SERIALIZABLE don’t allow dirty
reads, nonrepeatable reads and phantom reads to happen.
While REPEATABLE READ still allows another client to
modify data, while he performs a transaction. The change made by concurrent
transaction would be invisible to current transaction.
SERIALIZABLE strictly disallows other transaction change
data current transaction operates on.
If the table has no primary key, the query select * from
tt where d between 1 and 4 would lock all rows.
Serializable
vs. Snapshot
They are SERIALIZABLE and SNAPSHOT. They are both made
available in order to avoid dirty, non-repeatable or phantom reads, but they do
so using different methods.
SERIALIZABLE
The SERIALIZABLE level prevents phantom reads by using
range locks.
SERIALIZABLE transactions take range locks in order to
prevent Phantom Reads.
SNAPSHOT
While SERIALIZABLE uses locks, instead SNAPSHOT uses a
copy of committed data. Since no locks are taken, when subsequent changes are
made by concurrent transactions, those changes are allowed and not blocked.
If you try to make a change to some data and that data has
already been changed by concurrent transaction, you will get update conflict
error message.
Non-unique
indexes
A non-unique index is one in which any key value may occur
multiple times. This type of index is defined with the keyword INDEX or KEY.
ALTER TABLE department ADD INDEX dept_name_idx (name);
Unique indexes
Unique indexes are indexes that help maintain data
integrity by ensuring that no two rows of data in a table have identical key
values, the exception is that NULL values may occur multiple times.
ALTER TABLE department ADD UNIQUE dept_name_idx (name);
A PRIMARY KEY also is a unique-valued
index. It is similar to a UNIQUE index, but has additional restrictions:
A table may have multiple UNIQUE indexes, but at most one
PRIMARY KEY.
A UNIQUE index can contain NULL values, whereas a PRIMARY
KEY cannot.
A FULLTEXT index is specially designed for
text searching.
SHOW INDEX FROM department \G
Index
implementation
B-tree indexes
Branch nodes are used for navigating the tree, while leaf
nodes hold the actual values and location information.
Bitmap indexes
Although B-tree indexes are great at handling columns that
contain many different values, they can become unwieldy when built on a column
that allows only a small number of values.
For columns that contain only a small number of values
across a large number of rows (known as low-cardinality data), Oracle Database
includes bitmap indexes, which generate a bitmap for each value stored in the
column.
CREATE BITMAP INDEX acc_prod_idx ON account (product_cd);
Clustered Indexes
A clustered index determines the physical storage order of
the data in a table.
A table can only have a single clustered index.
The leaf nodes of a clustered index contain the data pages
of the underlying table.
By default, primary key creates a clustered index.
Query on clustered index is fast, as it only requires one
lookup.
Insert is slower, as the insert must be added in the exact
right place in the clustered index.
Non-clustered Index
Non-clustered indexes have the same B-tree structure as
clustered indexes, except:
The data rows of the underlying table are not sorted and
stored in order based on their non0clustered keys.
A table can contain multiple (up to 255) non-clustered
index.
The leaf nodes of non-clustered Index do not consist of
the data pages, only contain index pages:
If clustered index present, non-clustered index points
back to the data pages in the clustered index.
If no clustered index, non-clustered indexes point to the
actual data in the table.
Logical order of the index does not match the physical
stored order of the rows on disk.
A unique key by default creates a non-clustered index.
A non-clustered Index has separate storage space different
from the table storage.
This means that non-clustered indexes require a lot more
hard disk space compared to clustered indexes.
Query on non-clustered indexes requires two lookups. First
a lookup for the non-clustered index itself, then a lookup for the primary key.
Advantages and
disadvantages of indexes
The optimizer chooses an
index scan if the index columns are referenced in the SELECT statement and if
the optimizer estimates that an index scan will be faster than a table scan.
Advantages of
Index
Query optimization:
Indexes make search queries much faster, because normally Index files generally
are smaller and require less time to read than an entire table, particularly as
tables grow larger. In addition, the entire index may not need to be scanned.
The predicates that are applied to the index reduce the number of rows to be
read from the data pages.
Uniqueness: Indexes like
primary key index and unique index help to avoid duplicate row data.
Disadvantages
of indexes
Every time a row is
added, removed or updated, all indexes on that table must be modified.
Therefore, the more indexes you have the more work the server needs to do to
keep all schema objects up-to-date, which tends to slow things down.
Each index requires
storage or disk space. The exact amount depends on the size of the table and
the size and number of columns in the index.
Each index potentially
adds an alternative access path for a query for the optimizer to consider,
which increases the compilation time.
The best strategy is to
add an index when a clear need arises. If you need an index for only special
purposes, such as a monthly maintenance routine, you can always add the index,
run the routine, and then drop the index until you need it again.
Constraint is a rule which can not be violated by end
users.
There are five types of constraints:
Not null constraints:-which does not allows
NULL values.
Unique constraints:-which does not allow
duplication but allows NULL values.
Primary key constraints:-the key which does not allow
duplication and null values. One table can only have one primary key.
Foreign key constraints:-the key used to refer
primary key defined field in another table and it allows duplication.
Check constraint
Restrict the allowable values for a column
Practice
CONSTRAINT fk_product_type_cd FOREIGN KEY (product_type_cd)
REFERENCES
product_type (product_type_cd),
CONSTRAINT pk_product PRIMARY KEY (product_cd)
ALTER TABLE product
ADD CONSTRAINT pk_product PRIMARY KEY (product_cd);
ALTER TABLE product
ADD CONSTRAINT fk_product_type_cd FOREIGN KEY
(product_type_cd)
REFERENCES product_type (product_type_cd);
Constraints and Indexes
Constraint creation sometimes involves the automatic
generation of an index. However, database servers behave differently regarding
the relationship between constraints and indexes.
Cascading Constraints
ALTER TABLE product
ADD CONSTRAINT fk_product_type_cd FOREIGN KEY
(product_type_cd)
REFERENCES
product_type (product_type_cd)
ON UPDATE CASCADE
ON
DELETE CASCADE;
View
A view consists of a
stored query accessible as a virtual table composed of the result set of a
query. Unlike ordinary tables (base tables), a view is a virtual table, does
not form part of the physical schema: it is a dynamic, virtual table computed
or collated from data in the database.
Advantages of
View
Views can provide advantages over tables:
* Views can be used to
make complex queries easy. A user can use a simple query on a view to display
data from multiple tables without having the knowledge of how to join tables in
queries.
* Views can simplify
Statements for User, we can define frequently used joins, projections, and
selections as views so that users do not have to specify all the conditions and
qualifications each time an operation is performed on that data.
* Focus on the data that interests
them and on the tasks for which they are responsible. Data that is not of
interest to a user can be left out of the view.
* Different views can be
created from the same data per the requirements, in this way we can display
different data for different users.
* Provide additional
level of security, through a view, users can query and modify only the data
they can see. The rest of the database is neither visible nor accessible, so we
can hide sensitive data from certain groups of users.
* Views take
very little space to store; the database contains only the definition of a
view, not a copy of all the data it presents.
Disadvantages
of Views
* View affects
performance, querying from view takes more time than directly querying from the
table.
* View depends
on the table objects, when table is dropped, the view becomes inactive.
* Rows available
through a view are not sorted and are not ordered either, we can not use order
by when define view.
Read-only vs.
updatable views
We can define views as
read-only or updatable.
Oracle, DB2 supports materialized views - pre-executed,
non-virtual views commonly used in data warehousing.
In materialized view, the
query result is cached as a concrete table that may be updated from the
original base tables from time to time. This enables much more efficient
access, at the cost of some data being potentially out-of-date. The accuracy of
a materialized view depends on the frequency or trigger mechanisms behind its
updates.
It is most useful in data
warehousing scenarios, where frequent queries of the actual base tables can be
extremely expensive.
Stored
Procedure
A stored procedure is a precompiled
subroutine on in database. Stored procedures are used to consolidate and
centralize logic that was originally implemented in applications. Large or
complex processing that might require the execution of several SQL statements
can be moved into stored procedures, and all applications call the procedures
only.
Benefits of
using stored procedures
Reduced
network usage between clients and servers
The stored procedure
performs processing on the database server, without transmitting unnecessary
data across the network. Only the records that are actually required by the
client application are transmitted. Using a stored procedure can result in
reduced network usage and better overall performance.
The more SQL statements that you group together in a
stored procedure, the more you reduce network usage and the time that database
locks are held.
Stored
procedures are tunable. By having procedures that handle the database work for
your interface, you eliminate the need to modify application source code to
improve a query's performance. Changes can be made to the stored procedures
that are transparent to the front-end interface.
Stored
procedures are usually written by database developers/administrators. Persons
holding these roles are usually more experienced in writing efficient queries
and SQL statements. If you have your people performing the tasks to which
they are best suited, then you will ultimately produce a better overall
application.
Improved
security, by including database privileges with stored procedures that use
static SQL, the database administrator (DBA) can improve security, because you
can enable controlled access to sensitive data by appropriate selection of the
privileges a program has when it executes.
Centralized security,
administration, and maintenance for common routines, by managing shared logic
in one place at the server, you can simplify security, administration, and
maintenance.
Disadvantages:
We encapsulate business
logic in stored procedure, but stored procedures are incredibly tightly coupled
to the specific database, it would be hard to switch from one database to another
database vendor.
Trigger
A database trigger is
procedural code that is automatically executed in response to certain events on
a particular table or view in a database. Triggers are commonly used to prevent
changes, log changes, audit changes, enhance changes, enforce or execute business
rules.
Advantages of Triggers
The Main advantage of the
trigger is automatic, whenever the table is affected by inserts update, delete,
or query that time, the triggers will implicitly call.
Disadvantages of Triggers
Triggers run on every update made to the table therefore
it adds more load to the database and cause the system to run slower.
It is not possible to track or debug triggers.
Viewing a trigger is difficult compared to tables, views
stored procedures.
Triggers execution is invisible; it is easy to forget
about triggers.
They serve two different
purposes. A procedure executes only when called. A trigger is 'triggered' by a
system event and allows to intercede on insert, update, delete.
We can write a trigger
that calls a procedure. We can't really write a procedure that 'calls' a
trigger.