Purpose
|
Define the detailed physical design of the database.
|
The physical database design includes model elements (such as tables, views, and stored procedures) that
represent the detailed physical structure of the database and model elements (such as schemas and
tablespaces) that represent the underlying data storage design of the database. Collectively, these
model elements comprise the Physical Data Model of the database. This Physical Data Model is
contained in the Work Product: Data Model and is not a separate model
work product.
The detailed steps for developing the physical database design are as follows:
Purpose
|
To define reusable user-defined types.
|
Domains might be used by the database designer to enforce type standards throughout the database design.
Domains are user-defined data types that can be applied to a column in a table. Domains have the
properties of a column without the name.
Purpose
|
Create the initial database tables and relationships.
|
The database designer models the Physical Data Model elements using tables and columns in tables, as
described in Guideline: Data Model.
If a Logical Data Model has been created, then its logical entities can be used as the basis for an initial
set of tables.
Alternatively, the database designer might jump-start the Physical Data Model by using the persistent
classes in the Design Model as a starting point for tables in the Physical Data Model. The database
designer models the persistent classes and their attributes as tables and columns respectively.
The database designer also needs to define the relationships between the tables based on the associations
between the persistent classes in the Design Model. A description of how the Design Model elements
and relationships map to Data Model elements and relationships is provided in Work Product Guideline: Forward Engineering Relational Databases.
If you are starting the model from persistent classes rather than from a normalized Logical Data Model,
then you will generally need to apply some normalization in order to eliminate data redundancies and
non-key field dependencies. See Concept: Normalization for more information on database normalization.
Purpose
|
To define standard reference tables used across the project.
|
Often there are standard look-up tables, validation tables, or reference tables used throughout the
project. Since the data in these tables tends be frequently accessed but seldom-changing, that data is
worth special consideration. In the Design Model, these tables might contain standard product codes, state
or province codes, postal or zip codes, tax tables, area code validation tables, or other frequently
accessed information. In financial systems, these tables might contain lists of policy codes, insurance
policy rating categories, or conversion rates. Look in the Design Model for classes that are primarily
read-only, providing validation information for a large number of clients.
If the reference table is small, do not bother to index it, since indexing might actually add additional
overhead for small tables. A small, frequently accessed table also tends to remain in memory, because
caching algorithms often keep frequently accessed tables in the data cache.
If possible, make sure that the database cache is large enough to keep all reference tables in memory,
along with normal "working set space" for queries and transactions. Often the secret to increasing database
performance is reducing disk I/O.
Once the reference table structures are defined, determine a strategy for populating the reference tables.
Since these tables are accessed near the beginning of the project, determining the reference values and
loading the tables often need to occur relatively early during application runtime. While the database
designer is not responsible for obtaining the data, he or she is responsible for determining how and when
the reference tables will be refreshed.
Purpose
|
To define the one or more columns that uniquely identify a row in the table.
To define constraints on columns that guarantee the uniqueness of the data or collection of
data.
|
A primary key is one or more columns that uniquely identify rows in a table. A table has a single primary
key. There is often a "natural" key that can be used to uniquely identify a row of data (for example, the
postal code in a reference table). The primary key should not contain data that might change with the
business environment. If the "natural" key is a value that can change (for example a person's name), then
it is recommended that the database designer create a single non-meaningful, non-user-entered column when
creating a primary key. This creates a data structure that has greater adaptability to changes in the
business structure, rules, or environment.
The use of a non-meaningful, non-user-entered column as the primary key is an essential concept in
designing a data warehouse. Transactional systems often choose a "natural" primary key that might be
subject to minimal change over a non-meaningful, non-user-entered column.
A unique constraint designates that the data in the column or collection of columns is unique per row. If
the unique constraint is on a column, the data in a specific row in the specified column must be unique
from the data in a different row in the same column.
When a unique constraint is defined for a group of columns, the uniqueness is based on the collective whole
of the data in the columns that make up that unique constraint. The data in a specific row in a specific
column does not have to be unique from the data in a different row in the same column. The database
designer uses the unique constraint to ensure uniqueness of business data.
Purpose
|
To ensure the integrity of the database.
|
Data integrity rules, also known as constraints, ensure that data values lie within defined ranges. Where
these ranges can be identified, the database can enforce them. (This is not to say that data validation
should not be done in the application, but only that the database can serve as a "validator of last resort"
in the event that the application does not work correctly.) Where data validation rules exist, the database
constraints must be designed to enforce them.
A foreign key is one or more columns in a table that map to the primary key in another table. One table
might have many foreign keys, and each foreign key is a map to a different table. This mapping, or
relationship, between the tables is often referred to as a parent-child relationship. The child table
contains the foreign key, which maps to the primary key in the parent table.
The definition of foreign key constraints is also often used by the query optimizer to accelerate query
performance. In many cases, the foreign key enforcement rules use reference tables.
Purpose
|
To optimize the database data structures for performance.
|
In the case of a relational Data Model, the initial mapping generally yields a simple class-to-table
mapping. If objects from different classes need to be retrieved at the same time, the RDBMS uses an
operation called a "table join" to retrieve the rows related to the objects of interest. For frequently
accessed data, join operations can be computationally expensive. To eliminate the cost of the join, a
standard relational technique called "de-normalization" is often employed.
De-normalization combines columns from two or more different tables into the same table, effectively
pre-joining the information. De-normalization reflects a tradeoff between more-expensive update operations
in favor of less-expensive retrieval operations. This technique also reduces the performance of the system
in queries that are interested only in the attributes of one of the objects that are effectively joined in
the de-normalized table, since all attributes are normally retrieved on every query. For cases in which the
application normally wants all attributes, there can be a significant performance improvement.
De-normalizing more than two tables is rare and increases the cost of inserts and updates as well as the
cost of non-join queries. Limiting de-normalization to two tables is a good policy unless strong and
convincing evidence can be produced regarding the benefits.
De-normalization can be inferred from the design classes in cases in which classes are nested.
Nested classes can be mapped to a de-normalized table.
Some object databases allow a concept similar to de-normalization, in which related objects are clustered
together on disk and retrieved in single operations. The concept in use is similar: Reduce object retrieval
time by reducing the work the system must do in order to retrieve related objects from the database.
In some cases, optimizing the Data Model can unmask problems in the Design Model, including performance bottlenecks, poor
modeling, or incomplete designs. In this event, discuss the problems with the Designer of the class, triggering change requests where
appropriate.
Purpose
|
To provide for efficient data access using indexing.
To provide for efficient data access using database views.
|
Once the table structure has been designed, you must determine the types of queries that will be performed
against the data. Indexing is used by the database to speed access. Indexing is most effective when the
data values in the column being indexed are relatively distinct.
Consider the following indexing principles:
-
The primary key column of the table must always be indexed. Primary key columns are used frequently as
search keys and for join operations.
-
Tables smaller than 100 rows in size with only a few columns benefit little from indexing. Small tables
generally fit easily in the database cache.
-
Indexes should also be defined for frequently executed queries or for queries that must retrieve data
quickly (generally, any searches done while a person might be waiting). An index should be defined for
each set of attributes that are used together as search criteria. For example, if the system needs the
ability to find all Orders on which a particular product is ordered, an index on the Line Item table on
the product number column would be necessary.
-
Indexes should generally be defined only on columns used as identifiers, not on numeric values, such as
account balances or textual information such as order comments. Identifier column values tend to be
assigned when the object is created and then remain unchanged for the life of the object.
-
Indexes on simple numbers (integer and number data types) are much simpler and faster than indexes on
strings. Given the large data volumes processed on a query or a large join, small savings add up
quickly. Indexes on numeric columns tend to take significantly less space than indexes on
characters.
On the down side, the use of indexes is not free; the more indexes on a table, the longer inserts and
updates take to process. When contemplating the use of indexes, bear in mind the following precautions:
-
Do not index just to speed up an infrequently executed query, unless that query occurs at a critical
point, making maximum speed essential.
-
In some systems, update and insertion performance is more important than query performance. A common
example is in factory data acquisition applications in which quality data is captured in real time. In
these systems, only occasional online queries are executed, and most of the data is analyzed
periodically by batch reporting applications that perform statistical analysis on it. For
data-acquisition systems, remove all indexes to achieve maximum throughput. If indexes are needed, they
can be rebuilt just before the batch reporting and analysis applications run, then dropped when the
reporting and analysis is complete.
-
Always remember that indexes have hidden costs. For example, they take time to update (a tax paid on
every insert, update, or delete) and occupy disk space. Be sure you get value from using them.
Many databases offer a choice of index types. The most common include:
-
B-tree indexes-The most frequently used kind are based on balanced b-tree index data structures.
They are useful when the index key values are randomly distributed and tend to have wide variability.
They tend to perform poorly, however, when data being indexed is already in sequential order.
-
Hashed indexes-Less frequently, index key values are hashed. Hashing offers better performance
when the range of index key values is known, relatively unchanging, and unique. This technique relies
upon the use of the key value to calculate the address of the data of interest. Because of the need for
predictability, hash indexes tend to be useful only for medium-sized lookup tables that change very
infrequently.
Your choice of indexing strategy and timing of index creation can have a large impact on performance. Bulk
data loads should be performed without indexes (this can be achieved by dropping the index, loading
the data, and then re-creating the index). The reason for this is that the index structure is re-balanced
as each row is added. Since subsequent rows will change the optimal index structure, the work done
re-balancing the index as each row is inserted is largely wasted. It is faster and more efficient to load
data without indexes, then re-create the index when the data load is done. Some databases provide bulk
data-loaders to do this automatically.
Another strategy for optimizing database access performance is the use of views. Database views are virtual
tables that have no independent storage of their own. To the calling program (or user), however, a view
behaves like a table. A view supports retrieval of data, and it can be used to update data as
well-depending on the database structure and database vendor. The view contains data from one or more
tables that can be accessed through a single select statement. The performance gain occurs during the
selection of data, especially in frequently queried tables. The data is retrieved from a single
location-the view-instead of by searching the multiple or large tables that exist in the database.
Views also play a significant role in database security. A view containing parts of a table can restrict
access to sensitive data contained in the base table.
Purpose
|
To design the space allocation and disk page organization of the database.
|
A database designer uses tablespaces to represent the amount of storage space that is allocated to tables,
indexes, stored procedures, and so forth. One or more tablespaces are mapped to a database. The database
designer must analyze the tables in the Data Model to determine how to distribute them, along with other
support database elements, across the storage space in the database.
In determining the tablespace structures for the database, bear in mind that databases do not perform I/O
on rows, records, or even whole tables. Instead they perform I/O on disk blocks. The reason for this is
simple: Block I/O operations are usually optimized in the software and hardware on the system. As a result,
the physical organization of the tables and indexes in the database can have a dramatic impact on the
performance of the system.
When planning the space allocation and disk page organization of the database, consider the following
factors:
-
the density of information in the disk pages
-
the location of disk pages on disk or across multiple disks
-
the amount of disk space to allocate to the table
These factors are discussed in the sections that follow.
Disk Page Density
The density of disk pages depends on the extent to which data is expected to change over time.
Basically, a less-dense page is more capable of accepting changes in values or the addition of data
over time, while a fuller data page provides better read performance, since more data is retrieved per
block read.
To simplify disk management, the database designer can group tables by the extent to which they tend to
change. The following three groups constitute a good beginning for this type of organization:
-
highly dynamic tables
-
somewhat dynamic tables
-
mostly static tables
The highly dynamic tables should be mapped onto disk pages that have a great deal of empty space in
them (perhaps 30%); the somewhat dynamic tables should be mapped onto disk pages that have less empty
space (perhaps 15%); and the mostly static should be mapped onto disk pages that have very little empty
space (perhaps 5%). The indexes for the tables must be similarly mapped.
Disk Page Location
After the groups of tables are mapped, the database designer must determine where to put the disk
pages. The goal here is to try to balance the workload across a number of different drives and heads to
reduce or eliminate bottlenecks. Consider the following guidelines:
-
Never put data on the same disk as the operating system, its temporary files, or the swap devices.
These drives are busy enough without the addition of further workload to them.
-
Put data that is accessed simultaneously on different drives in order to balance the workload. Some
systems support parallel I/O channels. If this is the case, put the data on different channels.
-
Put the indexes on a different drive from the data that it indexes in order to spread out the
workload.
-
Refer to the database vendor's documentation for guidelines.
-
The type of storage used (for example, RAID-5, RAID-10, SAN, NAS, and channel attached) affects
database performance. Make use of the performance guidelines provided by the storage provider.
Database I/O is generally the limiting factor in database performance. I/O balancing is an iterative,
experimental process. By prototyping database access performance during the elaboration phase, coupled
with appropriate instrumentation to monitor physical and logical I/O, you can uncover performance
problems early while there is still time to adjust the database design.
Disk Space Allocation
Using the characteristics of the persistence design mechanism, estimate the number of objects that must
be stored. The amount of disk space required to store the objects varies from RDBMS to RDBMS. When
calculating disk space, make sure to account for growth due to additions of data. To estimate the
disk space for a database, first estimate the disk space required for each table, and then calculate
the space requirements for all tables. Consult the database administrator manual for the specific
RDBMS product to determine the precise size estimation formula. Here are some general steps for
estimating the space requirements for a table:
-
Calculate average row size. This calculation should include any control information at the
record level, as well as any control information required for variable-length columns.
-
Calculate the number of rows that will fit into a page or block of I/O. Because most databases
store only complete records on a page or I/O block, this should be the integer number of rows that
will fit into a page or block of I/O.
-
Calculate the number of pages or I/O blocks required to store the estimated number of records in
the database. The estimated number of records must include any load factors.
-
Multiply the number of pages or I/O blocks required by the size of the page or I/O block.
-
Add any overhead for additional indexes.
-
Add any fixed overhead for the table.
Once the table space requirements have been defined:
-
Compute the sum of the space required by the tables.
-
Add in any required fixed amount of space for database management.
-
Add in disk space required for the transaction log and audit trail.
In a frequently updated environment, the retention requirements for the audit trail require significant
amounts of storage. The documentation for major commercial database management systems usually provides
detailed sizing instructions. Be sure to refer to these instructions when calculating your estimates of
the database disk space requirements.
Purpose
|
To determine if the stored procedures or triggers should be used to implement data access class
operations.
|
Most databases support a stored procedure capability. A stored procedure is executable code that runs
within the process space of the database management system. It provides the ability to perform
database-related actions on the server without having to transfer data across a network. The judicious use
of stored procedures can the improve performance of the system.
Stored procedures are usually one of these two types: actual procedures or triggers. Procedures are
executed explicitly by an application, generally have parameters, and provide an explicit return value.
Triggers, on the other hand, are invoked implicitly when some database event occurs (for example, insert a
row, update a row, or delete a row), have no parameters other than the row being modified (since they are
invoked implicitly), and do not provide an explicit return value.
In database systems that lack constraints, triggers are often used to enforce referential and data
integrity. Otherwise, they tend to be used when an event needs to trigger (or cause) another event.
Triggers are also frequently used for security purposes by auditing the trigger event.
The design classes in the Design Model must be examined to see if they have operations that should be
implemented using the stored procedure or trigger facility. Candidates include:
-
any operations that primarily deal with persistent data (creating, updating, retrieving, or deleting
it).
-
any operations in which a query is involved in a computation (such as calculating the average quantity
and value of a product in inventory).
-
operations that must access the database in order to validate data.
Remember that improving database performance usually means reducing I/O. Therefore, if performing a
computation on the DBMS server will reduce the amount of data passed over the network, the computation
should probably be performed on the server.
Work with the designer of the design class to discuss how the database can be used to improve performance.
The designer will update the operation method to indicate whether one or more stored procedures can be used
to implement the operation.
|