XML repositories
With the
proliferation of XML as a new common data format, the problem of managing XML
documents has become more critical. New technologies are now available that
allow organizations to better manage their information as XML documents. In
this TechMail, we'll examine the technology of XML repositories and how they
help drive the future of extensible shared data.
Overview
An XML
repository is a system of storing and retrieving XML data. This data is usually
in the form of XML documents and their associated Document Type Definitions
(DTDs) or XML Schemas. Because XML data lends itself to a hierarchical
structure rather than a relational structure, it may be difficult to store XML
data in traditional relational database systems. The repository itself may be a
relational database system, but it is more likely a custom storage system built
exclusively for XML (or hierarchical) data.
The method used
to store the data will vary depending on the specific system being used. Other
variations include the process for storing and retrieving data. Data can be
stored and retrieved using a key-based indexing system, and it can also use a
query-based retrieval system.
Finally, XML
repositories may use a variety of access methods. Some systems use a
proprietary API based on COM, CORBA, or Enterprise JavaBeans (EJB) while others
use a more open ODBC standard. Most repositories provide good support for
network access.
Storing
XML data
The process of
storing XML data consists of two different tasks. One task is adding a new XML
document to the repository. The other task is updating an existing document.
Removing a document from the repository is considered a specialized example of
updating an existing document.
Because XML
data is not based on a traditionally relational model, implementing XML
repositories using traditional relational databases can be complex and
cumbersome. For example, every level of XML hierarchy requires a new relational
table. As your XML documents become more complex, your relational database does
as well.
Storage systems
that are built around a hierarchical model will more easily accept XML data and
will do so as native behavior rather than as an adaptation of a relational
model. Hierarchical systems also give the added benefit of allowing the use of
XQL and XPath expressions for accessing whole and partial documents.
Retrieving
XML data
The method used
to retrieve XML documents is related to the storage method. For relational systems,
this will usually be through SQL or stored procedures. These methods have the
disadvantage of accessing and returning data as a relational set rather than as
an XML hierarchical structure.
Hierarchical
systems will usually provide an XQL or XPath method for accessing XML data.
These technologies more accurately reflect the type of data queries made
against XML data. They also provide the data in a hierarchical format.
Indexing
XML data
When storing
data in relational systems, an external primary key may be attached to the XML
document for maintaining primary document keys. The data storage and retrieval
process uses these keys to identify which document is being stored or
retrieved. More advanced systems extract a primary key from an XML element or
attribute.
Indexes on data
stored in relational tables are based on a single table (or single hierarchy
level). Hierarchical systems allow you to address a primary key as an element
or attribute, as well, but also allow you to create indexes at different levels
based on data within the hierarchy.
Validating
data
One of the most
important aspects of XML documents is the option of data validation. Using a
variety of technologies, including DTDs and Schemas, XML parsers are able to
determine if an XML document meets certain data standards. Because repositories
are able to understand a DTD or XML Schema, they can provide validation as data
is stored and updated.
Summary
As XML documents continue to become more common, organizations will need to create a repository for managing hierarchical data. These repositories will offer new technology for storing, accessing, and optimizing XML documents. Here we've discussed how this new technology is implemented and how it relates to traditional data management systems.