This
document discusses the rationale, design, and implementation of the JDBC back-end
for Protégé.
Why we did it the way we did
There are a number of ways to map an object-oriented schema (essentially part of what Protégé produces) to an
RDBMS schema. These different ways all involve a number of tradeoffs between optimizing for space, time, queries, modifications,
insertion, deletion, etc. We wanted to support the possibility that our users might want to map onto a legacy, perhaps
even non-relational, database. Different types of mappings also seem more appropriate at different development stages.
Early
on, when you are just trying to get the class structure right, you want something where the schema is easy to change. Later on,
when you are working with instances, you may want something that is optimal for insertion, modification, and removal
of instances and perhaps has a "natural" feel about the mapping of classes to tables. Even later, when you are
concentrating on applications, you may want something that is optimized for exactly the queries that are most important
to you.
It was clear that no matter which approach we took for the database back-end, we were unlikely either to make very many people
happy, or to keep them happy for very long. Thus, there seemed to be two possible approaches:
- Develop (or use) a mapping language that would allow us to map the ontology to any RDBMS schema
- Allow users to write code to plugin into our system to access their database with whatever schema and
optimizations they choose, and then provide a simple example implementation of this plug-in.
There are a number of commercial products that address the first option by parsing Java source files and producing RDBMS
schema and interface classes of various types. We don't have Java source laying around for classes created inside of
Protégé. More importantly, we allow arbitrary changes to the class structure after we have acquired instances.
When the classes are changed we try not to lose any more information from the instances than necessary. We also allow arbitrary
"type migration" of instances from one class to another. These features are really central to the way users interact
with our tool and we were not aware of any commercial support for such a thing.
Thus, developing something along the lines of the first option was a lot of work, as well as being orthogonal to the main thrust
of our research here. We see ourselves as providing usable tools for knowledge base development and use and not as the world's
experts on OO-to-relational mapping tools. Also, our bias is against doing such a large project having no experience with simpler
systems. We prefer to build big things by extending little things that have proven to be inadequate. In this case we
didn't have any experience building the corresponding little things, inadequate or otherwise.
So, we chose the second option, allowing users to create their own database plug-ins. This choice is also consistent with the
rest of the Protégé plug-in architecture. We chose the simplest schema that we could think of and focused on the
"maximal change" usage where the class structure and hierarchy is undergoing constant change. In this design, therefore,
there is no attention paid to things such as query performance of any type.
What we did
We have a single table that stores the entire contents of the knowledge base. This
includes classes, slots, facets and instances. The table has a fixed number
of columns which are listed below. The Protégé metaclass architecture (which
is modeled after the CLOS Meta-Object Protocol) is used explicitly in the table
to simplify things: we treat all classes, slots, and facets as frames. Each
entry in the database corresponds to a frame in Protege. Classes have slots
such as ":DIRECT_SUPERCLASS" to maintain the inheritance hierarchy.
All frames have a :NAME slot which contains the name of the frame. The
database format is:
| Column |
Description |
Can be null |
Can be empty |
frame [integer] |
frame id Frame ID's < 10000 are reserved for the system. The frame ids for system frames are declared in the file:
edu.stanford.smi.protege.model.Model.java |
No |
No |
frame_type [smallint] |
same as "value_type" but for the frame column |
No |
No |
slot [integer] |
slot frame id |
No |
No |
facet [integer] |
facet frame id (0 if not a facet value) |
No |
No |
is_template [smallint] |
0 => value is OKBC "own", 1 => value is OKBC "template" |
No |
No |
value_index [integer] |
number used to maintain relative ordering of slot_or_facet_value entries
for a frame-slot(-facet) combination |
No |
No |
value_type [smallint] |
number used to indicate the "type" of the value stored in
slot_or_facet_value. The number-to-type conversion is given in the file:
edu.stanford.smi.protege.storage.database.DatabaseUtils.java |
No |
No |
slot_or_facet_value [varchar(N)] |
facet value if facet is not 0, slot value otherwise. Holds values of
length that will fit in a varchar (typically <= 255) |
Yes |
No |
long_slot_or_facet_value [longvarchar] |
same as slot_or_facet_value but holds values too long to fit
in slot_or_facet_value |
Yes |
No |
Note
If
you browse a Protégé database you may notice that, in the case of the
superclass and subclass relations, we appear to store duplicate information.
For example with class A we store that its subclass is B and with
B we store that its superclass is A. Maintaining separate records
for these relations is necessary because we want to be able to maintain
the ordering of both subclasses and superclasses. So while the "slot
value" information is indeed duplicated in these records, the "index"
information is unique. (Subclass ordering is a user-interface feature
that a number of our users have requested. Protégé attaches no
meaning to the ordering of superclasses or subclasses.)