Web Server forum
Back To The Forum Home!Search!Private Messaging System

This is Interesting: Free IT Magazines Now Free shipping to California  
Web Server Talk Web Server Talk > Free Databases support forum > Databases Forum > Database Theory > The TransRelational Model: Performance Concerns




Pages (5): [1] 2 3 4 5 »   Last Thread   Next Thread Next
  Show Printable Version Email this Page Subscribe to this Thread      Post New Thread    Post A Reply      

    The TransRelational Model: Performance Concerns  
Josh Hewitt


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

Recently there has been some excitement created amongst database
enthusiasts (though not researchers) by the patented technology called
the TransRelational(TM) Model.  It was brought to the attention of the
general public by C.J. Date who gave a brief description of it in the
appendix of the 8th edition of his classic book "An Introduction to
Database Systems".  Elsewhere he talks about it as "a radically new and
dramatically different technology" which would let us build an
implementation of the relational model which "would be orders of
magnitude faster than, and would deliver a far greater degree of data
independence than, today's SQL products."

Unfortunately, the available documentation on the TransRelational Model
is quite scarce.  Besides C.J. Date's appendix the only source of
information that an interested person can turn to is the published
patent.  (A side note: if we do a search on 'TransRelational' on Google
we can get to the web site of the patent assignee, Required Technologies
Inc. [http://www.requiredtech.com/].  The site is a quick read.)

I took the time and read the published patent (and patent applications)
because I wanted to know more about the technology that (the otherwise
often skeptical) C.J. Date calls "the most significant advance in this
field since Codd gave us the relational model, nearly 35 years ago."
This is a long email so I copied the 'Conclusions' section from the end
of the text so you can read it as a kind of an 'Abstract' of the
analysis.

As for the published patent description, it does not provide more
information than C.J. Date's appendix but it is much harder to read.  (As
it is usual with patent publications.)

In the appendix C.J. Date mentions a book in the works, called "Go
Faster! The TransRelational(TM) Approach to DBMS Implementation."  When I
contacted the Database Debunkings site for the publication date of the
book I was told that it is unlikely to appear in the near future due to
legal issues (of undisclosed nature.)

Source of C.J. Date quotes:
http://www.dbdebunk.com/page/page/619867.htm

Links to the relevant patents:
http://www.dbdebunk.com/page/page/618862.htm



As a part of my preparation for my Ph.D. dissertation I had to read a
couple of articles on the physical implementation issues of relational
DBMS's.  In particular, I was interested in indexing techniques
(costs/benefits) and ways of improving natural join performance.  These
articles made me think a little bit about the TransRelational Model
(abbreviated as 'TRM' from now on) and what I came up with is a bit
discouraging.

Here follows the detailed analysis:

('disk page', 'page', 'disk block', and 'block' are used interchangeably
in the text.)


First, in the TRM 'columns' of a relation are stored separately.  To
support effective searching, these columns are sorted individually based
on attribute values.  If we want to access more than one attribute of
a given record we have to be able to 'reconstruct' the record.  This is
the Record Reconstruction (RR) step.  RR in the TRM is handled in the
following way: each attribute value of a given record has a 'link' to the
value of the 'next' attribute of the same record.  ('next' is between
quotes because the actual order in which attributes are linked together
is more or less arbitrary.)  These links form a cycle so we can start the
reconstruction of the record from any attribute value.

Example:

relation EMP:

E#    ENAME        HIREYEAR    SALARY

30    Johnson        1993       5000
20    Smith          2000       6000
40    Hewitt         1998       9000
50    Bailey         1990       7000

TRM representation of relation EMP:

Field-Values (FV) Table (notice that each column is sorted):

ROW  E#    ENAME       HIREYEAR    SALARY

1   20    Bailey       1990        5000
2   30    Hewitt       1993        6000
3   40    Johnson      1998        7000
4   50    Smith        2000        9000

Record Reconstruction (RR) Table:

E#       ENAME     HIREYEAR    SALARY
4        1           3     +--->  2  --+
+->  3        3    +-->   1  ---+      1    |
|    2       (2) --+      4            4    |
|    1        4           2            3    |
|                                           |
+-------------------------------------------+

Fig. 1: RR for the record where ENAME = 'Johnson'

What are the advertised benefits of this data organization?

- All attributes are stored in sorted order so practically the
relation is simultaneously indexed by _all_ of its attributes.  This
makes range queries and equivalence queries quite efficient since
we can do an initial binary search on the search attribute.

- Performing ORDER BY queries that involve one attribute of the relation
becomes trivial.  Even in the case of multiple attributes the bulk of
the sorting is already done since we can traverse the FV table using
the first attribute in the sort order.

- Since values that are the same are stored consecutively, performing
GROUP BY (or SUMMARIZE BY) queries becomes much more efficient.  In
addition, various compression techniques (e.g. collapsing runs of values
that are the same) become feasible so significant storage savings can be
achieved.

(The organization can also be extended to support joins using 'merged
columns' but in this analysis I will stick to the single relation case.)


How do we use the TRM?

Let's suppose we would like to bring up the records of employees whose
name is 'Johnson' (using Tutorial D):

EMP WHERE ENAME = 'Johnson'


How to compute the result in the TRM?  Well, first we perform a binary
search on the ENAME column in the FV table (cost: log ||EMP||, where
||EMP|| is the number of tuples in relation EMP) and then follow the
attribute value links in the RR table (shown on Fig. 1)


So far so good.  This is how the TRM is supposed to work.

Now let's compare the TRM's performance on this example to that of any of
today's major DBMS's.  For the sake of the example, let's say that in the
major DBMS we have a B-Tree index defined on the E# attribute (this is a
primary key so it is a reasonable assumption) and also that we have a
secondary B-Tree index on the attribute ENAME.  HIREDATE and SALARY are
not indexed.  To make the example more realistic, let's suppose that we
have four additional attributes (non-indexed) in EMP: DEPT#, EMAIL,
ADDRESS and PHONENO.


First, notice that in any 'standard' DBMS implementation, attribute values
that belong to the same record are usually stored on the same page (disk
block.)  In the standard setting record reconstruction is trivial (a
single block read) once we located the record.

In the analysis that follows I will use the following symbols:

ACNT               8        number of attributes in EMP
|EMP|          1`000        number of pages of relation EMP
||EMP||      100`000        number of tuples in relation EMP
||RS||                      number of tuples in the Result Set

Let ||EMP|| be quite large (say, around 100`000).  Let's say we store 100
employee records per page on average.  This means that |EMP| is 1000.

The TRM stores each column separately so we can think of them as binary
relations where the first attribute is the attribute from the original
record and the second attribute is the link from the RR table.  Each
binary relation is sorted on the attribute values.  Let's call each such
table with the same name as the original attribute with '_FV' added as
suffix.  For example: E#_FV, ENAME_FV, etc.

The ENAME_FV Table:

ENAME    LINK
Bailey    1
Hewitt    3
Johnson   2
Smith     4


The assumption that the TRM stores each FV column separately along with
its corresponding links is not an assumption that is very important from
the point of view of the analysis.  What is important is that no matter
how the TRM actually chooses to physically store column values the
storage must facilitate binary search on the values and then provide easy
access to the link that points to the next attribute value.  If we want
to do binary search on disk (since we cannot fit the whole relation in
main memory) it is a good idea to keep as many attribute values on one
disk page as possible to reduce the number of disk accesses.  It is also
a good idea to keep the next-attribute-link close to the attribute value
so that once we found the attribute value we were looking for we can
immediately begin the reconstruction of the record.  Otherwise we would
need an additional block read to fetch the link corresponding to the
attribute value.  The design of _FV tables shows these desirable
properties and any other design for storing FV columns has either the
same performance or worse.  (For example, if fewer ENAME values are
stored on a single page, then any binary search on ENAME will use more
block reads than it would use on the ENAME_FV table.)


[[ A little side note, regarding the column compression feature of T
RM:

The column compression feature of the TRM *does not* reduce the size of
the RR table only the FV table.  Also, the FV table will be reduced in
size only if the cardinality of the given attribute is low.  In our
example only the DEPT# and, to some extent, the SALARY columns seem to
lend themselves for compression.  Compressing these two columns would
modify the results of the analysis only slightly, so for the sake of
simplicity we will ignore this possibility for now.)  ]]

Obviously, the number of records in each _FV table is the same as in the
original relation:

||EMP|| = ||E#_FV|| = ||ENAME_FV|| = ...

However, since the record size is smaller the _FV tables will occupy less
number of disk pages.  If we could store 100 records of the original
relation with 8 attributes on one page then we can store 800 records of
the _FV tables on one page (on average).  This means that each _FV table
will use 125 pages on average:

|EMP| / 8 = 1000 / 8 = 125 = |E#_FV| = |ENAME_FV| = ...


Now let's compare the cost of some simple retrieval queries on both
storage organizations (ie: Traditional and TRM)

First, observe that we have at least two factors that contribute to the
total cost of executing a query: (L) the cost of locating the record,
(RR) the cost of reconstructing the record.


Query 1:

EMP WHERE E# = 30

This is a restriction on the primary key so the result set is of
size one at most.  Let's suppose we have an employee with E# equal to 30.

Traditional RDBMS:

L: B-Tree search that results in a physical RowID.

Cost: at most 2 blocks (the branching factor of a B-Tree is very high)

RR: Accessing the record using the RowID.

Cost: 1 block

TRM:

L: Binary search in table E#_FV.

Cost: log |E#_FV| blocks = log 125 blocks = 7 blocks

RR: Accessing all _FV tables for the remaining attributes of the record.

Cost: (ACNT - 1) blocks = 7 blocks


Total cost of fetching a record based on its single attribute primary key
(assuming a B-Tree index in the traditional case):

Trad:        3 blocks
TRM:        14 blocks



Query 2:

EMP WHERE ENAME = 'Johnson'

Let's suppose that the size of the result set is 20.  (We have twenty
employees whose last name is Johnson.)


Traditional DBMS:

L: B-Tree search that results in the physical RowID of the first record
that has 'Johnson' as its ENAME attribute.  Since RowIDs in the leaf
nodes of the B-Tree are kept in key order all we have to do is
sequentially read the RowIDs that follow until we hit a key with a
different ENAME.

Cost: at most 2 blocks for search (the branching factor is very high)

RR: Accessing the records using their RowIDs.

Cost: ||RS|| blocks = 20 blocks

TRM:

L: Binary search in table ENAME_FV.

Cost: log |E#_FV| blocks = log 125 blocks = 7 blocks

RR: Accessing all _FV tables for the remaining attributes of the records.

There is no guarantee that two records that follow each other when we
sort the relation on one attribute will follow each other (or will be
even close to each other) if we sort the records on some other
attribute.  The consequence is that when we follow the links to the
next attribute in the RR phase it will be more or less random access
of _FV tables.  So for each record that appears in the result set we
have to read 1 block of each of the different attribute value tables.
This is why the cost of RR is so high in this case.

Cost: (ACNT - 1) blocks * ||RS|| = 7 * 20 blocks = 140 blocks


Total cost of fetching a record based on single non-key attribute
(assuming a B-Tree index in the traditional case):

Trad:         3 blocks
TRM:        147 blocks (!)


Query 3:

EMP WHERE SALARY = 5000

The final case is when we the attribute that appears in the query is not
indexed in the traditional RDBMS.  Let's suppose that the result set is
again of size 20.

Total cost:

Trad:        |EMP| blocks = 1000 blocks (full table-scan)
TRM:          147 blocks

It is easy to see that the TRM does not care about which attribute we
search on (it is 'unbiased') so the its execution cost is the same as
for the second query (EMP WHERE ENAME = 'Johnson').

The total cost function of the TRM based on this analysis is the
following:

Locating the result set using binary search (equality or range):

log (|EMP|/ACNT)  blocks

Record Reconstruction:

(ACNT - 1) * ||RS|| blocks


Two important observations:

- After all, binary search is not that cheap on large relations.
- Record reconstruction cost is a linear function of *both* the attribute
count and the size of the result set.

For example, in Q2, where ACNT was 8 and ||RS|| was 20 the total cost of
generating the result set using the TRM turned out to be 147 blocks!

The TRM had to fetch 15% of the total relation *after* locating the
records in question to generate the result set that consisted of 20
records, 0.02% (!) of the relation.

As the result set grows the performance of TRM degrades rapidly:

||RS||    % of ||EMP||       Total Cost     % of |EMP|

40         0.04            287 blocks        28.7
60         0.06            427 blocks        42.7
80         0.08            567 blocks        56.7

Keep in mind that we have 100000 records in EMP.  40 records in the
result set will make the TRM read in almost one third of the database!

Of course, as the result set grows the likelihood that we 'hit' the same
field-value block is getting higher so the actual numbers will be
somewhat smaller.  Nevertheless, for small to medium sized result sets
the overhead of the TRM is enormous.  Of course, if the query is such
that it has to process almost the entire relation than the TRM is not
worse than the traditional approach.


Notice, that we are still talking about retrieval costs where the TRM
supposedly shines because it keeps the database records sorted by all
attributes at the same time.


Let's look at insertion costs now.

INSERT NEW_EMP_REC INTO EMP

(NEW_EMP_REC is a record that contains the data of a new employee.)

Traditional RDBMS:

Insertion: Two B-Tree inserts (the E# and ENAME indexes) and a single
block access for inserting the record itself.

Cost:        2 * 2 + 1 = 5 blocks

TRM:

Insertion: If we want to insert a new tuple into the EMP relvar the TRM
has to insert _every_ attribute value into its corresponding _FV table
using binary search.  But this is not enough in itself.  Since the TRM
uses positional links we are forced to deal with the ordered array
insert problem.  We have to make room for the new value which means
that we have to shift the rest of the table and we have to update the
links in the 'previous' _FV table.  Let's be generous and ignore the
costs of making room/updating links step.  As we will see, even without
this additional overhead the cost of insertion is quite high.

Cost:
Binary searches:

ACNT * log (|EMP|/ACNT) = 8 * log (125) = 8 * 7 = 56 blocks (!)



Conclusions

Although the TransRelational Model seems to be very appealing at first
glance, closer analysis brings out its shortcomings.  To summarize: (1)
retrieval is prohibitively expensive for any reasonably sized relation
because of the huge record reconstruction costs incurred by the vertical
decomposition of tuples into 'Field-Value' tables, (2) insertion is
prohibitively expensive because of the many binary searches and value
insertions that have to be performed for each attribute of the record to
be inserted.

Another observation is that the reward offered for all the pain, binary
search on all attributes, is 'just' a log(n) search which is too slow
compared to B-Trees.  Not to mention that binary search is also a random
access search which is not very cache friendly.

Also, the current analysis completely ignored the cost of making room for
new values on insertion and maintaining the links between the _FV tables
after deletion and insertion.

In my opinion, in its current form (as described in the appendix of
C.J. Date's "An Introduction to Database Systems (8th ed)" and the
published patent) the TRM is not a feasible candidate for implementing a
relational DBMS.

Regards,

Josh Hewitt
Ph.D. Student
Florida Institute of Technology





[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Alistair Bayley


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

Alfredo Novoa wrote:
>
> The patent does not prescribe anything about how to physically
> implement the tables.
>
> Binary search does not imply a binary tree.
>
> http://c2.com/cgi/wiki?BinarySearch

I assumed that the tables would be implemented by something that gave
O(1) lookup time (amortized perhaps, but still more-or-less constant).
And following from this, reconstructing a record would also be a
constant time operation (or rather, O(n) where n is the number of
columns), as you'd have (say) to fetch 1 block per column from the value
table, and 1 block per column from the reconstruction table. Is this
more-or-less correct, or have I misunderstood?


> The ring structure is only one of the possible structures we may use.

Can tell me some of the others, please?


Thanks,
Alistair.





[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Troels Arvin


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

On Wed, 10 Nov 2004 23:54:59 -0800, Josh Hewitt wrote:

> In my opinion, in its current form (as described in the appendix of
> C.J. Date's "An Introduction to Database Systems (8th ed)" and the
> published patent) the TRM is not a feasible candidate for implementing a
> relational DBMS.

Thanks for an interesting posting.

IO-intensiveness of insertions in TRM seems rather unavoidable.

But I thought that some of the IO overhead in searches could maybe be
avoided by adding some kind(s) of indexes to the model? I know that this
sounds rather strange, because one of the ideas of TRM seems to be that
all columns are clustered, enabling (somewhat) quick searching by
"themselves".

Another thought: All your examples seem to be use of fully projected
relations, corresponding to SELECT * FROM ...
Maybe TRM has value if most of the actual queries project to only a
fraction of the existing columns?

--
Greetings from Troels Arvin, Copenhagen, Denmark






[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Costin Cozianu


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

Thanks very much for your post. A fresh note in this otherwise decayed
newsgroup. Boy that patent language is really annoying, thanks for your
tranlation into English 

I'll ask you a further question. Studying the patent superficially I
noticed no big difference from professor  Naphtali Ruishe's Semantic
Binary Model prosal (actually it looks like a particular implementation
choice of SBM). He also uses an inverted store by value scheme.

Do you know whether there are any major differences ? I mean not as far
as the patent office is concerned (I'm, sure they will find some
relevant differences), but from a realistic engineering perspective.

Thank you,
Costin





[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Gene Wirchenko


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

Alfredo Novoa <anovoa@ncs.es> wrote:

>On Fri, 12 Nov 2004 15:13:25 GMT, "Marshall Spight" <mspight@dnai.com>
>wrote:
> 
>
>I disagree a little bit. If you learn studying books written by
>anybody else, you are not teaching yourself in the strict sense.
>
>In fact, I don't see a fundamental difference between to study a book
>and to attend to a seminar.

One fundamental difference is that with a book, the communication
is one-way only.  At a seminar, there is the opportunity for two-way
communication.  This makes for easier debugging of problems.

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
I have preferences.
You have biases.
He/She has prejudices.





[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Alfredo Novoa


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

On Fri, 12 Nov 2004 17:06:38 GMT, "Dan" <guntermann@verizon.com>
wrote:

> that 
>
>Wasn't the paper quite explicit that binary searches over binary relations
>were used in contrast to B*Trees?

The patent does not prescribe anything about how to phisically
implement the tables.

Binary search does not imply a binary tree.

http://c2.com/cgi/wiki?BinarySearch

>This is obviously not the case after closer reading.  But a linked, ordered
>reconstruction table bothers me.  For relations of large degree, there seem
s
>to me to be a fixed overhead in traversing all attributes.

The ring structure is only one of the possible structures we may use.

>, that a relation or a significant part of a 100000 tuples 
>But he makes the same assumption with the conventional relational DBMS as
>well.  This puts it on equal footing doesn't it?

But this is not a realistic assumption and he did not study what
happens if we have a lot of RAM. The analisys is very biased.

It is unthinkable to have a DB server with less than several GBs of
RAM in these days. This is more than enough for most little company's
databases.

It would be rather affordable (for a big company) to build a Tera Byte
RAM DB Server with a cluster of PC boards, and it will be a lot more
affordable in a near future with the 64 bits PC architectures.

Although we only need RAM enough for a significant part of the most
used tables to avoid the most part of the disk loads.
 
>If field value tables are really ordered domains, any new value in each one
>will require overhead to place the new value in the correct spot, no?

We can find an empty slot quickly with an efficient search algorithm.


Regards








[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Marshall Spight


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

"Alfredo Novoa" <alfredo@ncs.es> wrote in message news:4194ae55.6409453@news.wanadoo.es...[v
bcol=seagreen]
> On Fri, 12 Nov 2004 11:41:29 GMT, Jan Hidders
> <jan.hidders@REMOVETHIS.pandora.be> wrote:
> 
>
> I have found this:
>
> http://www.cs.duke.edu/~junyang/cou...graefe-1993.pdf[
/vbcol]

By the way, many thanks to Jan Hidders for mentioning the paper
and Alfredo Novoa for finding it online! It looks to be just about
exactly what I was looking for.


Marshall







[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Alfredo Novoa


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

On Fri, 12 Nov 2004 11:41:29 GMT, Jan Hidders
<jan.hidders@REMOVETHIS.pandora.be> wrote:

>G. Graefe. Query evaluation techniques for large databases. ACM
>Computing Surveys, 25(2): 73--170, June 1993.

I have found this:

[url]http://www.cs.duke.edu/~junyang/courses/cps216-2003-spring/papers/graefe-1993.pdf[
/url]



Regards





[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Alfredo Novoa


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

On Sat, 13 Nov 2004 22:55:32 GMT, Alistair Bayley
<alistair@abayley.org> wrote:

>I assumed that the tables would be implemented by something that gave
>O(1) lookup time (amortized perhaps, but still more-or-less constant).

This is an option but hash tables consume a lot of memory.

O(log2 n) lookup time is enough for many applications.

>And following from this, reconstructing a record would also be a
>constant time operation (or rather, O(n) where n is the number of
>columns), as you'd have (say) to fetch 1 block per column from the value
>table, and 1 block per column from the reconstruction table. Is this
>more-or-less correct, or have I misunderstood?

It depends on many things.

If the blocks are indexed with a B-Tree like structure and that
indices are loaded in the cache what you said is more or less correct.

The number of "sorted" attributes is always optional. If you only want
to make lookups over a single attribute you don't need a
reconstruction table at all.

The same about a single and fixed sequence of attributes.
 
>
>Can tell me some of the others, please?

Any structure that links all the sorted attributes like: double rings,
stars, bidirectional rings, etc.


Regards





[ Post a follow-up to this message ]



    Re: The TransRelational Model: Performance Concerns  
Alfredo Novoa


View Ip Address Report This Message To A Moderator Edit/Delete Message


 
11-16-04 07:15 PM

On Fri, 12 Nov 2004 21:20:18 GMT, Jan Hidders
<jan.hidders@REMOVETHIS.pandora.be> wrote:
 
>
>If we assume a lot of RAM then we are essentially talking about a
>main-memory database in which case you should compare it to the usual
>techniques for those types of databases.

Main memory database or main memory DBMS?

Most non main memory DBMS perform a lot better if you have a lot of
RAM.

It is becoming very frequent to have SQL Server databases that fit in
RAM.

> One that springs to my mind is
>the Monet DB from the CWI in Amsterdam where relations are essentially
>split in binary relations between tuple identifiers and attribute
>values. For such relations you can then apply the usual techniques of
>hashing, indexes et cetera. I strongly suspect that the space and time
>complexity will be essentially equivalent with the TRM approach, e.g.,
>single attribute equi-joins are also here extremely fast, but it is much
>less complex and doesn't require fancy allocation algorithms, leaving
>gaps for inserts, nightly reorganizations or whatever.

Agreed, but the TRM is more general and flexible (the traditional
approach is also one of the options), and might perform very well with
a broader range of cases.


Regards





[ Post a follow-up to this message ]



    Sponsored Links  




 





   All times are GMT. The time now is 03:46 AM.      Post New Thread    Post A Reply      
Pages (5): [1] 2 3 4 5 »   Last Thread   Next Thread Next


Most Popular forums 

Forum Jump:
Rate This Thread:

Forum Rules:
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is OFF
vB code is ON
Smilies are ON
[IMG] code is OFF
 
Medical and Health forum | Computer Games Reviews | Graphics design forum

Back To The Top
Home | Usercp | Faq | Register