• Please review our updated Terms and Rules here

Q: Techniques for a mini-database

mbbrutman

Associate Cat Herder
Staff member
Joined
May 3, 2003
Messages
6,425
For the professional programmers out there ..

I started working on a user file for my telnet BBS. The file is organized as fixed length records, so random access to a specific record number is easy. I would like to be able to handle somewhere between 500 and 1000 records efficiently. (Assume a record is around 128 bytes.)

I am worried about the time it takes to search for a record. For example, when adding a user one must ensure that the name is not already taken. The naive method is to scan the entire file. With a few entries this is not bad, but if the number of entries was in the 100s then this would be noticeable. Especially with a floppy drive. :)

One solution is to implement some sort of index to organize the file in alphabetical order. That would let you do a binary search of the records, except that could really be poor because of the seeking back and forth through the file. For a small number of records it would make sense to table scan, but for a larger number of records it might make sense to go this route. Having the name and record number in the index would cut down on the disk seeks by allowing you to determine if a name was in use or not just by reading the index. Unfortunately, that would take a lot of space in memory if I really did get 500 users.

I'd really like any solution to not store data in memory, so the index in memory is probably not going to work. Memory is great, but it's not scalable. Even just storing names and record numbers in memory can consume more memory than I want to devote to this function. Having an index purely on disk means 2x the number of seeks and reads, and adding new entries requires rewriting the index file. That is better than rewriting the whole user file, but still expensive when TCP/IP packets are flying in and out on multiple connections.

I've been thinking about hybrid techniques too, like where the file is reorganized in alphabetical order nightly. That would allow one to use a binary search on the file, and then use a separate table scan for new users.

Thoughts?
 
One thought: break the index file into alpha groups and load them into a RAMdisk; let DOS do the main search by first letter(s), and update the diskette(s) when convenient.

m
 
If the names are the only criterion you want to search for, you could consider to use some hash-technique.

To to so, you first need a hash-function which maps alphabetic names to a (small) number. One such function could simply be: add the ascii-values of the characters of the name, modulo 127. This would map any name to a number in the range of 0..126. Of course, this mapping is not unique, more than one name can map to the same number, but if it works well, then the names will map equally to all possible numbers, such that each number represents an equally small amount of names.

Next you need a (small, 127 entries in this case) table mapping those hash-values to the first data record that holds a name which in turn maps to the respective hash-value.

Third you need in each record a pointer to the next record, belonging to the same hash-value.

Now, searching a particular name means calculating the proper hash-value first, then walking the corresponding chain. This walking ist a linear search but if the hash-function is sufficiently ballanced, this walk is rather short. Only some eigth or nine records instead of 1000.

Entering a new record would also require the caculation of the hash-value and then simply append the new record to the end of the proper chain.

I hope my explanation is clear enough. If not, ask again.
 
One thought: break the index file into alpha groups and load them into a RAMdisk; let DOS do the main search by first letter(s), and update the diskette(s) when convenient.

m

I had thought about partitioning the file into multiple files, but I'm doing to be updating the file fairly often and at worst case I would need to have a file handle open for each connection to the machine.

Memory is tight, so a RAM disk is out of the question. The most that I want to think about keeping in main memory is a small index. Assume two bytes per entry for the index, and that is reasonable.


Mike
 
WiWa64,

I have thought about using a hash table, but I think that it breaks down for the same reasons that an index does.

Let's assume there is no prebuilt index on disk, and that all we have is an unsorted file. At startup I am willing to pay the penalty for building an in memory index or hash table. It is only used for finding records, so it can be thrown away each time the program ends.

Indexing technique:

As I read the file in and look at user names I build an index in memory. Each element of the index is a two byte record number in the file, and it will take a lot of disk activity to get the elements of the index into the correct order. If I wanted to use some temporary memory I could store the names in memory while building the index, and then throw the names (and the extra memory away) when the index was done. All that would remain is the array of two byte entries.

At run time if I wanted to find a name I would use a binary search on the index. I'd still have to read each element from the table. (If I could keep the names in memory I wouldn't have to do that.) With 1000 records in the file this only requires 2K of storage, which is reasonable. Worst case to find a record is 10 reads from the main file.

If I want to add a name of the index the procedure is similar. After I find where it should be (and does not exist) I have to 'make a hole' for the new entry and do a large memory copy (on average 1K in size) to move the other entries over.


Hashing technique:

The hash table gets built once at program startup. Each element in the hash table is a record number on disk and then a pointer to the next record in memory.

The more 'buckets' you use to store the hash table, the quicker the searches are. At a minimum you still have to have an element for each name in the user file. But hash tables use more memory because there will be holes in the table and each element is at least four bytes, not two. (Disk record number and next element pointer/index)

Hash tables require less work to add an entry. But hash tables are a more complex algorithm.


If I implement the index as a linked list, the storage required becomes almost as big as that required by a hash table. But then I eliminate the memory copy each time a new record is added. Since new records are not added as often as existing records are searched, I probably wouldn't do this.
 
Ah well, yes, I was assuming that you could put the RAMdisk into non-conventional memory like the Juko boards or with EMS remapped as extended, etc.

m
 
i have a program called VGAdisk that makes a 176 KB RAMdisk from the VGA card's memory and it's really fast, though if you're doing this on the PCjr that's out of the question.

my first thought was also a hash technique. in the BBS i've been working on it just scans the entire file, and it does cause a bit of a delay. although it's not much.

you have to consider how many people are going to be connected at once and how many total users would sign up when you devote resources, not how many you'd like it to handle. i personally think 4 to 6 possible connections at once and 500 or so max users is incredibly generous for a BBS here in 2008.
 
i have a program called VGAdisk that makes a 176 KB RAMdisk from the VGA card's memory and it's really fast, though if you're doing this on the PCjr that's out of the question.

my first thought was also a hash technique. in the BBS i've been working on it just scans the entire file, and it does cause a bit of a delay. although it's not much.

you have to consider how many people are going to be connected at once and how many total users would sign up when you devote resources, not how many you'd like it to handle. i personally think 4 to 6 possible connections at once and 500 or so max users is incredibly generous for a BBS here in 2008.

The key to making a system that performs exceptionally well with 4 connections, 50 users, and a few hundred messages is to think and design for 20 connections, 500 users and a few thousand messages. ;-0

Scanning sequentially a file, even if it only takes 0.5 seconds, represents a delay not just for the connected user but for all connected users. In the days of a dial-up BBS this was ok, but for a telnet BBS it's probably not.

(During my last large scale test I noticed that disk I/O is particulary troublesome. Minimizing disk I/O time is a big part of ensuring good performance.)
 
WiWa64,

Hashing technique:

The hash table gets built once at program startup. Each element in the hash table is a record number on disk and then a pointer to the next record in memory.

The more 'buckets' you use to store the hash table, the quicker the searches are. At a minimum you still have to have an element for each name in the user file. But hash tables use more memory because there will be holes in the table and each element is at least four bytes, not two. (Disk record number and next element pointer/index)

Hash tables require less work to add an entry. But hash tables are a more complex algorithm.
Please excuse me for perhaps not having explained the idea clearly enough. I do have the impression, you didn't fully understand how it works.


First, the size of an entry depends on how you address your records on disk. If you use absolute positioning within the file, you would probably need four-byte pointers. If however your records are all the same size and you can asure that you will never have more than 65353 records, then you can use "abstract" record numbers, from which the absosulte disk position will be calculated (either by you or by the run-time system). This way you only need a two-byte quantity to address a record in your file.

Second, i would not build the hash table each time at program startup, rather i would store it within the file itself. This way you only have to read it at start-up. If your memory is very tight you could even refrain from holding the table in memory. However, the price for this saving of memory is slower execution.

Assuming two-byte record numbers, the size of the table will be two bytes times the number of "buckets", in my above mentioned example 127*2 = 254 bytes. You could store this easily in the first two records of your file.

Another misunderstanding are the "next"-pointers. They are not stored inside the table (the table only holds one pointer to the first record). Rather the next pointer is a field in each record. So you would have to set aside two bytes for this purpose in each of your records, but in the file not in memory.

One could think of the whole file as a set of 127 linked lists. The hash table only points to the beginning of each of these 127 lists. Searching an entry means consulting the hash table to find the right list, then linearly scanning the (relatively short) respective list, instead of scanning the whole file.
 
Another misunderstanding are the "next"-pointers. They are not stored inside the table (the table only holds one pointer to the first record). Rather the next pointer is a field in each record. So you would have to set aside two bytes for this purpose in each of your records, but in the file not in memory.

One could think of the whole file as a set of 127 linked lists. The hash table only points to the beginning of each of these 127 lists. Searching an entry means consulting the hash table to find the right list, then linearly scanning the (relatively short) respective list, instead of scanning the whole file.

I think that was the key misunderstanding - the next pointer is in each user record, not purely in the hash table structure. In a normal database structure one would always separate meta data such as indexes and 'next record' pointers, which is why I didn't understand. We don't have a normal database structure here. :)

Your method is superior because it gets the chains of records out of main memory without requiring multiple reads (one for the index structure/hash table and one for the actual data). I will think about it some more before coding.

Thanks!
-Mike
 
Oh, you're wellcome ! :)


Perhaps this sketch of the possible layout of the file makes the idea still a bit clearer:

In records #0 and #1 the hash table is stored. From there you find pointers (record numbers) to the various data records. Record #x+1 has a non zero next pointer, which means that the hash function was designed in such a way, that the names "Barney" and "Wilma" lead to the same hash value, so the two records belong to the same chain.



Code:
                    +-------+-------+-------+- - - -+ .......+- - - -+-------+
Record # 0          |       |       |       |       |        |       |       |
                    +-------+-------+-------+- - - -+ .......+- - - -+-------+
Record # 1          |   y   |  ...  |  x+1  |       |        |       |   x   |
                    +-------+-------+-------+- - - -+ .......+- - - -+-------+
                    :   |               |                                |
             +----------+               |                                |
             |      :                   |                                |
             | +------------------------+                                |
             | |    :                                                    |
             | | +-------------------------------------------------------+
             | | |  :
             | | |  +-------+-------+------- - - - -  ....... - - - -+-------+
Record # x   | | +->| Fred  | (NIL) |                                |       |
             | |    +-------+-------+------- - - - -  ....... - - - -+-------+
Record # x+1 | +--->| Barney|   y   |                                |       |
             |      +-------+-------+------- - - - -  ....... - - - -+-------+
             |      :           |
             |   +--------------+
             |   |  :
             |   |  :
             |   |  +-------+-------+------- - - - -  ....... - - - - -------+
Record # y   |   +->| Wilma | (NIL) |                                        |
             |      +-------+-------+------- - - - -  ....... - - - - -------+
             |      :
             |      :
             |      :
             |      :
             |      +-------+-------+------- - - - -  ....... - - - - -------+
Record # z   +----->|Pebbles| (NIL) |                                        |
                    +-------+-------+------- - - - -  ....... - - - - -------+
                    :
                    :
                    :
 
I am quite familiar with linked lists - about 20 years now. :)

I was just contaminated by too many years of 'big' database thinking, where stuff like indexing metadata does not get mixed in the table structure.
 
I've been thinking about this with my sensible head for a couple of days, As I see it the way to look at the problem is...
Minimise time to search
minimise file size handled at any one time.

If you want to search the whole of the information in the database then there are no short cuts, the information has to go into a file or set of files complete, and a search of the whole lot is necessary.
If memory space is limited and the major search is on the names, then some sort of simple hashing algorithm on the names is a good way to break the file sizes into manageable chunks e.g. add all the ascii codes in the name string up and do a mod 10 on the sum to give 10 possible alternative file names each of which can be loaded and searched separately.

if you only need to search part of the total data e.g. name / address/ zip then the names addresses and Zips could be put into an index file and used to reference a (set of) data files containing the rest of the unsearchable information.

That's how I'd do it, hope it makes sense!
 
splitting the database into seperate files for names starting with each letter of the alphabet would improve the speed quite a bit of course.
 
If I implement the index as a linked list, the storage required becomes almost as big as that required by a hash table. But then I eliminate the memory copy each time a new record is added. Since new records are not added as often as existing records are searched, I probably wouldn't do this.

I think you're overthinking this. The goal is to use a key (string, number, whatever) and use that to pull the entire 128-byte record off of disk. So for every key of each record, run that through a hashing algorithm (even a CRC32 would be enough) and record the hash + offset into memory. When you want to pull the entire record, hash the provided key and use that to look up the data.

Code:
    hash|(offset * 128)
2887EB76|0000h
AB0A0747|0001h
993B110D|0002h

...etc. For 1000 entries like you mentioned, the entire lookup table would take up 1000*6=6000 bytes. That's trivial.

What about collisions? If you don't care about complexity, then just compare the retrieved record's key with the provided key -- if they don't match, start your table scan. Yes, you could go with linked lists of hash buckets, but geez, that's crazy overkill for a floppy-based BBS system.
 
splitting the database into seperate files for names starting with each letter of the alphabet would improve the speed quite a bit of course.

This isn't such a good idea as it may seem. Actually this already is "some kind of" hash algorithm, but a very poor one.

If you do so, you will end up with some very large files for names starting with letters like "E" or "S" and a very smal ones for "Q" or "X". As a result you will improve search times for some very rare names but for the majority of the very common names search times will not improve significantly.

Therefore it is better to use a somewhat more "sophisticated" hash function, one that distributes the names more uniformly over the various "buckets".

As already mentioned, summing up the characters of a name is an already much better yet still simple enough apoach. To keep the resulting number within the range of available "buckets" (e.g. files) this summing has to be done modulo the number of "buckets" and and using a prime number (such as 127) helps to uniformly distribute the results.

Whether one stores the different "buckets" as seperate files or as differently linked lists within one file is more a matter of taste.
 
Wiwa64 - still out there?

I revisited the problem and implemented your original solution:

  • Add a 'next in chain' field to each record on disk
  • Use an in memory hash table over the user names

I did a little research on hash functions and decided to use a function described by Dan Bernstein about 15 years ago. Here is the function:

Code:
    unsigned long hash(unsigned char *str) {
        unsigned long hash = 5381;
        int c;

        while (c = *str++) {
            /* hash = hash * 33 + c */
            hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
        }

        return hash;
    }
This hash is supposed to work well for human readable words - it distributes nicely, even though some letters occur more frequently than others.

I'm using a hash table with 64 entries in it to start with. This should be fine for anything I am going to do.

Thanks again for the idea.


Mike
 
Yes i'm still around. Actually i peek into the forum every now and then.

Did you measure the distribution over the various buckets? And if so, was the number of entries large enough to give you significant statistics? (The number of entries should have been at least, say, five times the number of buckets). If so, and if your results were satisfactory, then just be happy :)

My personal experiences with this subject are 25 years in the past. In those days i was involved in the programming of a database for an electronic CAD system. In this case many of the "names" were very similar, like e.g. "R1" to "R35" for the resistors, "C1" to "C47" for capacitors and so on.

I used a very simple hash function which simply summed the values of the ascii-characters up, modulo the number of buckets. When experimenting with the optimal number of buckets, i found out, that a prime number, in this case 127, gave the best results. I am not a mathematician and i cannot give you a "proof" for that, but i assume, that the fact, that 127 doesn't divide by any (printable) ascii code causes the resulting hash-values to distribute evenly over the 127 buckets. I made statistics about the distribution and found, that it distributed perfectly uniform, even (or especially) for large PCB-bords with thousands of components. But perhaps i was just lucky? :confused:
 
Back
Top