Design spec for Genealogy Database

Version	1.3
Date	1st Dec 02
Author	Jeya Balaji
What is new	Most of the sections went through considerable changes.
Earlier version	genealogy12.html
Recent version	genealogy14.html

Content

Aim
Functionality
Technology
Design

Table design
Class design

Web admin
Cost estimation
Issues

Aim

Main theme: To collect details of ancestors and current generation, so that, genealogy charts will be available online.

Supplementary idea: This database can be utilized for matrimonial search; ie with person info database, complete family background of a girl or a boy can be known.

Website functionality

The functionalities available in this website are listed here :

Add/edit person records
Search one or more person
See the genealogy chart for a family
Find relationship between 2 person
Search bride/groom
Match bride/groom by horoscope

Technology

Oracle database holds the data of people and maintains relationship between them.
Java beans are used to pull the data from the db.
Jsp is used to present the data and accept user queries.
Servlets mediate beans and jsps
xml, xsl (if required)

Design

MVC model is adapted for implementing the web application. Jsp pages are used to display the records and collect data. servlets and beans are used in the back-stage to help this process.

Database table design

The main table, personTable, holds the records of individual. personTable and spouseTable stores the relationship between people.
The basic relationship stored are father, mother and spouse. With these three relations, we should be able to retrieve all other relatives.

matrimonialTable is used to place matrimonial info. The table horoscopeMatchingFactor determines the possibility of match between two horoscope.

50k to 100k people (records) are expected to be available in the personTable.
Here is the person detail -


create table personTable (
	personId     mediumint not null auto_increment,
	name         varchar(40),
	initial      varchar(10),
	familiarName varchar(40),
	sex          enum ('m', 'f'),
	birthDate    date,
	expireDate   date,

	fatherId     mediumInt,
	motherId     mediumInt,

	religionId   mediumInt   fk classificationTable,
	sectorId     mediumInt   fk classificationTable,		 

	education    varchar(100),
	profession   varchar(100),

	street       varchar(40),
	placeId      mediumInt   fk placeTable,
	native       mediumInt   fk placeTable,

	homePhone    varchar(20),
	workPhone    varchar(20),
	mobile       varchar(20),


	email        varchar(70),
	hideEmail    enum ('0', '1'),
	webpage      varchar(100),
	photo        varchar(100),

	bloodgroup   enum('A+', 'A-' ...),
      
	maritalStatus enum('single', 'married', 'divorce'),
	rasi          enum(0 - 12),
	nakshatram    enum(0-27),


	hobbies      varchar(200),
	description  varchar(500),

	recordStatus enum('available', 'pending'),
     
	primary key (personId)
);

// do we need a separate table or can we use varray in the personTable to hold list of spouse?
create table spouseTable (
	husbandId    mediumInt,
	wifeId       mediumInt,
	marriageDate date,

	primary key (husbandId, wifeId)
);

create table matrimonialTable (
	matriId     mediumInt,
	personId    mediumInt,
	
	height      int,
	weight      int,
	
	complexion  enum('fair', 'medium', 'black'),
	smoking     enum('never', 'sometimes', 'often'),
	drinking    enum('never', 'sometimes', 'often'),
	
	description varchar(500),
	expectation varchar(500),

	primary key matriId
);

HoroscopeMatchingFactor determines matching criteria of all possible combinations of rasi and nakshatram. This table will be useful for searching matrimonial table based on good horoscope match for a particular horoscope.

create table HoroscopeMatchingFactor (
	girlRasi          enum(0-12),
	girlNakshatram    enum(0-27),
	boyRasi           enum(0-12),
	boyNakshatram     enum(0-27),
	matchingFactor    float,

	primary key   (girlRasi, girlNakshatram, boyRasi, boyNakshatram)
)

HoroscopeMatchingFactor table can have 2 indexes based on (girlRasi, girlNakshatram) and (boyRasi, boyNakshatram)

As per social norms, person can be classified based on different belief and sectors (religion, caste, sub divisions). These divisions are maintained in an auxiliary table with the following structure.

create table classificationTable (
	divisionId      mediumInt,
	baseDivisionId  mediumInt,
	description     varchar(100),
	divisionType    enum('belief', 'sector'),

	primary key divisionId
);

Two divisionId (one for belief and one for sector) alone will be stored in the person record. Using this id, other divisions (parent, sub divisions) can be retrieved from the classificationTable.

The city, state, country names are stored in placeTable.

create table placeTable (
	placeId         mediumInt,
	basePlaceId     mediumInt,
	pincode         varchar(10),
	placeType       enum('country', 'state', 'city', 'town', 'village', 'area'),

	primary key placeId
);

Class design

Person is a java bean, which is vital object for all operations. Person bean is filled with the data from the database.
Most of the search are done with the primary key (personId) lookup.

Instead of having the attributes of a person (like name, addr etc) as separate variables, we can have them in a hashtable. Pros and cons are discussed in the issue section.

Person bean has all the attribute related to a person. Note that zodiac sign and star (rasi and nakshatram) are available in a separate class Horoscope. Person class has Horoscope class.

These classes are accessed from servlets and jsp.
(TODO : yet to plan the jsp, servlets)

Planning to provide SessionManager which will hold a pool of recently accessed person records.
(TODO : have to sketch other responsibilities of SessionManager)


class Person
{
	insertPersonRecord()
	{ 
	}
	
	updatePersonRecord()
	{
	}

	deletePersonRecord()
	{

	}

		
	retrievePersonRecord()   // from db
	{

	}

	retrievePersonName()
	{

	}

	formatPersonRecordForDisplay()
	{

	}

	formatPersonNameForDisplay()
	{

	}

	protected int personId;
	protected String name;

	protected Hashtable personData;  // all other info of a person

	protected Horoscope horoscope;
}



class Horoscope
{
	

	matchHoroscope(Horoscope otherHoroscope)
	{
		girlHoroscope = this;
		boyHoroscope = otherHoroscope;
		for(int i = 0; i < 10; i++)
		{
			call match[I](girlHoroscope, boyHoroscope);
		}

		return numberOfMatches, crispDesc, detailDesc;
	}

	// there r 10 match function.  
	// Each function returns a float indicating the matching 
	// criteria.  Also these function place their comment into  
	// crispDesc & detailDesc.
	match[0-9](Horoscope girlHoroscope, boyHoroscope)
	{

	}

	private static String rasiName[] = { "Mesham", � };
	private static String nakshtramName[] = {"Ashwini", �};

	private byte rasi;
	private byte nakshatram;

	private float numberOfMatches = 0;
	private String[] crispDesc, detailDesc;
}


class SessionManager
{
 	retrievePerson(int personId)
	{
		Person person = personPool.retrievePerson(personId);
	  	if (person == null)  // retrieve from db
	  	 	person = Person.retrievePerson(personId);
	  	return person;
	}
	  
	PersonPool personPool;
}


// a container to hold recently accessed person records in this session
class PersonPool
{
	retrievePerson(int personId)
	{
	}
	 
	addPerson(Person person)
	{
	  	// implement lru algorithm to handle person object
	}
}

Web admin

Tasks for Web Admin

Analyze the daily report generated by House keeping module and fix the issues
Go thru ALL new/edited records and approve them

Criteria for a good Webhost

Here are some points to consider while looking for a webhost.

Web space
- File space to hold html, script, class files
- Database space
- Runtime memory
Are these size specification flexible?
Number of ip addr (port) provided
Number of schemas (users?) allowed in the database
Backup, recovery procedure
Software version
- Jdk, jsp
- Web server
- Database
- Other api
Bandwidth, scalability
Statistical analyses of nature and behavior of users

Development environment

The development environment I have in my local machine is as follows :

java 1.3
Tomcat 4.0
- JSP 1.2
- Servlet 2.3
Oracle 8 - Trying for a recent version (>= Oracle 8i)
Ant 1.5.1 - a java based build tool from Apache
JCreater - a java IDE from Xinox
CoffeeCup - a html editor
Winston Veerender, a friend, suggested Eclipse from www.eclipse.org as a java IDE; Will try that if it is more suitable than JCreater for this application.

Cost estimation

Article	Amount (in Rs)	Description
Investment
Books	2000	Got the books Professional JSP Beginning XML Oracle Bible Planning to buy couple of books of this nature (bpbpub.com) 81 87105 86 0 Creating Commercial webpages 81 7635 046 X Dynamic Html unleashed 81 7656 271 8 XHTML 81 7656 306 4 Advanced Html with DHtml 81 7656 274 2 Commercial Application
Printing	1000	Letter pad Visiting card ...
Hard disk	????	Not sure if a hard disk is needed in the local machine. If needed, it may cost around 5k for 40GB
Misc	2000
Running cost (per month)
Article	Amount (in Rs)	Description
Domain name	250
Web space	1500
Internet/Power	1500
Domain name promotion	2500	Cable TV advt Stickers in cafes Notice in newspapers Expecting this cost will be cut in 3 to 6 months.
Salary	????	Expecting people from different fields like : Marketing, sales Data collection Web admin

Issues

Hashing

Ideally, Person class should be a entity bean mapping to one record of the PersonTable. I am thinking about a lazy way to code Person class.
Instead of having all the field names as an attribute(in a separate variable), we can have one hashtable which will hold all the name, value pairs.

Advantage:

Ease of coding and maintenance
Saves lots of space, because expecting most of the fields to be null. These null fields will not be in the hash.
Modifying the table will affect only the xsl file (or any presentation layer)

Disadvantage:

Accessing the attributes of the Person class will be 2 or 3 times slower than directly accessing a variable.
Space for the field name stored along with actual value should be considered

A question concerning the field names in the hashtable. Let us say that there are 100 Person objects in the jvm at a particular time. Each object has a hashtable which has almost same set of field names (we can expect all of the objects having many common fields). Take the scenario: there are 15 fields in each hash, out of which 10 are common to all the 100 hashtables. Will the jvm hold 15 * 100 strings (field names) in the memory or will it hold say 15 to 100 strings in the memory and use some indexing mechanism in the hashtable ...

Finding the relationship

A person record has three basic relationships : father, mother and a list of spouse.
We can view the table as a graph structure, each person as a node in the graph.
Our aim is to traverse through the graph (breadth first search) and find a path between two nodes.

Instead of viewing as a single graph, it will be easy to picturize them as a collection of well connected graphs. Nodes in a graph has a field to mention which group they belong. This field will help us to take a quick guess on the feasibility of finding a path. However, we take this info just as a suggestion; irrespective of the field suggesting positive or negative about the path, we explore the graph to find the path.

Is this field essential? Apart from space for such field in each record, we have to code for maintaining the definition. ie, when a new node enters the system, we have to check if different groups can be clubbed together through this new node. Also, deletion of a node should be considered for splitting groups.

A temp table can be used to do the search. The nodes visited is placed in the temp table. Traversing the graph takes place parallely from both the direction, ie, from person 1 and person 2.

Can we implement this logic in plsql?
Do we require java stored proc?
Will these proc handle multi-thread?
Do we have to use different temp table names?

The idea of navigation is simple. Get the relative's id and put them in the table, until a path is found. Is there any possibility of "heuristic" search?

The above logic will fetch one path. How can we find other genuine paths between those 2 nodes? ie the new path shouldn't be a slight deviation from the first path; instead it should imply a new relationship between the nodes. Note that some of the nodes may be common between two different paths.

Data entry

The webserver is expected to hold the main database. When there is a need to enter bulk data, it will be wise (in India) to enter them offline. (Recent trend - in cities, dishnet has come up with a scheme for unlimited direct net access without using phone line. If we have such connection, then we don't have to worry about this cost.)

For bulk data entry, a local database is provided with facility to enter and upload records.

The PersonId may be different for a record going from local to main db. The transferring code changes all the PersonId columns in the local db to that of main db and then transfers the records. This will ensure the id consistency between the 2 db. After uploading, the data collector will be prompted to delete the uploaded records (of the local db) or restore them.

Note that the local db will hold a small fraction of the main db's PersonTable records.

Important point to remember : only the personTable can be uploaded from local to main db. None of the other table will be uploaded. So, the data collector is not expected to add items to other tables of the local db. Instead, she/he should modify the main database and then synchronize the local db with that of main.

This logic seems cumbersome. If there is a way to avoid this functionality, I would like to do that. So, I will try to get a unlimited connection (without using phone). Or collect the data in xml format (or which ever format is easy to type) and then try to upload the data to the main db using some automated/semi-automated process.

House keeping

Periodical check of the database is required to ensure the validity and correctness of the entered data. This operation can be done in a semi-automatic fashion.

For each new/edited person record, see if there is a similar record
Send a report about new records and other relevant details to the web admin

Choosing the classification

We have numerous sectors, believes, and sub divisions (religion, caste, subdivisions etc) to which a person belongs.
Here is a way to display that list and allow the user to choose out of the list.

The division and subdivision relationship is retrieved and placed in xml files. These files will be used by the end user to choose appropriate belief and sector.

xml file displays the content in a tree structure. If there is an easier method to display classifications as a tree structure, that could be used.

Misc

In the personTable, can I use varray to hold spouse Ids? This will avoid a separate spouseTable.
What is the webspace we need for this application?
Is it legal, ethical to mention dowry range, so that matrimonial search can be done based on it?

Resolved issues

What size should the personId be? (mediumInt or longInt ?)
- mediumInt should be ok
In the personTable, I am expecting most of the fields to be null for most of the records (fields like webpage, email, addr etc). Will storage be utilization be optimal in such case or is it advisable to have two tables (first table with vital info & auxiliary table with sparsial data)?
- Ram: There will be minor a difference if you split the tables and the aux table will not have a row if all the cols in it are null. But it is not worth splitting the table. Go ahead with one if you expect only 50k to 100k rows.
Will the table be indexed on primary key automatically or do we have to do that?
- Ram: Table will be indexed on PK automatically. But you may want to add additional indexes based on the queried cols. e.g. Religion, age etc. this will improve the performance.
What is index-only table?
- Ram: IOTs are special kind of tables where data and index entries are stored in the same segment. (Normal tables will have different data and index segments). IOTs improve query performance. Generally used in look up tables which does not involve high dml activity but get large number of queries. If you have many dmls on this table the performance may get affected.

Hosted by www.Geocities.ws