Dynamic Document Search
Engine - Part 1
Introduction:
I started working with PHP six months ago. I used
to read many articles in Internet that gave me better understanding on PHP. I
started developing software for “Online Journals” that has the capability of
searching document’s contents. You can find articles in devarticles.com that
can perform keyword title and author search. This article gives you a brief
idea of Document-Based Search.
What is Document Search?
In a Dynamic Document Search every word in the
document is parsed (read) and matched with the search words. Results are
displayed based on the matches found.
Reading every word of the article matching it with
the search word over thousands or even lakhs of documents is very difficult
task. Also by default, PHP is configured to run maximum 30 seconds.
Prerequisites:
To understand this article, you should have a fair knowledge
of PHP. To run examples given in your machine, you need APACHE, PHP, and MYSQL
software installed and configured. I used PHP Version 4.3.1 and MYSQL 2.2.3.
Building Database:
The database consists of three tables. viz. Content Table, Keyword Table, Link Table. Content table
holds article’s title, and abstract. Keyword table holds keyword. Keyword field
is indexed. Link table holds keyword id, content id.
The SQL Statement for creating these three tables are shown below.
Content Table:
CREATE TABLE content ( contid mediumint(9) NOT NULL auto_increment,
title text,
abstract longtext,
PRIMARY KEY (contid) ) TYPE=MyISAM;
Keyword
Table:
CREATE TABLE keytable (
keyid
mediumint NOT NULL auto_increment,
keyword
varchar(100) default NULL,
PRIMARY KEY (keyid),
KEY keyword (keyword) ) TYPE=MyISAM;
Link
Table:
CREATE TABLE link (
keyid
mediumint NOT NULL,
contid
mediumint NOT NULL)
TYPE=MyISAM
Preparing
Database:
An input
interface with HTML form is created to enter title and document. After filling
and hitting enter, the title and the abstract is
stored in the content table. The generated new content id is stored in a
variable temporarily. In the next step and ‘Upload Engine’ that parses each
word in the abstract and process the whole text. It removes common words like
is, was, and, if, so, else, then etc. Then stores each word
in wordmap array. See that every word has only one entry in the wordmap
array.
For every
word in the wordmap array, keyword table is parsed and math is found. If there
is a match, the generated key id, and content id generated id earlier is stored
in the link table. Else, the new keyword is inserted in the keyword table and
with the generated keyword table and content id the link table is updated. And
thus we finished preparing our database.
The code snippet given below explains every step of the program.
Searching keyword table for every word is a long process. This
also reduces the efficiency of the program. To implement this all the keywords
in the keyword table is stored in an associative array $allWords. An
associative array is one, which works on B-Tree algorithm and very useful to
perform searches. Here is the function.
<?php
Function LoadCurrentWords(){
global $allWords;
$result = mysql_query( "select keyid, keyword
from keytable" ) or die( "Error in executing
mysql query" );
while ( $row = mysql_fetch_array($result) ) {
$allWords[$row[‘keyword’] = $row[‘keyid’];
}
}
?>
Common Words:
$COMMON_WORDS is an associative
array that stores an array of words, which are commonly used in English
Language. These words have to be removed while parsing the file.
$COMMON_WORDS=array(“a”=>1,
“as”=>1);
You can add as many common words as you like. See source code
for full list of common words.
ExtractWords() Function:
This function filters words by allowing only alphabetic
characters. To implement this, I used a technique called STATE MACHINE that
filters the characters.
Alphabetic characters are taken as STATE1 and other
characters (Numeric and Special Characters) as STATE0. Initially the machine will be in the STATE0. While parsing
letters, it encounters alphabetic characters, the machine switches to STATE1 else it will remain
in the same state. As a result we get a word with only alphabetic characters.
<?php
function ExtractWords($text){
$STATE0 = 0; //Numeric / Other Characters
$STATE1= 1; //Alpha Characters
$state = $ STATE0;
$wordList = array();
$curWord = "";
for ( $i = 0; $i < strlen($text); ++$i ) {
$ch = $text{$i};
$isAlpha = ctype_alpha( $ch );
if ( $state == $STATE0) {
if ( $isAlpha ) {
$curWord = $ch;
$state = $STATE1;
}
}
else if ( $state == $STATE1) {
if ( $isAlpha ) {
$curWord .= $ch;
}
else {
$wordList[] = strtolower( $curWord );
$state = $ STATE0;
}
}
}
if ( $state == $ STATE1) {
$wordList[] = strtolower( $curWord );
}
return $wordList;
}
?>
As a result we get a list of words stored in an array returned
to the called function.
FilterCommonAndDuplicateWords() Function:
This function is called after ExtractWords()
function. This parses filtered words removes common words like ‘a’,’is’,
’was’,’and’…. Other words are taken as valid words, remove duplicate among them
and then stored in an associative array $wordMap and this array is returned to the called function.
<?php
function FilterCommonAndDuplicateWords( $wordList ) {
global $COMMON_WORDS;
global $MAX_WORD_LENGTH;
$wordMap = array();
foreach ( $wordList as $word ) {
$len = strlen( $word );
if ( ($len > 1) && ($len < $MAX_WORD_LENGTH) ) {
if ( !$wordMap[$word] ) {
if ( !$COMMON_WORDS[$word] ) {
$wordMap[$word] = 1;
}
}
}
}
?>
Process Form function():
This is the core part of the upload program. After finishing
filtering, removing common words and duplicate words, this function is called.
First this function inserts the title and abstract in the content table. The
newly generated content id stored in $contentId. Then it updates keyword and link table.
For every word in the $wordMap array, if the word is already exists in keyword table, it
inserts the key id, content id in to link table. Conversely, if the word is not
found, it inserts the new word in keyword table, the generated new key id is
stored in $keyId. Then it updates
link table by inserting key id content id in link table.
<?php
function ProcessForm($title ,$body){
global $allWords;
$tempWordList = ExtractWords( $body );
$wordList = FilterCommonAndDuplicateWords($tempWordList);
// insert into
content
mysql_query( sprintf( "INSERT INTO content
(title, abstract) VALUES ('%s', '%s')",
mysql_escape_string($title), mysql_escape_string($body) ) );
//store the newly generated
content id in $contentId
$contentId = mysql_insert_id();
// insert all the new words
and links
while(list($word,$val)=each($wordList)) {
$keyId = "";
if ( !$allWords[$word] ) {
mysql_query( sprintf( "INSERT INTO keytable (
keyword ) VALUES ( '%s' )",
mysql_escape_string($word) ) );
$keyId = mysql_insert_id();
$allWords[$word] = $keyId;
}
else {
$keyId = $allWords[$word];
}
// insert the link
mysql_query( sprintf( "INSERT INTO link
(keyid, contid) VALUES ( %d, %d )", $keyId, $contentId ) );
}
//End of Processing
Form.
}
?>
The following code snippet is the starting place of execution,
which calls all the above functions. Here it connects to database server and
database. Initially form() function is called that
allows you to enter the title and abstract of the document.
<?php
if($submit){
global $allWords;
mysql_connect( "localhost", "root", "" ) or die( "Unable to connect to
database" );
mysql_select_db( "kpp" ) or die( "Unable to select
database" );
LoadCurrentWords();
if ( $title and $body){
ProcessForm($title ,$body);
}
}else{ //end of main
$err="Please fill in the
fields to upload\n";
form($err);
}
function form($errmsg)
{ ?>
<h4
align="center">File Parser & Uploader</h4>
<b><?php echo $errmsg; ?></b>
<center>
<form
method="POST" action=<?php echo $PHP_SELF ?>>
Title: <input
type="text" name="title" ><p>
Abstract: <input
type="text" name="body" ><p>
<input
type="submit" name="submit" value="Start Parsing and
Upload Content">
</table>
</form>
</center>
<?php
}
?>
Search Engine:
PHP script is written that makes it possible to query the
database through a HTML form. This will work as any other search engine: the
user enters a word in a textbox, hits enter, and the interface presents a
result page with links to the pages which contains the word that is searched
for.
In this example, the results are displayed the order in which
the pages are presented is selected by the number of search words appeared in
each document.
Declare an associative array $CommonWords that contains common
words like ‘is’, ‘in’, ‘was’ etc.
First convert all the search words in to lower case.
$search_keywords=strtolower(trim($keywords));
Next, we have to perform an explode operation on search words
that will store each search word in an array. The code is shown here.
$arrWords = explode("
", $search_keywords);
Next, remove duplicate words in $arrWords.
$arrWords = array_unique($arrWords);
In a search operation, first we have to remove the common words
like ‘is’, ‘in’, ‘was’ … This refines our search criteria. To implement this we
store common words in an associative array $CommonWords.
Next, remove common words in the search words. Search words are
stored in $searchWords and common words
are stored in $junkWords. Here is the code.
<?php
$searchWords=array();
$junkWords=array();
foreach($arrWords as $word)
//remove common words
if(!$CommonWords[$word]){
$searchWords[]=$word;
}else{
$junkWords[]=$word;
}
?>
We can display results in two ways.
Type 1: Display the document if all the search words present in the document.
Type 2: Display the document if any one of the search words is present.
If you want to perform the Type 1 operation, include the following
code snippet in to your program.
//count no of words in the search words and store
in a variable
$noofSearchWords=count($searchWords);
$noofSearchWords stores the number
of search words. Later after searching search words in key word table we get
results. There we can perform logical AND operation that will display our
desired results. If $noofSearchWords is equal to number
of records, the next part of the program gets executed. Else “NO SEARCH RESULT
FOUND” is displayed.
In the next step, we have to search for words in $searchWords
array in the keyword table. The following code snippet will return you a list
of keyids that matched query.
<?php
//implode to an array
$arrWords = implode("' OR
keyword='", $searchWords);
//get the key ids from the
key table
$query = "select * from keytable where keyword='$arrWords'";
$kResult = mysql_query($query);
?>
As discussed earlier, if you need to perform Type 1 operation,
you have check whether the number of search words and number of records in
query. If they are equal, you can proceed to the next step else display search
result not found. Here is the code.
<?php
if(mysql_num_rows($kResult) == $noofSearchWords){
//search for the keyids in
the link table and get the content id
//Fetch
title, first 200 words of the abstract in to an array
//Display
the result
}else{
echo “NO SEARCH RESULT FOUND”;
}
?>
The following code searches the link table for occurrences key
ids. This will return an array that contains the content ids.
<?php
while($kRow=mysql_fetch_array($kResult))
{
//get the link ids for each
key id
$kid= $kRow['keyid'];
$query = "SELECT * FROM link WHERE keyid=$kid";
$lResult = mysql_query($query);
//echo
mysql_num_rows($lResult);
while($lRow=mysql_fetch_array($lResult))
{
$thisContentId=$lRow["contid"];
if(!$contArray[$thisContentId]){
$contArray[$thisContentId]=1;
}else{
$contArray[$thisContentId]++;
}
}
}//end of while
?>
Sort the array in descending order of the key value. This will
order from highest occurrences to the lowest. For example, if the number of
search words is four, the order is displayed 4 then 3 then 2 and last 1.
//Sort array in descending order of the key value
arsort($contArray,SORT_DESC);
In the next step we have to fetch title, first 200 words in
content table in to an array $FoundRef.
<?php
//declare an array to
store the results
$FoundRef=array();
while(list($contentId,$occurances)=each($contArray)){
$aQuery = "select
contid,title,left(abstract,200) as summary from content where contid =
" . $contentId;
$aResult = mysql_query($aQuery);
if(mysql_num_rows($aResult) > 0){
$aRow = mysql_fetch_array($aResult);
$FoundRef[] = array (
"contid" => $aRow["contid"],
"title" => $aRow["title"],
"summary" => $aRow["summary"],
"occurance"=>$occurances
);
}//end of if
}
?>
Finally we have to display the results in the browser. Here is
the code.
<?php
if(isset($FoundRef))
{
echo "<table
width=\"100%\"><tr><th
class=\"title\">Search
Result</td></tr></table>";
echo "<a
href=\"#\"
onclick=\"history.back()\">Back</a>";
echo "<br />";
echo sizeof($FoundRef);
echo (sizeof($FoundRef) == 1 ? " reference" : " references");
echo " found";
echo "<p>";
if($junkWords){
echo "Common words like";
foreach($junkWords as $jWords){
echo " "."'".$jWords."'";
}
echo "are removed from the
search string";
}
echo "</h5>";
foreach($FoundRef as $a => $value)
{
echo "<table>";
echo "<tr><td valign=\"top\">";
// echo
$FoundRef[$a]["contid"];
?>
<a
href="showref.php?refid=<?php echo $FoundRef[$a]["contid"]?>"><emp><b><?php echo $FoundRef[$a]
["title"]?></b></emp></a><div
align="right"> Occurance(s): <?php echo $FoundRef[$a]["occurance"] ?></div>
<br
/><small><?php echo $FoundRef[$a]["summary"] ?>...</small><br
/><br />
<?php echo "</td></tr>";
echo "</table>";
}
}//end of isset FoundRef
?>
The HTML page to get input from user is given below.
<html>
<head>
<title>Search
Engine</title>
<style
type="text/css">
body{ font-size:20; font-weight:bold;
font-stretch:semi-expand;
font-family:MSserif; color:#0066CC;
background-color:#EEEEE4;
align:center;
background-color:white }
h4{ background-color:#0066CC; color:#FFFFFF; font-family:verdana; }
h3{ color:#0066CC; }
th{ background-color:#6996ED; color:#FFFFFF; font-family:Arial; }
a{text-decoration:none;}
</style>
</head>
<body>
<?php
if($submit)
{
if(!$keywords){
$errmsg="Sorry, Please fill in
search field";
form($errmsg);
}else{
//Start Timer
$start = getmicrotime();
//PERFORM SEARCH OPERATION
AND DISPLAY RESULT
}else {
//end Timer
$end = getmicrotime();
//TOTAL TIME TAKEN TO DO
SEARCH OPERATION
$time_taken=(float)($end-$start);
$time_taken=number_format($time_taken,2,'.','');
echo "<p>Your Query Executed in $time_taken Seconds";
$errmsg="<p>No Search result found for '$keywords'";
echo $errmsg;
echo "<br /><a
href=\"#\"
onclick=\"history.back()\">Back</a>";
}//endof isset ref
}//end of if key word exists
} else { //display the form
form($keyword);
} //END OF FORM DISPLAY ?>
</body>
</html>
<?php
function form($errmsg)
{ ?>
<h4
align="center">Search Engine</h4>
<b><?php echo $errmsg; ?></b>
<form
method=POST action=<?php echo $PHP_SELF ?>>
Enter
keywords to search on:
<input
type="text" name="keywords" maxlength="100" />
<input
type="submit" name="submit" value="Search" />
</form>
</body>
</html>
<?php
}
function getmicrotime()
{
list($usec,$sec)=explode(" ",microtime());
return ((float)$usec+(float)$sec);
}
?>
Function getmicrotime() returns time in microseconds. This function is called during
start and end of the search process.
Conclusion:
In this
part 1, the search engine searches for the occurrence of words in the document.
Part 2 is slightly modified such that when we upload the document, the number
of occurrence of each word is stored in the link table. The search engine then
ranks with the number of occurrence of each word in the document. For example,
if the word ‘paging’ occurred 11 times, ‘programs’ occurred 21 times then the
rank for the document is 11 + 12 = 23.
Source
Code: