General : We designed and implemented a caching web proxy server, which can be configured remotely over the Internet by its administrator. The proxy handles only HTTP communications. The source is written in Java, which makes the code cross-compiler and the application cross-platform. The Java technology enables the main proxy application to send Java applets to remote administrator, sitting on a different machine on a remote network, and accessed over the Internet by standard TCP\IP communication. The proxy includes two applications. The main application starts up and serves and a proxy server which listens to clients requests on a specific port, and forwards the requests to a web server or to another web proxy (father proxy), then sending the replys back to the clients. This application will be referred to as the proxy application. The proxy application also does caching. When a client requests an object, the proxy checks if the object is cached. If so, it does not forward the request, and replies to the client with the cached object (this is called a cache hit). If the requested object is not cached (cache miss), the request is forwarded, and when the reply arrives to the proxy, it both sends it back to the client and caches it on its machine, for future use. Thus, if a client requests an object which has already been requested (by the client or by another client), the proxy retrieves the object from its local machine cache and sends it back, without searching for the object out there on the Internet. This feature gives a performance boost to the proxy itself and to the client (and to all other clients who will request these cached objects in the future). Caching the objects on the local machine is done using the host machine file system, thus the cache size is dependent on the local system resources in terms of hard disk free space. The feature of forwarding requests to another web proxy or to a web server, along with the caching behavior, enables one to use a hierarchy of caching proxy servers. Modern client browsers do caching on the client local machine, so caching is really done in levels (hierarchy): The first level is in the RAM on the client machine (usually the browser caches a few web pages in memory). The size of this level is depended on the memory resources of the browser machine. The second level, also done by the browser, is keeping the objects in a persistent storage, using the local file system on the browser machine. The third level is the caching done by the proxy, on its machine. Then, if the proxy can forward requests to another proxy, thus creating a chain or a hierarchy of proxies, each one of them does its caching on its machine, and by this creating a new caching level. If the requested object could not be found throughout the cache chain, the request is finally forwarded to the web server, which delivers the bits back through the chain of proxies, enables each of them to cache the reply for future `use. Its importance : It seems like a lot of disk space is being spent that way, so what’s the point? The answer is that disk space can be cheaper than the time spent on getting the requested objects from the Internet. Access to the Internet is very time consuming. In any case, the proxy administrator can choose whether to enable this feature or not, so it is up to him to decide - the technology is there for him. This fact is specially important when considering Intranets. Today, organizations manage their Internet working by forming a local net (Intranet) which has a gateway to the Internet. A proxy server usually runs on this gateway machine. For big organizations, a local flat-hierarchy Intranet is not a good enough solution. Instead, the Intranet is built as a hierarchy of sub-Intranets, each of these sub-nets is gatewayed to the main Intranet through a proxy, and the entire Intranet is gatewayed through a proxy to the Internet. For example, let’s say corporation Turtle Inc. has a couple of branches. The branch in Italy has an Intranet, and so is the branch in France and New-York. The Turtle corporation has a hierarchy of Intranets and proxies as follows: The Italy branch is gatewayed to the Europe Intranet via a proxy server, and so does the France branch. The Europe Intranet is gatewayed to the Corp Intranet (in the U.S.) through a proxy server, and the New-York branch is gatewayed to the Corp Intranet through its proxy server. Now, when a client browser in Italy searches for on object that was requested earlier by a French client, it gets the reply from the cache in the Europe Intranet proxy. When a client browser in New-York searches for an object that was earlier requested by the French client, it gets the reply from the cache in the Corp Intranet proxy. And when a client in France searches for an object previously requested by him or by another French client, it gets the reply from the cache in the France Intranet proxy. Only the first request goes out there to the Internet, all future request to the same object will be satisfied due to the proxy hierarchy cache. Designing an organization local net in that structure leads to performance boost all over the enterprise, and that’s what making those features of caching and chaining so cool! Remote administration : The second application is a Java applet, referred to as the applet. It enables the proxy administrator to remotely manage the proxy, from his machine, via its web browser. When the administrator wants to, he can access the desired proxy server by typing the proxy machine’s IP address (or machine name) on the URL address window in the browser plus the suffix ‘/admin’ (for instance, ‘techst02/admin’ ). The browser considers this to be a valid client request, and forwards it on to the proxy (or chain of proxies). When the proxy monitors this request, it compares it at run-time to the IP address (or machine name) of the host machine on which it runs. If they do not match, the proxy forwards this as a normal request on to the father proxy (or web server). But, if they do match, then the proxy assumes that an administrator is trying to attach. It responses by sending back a Java applet that handles all the remote configuration and the necessary security issues (such as password login), and of course does not forward that request. Back on the administrator machine, the browser gets a Java applet and starts it. The applet starts by requesting a password from the administrator and sending it to the proxy application. The applet and the proxy now talk full duplex. If the password is correct, the proxy sends Ack to the applet, and the applet responses by presenting all the parameters and status of the proxy, enables the administrator to alter parameters, thus change the proxy behavior (in terms of traffic management and cache activities), and sending the new parameters to the proxy. All of the operations in the administrator machine take place through its browser, by the applet, thus having full Graphical User Interface (GUI) support which is not restricted only to HTML web forms, seen on search engines for example. It is the power of Java applets technology that gives the administrator the ability to control the proxy remotely, plus the friendly graphical environment to do so - thus, if we or anyone else in the future decide to enhance the set of configurable parameters or the user interface to the administrator, the infrastructure is there to be used and enhanced. We’re talking applets here, not just a dull HTML page form. A note on security: The applet only present GUI to the administrator and handles communications to the proxy application. When the administrator enters the login password, the applet sends it over the Internet to the proxy. The proxy checks the validity of the password and sends back to the applet Ack/Nack, based on that the applet logs the administrator in or not. So attackers can learn nothing at the password from viewing the applet operations. Accessing multiple proxies remotely : We talked about a structure of proxy hierarchy, and how the proxy and the cache behavior support that. The remote access feature also support this design. The administrator can access a unique proxy by specifying its IP address or machine name on the browser URL window. The requested proxy will be the only one to respond to the administrator by sending the applet. The applet and the requested proxy will talk full duplex over the Internet, and over all other chained proxies in the hierarchy. When the administrator alters the proxy parameters, only the requested proxy will be affected - all the other proxies will not change. This is important because it enables one administrator fully control the behavior of each and every one of the proxies in the structure. It also enables a group of administrators to control their proxies (the US administrator will control the proxies in the US, and independently the France administrator will control the proxies in France. Now, is that hot or what??). Multithreading : All the clients’ requests plus the administrator remote configuration are done parallelly by using threads. A certain proxy can handle multiple client requests at the same time, plus be remotely controlled by an administrator. For instance, in a certain time frame, the proxy can handle a request from client A, a request from client B, two requests from client C (this could happen because the modern browsers use threads too...) and communications with a remote applet running on the administrator machine. Non of these users (clients and administrator) would notice the difference. Using threads leads to special problems, concerning shared resources and synchronization issues, and we will discuss that later (see ‘Overview’ ). The development environment : The development was done on the Windows 95 and Windows NT platforms. Developing in Java means that the code is cross-compiler and platform independent, because the Java language encapsulates the details of the host operating system, and presents a uniform standard interface to the developer. Code written in Java should be compiled by any tool which compiles Java, on any platform. Plus, the Java Byte Code (that’s the binary results of the compilation) should run on any machine supporting the Java Virtual Machine. That’s a plus when developing Internet related applications. The main proxy application was developed using Microsoft Visual J++ 1.0 . We found it convenient to write the code using this tool, because it is a code-based development tool (as opposed to other tools which give the developer more ‘visual’ centric view), and the application is basically an engine running without GUI support, and without supporting human interaction with it (unless it is done remotely by a browser). The applet was developed with Symantec Visual Cafe. Much of its code is related to GUI, and we found the Visual Cafe environment to be very friendly and powerful when it comes to that (this is a visual tool). Overview : Its machanism : When the proxy application starts, one thread listens on the main socket always, and dispatches other threads to do the job of handling each client request. This dispatcher thread is referred to as the web daemon, while the connection handling threads are simply called proxies. A proxy thread is also incharge of catching the administrator special request and sending the applet back to him. A proxy thread is incharge of handling one request - handling means that it should reply back to the client through a socket. The proxy thread checks if the requested object is cached - it does so by invoking a method (service) on the cache manager. If the object is cached, the cache manager returns it and the proxy sends it to the client. If the object is not cached, the proxy forwards the request to father proxy or web server, and waits for the reply to come (by listening to a socket). When the reply arrives, the proxy thread delivers it to the client and caches it. Again, caching is done by calling methods of the cache manager. When this work is done, the proxy thread terminates itself. When a proxy thread is running, the web daemon still listens on the main socket to requests, and for each one creates another proxy thread. That enables many requests to be handled simultainaslly. The web daemon is created at startup and never dies. When starting, it performs general initializations (such as creating the cache manager object and cleaning junk from the cache directory), and enters an endless loop of listening to the main socket and creating a proxy thread for each request. To help the proxy threads using the HTTP protocol, we use two different classes: HttpRequestHdr and HttpReplyHdr. Sending or getting an HTTP message body is simple: we read/write the bits through the socket. But HTTP requires special header fields to be sent with a message. HttpRequestHdr does the work of creating these fields upon sending, and HttpReplyHdr does the work of receiving those fields upon arrival. For example, if the proxy fails to forward a request because the web server could not be reached, it generates an HTML web page with a proper message and constructs an HTTP message to send back to the client, informing him of the error and giving him the correct headers of this error (such as the return status code header). Caching : The cache manager is a static object. It encapsulates all the details of caching from the other components in the application. For example, the proxy thread does not aware of the file name inwhich the requested object is cached - it just delivers a URL argument to the cache manager, which in its turn generate a file name out of that URL. Thus, if we will want in the future to change the mechanism of file name generation, the only code we should change is the code of the cache manager - the proxy thread’s code would not be affected. This kind of encapsulation is supported by object oriented development environments, such as Java. Caching objects is done using the host machine (onwhich the proxy runs) file system. When the cache manager is called to cache an object, it first generates a file name out of the URL of the object. The stream of bits coming from the father proxy or web server is written to a file. In addition, the cache manager holds a hash table data structure. Each entry in the hash table has two fields: a key and a value. The key is the file name, and the value is a date structure. When a new object is cached, it is stored on disk and then in the hash table a new entry created, into which the cache manager inserts the file name and the date it was created (year, month, day, hour, minute, etc.). Later, when the cache manager is asked to check if an object is cached, it first generates a file name out of the URL of the object, then enumerates the hash table to find the file name. If the file name is found, the cache manager returns it to the proxy thread; otherwise, it returns a status indicating ‘not cached’. If a requested object is found in the cache (a cache hit), the entry in the hash table is updated with the current date. This enables the Least Recently Used (LRU) algorithm to take place when the cache if full and a file needs to be deleted. So if the free space of the cache is going under a minimum level, the LRU algorithm enumerates the hash table and deletes the LRU file. If we would like in the future to change this policy (maybe to Least Frequently Used, for instance) we should change the value field of the hash table to be some other class, and change the code in the methods incharging of making more free space - all other components are unaware of this mechanism. The cache manager methods are called from the multiple proxy threads. This could raise problems of synchronization and shared resources protection. We solve these problems by putting multithreading synchronization locks on all shared resources, as supported by the synchronized() Java native method. Methods involving the hash table are multithreaded-safe because the hash table object synchronizes all actions performed on it internally (supported by Java hash table methods). Files are multithreaded protected by the Java File object and by our code. For example, there is a chance that one thread will read from a file, and a second thread will try to delete this file (a scenario like that could happen, if the first thread is reading a cached object A, and the second thread is caching object B, thus making the cache size grow, and causing the cache manager to make more free space; the algorithm will detect that object A is the LRU, and try to delete it). This will cause the proxy to break in the worst case, or to failure in either the read or the delete operation, in the best case. We should not allow it, so we added code to check if write and delete operations are allowed before doing them. Cacheable Vs. Non-Cacheable objects : Not all the replys are to be cached. For example, when a client sends a request to a search engine, typically the search engine treats that request as a query, and generates an HTML web page with the results of the query. This result page should not be cached on the proxy, because next time an identical query will be sent to the engine, different results are likely to be returned (because of the frequent changes in the engine database). These generated pages are called dynamic pages, while regular web pages that sit somewhere on a web server waiting to be retrieved are called static pages. The problem is to identify that a certain web page is a dynamic one. We do not know a 100 percent solution to this problem, so we can only follow conventions. For example, dynamic pages can be generated by CGI scripts, and there’s no way for the proxy to know that the page is a result of a CGI script, unless it gets some help from the HTTP reply headers or the URL of the page. It is a convention that CGI generated pages are taken from a URL which contains the sub-string "cgi-bin", so we added code to check the URL of each request, and if it contains this sub-string we do not cache the reply. We also check for URLs containing special characters, indicating that this URL is a query. For example, the question mark "?" is a typical character used to submit queries. The proxy also gets help from the reply return code. If it contains a return code other then OK, for example, we do not cache the reply. Let us just note that this problem is not as serious to the browser cache mechanism as it is to the proxy cache manager. Modern browsers manage caching on the browser cache, and theoretically they could be faced with the same problem - what should not be cached. But even if the browser do cache a non cacheable object, the user can always instruct it to "refresh" or "reload" the object from the Internet - in that case, the browser will re-send the request and wait for reply. On the other hand, proxies behave very differently, and the main reason for that is that the user should not be aware of them. So, if a proxy caches a non cacheable object, and the client will ask to refresh (reload) on his machine, the browser will re-send the request to the proxy, and the proxy would treat that request as a cache hit (because it has the requested object in its cache), will not forward the request, and reply with the cached bits. So it is extremely important for proxies to try to identify which object should not be cached. The admin thread : One of the initialization operations done at startup by the web daemon is constructing the admin thread. This thread handles communication with a remote administrator, and sets/gets parameters from other components (such as the cache manager). It first creates a socket (admin socket) and listens on a special (admin) port. When the administrator accesses the proxy, the proxy thread catch the event and sends him back a web page with the admin applet. The applet starts by presenting a login dialog box, and waiting for the administrator to enter password; then it sends the password to the proxy application, to the admin port, thus the admin thread at the main application could catch that request without interfering the handling of all client requests. The admin thread in the main application processes the password and sends back to the applet, on the admin port, an answer (Ack/Nack). Again, having the applet and the main application "talking" full duplex does not reflect on the activities going on in the main application (handling client requests). At this phase, if another administrator accesses the main application, the proxy thread will catch that and send an admin applet to him too. The applet will run on his machine, and will try to talk to the main application. However, the admin thread in the main application talks only with the first administrator and ignores the second. As soon as the first administrator finishes, the admin thread will treat the second one. This protects the main application from multiple administration accesses, preventing the scenario of two or more remote administrators trying to control the same proxy, thus causing to inconsistent behavior of the proxy or unfriendly behavior in the applets on their machines (see above for a full description of the potential problem). The second administrator do get the applet (otherwise he could think the proxy is unreachable), but sending admin requests to the proxy is blocked until the first administrator finishes. We thought this is the best design approach for the problem, and implemented the solution that way. After the password check, the administrator gets a full GUI window with the status of the proxy and parameters that can be altered. This is not just an HTML based form, but a powerful Java applet with all the GUI that we wanted (or any developer will want in the future) - including dialogs, check boxes, etc., and getting those user interface controls out of the browser area in a windowing manner. The config class : When the administrator changes parameters and chooses to send them to the main application, the applet sends the bits to the proxy, and the proxy gets them and alter its behavior. To help both the proxy and the applet with the job of getting and setting parameters, we designed a special class called config. This class appears both in the applet code and in the proxy code, and handles all get/set methods involved with the configurable parameters. The config class calls methods on other object to retrieve status of their parameters (these are get methods), and calls different set of methods on the other object to change their parameters (these are set methods). It also does the job of packaging these parameters and sending them over sockets to a remote machine. So, the applet uses the get ability of the config class to learn about the status of the main application, and present this status to the administrator, providing him the interface required to change this status, while as the main application uses the set ability of the config class to change its parameters. Both the applet and the main application use the config class ability to send the parameters to each other. We thought this is a good design, for a couple of reasons: First, we should write all the code incharge of managing the configurable parameters only once, and let both the proxy and the applet use it. Second, it support encapsulation and object oriented philosophy - all the details are in the config class and not spread out all over the code, so future changes can be made more easily. For example, if we decide to encrypt the parameters before sending and decrypt them on retrieving, thus making the communication between the applet and the proxy more secure, we should add code to do that only in the config class; all other objects would not be aware to the change. ******* THE END *******