Research Plan


0 Abstract

The aim of my research is to invent a generic methodology for automatic program generation from very high level, declarative specifications (Section 1), and to write software systems that implement some of that methodology for particular application domains (Section 3). My thesis is that Case-Based Reasoning (CBR) will usefully extend the capabilities of automatic programming systems, so I am looking to take a case-based approach. As a by-product, I hope to contribute to the development of CBR (Section 2). The objectives of my research, my progress so far and plans for the future will be discussed in Section 4.

1 Automatic Programming

My research is concerned with automatic programming - the generation of executable programs, in ordinary programming languages, from higher-level descriptions of programs (Novak [13]).

1.1 High Level Descriptions of Programs

I seek descriptions that are very high level compared to programs in conventional languages, not too concerned with details such as algorithms and data structures. Such descriptions would be extremely declarative, with emphasis on what the problem is rather than how it is to be solved. Ideally, a computer could generate an efficient program from a statement of the problem to be solved, but this is very ambitious and difficult to achieve. More realistically, the algorithms and data structures might be specified at an abstract level, like a design that has not been implemented yet.

For example, a description of quicksort (Section 7.7 of Weiss [19]) at this level might describe the input as an unordered bag, without committing to representing it as a linked list or an array (data abstraction). It might state that an element is to be chosen from the bag to be used as the pivot, but without committing to implementing pivot selection as choosing the first element in the list or as median-of-three partitioning (algorithmic abstraction).

If programmers could work at this sort of level of abstraction rather than programming in a conventional high level language, an increase in programmer productivity would occur comparable to that which occurred when high level languages were introduced and programmers no longer had to program in machine code (Waters [18]).

1.2 Forward, Reverse, and Re-Engineering

The generation of program code from a high level description is forward engineering. The opposite is reverse engineering, the extraction from program code (and any other available information) of a high level description.

The advantages of the high level description are that it is easier for users to specify programs at this level, and also easier to for a computer to manipulate programs - as software evolves due to changes in requirements, an automatic programming system can take existing code, reverse-engineer a high level description of the program, transform that, and then forward-engineer the new program code, a process known as reengineering. We do not transform the program code directly, because we need to understand it at a higher level in order to do so. Reverse engineering is useful because it promotes software reuse, and reengineering can be used to transform legacy systems written in archaic programming languages to more modern ones.

Forward engineering, reverse engineering and reengineering are defined in Chikofsky & Cross [8].

2 Case-Based Reasoning

Case-Based Reasoning (CBR) is the enterprise of solving new problems by analogy with old ones, and was the subject of my first CS700 presentation. A Case-Based Reasoner remembers a previous similar problem and adapts its solution to solve the new problem. Given a description of a new problem, it retrieves a case (an old problem together with its solution) from its case library, adapts the old solution to better fit the new problem, and then, having solved the new problem, stores it together with its solution in the case library so that it can be used to solve similar problems in the future.

It is beyond the scope of this paper to describe CBR in detail - Broad [4], Kolodner [11] or the first two chapters of Riesbeck & Schank [15] are recommended for a good overview.

2.1 CBR applied to Automatic Programming

CBR is my secret weapon for automatic programming, in that it's a cross-fertilisation very few researchers have tried before, although Smyth & Cunningham [16] have attempted a very simple, yet instructive, form of case-based software design, and some researchers have taken approaches to automatic programming that clearly have some affinity with CBR, though they do not point this out themselves, for example the use of clichés in the Programmer's Apprentice (Waters [18]).

A case-based approach to automatic programming means writing programs by being reminded of old ones and modifying them (forward engineering). Here, a `problem' is high level description of a program and the `solution' is the program code. Cases might also be useful for reverse engineering if it is feasible to match the code to be reverse-engineered against code stored in cases and adapt the corresponding high level descriptions to fit the code being analysed.

2.2 Advantages and Disadvantages of CBR

The advantages and disadvantages of CBR are discussed in Section 1.5.2 of Kolodner [11] - here, that discussion is adapted to the context of automatic programming.

The advantages of CBR are that it should be quicker and easier to solve problems (write programs) in a case-based way rather than starting from first principles, cases can suggest solutions even in poorly-understood domains (there is no hard-and-fast way of writing software to solve a given problem except in very restricted domains, such as Lex and Yacc (a scanner generator and a parser generator respectively, see Sections 3.5 and 4.9 of Aho, Sethi & Ullman [1])), and cases can also warn of potential mistakes (e.g. bugs, like forgetting to increment the iterator in a loop, or not checking whether a pointer is null before attempting to follow it), allowing the reasoner to avoid them.

In sum, the major advantage of CBR is that it can learn from experience and improve both its efficiency (saving time by not having to construct solutions from scratch) and its competence (avoiding previous mistakes).

The fact that cases can suggest solutions in poorly-understood domains is the key to why CBR should be so beneficial for automatic programming. The current state-of-the-art in automatic programming is very limited compared to its full potential, and has only had success in very restricted domains. CBR has the potential to extend the capability of automatic programming systems beyond what they can do without CBR, though precisely what the added capability is going to be is a difficult research question which requires much analysis of the automatic programming literature. It is a question I am hoping to find some answers to for my MPhil!

CBR could also be useful for prototyping, as research has shown that CBR systems are also quicker to develop and easier to maintain than rule-based or model-based systems (Section 3.4 of Kolodner [11]). Effectively, an experienced CBR system has an ossified set of heuristics (over time, cases tend to become rule-like in nature (Chapter 1 of Riesbeck & Schank [15])), which could then be reimplemented as an efficient, hard-wired algorithm (this would be a good opportunity for automatic programming, by the way), although the hard-wired algorithm could not learn any more by experience.

The obvious disadvantage of CBR is that large case libraries will inevitably take up a lot of space, which is the computer's price to pay for having a lot of experience! Also, if retrieval is not efficient, it could outweigh the time benefit of not having to solve problems from first principles; CBR should have a greater pay-off for complex problems with intractable search spaces, rather than easy problems that can be solved efficiently without CBR. Finally, if an inappropriate case is used to propose a solution, it is not helpful for solving the problem - this could happen as a result of poor retrieval which fails to find the best (most similar) case, or just not having a good enough case in the case library, in which case the problem has to be solved by non-CBR means and then inserted into the case library so that it's there when a similar problem is to be solved in the future.

3 Transforming Constraints in Information Models

So, I am looking to apply CBR to automatic programming, to give such systems capabilities over and above what can be achieved using non-CBR techniques, but I need a concrete exemplar to apply it to - a fairly restricted domain where I make the progress that will hopefully give me an MPhil by the end of September 1998. The chosen problem is that of generating programs to map data from one model to another, respecting the constraints on those models, which, as we shall see, requires those constraints to be transformed.

3.1 What is Information Modelling?

An information model (Kahn & Williams [10]) is a representation of a domain. It defines the entities (objects) that exist in the domain, the attributes each of those entities have, the relationships that hold between those entities and attributes (such as containment, reference and inheritance) and the constraints on those entities and attributes, which constrain what instances of the entities are valid in a database that conforms to the model.

This sounds like a description of the records and fields in a database, but information models are more conceptual and less concerned with implementation details such as the exact format of the database. An information model can be implemented as a database or in a programming language, but it is not one in itself.

A schema is a collection of related entities. It may also contain other things such as data types and global constraints (those not tied to a particular entity). For example, if I were to write an information model of this university, I might separate out my world view by having a schema for courses, one for people and another for books, each being fairly self-contained and reusable, but having some links to other schemata. The Courses schema might have entities such as departments, courses and lectures. The People schema might include students and lecturers. The Books schema could include books and other reading material.

EXPRESS (Kahn & Williams [10]) is an information modelling language. Here's the Courses schema written in EXPRESS, albeit incomplete and rather ugly:

SCHEMA Courses1;
	ENTITY Course;
		code : STRING;
		department : STRING;
		number_of_students : INTEGER;
	UNIQUE course_code_must_be_unique:
This schema only contains one entity, Course. It says that a course has a code, a title (which is optional), is associated with a department, and has a certain number of students taking it. The UNIQUE clause says that no two instances of Course may have the same code - it's an example of a constraint. The other constraints are the non-optionality of attributes other than title. Of course, if I wanted to do this properly, I would not model the course code as a simple string - course codes have a well-defined, meaningful format, so this attribute needs to be constrained and should perhaps be broken into several attributes for the encoded information (department, year, programme, number for uniqueness and semester). And the department should be an entity in its own right rather than an unconstrained string. But never mind, this gives me the opportunity to write another schema later so that I can discuss mapping between the two, which is the object of the exercise.

3.2 Schema-to-Schema Mapping

The scenario here is that we have two data repositories - databases if you will - which conform to different models (i.e. the data is in a different format), and we wish to map data from one to the other (Brown, Filer & Moosa [6]). We want a system to generate a program to transform records from one repository to the other, provided that records from the source repository are not going to violate the constraints on the target repository.

This problem has received a great detail of attention in the CAD Group, due to their desire to achieve interoperability between CAD tools, which requires sharing data. It is also a real problem outside academic circles, because databases all over the world often store the same information in different formats, and we would like to be able to generate mappings from one to another.

The source model may differ structurally from the target model. The two models may also differ semantically, with one having different constraints on it than the other. This semantic heterogeneity is a real problem, because it means that some of the records in the source repository should not legitimately be mapped to the target repository because they would violate the constraints on the target repository. The response to this problem so far has been to ignore it!

As an example, consider the following alternative Courses schema:

SCHEMA Courses2;
	ENTITY Department;
		name : STRING;
		courses : SET [1 : ?] OF Course;

	ENTITY Course;
		code : STRING;
		title : STRING;
		students : INTEGER;
		the_department : Department FOR courses;
		course_code_must_be_unique: code;
		sensible_intake: students >= 10 AND students <= 250;
The differences between this schema and Courses1 are that Department is now represented as an entity in its own right, each instance of which has a name and a set of courses, of which there must be at least one. Entity Course now has an INVERSE clause, which says that each instance of Course can only exist as a member of the set of courses for a particular department - the_department - there is no course in this world that is not part of some department. This difference is really only structural - there aren't going to be any records from Courses1 that cannot be mapped to Courses2 on this count, it's just that an instance of Department would be created for every distinct value of department in Courses1.

Another change is that, in Courses2, the attribute title has ceased to be optional, so any records from Courses1 that do not have a title cannot be mapped to Courses2, because they violate the optionality constraint.

The attribute number_of_students has been renamed students - a trivial change, but I have also added a constraint on entity Course which says that the number of students must be between ten (otherwise the course would not be worth running) and 250 (the number of seats in the biggest lecture theatre) inclusive - this is a domain rule, which is expressed as a WHERE clause. Again, this causes a problem with the mapping, because any records from Courses1 whose number_of_students is outside this range cannot be mapped to Courses2.

So far, in the CAD Group, methodologies and tools have been developed which can map from one model to another only structurally (see Brown, Filer & Moosa [6]). The current state of the art is a system that can generate a program to map from one model to another, given the two models written in EXPRESS, plus a specification of the structural mapping (the correspondences between entities and attributes in the two models are human decisions).

What's missing is code in the generated mapping program to check that records from the source repository are not going to violate constraints on the target repository when they are mapped. This has proven to be a very difficult problem, and I have the chance to make a significant contribution by solving it with the help of my magic weapon, CBR. The program generator needs to transform constraints on the target repository so as to plant code which can check, for a given source repository record, whether it violates those constraints before it attempts to map it to the target repository, and if so, raise an exception, pinpointing in what way the constraints violated.

The transformation of constraints on the target repository to apply to records in the source repository can be done by reengineering. As discussed in Section 1.2, this involves reverse-engineering each target repository constraint into a high level description, which is then transformed and forward-engineered into code to check that constraint.

4 Research Plan

4.1 Research Questions

The main goal of my research this year is to develop a generic methodology for automatic programming, working in the context of a concrete exemplar - understanding and transforming constraints in information models. The methodology also needs to be extensible, so that I can build on it as I work towards my PhD, which will presumably involve more ambitious exemplars of automatic programming in wider generality.

There are several issues that need to be addressed in the course of this research. They all merge into two categories. Firstly, how should higher-level descriptions of programs (cases) be represented and what knowledge should be included in these representations? Secondly, what is a suitable architecture for the automatic programming systems I seek to build? These two categories are inextricably intertwined - research on one will help drive the other forward.

The knowledge needed by an automatic programming system can be classified along two orthogonal dimensions: the 'what' and the 'how', and the specific and the general. This gives four possible classifications: knowledge of the problem (specific 'what': a specification of what the program is to achieve with no regard for algorithms), knowledge of the domain (general 'what': background facts - knowledge of this type is widely considered to be important for automatic programming, e.g. Barstow [2]), knowledge of the program (specific 'how': e.g. graphs of control/data flow, code topology) and knowledge of programming itself (general 'how': knowledge of languages and programming techniques).

Note that these boundaries are not clear-cut, especially with a case-based approach - knowledge of specific programs migrates into knowledge of programming in general, because cases teach lessons that exemplify general principles; knowledge of the domain can be brought to bear in order to better understand the problem; the distinction between 'what' and 'how' is often blurred in reality, and so on.

A related issue is what knowledge can be captured automatically from existing programs (reverse engineering). A search of the Reverse Engineering literature has suggested several answers, including call graphs, data flow and control flow graphs (Cimitile, De Lucia & Munro [9]), variable-reference graphs (Canfora, Cimitile & Munro [7]), resource-flow graphs and module dependency graphs (Müller, Orgun et al [12]), state-transition diagrams and inheritance hierarchies (Booch [3]) and various software engineering metrics such as cohesion, coupling and reusability (Cimitile, De Lucia & Munro [9] and Müller, Orgun et al [12]). Clichés in programs can be recognised to reconstruct a program's design (Rich & Wills [14]) as well as to forward-engineer programs (Waters [18]).

Consideration of the architecture of an automatic programming system gives rise to several subissues. What components should be included in such an architecture? How much of the system's behaviour can be automated and what aspects cannot be fully automated? It is unrealistic to assume we can achieve fully automatic programming in its full generality in the short term, yet we should strive to improve the extent of automation in the long term, to reduce the need for human intervention.

Another set of issues arises from the question: which parts of the architecture should be case-based and which not? It may be that CBR is not appropriate for everything, and other technologies will have to be used as well.

For those parts that are case-based, a number of CBR issues arise (these are all discussed at length in Kolodner [11]):

  • How should cases be represented? This question actually falls into the category of knowledge needed for automatic programming, discussed earlier.
  • How should cases in memory be indexed so that they will be retrieved under appropriate circumstances in the future?
  • How can retrieval be made efficient without compromising its effectiveness and flexibility? How should the case library be organised to support such a retrieval process?
  • Is situation assessment needed prior to retrieval? (The description of a new problem might not be a useful memory probe if it is not conformant with the case library's indexing structure, in which case the description of the new problem needs to be elaborated, calculating indexes into the case library to guide retrieval.)
  • How should similarity be measured?
  • Is analogical (interdomain) reasoning needed?
  • Is adaptation needed, and if so, what kinds?
  • Is evaluation of solutions needed, with consequent repair of failing solutions?
  • What is a sensible policy for updating memory when a new problem has been solved? (Storing cases profligately wastes space - a case should only be retained if it teaches a lesson fundamental to the reasoner's goals that could not easily be inferred without that case in the library.)

    Another issue is whether a multi-stage CBR is needed, as discussed in Smyth & Cunningham [16]. This can arise due to the multi-level nature of programming, whereby you can put a chunk of code into an abstraction (such as a function or a procedure), which can then be called. This means that there is not just one specification and implementation for the whole program, but nested specifications and their implementations for the various abstractions in the program. A multi-stage CBR process for automatic programming would retrieve a case for a program as a whole and adapt it, and the CBR process recurses for each abstraction it comes across.

    4.2 Past Progress, Current Position, and Future Plan

    So far, I have been thinking in general about CBR and automatic programming, reading around the subjects to gain a broad overview of what the issues are and try to get a feel of what current approaches can and can't do, so that I will be in a position to analyse what CBR can add to automatic programming (this is not easy to judge from merely reading about other people's work!). I have been looking at the Reverse Engineering literature to see what kinds of knowledge are being used, and compiling a list of knowledge, organised in the two-by-two ontology discussed in Section 4.1.

    This literature search must continue, but I am also moving into more practical work - it's just this month (January 1998) that I'm starting to tackle the constraints problem presented in Section 3. I'm at the stage now where I am trying to better understand the problem (by thinking about it and trying hand-worked examples), develop the architecture (I have already worked it out at the top level, I need to expand the components in detail) and develop a representation of the knowledge of information models, programs and in particular constraints, that the system will use (that is going to be the major difficulty). These three activities are evolving in parallel with each other, since effort in one will feed progress in the others.

    My plan for the next few months is to move forward in these directions and concentrate primarily on the constraints problem. To begin with, I might consider easy constraints such as INVERSE, OPTIONAL and UNIQUE, then move into harder ones such as WHERE rules, which can involve expressions of arbitrary complexity, including calls to defined functions and procedures. I also hope to get a paper or two published this year (in LOPSTR'98: the eighth international workshop on LOgic-based Program Synthesis and TRansformation, and EWCBR-98: the fourth European Workshop on Case-Based Reasoning).

    To test some CBR ideas, and to practice programming in Perl (Wall, Christiansen & Schwartz [17]), a language I might find useful for implementing or at least prototyping my systems in, I have written a program called REF, which keeps track of bibliographical information, retrieves references in response to queries and automatically generates a bibliography for them, such as the one for this paper. It is awfully useful for keeping track of the bibliographies for my theses and publications, and I have released it on the Internet (Broad [5]). It does not directly contribute anything to my thesis work, though.

    5 Bibliography

    [1] Aho A.V., Sethi R. and Ullman J.D. (1986). Compilers: Principles, Techniques, and Tools. Addison-Wesley Publishing Company, Reading, MA. ISBN 0-201-10194-7.

    [2] Barstow D.R. (1985). Domain-Specific Automatic Programming. Pages 1321-1336 in IEEE Transactions on Software Engineering 11(11), November 1985.

    [3] Booch G. (1994). Object-Oriented Analysis and Design with Applications, Second Edition. Benjamin Cummings Publishing Company, Inc., Redwood City, CA. ISBN 0-8053-5340-2.

    [4] Broad A. (1997). Case-Based Reasoning. CS3411 essay, Department of Computer Science, University of Manchester. On the Internet, (last modified 26th January 1998).

    [5] Broad A., Department of Computer Science, University of Manchester. Andrew Broad's Website: Computer Science: Perl: REF. On the Internet, (last modified 26th January 1998).

    [6] Brown M., Filer N. and Moosa Z. (1995). KRU - Generic Operators for Schema-to-Schema Mapping. Department of Computer Science, University of Manchester, technical report JCF/MAN/108-01.

    [7] Canfora G., Cimitile A. and Munro M. (1996). An Improved Algorithm for Identifying Objects in Code. Pages 25-48 in Software - Practice & Experience 26(1), January 1996.

    [8] Chikofsky E.J. and Cross J.H. II (1990). Reverse Engineering and Design Recovery: A Taxonomy. Pages 13-17 in IEEE Software 7(1), January 1990.

    [9] Cimitile A., De Lucia A. and Munro M. (1995). An Overview of Structural and Specification Driven Candidature Criteria for Reuse Reengineering Processes. Department of Computer Science, University of Durham, technical report 7/95. On the Internet, (last modified 30th October 1996).

    [10] Kahn H.J. and Williams A.R. (1997). An Introduction to EXPRESS: Producing an Information Model. Handout for CS3241, Department of Computer Science, University of Manchester.

    [11] Kolodner J.L. (1993). Case-Based Reasoning. Morgan Kaufmann Publishers, Inc., San Mateo, CA. ISBN 1-55860-237-2.

    [12] Müller H.A., Orgun M.A., Tilley S.R. and Uhl J.S. (1993). A Reverse Engineering Approach To Subsystem Structure Identification. Pages 181-204 in Software Maintenance: Research and Practice 5(4), December 1993.

    [13] Novak G.S. Jr., Computer Sciences Department, University of Texas at Austin. CS 394P: Automatic Programming. On the Internet, (last modified 24th October 1997).

    [14] Rich C. and Wills L.M. (1990). Recognizing a program's design: a graph-parsing approach. Pages 82-89 in IEEE Software 7(1), January 1990.

    [15] Riesbeck C.K. and Schank R.C., editors. (1989). Inside Case-Based Reasoning. Lawrence Erlbaum Associates, Hillsdale, NJ. ISBN 0-89859-767-6.

    [16] Smyth B. and Cunningham P. (1992). Déjà Vu: A Hierarchical Case-Based Reasoning System for Software Design. Pages 587-589 in Neumann B., editor. Proceedings of the Tenth European Conference on Artificial Intelligence (ECAI 92), Vienna, Austria, 3rd-7th August 1992. John Wiley & Sons Ltd, Chichester. ISBN 0-471-93608-1.

    [17] Wall L., Christiansen T. and Schwartz R.L. (1996). Programming Perl. O'Reilly & Associates, Sebastopol, CA. ISBN 1-56592-149-6.

    [18] Waters R.C. (1985). The Programmer's Apprentice: A Session with KBEmacs. Pages 1296-1320 in IEEE Transactions on Software Engineering 11(11), November 1985.

    [19] Weiss M.A. (1993). Data Structures and Algorithm Analysis in C. The Benjamin/Cummings Publishing Company, Inc., Redwood City, CA. ISBN 0-8053-5440-9.

    Back to CS700
    Hosted by