Comparative Code Understanding

Dr. Andrew Broad
Computer Science
Comparative Code Understanding


Comparative code understanding is the enterprise of comparing two pieces of code at a semantic level rather than a purely structural level. This entails combining comparison with the extraction of higher-level knowledge about the code, using techniques from the field of code understanding.


Introduction

Imagine two pieces of code that perform the same task using different algorithms, e.g. two different sorting-algorithms. Their semantic equivalence could be assessed by:

  1. applying code-understanding techniques to recognise that both codes perform sorting - that is, to extract from each code a "sort" clich� (a data-structure encoding the fact that a specified code-fragment performs sorting, what data are being sorted, and the nature of the sort - i.e. the order-relation);
  2. comparing the clich�s to assess the extent to which they are semantically equivalent, or what the semantic differences are.

Such combination of extraction and comparison is the essence of comparative code understanding.

Comparative code understanding could be applied in the following areas:


Comparative code understanding of information models: an illustrative example

My PhD project entailed the comparative code understanding of EXPRESS information models (similar to database-schemata, but conceptual rather than implementation-oriented).

Specifically, the task was to compare two EXPRESS models and assess the semantic equivalence of the constraints in those models. This was achieved by extracting and comparing higher-level constraints (HLCs), triggered by syntactic differences in the models. For example:

SCHEMA model1;                     SCHEMA model2;
  ENTITY person;                     ENTITY person;
    name : STRING;                     name : STRING;
    age : INTEGER;                     age : INTEGER;
  WHERE                              WHERE
    (age >= 0) AND (age <= 130);        {0 <= age < 130};
  END_ENTITY;                        END_ENTITY;
END_SCHEMA;                        END_SCHEMA;

The first issue is inferring correspondences between parts of the codes. The above example is easy because the names are the same. In general, name-equivalence has to be armed with a thesaurus in order to cope with synonyms and homonyms.

The WHERE-rules are syntactically different, so they are submitted to a code-understanding system, which extracts a "range" HLC from each rule, encoding the abstract knowledge that age must be between 0 and 130.

The two "range" HLCs are then compared, and inferred to correspond by semantic equivalence - and hence the WHERE-rules that they cover are also inferred to correspond. In fact {0 <= age < 130} is slightly stronger (more constrained) than (age >= 0) AND (age <= 130) because the latter admits age = 130 while the former does not.

This assessment of semantic equivalence and semantic strength is another major issue in comparative code understanding, intertwined with the problem of inferring correspondences.


Selective extraction and comparison

The other major issue is that comparative code understanding enables the selective extraction of higher-level knowledge from code, whereas most code-understanding systems extract all the knowledge that they know how to extract. Comparative code understanding should only extract the knowledge needed to assess semantic equivalence, avoiding the extraction of extraneous knowledge.

Selective extraction can be achieved, in part, by being triggered by syntactic differences: localised bouts of extraction from code-fragments for which there is a difference. This assumes, of course, that syntactically identical code-fragments are semantically equivalent (an assumption which could be fooled by homonyms or computational mapping).

Comparison of HLCs should also be selective, because comparing every HLC extracted from one code with every HLC extracted from the other code would result in gigantic quadratic explosion! Firstly, only HLCs of the same type should be compared. Secondly, the HLC-comparisons can be pruned using heuristics such as monogamy (the assumption that HLC-correspondences can only be one-to-one) and locality.

The challenge of selective comparison is to reduce the comparison-effort vastly without losing useful correspondences.


Comparative code generation

Comparative code generation is the enterprise of generating code to bridge the gap between two codes which have semantic differences.

One potential application of this is schema-to-schema mapping. Consider again the EXPRESS models above.

SCHEMA model1;                     SCHEMA model2;
  ENTITY person;                     ENTITY person;
    name : STRING;                     name : STRING;
    age : INTEGER;                     age : INTEGER;
  WHERE                              WHERE
    (age >= 0) AND (age <= 130);        {0 <= age < 130};
  END_ENTITY;                        END_ENTITY;
END_SCHEMA;                        END_SCHEMA;
Imagine a program that maps instances of person in model1 to person in model2. What happens if age = 130, which is allowed in model1 but not model2?

Answer: The mapping-program had better detect when a constraint in the target-model is stronger than the corresponding constraint in the source-model. It could perform the mapping and then check the constraints on the target, but it would be more efficient to check whether the source-instances satisfy the target-constraints before attempting to map them.

Effectively, this entails transferring constraints from the target-model to the source, when either there is no corresponding constraint in the source-model, or the constraint is semantically weaker (less constrained) in the source-model than in the target-model - as is the case when mapping from model1 to model2. Effectively, the source-model has to be tweaked so that age < 130 - so that the mapping-program can check that age < 130 for each source-instance before attempting to map it to model2.


Email me
Hosted by www.Geocities.ws

1