Advanced Topics in Software Engineering
Project: An incremential unsupervised pattern generator for English sentences
Links
10/14/03 - Automatic acquisition of Terminological Resources for Information Extraction Applications
Author: Byron Georgantopoulos
  The paper presents a method of automating the process of locating domain-specific terminology from machine-readable text. The method aims to extract multi-word terms from the domain that will be validates by domain experts. The method is stochastic and supervised, and seems to work very well in the author's tests.

The paper mentions the two methodological approaches to extracting domain-specific information. The first is using a term grammar, which is a CFG that extracts recognized phrases from annotated text. The second is the usage of statistical and heuristic formulas to extract likely candidates. The statistical approach assumes that words that are domain-specific tend to appear more frequently. Between these methods are many approaches which seem to combine them.

The approach specified by the paper includes several phases. The first is done by morphosyntactic annotation using a part of speech tagger and disambiguator. The second part is based on a pattern grammer, and the third is performed by a morphological lexicon. The algorithm processes the text with the three parts sequentially, extracting useful words, and submitting them for validation by a evaluation metric, and possibly domain experts.
10/14/03 - Maude versus Haskell: an Experimental Comparison in Security Protocol Analysis
Author: David Basin
  This paper describes the Maude programming language as it is used to model systems. Maude is compared to Haskell. Maude has several strengths which include equational logic and rewrite, but Haskell also has features similar to equational logic, and provides higher-order polymorphic functions and lazy data types.

A security protocol is a protocol that specifies how agents should behave. The protocols are generally described using an informal notation describing the actions that can be taken to achieve some goal. A standard method of modelling a security protocol is to abstract away the implementation details of the steps and instead model the global state of the system, and how the global state changes with respect to time. Then, the protocol is modeled as a tree of global states, with the root modelling starting configurations, and leaves modelling solutions. These trees are generally infinite. Maude can model these sort of systems efficiently through rewrite logic, but Haskell's lazy data types are also useful in being able to represent infinite trees.

One goal for a model is that it should be usable with a checking system, so that the model can be explored to find potential attacks.

The paper then shows implementations in Maude and Haskell of a model checker, with heuristics to find attacks. The models and languages are compared with respect to three criteria - separation of concerns, reusability, and readability.

In terms of lines of code, the Maude model seems to be much larger than the Haskell model. The paper concludes that the languages have their own strengths, with Maude excelling in generality, and Haskell excelling in speed, space, and the usefulness of aforementioned features.

10/13/03 - Stochasic Recurrent Networks Training by the Local Backward-Forward Algorithm
Author: Athanasios Kehagias
  Stochastic Recurrent Networks are networks of finite state machines, that, at each time step, enter a new state based on a probabilistic function with parameters of the states of the neighboring machines. The paper introduces an algorithm that improves the ability of these machines to learn; the algorithm is an improvement of a technique used in hidden markov models.

This paper isn't particularly related to the topic; but it was interesting. The author gives a good overview of a lot of concepts; a SRN can be represented as a hidden markov model, and a HMM can be represented as a SRN.

10/13/03 - Mapping ontologies into Cyc
Author: Stephen Reed
  This paper discusses some of the difficulties and successes of Cycorp in mapping various ontologies into their commercial common-sense knowledge base, Cyc.

Cyc is a large ontological knowledge base, with over 100k terms and 1 million hand-entered assertions in predicate calculus. The Cyc system also has generic theorem provers and a large suite of heuristic level modules. The knowledge base is divided into locally-consistent contexts called micro-theories.

FIPS 10-4 is a set of standards that describe document processing codes. This mapping was successful.

MeSH is Medical Subject Headings. It was used to create a large number of new terms and relationships in Cyc.

SENSUS is an ontology derived from WordNet and the Pennman Upper Model that was partially mapped into Cyc.

Difficulty in mapping Open Directory led to an abadonment of this effort, due to the chaotic nature of the information in the ever-growing open directory.

WordNet was mapped into Cyc, and like Open Directory, WordNet changed frequently. However, a utility program was created that preserves backwords compatibility.

Other schemas such as XML, UML, and DAML were also mapped.

Many other things were mapped. Conclusion: there is a lot of stuff available that can be mapped into an ontology to make a useful knowledge base.

10/11/03 - Generality in artificial intelligence
Author: John McCarthy
  This is a revised version of a Turing award lecture.

Behavior can be represented by a program through a genetic approach. The program is modified in small ways with better programs kept for further modification, and worse programs discarded.

The General Problem Solver system was a system that tried to represent problems of an unknown class by applying sufficient transformation rules to transform the problem into a known class. One of the problems is that problems do not take on a form so general as to be transformed easily, and because the rules necessary can not easily be represented as problem transformation rules.

A third technique called production systems is mentioned. Production systems represent knowledge in the form of facts and rules. The facts usually correspond to logical statements, whereas rules tend to have a pattern and an action. Production systems rarely have domain-specific knowledge. Production systems do not infer general propositions, because they work in terms of variables, not specifics, and thus can not generalize.

Other statements about the usefulness of declarative logic and data are made, as well as more examples, and commonsense databases.

Nonmonotonicity and reification are mentioned, as well is formalizing the notion of context.

10/10/03 - Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging
Author: Eric Brill
  This paper describes an unsupervised learning algorithm for automatically training a rule-based part of speech tagger without using a manually tagged corpus.

The paper starts by describing the usefulness of stochastic taggers. He mentions the Baum-Welch algorithm, but doesn't go into details. Essentially, the algorithm seems to work by iteratively modifying probabilities in rules until a local maximum for tagging is found for the corpus.

The author continues to mention that the best taggers use stochastic methods to analyze hand-tagged texts. The algorithm presented is a fast algorithm that accurately generates a set of rules through a stochastic, unsupervised method on an untagged corpus.

The system is based on transformation-based error-driven learning. First, un-annotated text is annotated through some method. Then, the text is tagged by the tagger, and the tagger is automatically modified to make it conform to the tags in the annotated text. The method is greedy and iterative; at each iteration, a transformation is chosen in such a way that the tagger has the highest score. Once transformations are learned, text is annotated by passing the text through an initial annotator, then through each of the learned transformations in order.

The author then works through examples of his algorithm. Finally, he discusses results. The author claims to see a substantial increase in accuracy, raising from an initial state tagging accuracy ot 89.9% to a learned accuracy of 95.6%. With a larger training set, accuracy increases to 96.0^.
10/9/03 - Acquiring disambiguation rules from text
Author: Donald Hindle
  This paper presents a procedure for automatically acquiring a set of disambiguation rules for a parser by using tagged text. Performance of the rules is better than existing hand-made rules. The success of the rules depends on using the linguistic information encoded in the parser, improving the parser improves the rule set.

One of the most difficult aspects of NLP is constructing a grammar that can process a language comprehensively. The paper attempts to provide a step towards a solution by describing a training procedure for automatically acquiring symbolic rules for a deterministic parser. The author describes experiments in acquiring rules for determining part of speech. The acquired rules are approximately comparable to probabilistic approaches.

More than a third of the words in written English are categorically ambiguous. Syntactic analysis depends partly on correctly disambiguating the part of speech of each word in a sentence. Disambiguation seems to be a very difficult problem, but it is a problem that needs to be solved using simple algorithms. However, it seems that part of speech disambiguation can occur using syntactic patterns. The most successful projects in automatically determining these patterns use a probabilistic approach. Systems based on categorical rules and patterns are much less successful; probabilistic programs have 90% or better accuracy for part of speech disambiguation.

The author then promotes the idea that categorical rules can be generated through stochastic methods to generate a system that is superior to either stochastic or heuristic systems alone.

The author then describes the Fidditch deterministic parser. Of note, Fidditch has about 700 rules, 350 to parse phrase grammar and 350 to disambiguate lexical category. Fidditch also has 46 lexical categories.

The author discusses the reasons his approach may be better; this primarily consists of his assumption that categorical rules are superior, but that humans can not accurately make a large enough body of them to be fully useful.

The author continues to discuss his training procedure.

10/7/03 - Maude Primer
Author: Theodore McCombs
  This is actually a small book on the language Maude. Maude is a declarative equational logic and rewrite language. It is extremely powerful and flexible, and actually includes an implementation of itself, with object oriented extensions, written in Maude.

  The book is fairly comprehensive, but is a primer -- there is a seperate manual which can be used as a reference. The primer covers equations, conditional equations, which are equations only applied under certain circumstances, rewrite rules, which work to allow algebras to be defined, conditional rewrite rules, sorts which are types, kinds which are classifiers for sorts, variables, object orientation, and much more.

10/6/03 - Evolutionary Incremential Development for Case-Based Reasoning
  This is a neat little paper that discusses a nice alignment-based pattern identifier. The paper is actually about an incremential system for developing a case-based reasoning system, but the pattern identifier is more interesting. The system works by starting with an initial case, then adding an additional case, and extrapolating generalizations by detecting patterns. The process is then repeated.

  The system applies two strategies to developing new rules -- one is the creation of new rules from cases for which no existing pattern can be modified to suit the addition of the case, and the second is an extension strategy in which an existing rule is extended in terms of the patterns it matches so that the new case is covered. Extensions only happen in such a way that variable targets for actions in cases are covered, so the system is bound.

9/16/03 - Notes On Formalizing Context
Author: John McCarthy
  This paper provides notes from John McCarthy on the explicit formalization of contexts as first class objects. These notes delve into the possibility of a framework that will allow rules based in a particular context to extend to a seperate context.

9/15/03 - Improving Part of Speech Disambiguation Rules by Adding Linguistic Knowledge
  This paper is an overview of work in producing a "part of speech tagger" for the Swedish language. A part of speech tagger is a system that is capable of determining the contextually correct definition of a word in a text. The system was capable of performing, using computer-generated rules, with better than 99.4% accuracy, based on an automated training process from unedited text.

9/14/03 - On the state of the art in machine learning: A personal review
  This paper is an overview of books related to machine learning, and attempts to predict based on the content of the books, the future direction of research into machine learning.

9/13/03 - Data Mining: An Overview from a Database Perspective
  This paper is a survey of modern data mining techniques, and defines data mining as knowledge discovery in databases. It details requirements of data mining techniques and provides a very broad overview of modern data minind techniques.

9/11/03 - Welcome!
  As part of our assignments, members of our class are required to read research papers related to our project and to software engineering, and post blurbs on this site. This page contains these blurbs.

Marc Santoro