Parsing and Computational Linguistics

Tuesday, June 22, 2010

Lately I've been studying Tensor Calculus and how to use tensors for computations in The Harmonic Mind by Smolenksy and Legendre. I find the Harmonic Mind book to give not enough and too much information. They don't seem to assume any prior knowledge in the domain of linear algebra since pp.150-156 gives a tutorial, but as early as page 65 there are concepts introduced that use the material introduced several chapters later. I think they are trying to convince you that the subject is worth studying by showing you that it works without showing you how it works. I find this mode of argumentation to be a waste of time.

Later, in Chapter 8, they finally get into the gritty details of how their system works, but it focuses too much on formalisms. There is an example of how to separate a constituent and its role binding (p. 293-294), however it is extremely elementary. Ideally what I would have liked is one large, concrete example implemented in something like Mathematica, which is well suited to the task. I'm constructing such an example on my own but I have been hampered by ambiguities and thick use of mathematical rather than computational syntax. For instance, on p. 188 filler/role binding is defined as "A x r1 + B x r2" for binding A to r1 and B to r2. The next definition is of recursive embedding: "A x r0 + (B x r0 + C x r1) x r1." Notice how the indices for r in the first definition are one higher than in definition 2. I spent at least a few hours reading and reading to try and find out how they were different only to figure out that they're not; merely the use of indices is inconsistent. The same inconsistency is found on p. 295.

Overall though I've learned a lot. I have PassiveNet from p. 197 mostly implemented. The bug right now is that I don't have unbinding working quite right.

Since I began studying The Harmonic Mind my General Problem Solver, Regular Expression, and Vague Predicate Semantics projects have been on hold. The Regex compiler doesn't have an implementation of backreferences working inside of lookarounds. What it does have working is variable length lookarounds in both directions, which is something the Perl regex module doesn't do. On the other hand, I'm sure my implementation is slower.

My final goals for my Regex machine is to generalize over strings so that any linear type can be used for matching and then allow the matching machine to be used as a computational device in the sense of "(?{code})" in Perl. The second goal might require me to re-engineer my machine in terms of a proper backtracking state machine, but I would rather not do that since my machine (like Perl's) is more powerful than a finite state machine.

For my GPS I don't have any particular long term goals. It works as far as I wanted it to. I would like to find out if a form of GPS can handle recursive clause structure such as CP. I have had no luck making such a grammar work in my GPS via grammar rewriting. I believe it would require special tags on the grammar to tell the GPS machine to push an empty stack when going into a particular structure, and then pop it when retreating.

Details about my research in vague predicate semantics will remain private for now as I develop the mathematical and computational tools I need. I'm having to start a new line of inquiry since my studies in Tensor Analysis have provided me with fresh tools and ideas. When I feel the material is of a quality worth publishing I have a few academic friends I can call on for help.

On a side note, I believe I have stumbled on a use for tensors for analyzing character abilities in role playing games. Specifically, a character in an RPG game uses multiple "stats", usually just numbers such as "agility", within a system of combat to determine their character's offensive and defensive abilities. Players of such games try to optimize the numbers to maximize their combat abilities. Many simulators exist to assist players in deciding how to develop their characters, but I believe Tensors could give a form of static analysis that is potentially much more powerful.

Thursday, May 20, 2010

This is a story+tutorial about how I learned to marry OCaml and GNU make. Since I'll be leaving out my mistakes I'll appear to be unusally intelligent. Excellent!

Some of my projects that I write in OCaml have gotten too large for a simple "run these commands" build script to be convenient. There are many, many options for building OCaml projects and some are designed specifically for OCaml. OMake, Ocamlbuild, and GODI are the ones I know of. I find that these are overkill for my projects and not trivial to learn.

Another option is GNU make, which is not designed for OCaml, but is general enough that it works. Its syntax is a little on the shell scripting side but it has the advantage that if you learn to use it with OCaml, it's easy to apply your knowledge to projects in other languages. GNU make also supports parallel building. If your Makefile is written properly, make can spawn multiple "jobs" to potentially increase build speed on systems with multiple CPU's. This is done with the -jn flag where n is the number of jobs.

make's syntax is very much like shell scripting but its semantics aren't entirely procedural. It's much like a combination between a General Problem Solver, a database with rewriting rules and a shell. What it does is create a dependency graph to the first target listed in the file and then runs commands in order to satisfy the target's dependencies. For instance, if you have a list of object files that must be compiled before linking some executable, "executable_file", you would need the rule "executable_file: object1.o object2.o object3.o ..." The command to perform the linking goes on the next line after a tab. Like Python and Haskell, make makes use of white space for its syntax. So keeping mind that the tab is required, a common sort of entry in a Makefile is

executable_file: object1.o object2.o object3.o
    linker -o executable_file object1.o object2.o object3.o

You can replace the target with "$@" and the dependencies with "$^". "$@" and "$^" are automatic variables.

executable_file: object1.o object2.o object3.o
    linker -o $@ $^

For debugging, I found the most useful thing to do was "make -n", which tells make to print what commands it would run but not run them. Here is what my Makefile prints:

jeff$ make -n
ocamlc atn.mli
ocamlc -c -o atn.cmo atn.ml
ocamlc iMap.mli
ocamlc -c -o iMap.cmo iMap.ml
ocamlc q.mli
ocamlc -c -o q.cmo q.ml
ocamlc matching.mli
ocamlc -c -o matching.cmo matching.ml
ocamlyacc regex.mly
ocamlc regexTypes.mli
ocamlc regex.mli
ocamlc -c -o regex.cmo regex.ml
ocamllex regexLex.mll
ocamlc -i regexLex.ml > regexLex.mli
ocamlc regexLex.mli
ocamlc -c -o regexLex.cmo regexLex.ml
ocamlc regexATN.mli
ocamlc -c -o regexATN.cmo regexATN.ml
ocamlc -a -o regexLib.cma atn.cmo iMap.cmo q.cmo matching.cmo regex.cmo regexLex.cmo regexATN.cmo
rm regex.ml

As you might have guessed, my project involves regular expressions. The final file is a library of several modules, regexLib.cma. Interestingly, make will try to deduce what temporary files can be cleaned up after it's done. In the above case it's deleting the file generated with ocamlyacc, which is fine with me. Unfortunately there was one file it wanted to delete that I wanted to keep for debugging reasons. To keep it I simply added it to the list of dependencies for regexLib.cma.

To switch between compiling bytecode or native code I just change the variable TARGET before running make or give make the argument TARGET="BYTECODE" or TARGET="OPT". My Makefile assumes every file will be a module and so it creates an .mli for .ml files that have no corresponding .mli file. I'm also building a library, not an executable, but for an executable only minor changes would be needed.

#TARGET is BYTE or OPT (i.e. machine code).
#On the command line, set this with TARGET="OPT" or TARGET="BYTE".
TARGET = BYTE

#Set OC to the compiler being used and file suffixes be appropriate
#for TARGET.
ifeq ($(TARGET),OPT)
OC = ocamlopt
SUFFIX = .cmx
SUFFIX_LIB = .cmxa
else
OC = ocamlc
SUFFIX = .cmo
SUFFIX_LIB = .cma
endif

#Command to compile object code file.
OC_OBJ = $(OC) -c -o
#Command to compile a library.
OC_LIB = $(OC) -a -o
#Command to compile an interface to a source file.
OC_I = ocamlc -i

I hope the syntax is obvious so far. Referencing variables will look odd to some. It's done with $(name) and the names are case sensitive. Technically, the above will set OC to ocamlc for any value of TARGET that is not OPT, but since it's a binary choice I don't see that is a bug. Having references to SUFFIX and such later in the file make it look a bit messy, but it saves the trouble of making a new Makefile or other complicated procedure when I want to compile to machine code. Having compile commands in variables is makes it easier to globally change them.

Several guides I looked at recommended putting all of your object files in one variable so that they are easy to put into depency rules. For OCaml, generated interface files (.mli) might be necessary to keep for reference. In addition, some compiled interface files (.cmi) should be kept for users of the library. I'll put such files in the INTERFACE variable.

OBJ = atn$(SUFFIX) iMap$(SUFFIX) q$(SUFFIX) matching$(SUFFIX) \
    regex$(SUFFIX_LIB) regexLex$(SUFFIX) regexATN$(SUFFIX)

INTERFACE = q.cmi matching.cmi regexTypes.cmi regex.cmi regex.mli \
    regexLex.cmi regexLex.mli regexATN.cmi

$(SUFFIX) will be replaced with either .cmo or .cmx depending on the definition of TARGET.

So far I haven't told make to do anything. Since the first target in the Makefile is the one built by default, I list the final library file first.

regexLib$(SUFFIX): $(OBJ) $(INTERFACE)
    $(OC_LIB) regexLib$(SUFFIX_LIB) $(OBJ)

The first line says that the file regexLib.(cma|cmxa) depends on all the object code files and all the compiled interface files. The second line gives the command to create the file regexLib.(cma|cmxa). For bytecode, you can see the full expansion on the second to last line of the log (the above output from make -n).

It would be tedious and repetitive to manually write the dependency rules and associated commands for every file. Like the following:

regex.cmo: regex.ml regex.cmi
    $(OC) $@ $<

regex.cmi: regex.mli regexTypes.cmi
    $(OC_I) $<

regex.ml regex.mli: regex.mly
    ocamlyacc regex.mly

so I don't. Luckily I discovered make's template system. (I think of these rules as templates, but GNU calls them Static Patterns.) Targets that match a specified prefix and suffix can have an automatically generated dependence rule. Here is the macro for compiling a bytecode object file:

%.cmo: %.ml %.cmi
    ocamlc -c -o $@ $<

"%.cmo" means "match any target such that it has the prefix ".cmo"." The % signs on both sides of ":" are expanded to whatever the target is. So since regexLib.cma depends on q.cmo, if I have not specified a rule for q.cmo, make uses the template and fills in "q" wherever there is a "%", "q.cmo" wherever there is a "$@", and "$q.ml" wherever there is a "$<". Note that "%" cannot appear in the commands, only in the target and dependancy lists. I say lists because you can have more than one target for a particular pair of dependancy and command lists. Since ocamlyacc generates both .ml and .mli files from a .mly file, the following template will compile any ocamlyacc grammar files:

%.ml %.mli: %.mly
    ocamlyacc $<

Here are the rest of my templates:

%.cma: %.ml %.cmi
    ocamlc -a -o $@ $<

%.cmx: %.ml %.cmi
    ocamlopt -c -o $@ $<

%.cmi: %.mli
    ocamlc $<

%.ml: %.mll
    ocamllex $<

#Create an interface for any .ml file that doesn't have one.
%.mli: %.ml
    ocamlc -i $< > $@

These templates are nice for simple cases but Make requires that we manually enter dependencies for each object file. For my project. regex.cmi requires regexTypes.cmi, so the rule to compile regex.cmi would be

regex.cmi: regex.mli regexTypes.cmi
    ocamlc regex.mli

Writing something like that manually for every file defeats the purpose of having templates. This is where the sometimes-not-procedural part of make comes in handy. In addition to the template, I include "regex.cmi: regexTypes.cmi". When using the %.cmi template to target regex.cmi, make will add regexTypes.cmi to regex.cmi's dependency list. Thanks to this useful functionality, the only special case in the Makefile is the final library.

There is one downside to how make handles these rules. In OCaml, to compile an executable or object code file, we often need to have available an interface file (.cmi) to some library, but OCaml compilers fail when given .cmi files as arguments. To further automate the Makefile, I would like to use a template for making library files.

%.cma: %.ml %.cmi
    ocamlc -a -o %@ $^

The problem with this is that any .cmi files that the target is dependent upon are given to ocamlc and ocamlc then chokes on them.

jeff$ ocamlc -a -o regexATN.cma regexATN.ml regexTypes.cmi
/usr/bin/ocamlc: don't know what to do with regexTypes.cmi.

Furthermore, for an executable, the %.ml file will need to be the last one on the argument list. There is perhaps a work-around involving functions, but I have yet to play with those to find out.

The final thing to add is a "clean" target, so that "make clean" will remove any files used during compilation. Since clean is a target that does not create a file, it will confuse make unless we include the ".PHONEY: clean" directive, to tell make that clean does not produce a file.

.PHONY: clean
clean:
    rm -f regex.ml regex.mli regexLex.mli regexLex.ml
    rm -f *.cma *.cmx *.cmxa *.cmi *.o *.a *.cmo

That's the end of my Makefile. You can download it if you like. It should be easy to adapt it to other simple OCaml projects. I hope I helped make make your life a little easier.

Monday, May 03, 2010

I found another strange case in the perl regex machine. Consider the following match:

"aab" =~ /(.)(?!(\1)(\1))/

This match succeeds. Perl allows you to pull groups out from inside lookarounds, so $2 is "a". If you think about it, that's a little odd, since group 2 matches while inside an environment where booleans are inverted. (?!(\1)(\1)) is prima facie homomorphic with "not (A and B)" or rewritten with DeMorgan's, "(not A) or (not B)". More explicitly, "(do not match group 1) or (do not group match 1)." The first assertion "do not match group 1" fails but $2 is given the value "a" anyway. After that, the assertion "do not match group 1" succeeds and yet the successful match is not bound to any characters.

I do not think this implementation of negative lookahead is as useful as it could be. Group 3 is where the assertion succeeds. It wouldn't be terribly difficult to bind the text being matched for group 3 to $3.

Sunday, May 02, 2010

Perl does not allow variable length lookbehind. The restriction is easily found when attempting to use a backreference in a lookbehind. Perl's regex compiler is dumb about backreferences though, since some backreferences are a known length at compile time. When this is the case, Perl's error message error is confusing. For instance, you would expect group 1 in

$str =~ /(.)(?<=\1)/

to match a character that is not preceded by itself. In this case, group 1 is always one character long, but the perl regex compiler doesn't know that. It treats it the same as it would

$str =~ /(.+)(?<=\1)/

and gives the same error message.

Monday, September 17, 2007

Here I present a General Problem Solver (GPS) written in OCaml. You can find some references at the Wikipedia article and a Scheme implementation by John Stone here. The example I give at the end is from Peter Norvig. Please note that this is the naive version of the GPS (Peter Norvig's version 1, gps1.lisp); it solves the goals in order but does not guarantee that the goals will all be in the stack when the function returns. It suffers from what Norvig calls the "Clobbered Sibling Goal" problem.

You may download the source code.

First, a couple of definitions.

exception Failed_path

type op =
{
  name : string;
  preconditions : string list;
  add : string list;
  delete : string list;
}

Failed_path isn't too special, we simply use it when a path of decisions is faulty and we need to backtrack and use a different path to our goal. An op, or operator, is a named structure that will "operate" on string lists, treating them as sets. Preconditions are what must be in the state for the op to be followed, add is what is added to the state when you follow the op, and delete is what you remove from the state when you follow the op. List.append works as union, and cons works to add an element, but we will need a difference function ( \ or - in set algebra). For small problems this is okay.

let difference a b =
  List.fold_left
    (fun accum a ->
      if List.mem a b
      then accum
      else a::accum
    )
    []
    a

Now a note on how the GPS works generally. A GPS requires a starting state (in this case, a list of strings), a list of goals, and a list of ops. The goals are attained one by one via a depth-first search through the ops. The ops constitute a digraph and the initial state determines where in the graph you can start your search. The GPS algorithm travels the digraph until it finds a path such that the stack contains the goals in order. Successive goals may be removed from the stack on the way to the next goal. The GPS function returns a tuple containing the path taken and the final state.

First lets consider the case of trying to satisfy a single goal:

let rec achieve goal state ops =

We know that if the goal is already in the state, then there is no work to do

if List.mem goal state
then ([],state)
else

Returning ([],state) means that we've taken no path through the ops and the state is unmodified.

Next we consider what to do when the goal is not yet reached. We search the ops for nodes such that they point to the goal:

let to_try =
  List.find_all
    (fun op ->
      List.mem goal op.add
    )
    ops
    in

If the list to_try is empty, we can forget this path through the ops.

if to_try = []
then raise Failed_path
else

The next step requires an understanding of the digraph defined by the ops and the state. We know where we want to be in the digraph, but we have to find a path through the digraph such that the state eventually contains the prerequisites for one of the ops that will add our goal to the state. This is a rather complex task, but the "achieve" function is co-recursive with an "achieve_all" function that happens to do exactly what we want.

Now I have to apologize to the functional purists out there, but I found the next portion of "achieve" to be easiest to write imperatively. It sets up an empty tuple ([],[]) that will be replaced if "achieve_all" is able to complete a path to our target op such that the state meets the target op's prerequisites. We'll pick the first matching path.

let choice = ref ([],[]) in
let len = List.length to_try in
let i = ref 0 in
while !i < len && !choice = ([],[]) do
  begin
    let op = List.nth to_try !i in
    try
      let (path,result) = achieve_all op.preconditions state ops in
      choice := (path @ [op.name], op.add @ (difference result op.delete))
    with
      Failed_path -> ()
  end;
  incr i;
done;
if !choice = ([],[]) then raise Failed_path else !choice

Notice the line that begins "choice :="; this is the only place where an op modifies the stack and is recorded on the path. The rest of the above code block is a while loop that checks the ops in "to_try" and modifies "choice" if it finds an acceptable path. Note that the while loop is where we catch Failed_path. Any failed path is ignored and the next candidate op is tried.

Save the definition of "achive_all", the GPS can achieve any single goal. Achieving a list of goals is the task of "achieve_all".

If there are no goals, there is nothing to do.

and achieve_all goals state ops =
  if goals = []
  then ([],state)
  else

The alternative is to fold_left over the list of goals, using "achieve" to solve them one at a time.

List.fold_left
  (fun (oldpath,oldresult) g ->
    let (path,result_state) = achieve g oldresult ops in
    (oldpath @ path, result_state)
  )
  ([],state)
  goals

That's it! Now we can try an example:

let driving_ops =
[make_op "drive son to school" ["son at home"; "car works"] ["son at school"] ["son at home"];
  make_op "shop installs battery" ["car needs battery";"shop knows problem";"shop has money"] ["car works"] ["car needs battery";"shop knows problem"];
  make_op "tell shop problem" ["in communication with shop"] ["shop knows problem"] [];
  make_op "telephone shop" ["know phone number"] ["in communication with shop"] [];
  make_op "look up number" ["have phone book"] ["know phone number"] [];
  make_op "give shop money" ["have money"] ["shop has money"] ["have money"]
]

let _ =
  achieve_all ["son at school"] ["son at home"; "car needs battery";"have money";"have phone book"] driving_ops

- : string list * string list =
(["look up number"; "telephone shop"; "tell shop problem"; "give shop money";
"shop installs battery"; "drive son to school"],
["son at school"; "shop has money"; "in communication with shop";
"have phone book"; "know phone number"; "car works"])

Monday, September 10, 2007

I've been looking at a paper by Jean-Cristophe Filliatre, Backtracking Iterators. The goal of them is to be able to traverse a tree, going up or down, without having to use exceptions. In other words, instead of using the caml stack to do the backtracking you do the backtracking explicitly.

I'm not going to go into details about how the following works right now, but it's a backtracking iterator I wrote for n-ary trees.

type path = Down | Up | Entry of int

exception End_of_Traversal

(* Preorder traversal of an n-ary tree. checked *)

let preorder t =
  let rec aux = function
      E -> []
    | N (x, []) -> [Entry x]
    | N (x, children) ->
        let result =
          List.fold_left
            (fun accum c ->
              match (aux c) with
                  [] -> accum
                | x ->
                    Up::x @ Down::accum
            )
            []
            children
        in
        result @ [Entry x]
  in
  List.rev (aux t)

(* postorder ie depth-first, checked *)

let postorder t =
  let rec aux accum = function
      E -> []
    | N (x, []) -> [Entry x]
    | N (x, children) ->
        let result =
          List.fold_left
            (fun accum c ->
               match (aux [] c) with
                   [] -> accum
                 | x ->
                     Up::x @ Down::accum
            )
            []
            children
        in
        (Entry x)::result
  in
  List.rev (aux [] t)

(*Now the same functions can be used for going forward and backward in
the traversal, whatever traversal you chose.*)

let rec forward = function
    ([],_) -> raise End_of_Traversal
  | Up::(Entry x)::xx, history ->
      (Entry x)::xx, Down::history
  | Down::(Entry x)::xx, history ->
      (Entry x)::xx, Up::history
  | Up::x::xx, history ->
      forward (x::xx, Down::history)
  | Down::x::xx, history ->
      forward (x::xx, Up::history)
  | x, history -> forward (List. tl x, (List.hd x)::history)

let rec backward = function
    (_, []) -> raise End_of_Traversal
  | (current, Up::history) -> backward (Down::current, history)
  | (current, Down::history) -> backward (Up::current, history)
  | (current, x::xx) -> (x::current, xx)

let rewind (current, history) =
  List.rev_append (invert history) current

let extract (current, history) =
  try
    match List.hd current with
        Up | Down -> None
      | Entry x -> Some x
  with
    (Failure "hd") -> raise End_of_Traversal

Friday, May 25, 2007

Labltk is somewhat more primitive than Tk, but Tk still doesn't have native support for a tabbed interface, or notebook. This is a function which will make a widget that works pretty much like a tabbed interface.

open Tk

let create_notebook frames top =
  let tv = Textvariable.create () in
  let currentframe = ref None in
  let bounding = Frame.create ~borderwidth:2 ~relief:`Raised top in
  let buttonframe = Frame.create bounding in
  let bottomframe = Frame.create ~relief:`Groove bounding in
  let buttons = 
    List.map
      (fun (name,f) ->
        let f = f bottomframe in
        name,f,Button.create ~text:name ~command:(fun () -> Textvariable.set tv name) buttonframe
      )
      frames
  in
  let trd (_,_,c) = c in
  if frames <> [] then pack ~side:`Left (List.map trd buttons);
  let rec toggle () =
    (
     match !currentframe with
       None -> ()
     | Some f -> Pack.forget [f]
    );
    let b = Textvariable.get tv in
    List.iter
      (fun (n,f,button) ->
        if n = b then
          (
            Button.configure ~state:`Disabled button;
            if Pack.slaves f <> [] then
              Frame.configure ~borderwidth:2 bottomframe
            else
              Frame.configure ~borderwidth:0 bottomframe;
            pack [f];
            currentframe := Some f
          )
        else
          Button.configure ~state:`Normal button
      )
      buttons;
    Textvariable.handle tv ~callback:toggle;
  in
  Textvariable.handle tv ~callback:toggle;
  pack ~anchor:`W [buttonframe];
  pack ~fill:`Both ~expand:true [bottomframe];
  if buttons <> [] then Button.invoke (trd (List.hd buttons));
  bounding

Here is a function you can use to test it:

let test () =
  let top = openTk () in
  let frames =
    ["one",(fun f -> let f2 = Frame.create f in let l = Label.create ~text:"one" f2 in pack [l]; f2);
    "two",(fun f -> let f2 = Frame.create f in let l = Label.create ~text:"two" f2 in pack [l]; f2);
   "nothing",(fun f -> Frame.create f)]
  in
  let nb = create_notebook frames top in
  pack ~fill:`Both ~expand:true [nb];
  mainLoop ()