6.1.5 Any number of

For an regular expression that consists of an atom and the Kleene closure, we first examine the DFA representing the atom. Let’s call that M1 = (S1,Σ,T1,s1,A1), and we assume M1 has no ε transitions. The composite machine is M = (S1,Σ,T,s1,A). Let us define a new set U = {σ|(s,σ,x) T1}. In other words, U is the set of all symbols that lead us out of the start state of M1. Then, we add new transitions to one accept state A = {a} so that σ Σ U ∖{ε} : T(s1) = a.

This takes care of the fact that we do not need one instance of the atom to accept the input.

Next, we need to make the DFA recognize the longest possible sequence. To do this, we first create new accept transitions. b A1Σ U ∖{ε} : T(b,σ) = a. This means we only get to the accept state if we cannot extend the recognized string any longer.

Then, we specify b A1U : T(b,σ) = s1.

We now have a problem because each outgoing transition from an original accept state b A1 of the composite machine “eats” a symbol just to get to the next state. This symbol, however, should serve as the first symbol for the next DFA in a sequence.

The easiest solution is to “hack” the DFA and say that we can “unget” a symbol so that the symbol is available for the next DFA. In other words, all outgoing transitions from b A1 (an accept state of the component machine) need to unget the input symbol so that it is available for the following transition.

We can collect these special “unget” transitions into a subset T′⊆ T. This way, if we want to rid ourselves of the “unget” transitions, we can do it later. However, for the purpose of constructing a DFA for a regular expression, the ability to use “unget” transitions is invaluable. Otherwise, we lose the ability to cleanly separate component DFAs.