COMPILER DESIGN (MAY-2017) UNIT-II

Question:- 4(a) How input buffering helps lexical analyzer in compilation process? Discuss with an example.

Answer:- The Role of Lexical Analyzer:

 

As the first phase of  compiler, the main task of the lexical analyzer is to read the input characters of the source program group them into lexemes and produce as output a sequence of tokens for each lexeme in the source program. When the lexical analyzer  discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. The lexical analyzer not only identifies the lexemes but also pre-processes the source text like removing comments, white spaces, etc.

 

Lexical analyzers are divided into a cascade of two processes:

 

  1. Scanning – It consists of simple processes that do not require the tokenization of the input such as deletion of comments, compaction of consecutive white space characters into one.
  2. Lexical Analysis- This is the more complex portion where the scanner produces sequence of tokens as output.

 

Tokens, Patterns and Lexemes:

 

  1. A Token is pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing the kind of lexical unit, eg., a particular keyword or an identifier.
  2. A pattern is a description of the form that the lexemes of a token may take. In case of a keyword as a token the pattern is just a sequence of characters that form the keyword.
  3. A Lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

 

 

Input Buffering:

 

Buffer Pairs:

 

Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character.

Two pointers to the input are maintained:

  1. Pointer Lexeme Begin, marks the beginning of the current lexeme, whose extent we are attempting to determine
  2. Pointer Forward, scans ahead until a pattern match is found.

Once the next lexeme is determined, forward is set to character at its right end.Then, after the lexeme is recorded as an attribute value of a token returned to the parser, Lexeme Begin is set to the character immediately after the lexeme just found.

Sentinels:

If we use the scheme of Buffer pairs we must check, each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one to determine what character is read (the latter may be a multiway branch). We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character EOF.

Note that EOF retains its use as a marker for the end of the entire input. Any EOF that appears other than at the end of a buffer means that the input is at an end.




Question :- 4(b) What is recursive descent parser? Construct-recursive descent parser for the following:      E|E+T|T

                           T|TF|F

                           F|F_|a|b

Answer: – Parsing

Recall that parsing is the problem of taking a string of terminal symbols and finding a derivation for that string of symbols in a context-free grammar. Parsing is critical task in implementing an interpreter or compiler for a programming language, since the syntax of a program contains essential information about the meaning of the program.

parser is the module of an interpreter or compiler which performs parsing. It takes a sequence of tokens from the lexical analyzer (you know how to write one of those!), finds a derivation for the sequence of tokens, and builds a parse tree (also known as a syntax tree) representing the derivation. We have seen that parse trees are very important in figuring out the meaning of a program (or part of a program).

Recursive Descent

Recursive descent is a simple parsing algorithm that is very easy to implement. It is a top-down parsing algorithm because it builds the parse tree from the top (the start symbol) down.

The main limitation of recursive descent parsing (and all top-down parsing algorithms in general) is that they only work on grammars with certain properties. For example, if a grammar contains any left recursion, recursive descent parsing doesn’t work.

Eliminating Left Recursion

Here’s our simple expression grammar we discussed earlier:

S → E

E → T | E + T | E – T

T → F | T * F | T / F

F → a | b | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

[S, E, T, and F are nonterminal symbols, and ab, and the digits 0-9 are terminal symbols.]

Unfortunately, this grammar is not suitable for parsing by recursive descent, because it uses left recursion. For example, consider the production

E → E + T

This production is left recursive because the nonterminal on the left hand side of the production, E, is the first symbol on the right hand side of the production.

To adapt this grammar to use with a recursive descent parser, we need to eliminate the left recursion. There is a simple technique for eliminating immediate instances of left recursion. [This technique won’t handle indirect instances of left recursion.]

Given an occurrence of left-recursion:

A → A α

A → β

Note that some non-recursive production of the form A → β must exist; otherwise, we could never eliminate occurrences of the nonterminal A from our working string when constructing a derivation.

We can rewrite the rules to eliminate the left recursion as follows:

A → β A’

A’ → α A’

A’ → ε

So, for example,

E → E + T

E → T

becomes

E → T E’

E’ → + T E’

E’ → ε

Here’s the entire expression grammar, with left-recursion eliminated.

E → T E’

E’ → + T E’

E’ → – T E’

E’ → ε

T → F T’

T’ → * F T’

T’ → / F T’

T’ → ε

F → a | b | 0 | 1 | 2 | … | 9

Left-Factoring

Another property required of a grammar to be suitable for top-down parsing is that the grammar is left-factored. A left-factored grammar is one where for each nonterminal, there are no two productions on that nonterminal which have a common nonempty prefix of symbols on the right-hand side of the production.

For example, here is a grammar that is not left-factored:

A → a b c

A → a b d

Both productions share the common prefix a b on the right hand side of the production.

We can left-factor the grammar by making a new nonterminal to represent the alternatives for the symbols following the common prefix:

A → a b A’

A’ → c

A’ → d

Our non-left-recursive expression grammar is already left-factored, so we don’t need to change it.

Recursive Descent Parsing

Once you have a non-left-recursive, left-factored grammar, recursive descent parsing is extremely easy to implement.

Each nonterminal symbol has a parsing function. The purpose of the parsing function for a nonterminal is to consume a string of terminal symbols that are “generated” by an occurrence of that nonterminal.

Terminal symbols are consumed either by directly reading them from the lexer, or by calling the parse methods of another nonterminal (or nonterminals). In general, the behavior of the parse method for a particular nonterminal is governed by what string of symbols (nonterminal and terminal) is legal for the right-hand side of productions on the nonterminal.

Typically, any parser will construct a parse tree as a side-effect of parsing a string of input symbols. The nodes of a parse tree represent the nonterminal and nonterminal symbols generated in the derivation of the input string. So, we can have the parse method for each nonterminal return a reference to a parse tree node for that particular nonterminal symbol.

Example

Here’s what a parse method for the E nonterminal in our revised expression grammar might look like:

public Symbol parseE() throws IOException {

NonterminalSymbol e = new NonterminalSymbol(“E”);

 

e.addChild(parseT());

e.addChild(parseEPrime());

 

return e;

}

The parseE method internally calls parseT and parseEPrime methods, because the only production on E is

E → T E’

The parseT method is similar.

The parseEPrime method is more interesting. There are three productions on E’:

E’ → ε

E’ → + T E’

E’ → – T E’

When parseEPrime is called, we should ask the lexical analyzer if we’ve reached the end of the input string. If so, then the epsilon production must be applied.

If the lexer is not at end-of-input, we should ask the lexical analyzer for a single terminal symbol. If we see a symbol other than + or , then we’ll again assume that the epsilon production should be applied. Otherwise, we’ll add the terminal symbol to the parse node we’re creating for the E’ symbol, and continue by parsing the T and E’ nonterminal symbols by calling their parse methods:

private Symbol parseEPrime() throws IOException {

NonterminalSymbol ePrime = new NonterminalSymbol(“E'”);

 

TerminalSymbol op = lexer.peek(); // look ahead

if (op == null) {

// end of input: epsilon production is applied

System.out.println(“end of input”);

} else {

if (!op.getValue().equals(“+”) && !op.getValue().equals(“-“)) {

// we saw a terminal symbol other than + or -:

// apply epsilon production

} else {

// consume the operator

lexer.get();

 

// saw + or –

// production is

//    E’ -> + T E’

// or

//    E’ -> – T E’

ePrime.addChild(op);

ePrime.addChild(parseT());

ePrime.addChild(parseEPrime());

}

}

 

return ePrime;

}

Note a few interesting things going on in parseEPrime:

  • The lexical analyzer is the lexer object. We’re using two of its methods: peek, which asks for the next token without consuming it, and get, which asks for and consumes the next token. Both methods return a TerminalSymbol object. peek returns the null value when the end of input is reached.
  • Applying the epsilon production means we don’t add any child symbols to the parse node being created.
  • The parseEPrime method can call itself recursively, because the

E’ → + T E’

E’ → – T E’

productions contain the symbol E’ on the right hand side. That’s why it’s called recursive descent!

To use a recursive descent parser to parse an entire input string, simply call the parse method for the grammar’s start symbol. It will return a reference to the root node of the parse tree.




Question: -5(a)  Generate SLR Parsing table for following grammar.

S->Aa|bAc|Bc|bBa

                      A->d

                      B->d

and parse the sentence “bdc” and “dd”.

Answer: –Introduction SLR Parsing table

Implementation of a particular method of constructing a parse table for an LR (left to right bottom up) parser called an SLR parser or Simple-LR parser.

Main things that lead to i am start to program this project is export pars Table in reusable format for other parsing program.

Parsing

Syntax Analysis. Recognize sentences in a language. Discover the structure of a document/program.

Construct (implicitly or explicitly) a tree (called as a parse tree) to represent the structure.

LR Parsers

Recognize CFGs for most programming languages. No need to rewrite grammar.
Most general O(n) shift-reduce method.
Optimal syntax error detection.
LR grammars superset of LL grammars.

Bottom-Up Parsing

Construct parse tree from the leaves to the root.
Corresponds to a rightmost derivation in reverse.

Shift-Reduce Parsing

Parser performs actions based on parse table information:

  1. Shift. Shift the next input symbol onto the top of the stack.
  2. Reduce. The right end of the string to be reduced must be at the top of the stack. Locate the left end of the string within the stack and decide with what nonterminal to replace the string.
  3. Accept. Announce successful completion of parsing.
  4. Error. Discover a syntax error and call an error recovery routine.

Consider following example grammar and steps to create Parse Table :

Grammar :

States of this grammar :

state creation rules :

1- State 0 build from extra grammar Law S’ -> S $ that S is all start symbol of grammar

and one Dot ( . ) before S symbol.

2- if Dot symbol is before a nonterminal Add grammar Laws that this nonterminal is in Left Hand Side of that Law and set Dot in before of first part of Right Hand Side.

3-if state is exist ( a state with this Laws and same Dot position ) use that instead

4- for one state that step 4 not check to now find Set of terminal and nonterminal that Dot exist in before

5- if Step 4 Set is Nonempty go to 5 else go to 7

5- for each terminal/nonterminal in set step 4 create new state by using all grammar law that Dot position is before of that terminal/nonterminal in refrence state by increasing Dot Point to Next Part in Right Hand Side of that Laws

6- goto step 2

7-end of state building 

fill the parse table by SLR algorithm rules:

Goto : in parse table in row StateNumber SN and Column nonterminal V set GX when exist an export nonterminal V from state SN to state SX

Reduce : in parse table in row StateNumber SN and Column terminal T set RX when exist Law number X in state SN that have Dot point in end of law and T is last part of Right Hand Side of law number X ( terminal T is before Dot )

Shift : in parse table in row StateNumber SN and Column terminal T set SY when exist a Law in state SN that have Dot point before terminal T and state SN exports T to State number Y

Accept : In parse Table in row StateNumber SN and Column ( Extra terminal added to real Terminals ) set Accept when S’ -> S Dot ( Dot after Start nonterminal ) member of StateNumber SN Laws

Error : all empty parse table cells are Parse Errors.

grammar can ambiguity between shift and reduce when both condition is true and set parse table Row SN to RX/SY

Using the code

for avoid of constraint in count of grammar production rule and nonterminal and terminal and … all classes implemented in linked list data structure.

by this logic all Classes implemented in 3 level :

1- Item fields Struct

2- Node Class

3- List Class

Classes :

Class Diagram exist in download list

Classes Definitions:

1-ParsHead

this Class Top Level of All Detail Classes and Contain main Properties:

first(Head) of nonterminal list object

All terminal count

All nonterminal count

a stack for terminal symbols

2-NonTerminals

for manage Nonterminals. Nonterminal Parent is ParsHead

sample method description :

Grammar Laws proccessed and LHS nonterminal and LawNumber send to this method, nonterminal pure name build from split # and rest of name after that check that Nonterminals list is empty if it was Add first node that is extraNonTerm and add law number 0 ( added law ) and exit from method.

if first item is not null , searching for Nonterminal node that it’s name is equall to searching nonterminal , if not found add new NonterminalNode and add it in List.

finally add LawNum to LawLink of Nonterminal




Question: -5(b)  How can errors be recovered in lexical phase of a compiler? Explain.

Answer:-

The program errors are detected and reported by parser. The parser handles the errors encountered and the rest of the input is parsed. The errors may be encountered at various stages of the compilation process. At various stages, the following kinds of errors occur:

  • Lexical : name of some identifier typed incorrectly
  • Syntactical : missing semicolon or unbalanced parenthesis
  • Semantical: incompatible value assignment
  • Logical : code not reachable, infinite loop

Implemented in Compiler Design

In order to deal with the errors in the code, the following are the four common error-recovery strategies:

Panic mode

When an error is encountered anywhere in the statement, the rest of the statement is ignored by not processing the input from erroneous input to delimiter such as semi-colon. This mode prevents the parser from developing infinite loops and is considered as the easiest way for recovery of the errors.

Statement mode

When an error is encountered by the parser, corrective measures are taken which facilitate the parser to parse the rest of the inputs of the statements. For example, inserting a missing semicolon, replacing comma with a semicolon etc. Here more attention is required as one wrong correction may lead to infinite loop.

Error productions

The compiler designers sometimes know that certain errors may occur in the code. In such instances, augmented grammar is created by the designers as productions which generate constructs during the time of occurrence of errors.

Global correction

The program in hand is considered as a whole and the intended program is figured out and the closest match for the same is matched, which is error-free. When an erroneous input (statement) X is fed, it creates a parse tree for some closest error-free statement Y. This enables the parser in making minimal changes to the source code.

Abstract Syntax Trees

The representation of the parse tree is not easily parsed by the compiler as they contain more details. For instance, the following parse tree is considered as an example:

It is observed that the parent nodes have single child leaf nodes. When feeding it to the next phase, this information can be eliminated. The hiding of extra information results in a tree as shown below:

Abstract tree can be represented as:

The important data structures in a compiler with minimum unnecessary information are ASTs. ASTs are easily used by the compiler as they are more compact than a parse tree.




This article is contributed by Tarun Jangra. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.




deeply explained topics




 

click here for solved UNIT-III

Author: Susheel kumar

2 thoughts on “COMPILER DESIGN (MAY-2017) UNIT-II

Leave a Reply

Your email address will not be published. Required fields are marked *