Role of Lexical Analyzer – Compiler Design

Works as an interface between the source program and parser.
It examines input text character by character and separates the source program into pieces called tokens.
Acts as a subroutine which is called by the parser for the new token.
It returns to the parser a representation for the token is found.
Representation is an integer code if the token is simple to construct (parenthesis, comma, colon, etc.).
Representation is a pair consisting of integer code and a pointer to a table if the token is a complex element (identifier, constant, etc.).
Integer code gives the token type, the pointer points to the value of that token.
Remove the content free characters (comment, white spaces, etc.).
Insert line number during analysis.
Evaluating constants.
Detect lexical errors such as numeric literals are too long, identifiers are too long.
The output of the lexical analysis is input to syntax analysis.

Input Buffering

Lexical analyzer scans one character at a time of source program to discover tokens.
Many characters beyond token may have to be examined to determine the token itself.
So lexical analyzer reads its input from the input buffer.

Example
Consider two-pointe

One pointer marks the beginning of the token being discovered.
A look-ahead pointer scans ahead of the beginning point until the token is discovered.

Tokens

A program can be partitioned at the lowest level into a sequence of substrings called tokens.
Each token is a sequence of characters whose significance is possessed collectively rather than individually.
Example of tokens:

Constants (1, 2.3, 4.5E6)
Identifiers (A, H2035B, SPEED)
Operator symbols (+, -, **, :=, .EQ.)
Keywords (IF, GOTO, SUBROUTINE)
Punctuation symbols (parenthesis, brackets, comma, and semicolon).

Design of Lexical Analyzer

The behavior of any program can be represented by a flowchart.
We represent the behavior of the lexical analyzer by transition diagram.
In the transition diagram, boxes of flowcharts are drawn as circles called states.
States are connected by arrows called edges.
Labels on the edges leaving the state indicate the input characters that can appear after that state.

Description of the transition diagram

The figure shows a transition diagram for an identifier.
The identifier is a letter followed by any number of letters or digits.
The starting state is 0.
Edge from 0 indicates that the first character must be a letter (enter state 1).
Look for the next input character, if letter or digit, reenter state 1.
Continue reading letters or digits and making a transition from state 1 to itself.
Continue reading letters or digits and making the transition from state
1 to itself.
If the input character is a delimiter (not a letter or digit) for an identifier, enter state 2.

Construction of code for transition diagram

Construct a segment of code for each state.
The first step is: obtain the next character from the input buffer (using GETCHAR function).
GETCHAR function returns the next character and advancing the lookahead pointer at each call.
Determine edge, out of state is labeled by a character or class of character.
Determine edge, out of state is labeled by a character or class of character.
If such edge is found, control is transferred to the state pointed to by that edge.
If no such edge is found, and the state is not one that indicates that the token has been found (final state), we have failed to find this token.
The lookahead pointer must be retracted to where the beginning pointer is, and search for another token.

Finite Automata:

These are machines that represent a computer by providing a medium, which executes a finite no. of instructions sequentially as per algorithm accepting valid input and producing an output if the input is accepted these machines have input tape, reading head register recording the state for a machine, a set of instructions that govern the transition from one state to another