SLang Scanner (Improved)
A. Second Edition
This is second edition of my SLang Scanner for Comp442 assignment 1.
In this assignment, you have to design a lexical analyzer for tokens of a programming language SLANG (our Source
LANGuage). In SLANG, the following token types exist: (
l represents any letter, d any digit, and c any character).identifer
specified by `l( l + d + _(`l + d))*numrical constant
specifed by dd*character constant
specified by 'c'//THE FOLLOWING ARE SYMBOL TYPE 18
"(", ")", ";", "+", "-", "*",
"/", ":=", "<", ">", "=", "<=",
">=", "!=", "[", "]", ",",
":",
//THE FOLLOWING ARE RESERVED TYPE 15
"begin", "end", "program", "variables","integer", "array", "char",
"module", "if", "then", "else", "loop", "exit", "read", "write"
E.Further improvement
F.File listing
1. scanner.h
2. errorNo.h
3. scanner.cpp
4. main.cpp (main)
5. initialize.cpp
file name: scanner.h
///////////////////////////////////////////////////////////////////////////////////////////
//Program: SLang Scanner
//Author: Qingzhe Huang
//Date: Jan. 18, 2004
//FileName: scanner.h
//Features:
// 1. I want to improve efficiency of scanning, so I used table-driven method.
// 2. I used enum to represent character of all ASCII---CharType---where "space, tab,
// end of line, end of file are all considered to be White Space.
// 3. All legal token is represented by enum TokenType.
// 4. I defined a huge amount of TokenState which is basically the state of a DFA. As
// I don't want to search reserved keyword with linear search or whatever, I have
// many states for the reserved words.
// 5. I deliberately make the sequence of first 38 TokenState elements exactly same as
// all that of TokenType, so that each final state of DFA has a 1-1 correspondence with
// type of token.
// 6. I defined a struct of Token which may be used in future parser.
// 7. I defined an errorNo variable to represent various errors. And a series error string
// for displaying information.
// 8. When class Scanner is created, it will initialize the big "state-charType" table.
// 9. When readFromFile is called, it will first read one char in advance.
// 10. When an error is encountered, the caller of Scanner should understand that no further
// char is read in. So, stop calling "nextToken()". This is a bit controvercial, and I
// plan to change it in next version.
////////////////////////////////////////////////////////////////////////////////////////////
/*////////////////////////////////////////////////////////////////////////////
Program: SLang Scanner
Author: Qingzhe Huang
Date: Jan. 21, 2004
FileName: scanner.h
Features:
1. I restructured the struct Token, to make it a union field in order to store int
value for number.
2. I restructured the function "nextToken()" in order to give out correct line no.
when error occurs.
*////////////////////////////////////////////////////////////////////////////////
#ifndef SCANNER_H
#define SCANNER_H
#include <iostream>
using namespace std;
extern enum ErrorCode;
const int TokenStateCount=138;
const int CharTypeCount=72;
const int MaxTokenLength=255;
enum CharType
{
//all small letters 26
SMALLA,SMALLB,SMALLC,SMALLD,SMALLE,SMALLF,SMALLG,SMALLH,SMALLI,SMALLJ,SMALLK,SMALLL,
SMALLM,SMALLN,SMALLO,SMALLP,SMALLQ,SMALLR,SMALLS,SMALLT,SMALLU,SMALLV,SMALLW,SMALLX,
SMALLY,SMALLZ,
//all big letters 26
BIGA,BIGB,BIGC,BIGD,BIGE,BIGF,BIGG,BIGH,BIGI,BIGJ,BIGK,BIGL,BIGM,BIGN,BIGO,BIGP,BIGQ,
BIGR,BIGS,BIGT,BIGU,BIGV,BIGW,BIGX,BIGY,BIGZ,
//all digit 1
DIGIT,
//all symbols 16
QUOTE, OPENPAR, CLOSEPAR, SEMICOLON,PLUS, MINUS, TIMES, SLASH, COLON,
EQUAL,SMALLER,GREATER,EXCLAIM,OPENBRACKET, CLOSEBRACKET,COMMA,
//space, tab, end of line are regarded as whitespace, 1
WHITESPACE,
//UNDERSCORE IS A SPECIAL SYMBOL 1
UNDERSCORE,
//all other ASCII is regarded as illigal 1
ILLIGAL
};
//TOTAL 38, JUST 1-1 WITH THE FIRST 38 OF TOKENSTATE
enum TokenType
{
//GENERAL TYPE 5
IDTYPE, NUMBERTYPE, CHARCONSTTYPE, COMMENTTYPE, ERRORTYPE,
//THE FOLLOWING ARE SYMBOL TYPE 18
OPENPARTYPE, CLOSEPARTYPE, SEMICOLONTYPE, PLUSTYPE, MINUSTYPE, TIMESTYPE,
SLASHTYPE, ASSIGNMENTTYPE, SMALLERTYPE, GREATERTYPE, EQUALTYPE, SMALLEREQUALTYPE,
GREATEREQUALTYPE, NOTEQUALTYPE, OPENBRACKETTYPE, CLOSEBRACKETTYPE, COMMATYPE,
COLONTYPE,
//THE FOLLOWING ARE RESERVED TYPE 15
BEGINTYPE, ENDTYPE, PROGRAMTYPE, VARIABLESTYPE,INTEGERTYPE, ARRAYTYPE, CHARTYPE,
MODULETYPE, IFTYPE, THENTYPE, ELSETYPE, LOOPTYPE, EXITTYPE, READTYPE, WRITETYPE
};
enum TokenState
{
//THE FINAL STATE 38, in order to easy initialize "finalState", I put them in beginning
//5 generals
IDEND, NUMBEREND, CONSTCHAREND, COMMENTEND, ERROR,
//18 symbols
OPENPAREND, CLOSEPAREND, SEMICOLONEND, PLUSEND, MINUSEND, TIMESEND,
SLASHEND, ASSIGNMENTEND, SMALLEREND, GREATEREND, EQUALEND, SMALLEREQUALEND,
GREATEREQUALEND, NOTEQUALEND, OPENBRACKETEND, CLOSEBRACKETEND, COMMAEND,
COLONEND,
//15 reserved
BEGINEND, ENDEND, PROGRAMEND, VARIABLESEND, INTEGEREND, ARRAYEND, CHAREND,
MODULEEND, IFEND, THENEND, ELSEEND, LOOPEND, EXITEND, READEND, WRITEEND,
//THE FOLLOWING ARE ALL NON-FINAL STATES
//THE very FIRST CHAR 1
READY,
//THE FOLLOWING ARE ALL RESERVED STATE
//the first char 12
ARRAY1, BEGIN1, CHAR1, E1, I1, LOOP1, MODULE1, PROGRAM1, READ1, THEN1, VARIABLES1,
WRITE1,
//THE SECOND CHAR 15
ARRAY2, BEGIN2, CHAR2, ELSE2, END2, EXIT2, IF2, INTEGER2, LOOP2, MODULE2, PROGRAM2,
READ2, THEN2, VARIABLES2, WRITE2,
//THE THIRD CHAR 14
ARRAY3, BEGIN3, CHAR3, ELSE3, END3, EXIT3, INTEGER3, LOOP3, MODULE3, PROGRAM3, READ3,
THEN3, VARIABLES3, WRITE3,
//THE FOURTH CHAR 13
ARRAY4, BEGIN4, CHAR4, ELSE4, EXIT4, INTEGER4, LOOP4, MODULE4, PROGRAM4, READ4, THEN4,
VARIABLES4, WRITE4,
//THE FIFTH CHAR 7
ARRAY5, BEGIN5, INTEGER5, MODULE5, PROGRAM5, VARIABLES5, WRITE5,
//THE SIXTH CHAR 4
INTEGER6, MODULE6, PROGRAM6, VARIABLES6,
//THE SEVENTH CHAR 3
INTEGER7, PROGRAM7, VARIABLES7,
//THE EIGHTH CHAR 1
VARIABLES8,
//THE NINETH CHAR 1
VARIABLES9,
//THESE ARE NON-RESERVED
//THESE ARE GENERAL 9
IDBEGIN, IDUNDERSCORE, NUMBERBEGIN, CONSTCHARQUOTEBEGIN, CONSTCHARBEGIN, COMMENTSTARBEGIN,
COMMENTBEGIN, COMMENTSTAREND, COMMENTSLASHBEGIN,
//the SINGLE symbols 16
QUOTEBEGIN, OPENPARBEGIN, CLOSEPARBEGIN, SEMICOLONBEGIN,
PLUSBEGIN, MINUSBEGIN, TIMESBEGIN, SLASHBEGIN, COLONBEGIN, SMALLERBEGIN, GREATERBEGIN,
EQUALBEGIN, EXCLAIMBEGIN, OPENBRACKETBEGIN, CLOSEBRACKETBEGIN, COMMABEGIN,
//MULTI SYMBOL 4
ASSIGNMENTBEGIN, SMALLEREQUALBEGIN,
GREATEREQUALBEGIN, NOTEQUALBEGIN
};
//extern ErrorCode errorNo;
extern void errorHandle(ErrorCode errorNo);
struct Token
{
TokenType type;
union
{
char name[MaxTokenLength+1];
int number;
};
};
class Scanner
{
private:
int tokenCount;
unsigned char ch;
void printLineNo();
FILE* stream;
bool nextChar();
void initialize();
bool resume();
public:
Scanner();
static Token token;
bool readFromFile(const char* fileName, const char* listFileName="c:\\nickList.txt");
const char* getToken(){return token.name;}
bool nextToken();
void report();
};
void initialTokenState();
#endif
file name: errorNo.h
#ifndef ERRORNO_H
#define ERRORNO_H
extern char* errorStr[];
const int ScannerErrorCount=6;
enum ErrorCode
{IllegalToken, TokenTooLong, UnexpectedReachEOF, FileEmptyError, CannotOpenFile,
ExceedNumberLimit};
#endif
file name: scanner.cpp
/*////////////////////////////////////////////////////////////////////////////
Program: SLang Scanner
Author: Qingzhe Huang
Date: Jan. 21, 2004
FileName: scanner.cpp
Features:
1. As Dr. Optrany said, the number should be stored as int or double whatever.
2. I restructured the function "nextToken()" in order to give out correct line no.
when error occurs.
*////////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include <fstream>
#include "scanner.h"
#include "errorNo.h"
using namespace std;
ofstream fList;
//this will determine how many errors of maximum the scanner will tolerant
const int MaxErrortolerant=10;
//as integer usually have max 12 digit roughly
const int MaxNumberLength=12;
int errorCount=0;
int lineCount=1;
//static memeber
Token Scanner::token;
const int ErrorCount=6;
const int TokenTypeCount=38;
void errorHandle(ErrorCode errorNo);
char* errorStr[ErrorCount]=
{"IllegalToken", "TokenTooLong", "UnexpectedReachEOF", "FileEmptyError",
"CannotOpenFile", "ExceedNumberLimit"};
//this is purely for displaying purpose
char* tokenTypeStr[TokenTypeCount]=
{
//GENERAL TYPE 5
"ID", "NUMBER", "CHARACTER CONSTANT", "COMMENT", "ERROR",
//THE FOLLOWING ARE SYMBOL TYPE 18
"(", ")", ";", "+", "-", "*",
"/", ":=", "<", ">", "=", "<=",
">=", "!=", "[", "]", ",",
":",
//THE FOLLOWING ARE RESERVED TYPE 15
"begin", "end", "program", "variables","integer", "array", "char",
"module", "if", "then", "else", "loop", "exit", "read", "write"
};
CharType charType[256];
TokenState tokenState[TokenStateCount][CharTypeCount];
//this is going to be improved in future as parser need to
//call it, too. so, more parameter should be added?
//No! the error no. itself specifies the error and it is
//error handler to try to find necessary info to display.
void errorHandle(ErrorCode errorNo)
{
if (errorNo<ScannerErrorCount)
{
errorCount++;
//the illegal token may be for various reason and I only suggest
//a possible nearby place to spot the error occurs.
fList<<"\nerror of "<<errorStr[errorNo]<<" occurred at line "
<<lineCount<<" near token "<<Scanner::token.name<<endl;
}
}
//when error occurs, no message is immediately output, it is
//postponed to next time, because when '\n' is read in,
//lineCount is not incremented until token is decided.
//Therefore, when a token is ended with '\n', the line no is not updated
//until next round. So, we can keep the correct line no. for each token
bool Scanner::nextToken()
{
TokenState state=READY;
int digitCount=0;
int value=0;
int count=0;//to count the length of token
char* ptr=token.name;
bool isComment=false;
do
{
//map ch to CharType reducing 256 ASCII to 73 CharTypes
//the table for "state" and "CharType is 138x73, each entry is a
//index for state.
state=tokenState[state][charType[ch]];
if (state==NUMBERBEGIN)
{
digitCount++;
if (digitCount>=MaxNumberLength)
{
errorHandle(ExceedNumberLimit);
return resume();
}
//to accumulate the value
value*=10;
value+=ch-'0';
}
//because I put all final state in the first 38 positions
if (state<38)
{
//This is a dirty trick! Because I make the "TokenType" 1-1 with
//TokenState for the 38 finals.
*ptr='\0';
token.type=(TokenType)(state);
if (state==ERROR)
{
errorHandle(IllegalToken);
//printLineNo();
return resume();
}
tokenCount++;
if (state==NUMBEREND)
{
token.number=value;
}
//printLineNo();
return true;
}
if (state==COMMENTBEGIN)
{
isComment=true;
}
if (count>=MaxTokenLength)
{
errorHandle(TokenTooLong);
token.type=ERRORTYPE;
//printLineNo();
return false;
}
//cout<<ch;
if (!isComment&&state!=READY)
{
*ptr=ch;
ptr++;
count++;
}
//it is only at end to update line no.
printLineNo();
}while (nextChar());
state=tokenState[state][charType[ch]];
//at this point, it is either in ready state, or error state
if (state==ERROR)
{
token.type=(TokenType)(state);
errorHandle(UnexpectedReachEOF);
}
//but in all case it means end of file, so return false
return false;
}
bool Scanner::resume()
{
//the scanner will try to continue if error number is within 10
if (errorCount==MaxErrortolerant)
{
return false;
}
return nextChar();
}
void Scanner::report()
{
fList<<"\ntotal number of tokens is "<<tokenCount;
fList<<"\ntotal number of errors is "<<errorCount;
}
void Scanner::printLineNo()
{
if (ch=='\n')
{
fList<<++lineCount<<" ";
}
}
Scanner::Scanner()
{
initialize();
}
void Scanner::initialize()
{
errorCount=0;
lineCount=1;
tokenCount=0;
initialTokenState();
}
bool Scanner::readFromFile(const char* fileName, const char* listFileName)
{
if ((stream=fopen(fileName, "r"))==NULL)
{
errorHandle(CannotOpenFile);
return false;
}
else
{
fList.open(listFileName);
fList<<lineCount<<" ";
//this is to prevent the empty file situation in which
//you cannot even read one single char because my scanner need to read
//one char ahead
if (!nextChar())
{
errorHandle(FileEmptyError);
return false;
}
}
return true;
}
bool Scanner::nextChar()
{
ch=fgetc(stream);
fList<<ch;
return ch!=255;
}
file name: initialize.cpp
/*////////////////////////////////////////////////////////////////////////////
Program: SLang Scanner
Author: Qingzhe Huang
Date: Jan. 18, 2004
FileName: initialize.cpp
Features:
1. This is purely mechnical job, you know to initialize a huge state table:
138x72 is really a boring, routine job.
2. For EOF, I want "ch" to be able to be an index in "CharType" array, so, it
cannot be -1, but 255 for "unsigned char" which is declared in class Scanner.
*////////////////////////////////////////////////////////////////////////////////
#include "scanner.h"
extern enum CharType;
extern enum TokenState;
extern CharType charType[256];
extern TokenState tokenState[TokenStateCount][CharTypeCount];
void finalSymbolToken(TokenState state, TokenState endState);
void finalReservedToken(TokenState state, TokenState endState);
void initialCharType();
void setFinalTokenState();
void initialReserved(TokenState state);
void setRange(TokenState state, CharType start, CharType end, TokenState target);
void setState(TokenState state, TokenState targetState);
void setDefaultState();
void setDefaultState()
{
//all states are by default error
for (int i=0; i<TokenStateCount; i++)
{
setState((TokenState)i, ERROR);
}
//the default for all letters are IDBEGIN
setRange(READY, SMALLA, BIGZ, IDBEGIN);
//THIS IS another dirty trick, since I put all reserved states together
//so you can initialize them together.
for (i=ARRAY1; i<=VARIABLES9; i++)
{
initialReserved((TokenState)i);
}
setFinalTokenState();
}
void setFinalTokenState()
{
//FOR ID
finalReservedToken(IDBEGIN, IDEND);
//for number
finalReservedToken(NUMBERBEGIN, NUMBEREND);
//THESE FOR RESERVED WORDS
finalReservedToken(ARRAY5, ARRAYEND);
finalReservedToken(BEGIN5, BEGINEND);
finalReservedToken(CHAR4, CHAREND);
finalReservedToken(ELSE4, ELSEEND);
finalReservedToken(END3, ENDEND);
finalReservedToken(EXIT4, EXITEND);
finalReservedToken(IF2, IFEND);
finalReservedToken(INTEGER7, INTEGEREND);
finalReservedToken(LOOP4, LOOPEND);
finalReservedToken(MODULE6, MODULEEND);
finalReservedToken(PROGRAM7, PROGRAMEND);
finalReservedToken(READ4, READEND);
finalReservedToken(THEN4, THENEND);
finalReservedToken(VARIABLES9, VARIABLESEND);
finalReservedToken(WRITE5, WRITEEND);
//THESE FOR SYMBOLS
finalSymbolToken(OPENPARBEGIN, OPENPAREND);
finalSymbolToken(CLOSEPARBEGIN, CLOSEPAREND);
finalSymbolToken(SEMICOLONBEGIN, SEMICOLONEND);
finalSymbolToken(PLUSBEGIN, PLUSEND);
finalSymbolToken(MINUSBEGIN, MINUSEND);
finalSymbolToken(TIMESBEGIN, TIMESEND);
finalSymbolToken(SLASHBEGIN, SLASHEND);
finalSymbolToken(ASSIGNMENTBEGIN, ASSIGNMENTEND);
finalSymbolToken(SMALLERBEGIN, SMALLEREND);
finalSymbolToken(GREATERBEGIN, GREATEREND);
finalSymbolToken(EQUALBEGIN, EQUALEND);
finalSymbolToken(SMALLEREQUALBEGIN, SMALLEREQUALEND);
finalSymbolToken(GREATEREQUALBEGIN, GREATEREQUALEND);
finalSymbolToken(NOTEQUALBEGIN, NOTEQUALEND);
finalSymbolToken(OPENBRACKETBEGIN, OPENBRACKETEND);
finalSymbolToken(CLOSEBRACKETBEGIN, CLOSEBRACKETEND);
finalSymbolToken(COMMABEGIN, COMMAEND);
finalSymbolToken(COLONBEGIN, COLONEND);
//COMMENT
finalSymbolToken(COMMENTSLASHBEGIN, COMMENTEND);
//CONSTCHAR
finalSymbolToken(CONSTCHARQUOTEBEGIN, CONSTCHAREND);
}
void initialTokenState()
{
//initialize all charType
initialCharType();
//default is always error
setDefaultState();
//loop
tokenState[READY][WHITESPACE]=READY;
//number
tokenState[READY][DIGIT]=NUMBERBEGIN;
tokenState[NUMBERBEGIN][DIGIT]=NUMBERBEGIN;//HOW LONG SHOULD NUMBER BE?
//ID
//setRange(READY, SMALLA, BIGZ, IDBEGIN); THIS IS IN DEFAULT
setRange(IDBEGIN, SMALLA, DIGIT, IDBEGIN);
tokenState[IDBEGIN][UNDERSCORE]=IDUNDERSCORE;
setRange(IDUNDERSCORE, SMALLA, DIGIT, IDBEGIN);
//reserved words
//ARRAY1, BEGIN1, CHAR1, E1, I1, LOOP1, MODULE1, PROGRAM1, READ1, THEN1, WRITE1,
//VARIABLES1,
tokenState[READY][SMALLA]=ARRAY1;
tokenState[READY][SMALLB]=BEGIN1;
tokenState[READY][SMALLC]=CHAR1;
tokenState[READY][SMALLE]=E1;
tokenState[READY][SMALLI]=I1;
tokenState[READY][SMALLL]=LOOP1;
tokenState[READY][SMALLM]=MODULE1;
tokenState[READY][SMALLP]=PROGRAM1;
tokenState[READY][SMALLR]=READ1;
tokenState[READY][SMALLT]=THEN1;
tokenState[READY][SMALLV]=VARIABLES1;
tokenState[READY][SMALLW]=WRITE1;
/* RESERVED WORDS
ARRAY2 */
tokenState[ARRAY1][SMALLR]=ARRAY2;
//BEGIN2
tokenState[BEGIN1][SMALLE]=BEGIN2;
//CHAR2
tokenState[CHAR1][SMALLH]=CHAR2;
//ELSE2,
tokenState[E1][SMALLL]=ELSE2;
//EXIT2
tokenState[E1][SMALLX]=EXIT2;
//END2
tokenState[E1][SMALLN]=END2;
//IF2
tokenState[I1][SMALLF]=IF2;
//INTEGER2
tokenState[I1][SMALLN]=INTEGER2;
//LOOP2
tokenState[LOOP1][SMALLO]=LOOP2;
//MODULE2
tokenState[MODULE1][SMALLO]=MODULE2;
//PROGRAM2
tokenState[PROGRAM1][SMALLR]=PROGRAM2;
//READ2
tokenState[READ1][SMALLE]=READ2;
//THEN2
tokenState[THEN1][SMALLH]=THEN2;
//VARIABLES2
tokenState[VARIABLES1][SMALLA]=VARIABLES2;
//WRITE2
tokenState[WRITE1][SMALLR]=WRITE2;
/* RESERVED WORDS
ARRAY3 */
tokenState[ARRAY2][SMALLR]=ARRAY3;
//BEGIN2
tokenState[BEGIN2][SMALLG]=BEGIN3;
//CHAR2
tokenState[CHAR2][SMALLA]=CHAR3;
//ELSE2,
tokenState[ELSE2][SMALLS]=ELSE3;
//END2
tokenState[END2][SMALLD]=END3;
//EXIT2
tokenState[EXIT2][SMALLI]=EXIT3;
//INTEGER2
tokenState[INTEGER2][SMALLT]=INTEGER3;
//LOOP2
tokenState[LOOP2][SMALLO]=LOOP3;
//MODULE2
tokenState[MODULE2][SMALLD]=MODULE3;
//PROGRAM2
tokenState[PROGRAM2][SMALLO]=PROGRAM3;
//READ2
tokenState[READ2][SMALLA]=READ3;
//THEN2
tokenState[THEN2][SMALLE]=THEN3;
//VARIABLES2
tokenState[VARIABLES2][SMALLR]=VARIABLES3;
//WRITE2
tokenState[WRITE2][SMALLI]=WRITE3;
/* RESERVED WORDS
ARRAY3 */
tokenState[ARRAY3][SMALLA]=ARRAY4;
//BEGIN2
tokenState[BEGIN3][SMALLI]=BEGIN4;
//CHAR2
tokenState[CHAR3][SMALLR]=CHAR4;
//ELSE2,
tokenState[ELSE3][SMALLE]=ELSE4;
//EXIT2
tokenState[EXIT3][SMALLT]=EXIT4;
//INTEGER2
tokenState[INTEGER3][SMALLE]=INTEGER4;
//LOOP2
tokenState[LOOP3][SMALLP]=LOOP4;
//MODULE2
tokenState[MODULE3][SMALLU]=MODULE4;
//PROGRAM2
tokenState[PROGRAM3][SMALLG]=PROGRAM4;
//READ2
tokenState[READ3][SMALLD]=READ4;
//THEN2
tokenState[THEN3][SMALLN]=THEN4;
//VARIABLES2
tokenState[VARIABLES3][SMALLI]=VARIABLES4;
//WRITE2
tokenState[WRITE3][SMALLT]=WRITE4;
/* RESERVED WORDS
ARRAY */
tokenState[ARRAY4][SMALLY]=ARRAY5;
//BEGIN2
tokenState[BEGIN4][SMALLN]=BEGIN5;
//INTEGER2
tokenState[INTEGER4][SMALLG]=INTEGER5;
//MODULE2
tokenState[MODULE4][SMALLL]=MODULE5;
//PROGRAM2
tokenState[PROGRAM4][SMALLR]=PROGRAM5;
//VARIABLES2
tokenState[VARIABLES4][SMALLA]=VARIABLES5;
//WRITE2
tokenState[WRITE4][SMALLE]=WRITE5;
// RESERVED WORDS*/
//INTEGER2
tokenState[INTEGER5][SMALLE]=INTEGER6;
//MODULE2
tokenState[MODULE5][SMALLE]=MODULE6;
//PROGRAM2
tokenState[PROGRAM5][SMALLA]=PROGRAM6;
//VARIABLES2
tokenState[VARIABLES5][SMALLB]=VARIABLES6;
// RESERVED WORDS*/
//INTEGER2
tokenState[INTEGER6][SMALLR]=INTEGER7;
//PROGRAM2
tokenState[PROGRAM6][SMALLM]=PROGRAM7;
//VARIABLES2
tokenState[VARIABLES6][SMALLL]=VARIABLES7;
// RESERVED WORDS*/
//VARIABLES2
tokenState[VARIABLES7][SMALLE]=VARIABLES8;
//VARIABLES2
tokenState[VARIABLES8][SMALLS]=VARIABLES9;
/*
CONSTCHAR, UNDERSCOREBEGIN, ASSIGNMENTBEGIN, SMALLEREQUALBEGIN,
GREATEREQUALBEGIN, NOTEQUAL, COMMENTBEGIN, IDUNDERSCORE,*/
//now is the symbols
//QUOTEBEGIN, OPENPARBEGIN, CLOSEPARBEGIN, SEMICOLONBEGIN,
//PLUSBEGIN, MINUSBEGIN, TIMESBEGIN, SLASHBEGIN, COLONBEGIN, SMALLERBEGIN, GREATERBEGIN,
//EQUALBEGIN, EXCLAIMBEGIN, OPENBRACKETBEGIN, CLOSEBRACKETBEGIN, COMMABEGIN,
//'
tokenState[READY][QUOTE]=QUOTEBEGIN;
//(
tokenState[READY][OPENPAR]=OPENPARBEGIN;
//)
tokenState[READY][CLOSEPAR]=CLOSEPARBEGIN;
//;
tokenState[READY][SEMICOLON]=SEMICOLONBEGIN;
//+
tokenState[READY][PLUS]=PLUSBEGIN;
//-
tokenState[READY][MINUS]=MINUSBEGIN;
//*
tokenState[READY][TIMES]=TIMESBEGIN;
///
tokenState[READY][SLASH]=SLASHBEGIN;
//:
tokenState[READY][COLON]=COLONBEGIN;
//<
tokenState[READY][SMALLER]=SMALLERBEGIN;
//>
tokenState[READY][GREATER]=GREATERBEGIN;
//=
tokenState[READY][EQUAL]=EQUALBEGIN;
//!
tokenState[READY][EXCLAIM]=EXCLAIMBEGIN;
//[
tokenState[READY][OPENBRACKET]=OPENBRACKETBEGIN;
//]
tokenState[READY][CLOSEBRACKET]=CLOSEBRACKETBEGIN;
//,
tokenState[READY][COMMA]=COMMABEGIN;
//AFTER QUOTE IT CAN BE ANY CHARACTER, INCLUDING ILLIGAL CHAR
setRange(QUOTEBEGIN, SMALLA, ILLIGAL, CONSTCHARBEGIN);
//ANY OTHER STATE IS BY DEFAULT ERROR
tokenState[CONSTCHARBEGIN][QUOTE]=CONSTCHARQUOTEBEGIN;
//FOR /, DEFAULT IS SLASHEND, EXCEPT * WHICH IS COMMENTSTARBEGIN
tokenState[SLASHBEGIN][TIMES]= COMMENTSTARBEGIN;
//FOR :, DEFAULT IS COLONEND, EXCEPT FOR = WHICH IS ASSIGNMENTBEGIN
tokenState[COLONBEGIN][EQUAL]= ASSIGNMENTBEGIN;
//FOR <, DEFAULT IS SMALLEREND, EXCEPT FOR= WHICH IS SMALLEREQAULBEGIN
tokenState[SMALLERBEGIN][EQUAL]=SMALLEREQUALBEGIN;
//FOR >, DEFAULT IS GREATEREND, EXCEPT FOR= WHICH IS GREATEREQAULBEGIN
tokenState[GREATERBEGIN][EQUAL]= GREATEREQUALBEGIN;
tokenState[EXCLAIMBEGIN][EQUAL]= NOTEQUALBEGIN;
//WITHIN COMMENT IT IS A LOOP, EXCEPT FOR * WHICH IS POSSIBLE FOR END OF COMMENT
setRange(COMMENTSTARBEGIN, SMALLA, ILLIGAL, COMMENTBEGIN);
tokenState[COMMENTSTARBEGIN][TIMES]=COMMENTSTAREND;
setRange(COMMENTBEGIN, SMALLA, ILLIGAL, COMMENTBEGIN);
tokenState[COMMENTBEGIN][TIMES]=COMMENTSTAREND;
//FROM COMMENTSTARBEGIN, ALL IS BACK TO COMMENTBEGIN, EXCEPT / WHICH IS END OF COMMENT
setRange(COMMENTSTAREND, SMALLA, ILLIGAL, COMMENTBEGIN);
tokenState[COMMENTSTAREND][SLASH]=COMMENTSLASHBEGIN;
//
}
void initialReserved(TokenState state)
{
setRange(state, SMALLA, DIGIT, IDBEGIN);
finalReservedToken(state, IDEND);
tokenState[state][UNDERSCORE]=IDUNDERSCORE;//a_
}
void finalSymbolToken(TokenState state, TokenState endState)
{
for (int i=SMALLA; i<=WHITESPACE; i++)
{
tokenState[state][(CharType)i]=endState;
}
}
void finalReservedToken(TokenState state, TokenState endState)
{
//all non-letter, non-digit is regarded to be delimeter
for (int i=QUOTE; i<=WHITESPACE; i++)
{
tokenState[state][(CharType)i]=endState;
}
}
//the default charType is ILLIGAL
void initialCharType()
{
int chType;
//the default charType is ILLIGAL
for (int i=0; i<256; i++)
{
charType[i]=ILLIGAL;
}
//chType is SMALLA
chType=SMALLA;
for (i='a'; i<='z'; i++)
{
charType[i]=(CharType)(chType);
chType++;
}
//chType is now BIGA
chType=BIGA;//I don't want to rely on the trick.
for (i='A'; i<='Z'; i++)
{
charType[i]=(CharType)(chType);
chType++;
}
chType=DIGIT;
for (i='0'; i<='9'; i++)
{
charType[i]=(CharType)(chType);
}
/*
UNDERSCORE, QUOTE, OPENPAR, CLOSEPAR, SEMICOLON,PLUS, MINUS, TIMES, SLASH, COLON,
EQUAL,SMALLER,GREATER,EXCLAIM,OPENBRACKET, CLOSEBRACKET,COMMA,
SPACE,TAB, ENDLINE, ILLIGAL
*/
charType['_']=UNDERSCORE;
charType['\'']=QUOTE;
charType['(']=OPENPAR;
charType[')']=CLOSEPAR;
charType[';']=SEMICOLON;
charType['+']=PLUS;
charType['-']=MINUS;
charType['*']=TIMES;
charType['/']=SLASH;
charType[':']=COLON;
charType['=']=EQUAL;
charType['<']=SMALLER;
charType['>']=GREATER;
charType['!']=EXCLAIM;
charType['[']=OPENBRACKET;
charType[']']=CLOSEBRACKET;
charType[',']=COMMA;
charType[' ']=WHITESPACE;
charType['\t']=WHITESPACE;
charType[10]=WHITESPACE;
charType[13]=WHITESPACE;
//pls note, since I changed the type of "ch" to be "unsigned char"
//the EOF now is not -1, but 255
charType[255]=WHITESPACE;//IT IS A KIND OF DELIMETER
}
void setRange(TokenState state, CharType start, CharType end, TokenState target)
{
for (int i=start; i<=end; i++)
{
tokenState[state][i]=target;
}
}
void setState(TokenState state, TokenState targetState)
{
for (int i=0; i<CharTypeCount; i++)
{
tokenState[state][i]=targetState;
}
}
file name: initialize.cpp
/*////////////////////////////////////////////////////////////////////////////
Program: SLang Scanner
Author: Qingzhe Huang
Date: Jan. 18, 2004
FileName: main.cpp
Features:
1. User can input source file for scanning by giving file name in command line.
If no name is given, default file name is used.
2. The output file for token is different from "listing file" which is set as
default parameter in class Scanner.
And error type will be skipped and only in the listing file
error message will be displayed.
*////////////////////////////////////////////////////////////////////////////////
#include <iostream>
#include <fstream>
#include "scanner.h"
#include "errorno.h"
using namespace std;
extern char* tokenTypeStr[];
char* defaultInputFile="c:\\scannerSource.txt";
//the default input file name is declared as above
//user can input designated input source file from command line
int main(int argc, char *argv[ ])
{
Scanner S;
ofstream f;
char* fileName=defaultInputFile;
if (argc==2)
{
fileName=argv[1];
}
if (argc>2)
{
exit(1);
}
if (!S.readFromFile(fileName))
{
errorHandle(CannotOpenFile);
}
//this is the file for token type.
f.open("c:\\nickType.txt");
while (S.nextToken())
{
if (S.token.type==ERRORTYPE)
{
continue;
}
f<<"\nthe type of token is:"<<tokenTypeStr[S.token.type]<<endl;
if (S.token.type==NUMBERTYPE)
{
f<<S.token.number<<endl;
}
else
{
if (S.token.type==IDTYPE)
{
f<<S.token.name<<endl;
}
}
}
S.report();
return 0;
}
Here is the result: The input file is "c:\scannerSource.txt".
Sorry I don't give you the text file, you can even use the source code of program itself, except you have
to remove some illegal symbol, like ", #, {, }. etc.