1 \documentclass{article}
4 \def\pgfsysdriver{dvisvgm4ht/pgfsys-dvisvgm4ht.def}
7 \usepackage[square,numbers]{natbib}
8 \bibliographystyle{unsrtnat}
13 \usetikzlibrary{chains,matrix,graphs,scopes,decorations.shapes,arrows,shapes}
25 bottom color=red!50!black!20,
32 very thick,draw=black!50,
33 top color=white,bottom color=black!20,
41 top color=white,bottom color=white},
42 skip loop/.style={to path={-- ++(0,#1) -| (\tikztotarget)}}
45 \tikzset{terminal/.append style={text height=1.5ex,text depth=.25ex}}
46 \tikzset{nonterminal/.append style={text height=1.5ex,text depth=.25ex}}
47 \tikzset{instruction/.append style={text height=1.5ex,text depth=.25ex}}
52 \title{T7 compiler manual}
53 \author{Nicolas Mauger}
62 T7 is a project started in August 2018 originally written by Nicolas Mauger.
63 It's a step-by-step approach to compiler development mostly based on the books
64 \textit{Compiler Design In C \cite{CompilerDesignInC}} and \textit{Systematic Software
65 Development Using VDM \cite{VDMdevelopment}}. \\
66 All documents in this project are subject to the CeCILL license. You should have receive a
67 copy of the CeCILL free software license agreement version 2.1 along with this program and
68 documents; if not, please visit
69 \href{http://www.cecill.info/licences/Licence_CeCILL_V2.1-en.txt}{www.cecill.info} or send
70 an email at \href{mailto:nicolas@mauger.cafe}{nicolas@mauger.cafe}.
75 \section{Basic concepts and naive implementation}
76 Our very first compiler is split in three parts : a lexical analyzer (see \texttt{lex.*}
77 files), a parser (see \texttt{parse.*} files) and a code generator.
79 \subsection{Short lexical analyzer}
80 The lexical analyzer will simply \textit{tokenize} all the sources and translates all
81 the lexemes\footnote{A lexeme is a sequence of characters in the source program that
82 matches the pattern for a token and is identified by the lexical analyzer as an instance
83 of that token.} into a simple computable representation. Their representations will be
86 \subsection{Our very first parser}
87 We start by parsing simple arithmetical operation. First we decompose statements in four
88 simple syntax who can be found in any basic programmation langague who accept
89 arethmetical operation. Moreover, expressions, terms and actors can be used recursively:
90 expression contains terms which contain factors which contain expressions...etc. At this
91 step we can already catch error if the operation does not respect theirs following
94 \begin{figure}[!ht] % Statment syntax
95 \label{fig:statmentSyntax}
97 \begin{tikzpicture}[point/.style={coordinate},>=stealth',thick,draw=black!50,
98 tip/.style={->,shorten >=0.007pt},every join/.style={rounded corners},
99 hv path/.style={to path={-| (\tikztotarget)}},
100 vh path/.style={to path={|- (\tikztotarget)}},
101 text height=1.5ex,text depth=.25ex]
102 \matrix[ampersand replacement=\&,column sep=4mm] {
103 \node (statment) [instruction] {\textbf{statement}}; \&
104 \node (p1) [point] {}; \& \node (p1a) [point] {}; \&
105 \node (p2) [point] {}; \& \node (expr1) [nonterminal] {expression}; \&
106 \node (p3) [point] {}; \& \node (dot) [terminal] {;}; \&
107 \node (p4) [point] {}; \& \node (p4a) [point] {}; \&
108 \node (p5) [point] {}; \& \node (p5a) [point] {}; \\
112 \chainin (statment)[];
114 \chainin (p1a) [join,join=with p4a by {skip loop=-11mm,tip}];
115 \chainin (p2) [join];
116 \chainin (expr1) [join=by tip];
117 \chainin (p3) [join];
118 \chainin (dot) [join=by tip];
119 \chainin (p4) [join];
120 \chainin (p4a) [join];
121 \chainin (p5) [join];
122 \chainin (p5a) [join=by tip];
127 \begin{figure}[!ht] % Expression syntax
128 \label{fig:expressionSyntax}
130 \begin{tikzpicture}[point/.style={coordinate},>=stealth',thick,draw=black!50,
131 tip/.style={->,shorten >=0.007pt},every join/.style={rounded corners},
132 hv path/.style={to path={-| (\tikztotarget)}},
133 vh path/.style={to path={|- (\tikztotarget)}},
134 text height=1.5ex,text depth=.25ex]
135 \matrix[ampersand replacement=\&,column sep=4mm] {
136 \node (expression) [instruction] {\textbf{expression}}; \&
137 \node (p1) [point] {}; \& \node (p1a) [point] {}; \&
138 \node (p2) [point] {}; \& \node (term1)[nonterminal] {term}; \&
139 \node (p3) [point] {}; \& \node (plus) [terminal] {$+$}; \&
140 \node (p4) [point] {}; \& \node (term2)[nonterminal] {term}; \&
141 \node (p5) [point] {}; \& \node (p5a) [point] {}; \\
145 \chainin (expression)[];
147 \chainin (p1a) [join];
148 \chainin (p2) [join];
149 \chainin (term1) [join=by tip];
150 \chainin (p3) [join,join=with p5 by {skip loop=-11mm,tip}];
151 \chainin (plus) [join=by tip];
152 \chainin (p4) [join];
153 \chainin (term2) [join=by tip];
154 \chainin (p5) [join];
155 \chainin (p5a) [join=by tip];
159 \begin{figure}[!ht] % Term syntax
160 \label{fig:TermSyntax}
162 \begin{tikzpicture}[point/.style={coordinate},>=stealth',thick,draw=black!50,
163 tip/.style={->,shorten >=0.007pt},every join/.style={rounded corners},
164 hv path/.style={to path={-| (\tikztotarget)}},
165 vh path/.style={to path={|- (\tikztotarget)}},
166 text height=1.5ex,text depth=.25ex]
167 \matrix[ampersand replacement=\&,column sep=4mm] {
168 \node (expression) [instruction] {\textbf{term}}; \&
169 \node (p1) [point] {}; \& \node (p1a) [point] {}; \&
170 \node (p2) [point] {}; \& \node (factor1)[nonterminal] {factor}; \&
171 \node (p3) [point] {}; \& \node (times) [terminal] {$*$}; \&
172 \node (p4) [point] {}; \& \node (factor2)[nonterminal] {factor}; \&
173 \node (p5) [point] {}; \& \node (p5a) [point] {}; \&
174 \node (p6) [point] {}; \& \node (p6a) [point] {}; \\
178 \chainin (expression)[];
180 \chainin (p1a) [join];
181 \chainin (p2) [join];
182 \chainin (factor1) [join=by tip];
183 \chainin (p3) [join,join=with p5a by {skip loop=-11mm,tip}];
184 \chainin (times) [join=by tip];
185 \chainin (p4) [join];
186 \chainin (factor2) [join=by tip];
187 \chainin (p5) [join];
188 \chainin (p5a) [join];
189 \chainin (p6) [join=by tip];
193 \begin{figure}[!ht] % Factor syntax
194 \label{fig:FactorSyntax}
196 \begin{tikzpicture}[point/.style={coordinate},>=stealth',thick,draw=black!50,
197 tip/.style={->,shorten >=0.01pt},every join/.style={rounded corners},
198 hv path/.style={to path={-| (\tikztotarget)}},
199 vh path/.style={to path={|- (\tikztotarget)}},
200 text height=1.5ex,text depth=.25ex]
202 \matrix[ampersand replacement=\&,column sep=4mm] {
204 \node (factor) [instruction] {\textbf{factor}}; \&
205 \node (p1) [point] {}; \& \node (p1a) [point] {}; \&
206 \node (p2) [point] {}; \& \node (number) [terminal] {number}; \&
207 \node (p3) [point] {}; \& \node (p3a) [point] {}; \\
211 \& \& \& \node (LP) [terminal]{$($}; \&
212 \node (ex) [nonterminal]{expression}; \& \node (RP) [terminal]{$)$}; \\
218 \chainin (p1a) [join];
219 \chainin (p2) [join];
221 \chainin (LP) [join=by {vh path,tip}];
222 \chainin (ex) [join=by tip];
223 \chainin (RP) [join=by tip];
224 \chainin (p3) [join=by {hv path,tip}];
226 \chainin (number)[join=by tip];
227 \chainin (p3) [join];
228 \chainin (p3a) [join=by tip];
234 \subsection{The code generator}
235 The parse tree created in the previous step can be simply used by a different program to
236 generate code. But most of the modern compiler, the code generation is a part of parse
237 process. Indeed, the parse tree created can be used to create an intermediate language,
238 like a super-assembly, to perform specific tasks like optimisation, compression and
239 platform specific binary generation. This is this intermediate language method how will
242 \section{First improvments}
243 \subsection{Improving the Parser}
244 The first improvement constitutes of changing recursive expressions to loops to decrease
245 the size of subroutines stack. The second improvement is the merge of our four baci
246 syntaxes: statements, expressions, terms, factors and their subroutines.
248 \subsection{Temporary variables in code generation}
249 In the real world, a compiler use registers or the run-time stack for temporaries. For
250 the moments we just use global variables: the expression is evaluated operator by
251 operator, with each temporary holding the result of evaluating the current subexpression
252 . If the number of subexpressions exceeds the number of temporaries, an error is raised.
254 \subsection{Generation using subroutine arguments}
255 The subroutine that reads the factor and calls procedure to get an anonymous temporary
256 name, generates the code to copy the input number into the temporary, and then returns
257 the name to its parent. Similarly, the best place to generate multiplication
258 instructions is the place where the times signs are read, in the
259 term~\ref{fig:TermSyntax} operation. After the factor~\ref{fig:FactorSyntax} calls,
260 two globals are called to join the names of the two temporaries. Code is generated to do
261 the multiplication, one of the temporaries is freed, and the other (which holds the
262 result of the multiply) is passed backup to the parent. So, this temporary, the one that
263 holds the result of the subexpression evaluation, is used in the next step. Addition is
264 handled the same way in expression~\ref{fig:expressionSyntax}.