- The LLVM Target-Independent Code Generator
- Introduction
- Target description classes
- Machine code description classes
- The “MC” Layer
- Target-independent code generation algorithms
- Instruction Selection
- Introduction to SelectionDAGs
- SelectionDAG Instruction Selection Process
- Initial SelectionDAG Construction
- SelectionDAG LegalizeTypes Phase
- SelectionDAG Legalize Phase
- SelectionDAG Optimization Phase: the DAG Combiner
- SelectionDAG Select Phase
- SelectionDAG Scheduling and Formation Phase
- Future directions for the SelectionDAG
- SSA-based Machine Code Optimizations
- Live Intervals
- Register Allocation
- Prolog/Epilog Code Insertion
- Late Machine Code Optimizations
- Code Emission
- VLIW Packetizer
- Instruction Selection
- Implementing a Native Assembler
- Target-specific Implementation Notes
The LLVM Target-Independent Code Generator
Warning
This is a work in progress.
Introduction
The LLVM target-independent code generator is a framework that provides a suiteof reusable components for translating the LLVM internal representation to themachine code for a specified target—either in assembly form (suitable for astatic compiler) or in binary machine code format (usable for a JITcompiler). The LLVM target-independent code generator consists of six maincomponents:
- Abstract target description interfaces which capture important propertiesabout various aspects of the machine, independently of how they will be used.These interfaces are defined in
include/llvm/Target/
. - Classes used to represent the code being generated for a target. Theseclasses are intended to be abstract enough to represent the machine code forany target machine. These classes are defined in
include/llvm/CodeGen/
. At this level, concepts like “constant poolentries” and “jump tables” are explicitly exposed. - Classes and algorithms used to represent code at the object file level, theMC Layer. These classes represent assembly level constructs like labels,sections, and instructions. At this level, concepts like “constant poolentries” and “jump tables” don’t exist.
- Target-independent algorithms used to implement various phases of nativecode generation (register allocation, scheduling, stack frame representation,etc). This code lives in
lib/CodeGen/
. - Implementations of the abstract target description interfaces forparticular targets. These machine descriptions make use of the componentsprovided by LLVM, and can optionally provide custom target-specific passes,to build complete code generators for a specific target. Target descriptionslive in
lib/Target/
. - The target-independent JIT components. The LLVM JIT is completely targetindependent (it uses the
TargetJITInfo
structure to interface fortarget-specific issues. The code for the target-independent JIT lives inlib/ExecutionEngine/JIT
.Depending on which part of the code generator you are interested in working on,different pieces of this will be useful to you. In any case, you should befamiliar with the target description and machine code representationclasses. If you want to add a backend for a new target, you will need toimplement the target description classes for your new target and understandthe LLVM code representation. If you are interested inimplementing a new code generation algorithm, it should only depend on thetarget-description and machine code representation classes, ensuring that it isportable.
Required components in the code generator
The two pieces of the LLVM code generator are the high-level interface to thecode generator and the set of reusable components that can be used to buildtarget-specific backends. The two most important interfaces (TargetMachine and DataLayout) are the only ones that are required to be defined for abackend to fit into the LLVM system, but the others must be defined if thereusable code generator components are going to be used.
This design has two important implications. The first is that LLVM can supportcompletely non-traditional code generation targets. For example, the C backenddoes not require register allocation, instruction selection, or any of the otherstandard components provided by the system. As such, it only implements thesetwo interfaces, and does its own thing. Note that C backend was removed from thetrunk since LLVM 3.1 release. Another example of a code generator like this is a(purely hypothetical) backend that converts LLVM to the GCC RTL form and usesGCC to emit machine code for a target.
This design also implies that it is possible to design and implement radicallydifferent code generators in the LLVM system that do not make use of any of thebuilt-in components. Doing so is not recommended at all, but could be requiredfor radically different targets that do not fit into the LLVM machinedescription model: FPGAs for example.
The high-level design of the code generator
The LLVM target-independent code generator is designed to support efficient andquality code generation for standard register-based microprocessors. Codegeneration in this model is divided into the following stages:
- Instruction Selection — This phase determines an efficient way toexpress the input LLVM code in the target instruction set. This stageproduces the initial code for the program in the target instruction set, thenmakes use of virtual registers in SSA form and physical registers thatrepresent any required register assignments due to target constraints orcalling conventions. This step turns the LLVM code into a DAG of targetinstructions.
- Scheduling and Formation — This phase takes the DAG of targetinstructions produced by the instruction selection phase, determines anordering of the instructions, then emits the instructions as MachineInstrs with that ordering. Note that wedescribe this in the instruction selection section because it operates ona SelectionDAG.
- SSA-based Machine Code Optimizations — This optional stage consists of aseries of machine-code optimizations that operate on the SSA-form produced bythe instruction selector. Optimizations like modulo-scheduling or peepholeoptimization work here.
- Register Allocation — The target code is transformed from an infinitevirtual register file in SSA form to the concrete register file used by thetarget. This phase introduces spill code and eliminates all virtual registerreferences from the program.
- Prolog/Epilog Code Insertion — Once the machine code has been generatedfor the function and the amount of stack space required is known (used forLLVM alloca’s and spill slots), the prolog and epilog code for the functioncan be inserted and “abstract stack location references” can be eliminated.This stage is responsible for implementing optimizations like frame-pointerelimination and stack packing.
- Late Machine Code Optimizations — Optimizations that operate on “final”machine code can go here, such as spill code scheduling and peepholeoptimizations.
- Code Emission — The final stage actually puts out the code for thecurrent function, either in the target assembler format or in machinecode.The code generator is based on the assumption that the instruction selector willuse an optimal pattern matching selector to create high-quality sequences ofnative instructions. Alternative code generator designs based on patternexpansion and aggressive iterative peephole optimization are much slower. Thisdesign permits efficient compilation (important for JIT environments) andaggressive optimization (used when generating code offline) by allowingcomponents of varying levels of sophistication to be used for any step ofcompilation.
In addition to these stages, target implementations can insert arbitrarytarget-specific passes into the flow. For example, the X86 target uses aspecial pass to handle the 80x87 floating point stack architecture. Othertargets with unusual requirements can be supported with custom passes as needed.
Using TableGen for target description
The target description classes require a detailed description of the targetarchitecture. These target descriptions often have a large amount of commoninformation (e.g., an add
instruction is almost identical to a sub
instruction). In order to allow the maximum amount of commonality to befactored out, the LLVM code generator uses theTableGen tool to describe big chunks of thetarget machine, which allows the use of domain-specific and target-specificabstractions to reduce the amount of repetition.
As LLVM continues to be developed and refined, we plan to move more and more ofthe target description to the .td
form. Doing so gives us a number ofadvantages. The most important is that it makes it easier to port LLVM becauseit reduces the amount of C++ code that has to be written, and the surface areaof the code generator that needs to be understood before someone can getsomething working. Second, it makes it easier to change things. In particular,if tables and other things are all emitted by tblgen
, we only need a changein one place (tblgen
) to update all of the targets to a new interface.
Target description classes
The LLVM target description classes (located in the include/llvm/Target
directory) provide an abstract description of the target machine independent ofany particular client. These classes are designed to capture the _abstract_properties of the target (such as the instructions and registers it has), and donot incorporate any particular pieces of code generation algorithms.
All of the target description classes (except the DataLayout class) are designed to be subclassed by the concrete targetimplementation, and have virtual methods implemented. To get to theseimplementations, the TargetMachine classprovides accessors that should be implemented by the target.
The TargetMachine class
The TargetMachine
class provides virtual methods that are used to access thetarget-specific implementations of the various target description classes viathe get*Info
methods (getInstrInfo
, getRegisterInfo
,getFrameInfo
, etc.). This class is designed to be specialized by a concretetarget implementation (e.g., X86TargetMachine
) which implements the variousvirtual methods. The only required target description class is theDataLayout class, but if the codegenerator components are to be used, the other interfaces should be implementedas well.
The DataLayout class
The DataLayout
class is the only required target description class, and itis the only class that is not extensible (you cannot derive a new class fromit). DataLayout
specifies information about how the target lays out memoryfor structures, the alignment requirements for various data types, the size ofpointers in the target, and whether the target is little-endian orbig-endian.
The TargetLowering class
The TargetLowering
class is used by SelectionDAG based instruction selectorsprimarily to describe how LLVM code should be lowered to SelectionDAGoperations. Among other things, this class indicates:
- an initial register class to use for various
ValueType
s, - which operations are natively supported by the target machine,
- the return type of
setcc
operations, - the type to use for shift amounts, and
- various high-level characteristics, like whether it is profitable to turndivision by a constant into a multiplication sequence.
The TargetRegisterInfo class
The TargetRegisterInfo
class is used to describe the register file of thetarget and any interactions between the registers.
Registers are represented in the code generator by unsigned integers. Physicalregisters (those that actually exist in the target description) are uniquesmall numbers, and virtual registers are generally large. Note thatregister #0
is reserved as a flag value.
Each register in the processor description has an associatedTargetRegisterDesc
entry, which provides a textual name for the register(used for assembly output and debugging dumps) and a set of aliases (used toindicate whether one register overlaps with another).
In addition to the per-register description, the TargetRegisterInfo
classexposes a set of processor specific register classes (instances of theTargetRegisterClass
class). Each register class contains sets of registersthat have the same properties (for example, they are all 32-bit integerregisters). Each SSA virtual register created by the instruction selector hasan associated register class. When the register allocator runs, it replacesvirtual registers with a physical register in the set.
The target-specific implementations of these classes is auto-generated from aTableGen description of the register file.
The TargetInstrInfo class
The TargetInstrInfo
class is used to describe the machine instructionssupported by the target. Descriptions define things like the mnemonic forthe opcode, the number of operands, the list of implicit register uses and defs,whether the instruction has certain target-independent properties (accessesmemory, is commutable, etc), and holds any target-specific flags.
The TargetFrameLowering class
The TargetFrameLowering
class is used to provide information about the stackframe layout of the target. It holds the direction of stack growth, the knownstack alignment on entry to each function, and the offset to the local area.The offset to the local area is the offset from the stack pointer on functionentry to the first location where function data (local variables, spilllocations) can be stored.
The TargetSubtarget class
The TargetSubtarget
class is used to provide information about the specificchip set being targeted. A sub-target informs code generation of whichinstructions are supported, instruction latencies and instruction executionitinerary; i.e., which processing units are used, in what order, and for howlong.
The TargetJITInfo class
The TargetJITInfo
class exposes an abstract interface used by theJust-In-Time code generator to perform target-specific activities, such asemitting stubs. If a TargetMachine
supports JIT code generation, it shouldprovide one of these objects through the getJITInfo
method.
Machine code description classes
At the high-level, LLVM code is translated to a machine specific representationformed out of MachineFunction,MachineBasicBlock, and MachineInstr instances (defined ininclude/llvm/CodeGen
). This representation is completely target agnostic,representing instructions in their most abstract form: an opcode and a series ofoperands. This representation is designed to support both an SSA representationfor machine code, as well as a register allocated, non-SSA form.
The MachineInstr class
Target machine instructions are represented as instances of the MachineInstr
class. This class is an extremely abstract way of representing machineinstructions. In particular, it only keeps track of an opcode number and a setof operands.
The opcode number is a simple unsigned integer that only has meaning to aspecific backend. All of the instructions for a target should be defined in the*InstrInfo.td
file for the target. The opcode enum values are auto-generatedfrom this description. The MachineInstr
class does not have any informationabout how to interpret the instruction (i.e., what the semantics of theinstruction are); for that you must refer to the TargetInstrInfo class.
The operands of a machine instruction can be of several different types: aregister reference, a constant integer, a basic block reference, etc. Inaddition, a machine operand should be marked as a def or a use of the value(though only registers are allowed to be defs).
By convention, the LLVM code generator orders instruction operands so that allregister definitions come before the register uses, even on architectures thatare normally printed in other orders. For example, the SPARC add instruction:“add %i1, %i2, %i3
” adds the “%i1”, and “%i2” registers and stores theresult into the “%i3” register. In the LLVM code generator, the operands shouldbe stored as “%i3, %i1, %i2
”: with the destination first.
Keeping destination (definition) operands at the beginning of the operand listhas several advantages. In particular, the debugging printer will print theinstruction like this:
- %r3 = add %i1, %i2
Also if the first operand is a def, it is easier to create instructions whoseonly def is the first operand.
Using the MachineInstrBuilder.h functions
Machine instructions are created by using the BuildMI
functions, located inthe include/llvm/CodeGen/MachineInstrBuilder.h
file. The BuildMI
functions make it easy to build arbitrary machine instructions. Usage of theBuildMI
functions look like this:
- // Create a 'DestReg = mov 42' (rendered in X86 assembly as 'mov DestReg, 42')
- // instruction and insert it at the end of the given MachineBasicBlock.
- const TargetInstrInfo &TII = ...
- MachineBasicBlock &MBB = ...
- DebugLoc DL;
- MachineInstr *MI = BuildMI(MBB, DL, TII.get(X86::MOV32ri), DestReg).addImm(42);
- // Create the same instr, but insert it before a specified iterator point.
- MachineBasicBlock::iterator MBBI = ...
- BuildMI(MBB, MBBI, DL, TII.get(X86::MOV32ri), DestReg).addImm(42);
- // Create a 'cmp Reg, 0' instruction, no destination reg.
- MI = BuildMI(MBB, DL, TII.get(X86::CMP32ri8)).addReg(Reg).addImm(42);
- // Create an 'sahf' instruction which takes no operands and stores nothing.
- MI = BuildMI(MBB, DL, TII.get(X86::SAHF));
- // Create a self looping branch instruction.
- BuildMI(MBB, DL, TII.get(X86::JNE)).addMBB(&MBB);
If you need to add a definition operand (other than the optional destinationregister), you must explicitly mark it as such:
- MI.addReg(Reg, RegState::Define);
Fixed (preassigned) registers
One important issue that the code generator needs to be aware of is the presenceof fixed registers. In particular, there are often places in the instructionstream where the register allocator must arrange for a particular value to bein a particular register. This can occur due to limitations of the instructionset (e.g., the X86 can only do a 32-bit divide with the EAX
/EDX
registers), or external factors like calling conventions. In any case, theinstruction selector should emit code that copies a virtual register into or outof a physical register when needed.
For example, consider this simple LLVM example:
- define i32 @test(i32 %X, i32 %Y) {
- %Z = sdiv i32 %X, %Y
- ret i32 %Z
- }
The X86 instruction selector might produce this machine code for the div
andret
:
- ;; Start of div
- %EAX = mov %reg1024 ;; Copy X (in reg1024) into EAX
- %reg1027 = sar %reg1024, 31
- %EDX = mov %reg1027 ;; Sign extend X into EDX
- idiv %reg1025 ;; Divide by Y (in reg1025)
- %reg1026 = mov %EAX ;; Read the result (Z) out of EAX
- ;; Start of ret
- %EAX = mov %reg1026 ;; 32-bit return value goes in EAX
- ret
By the end of code generation, the register allocator would coalesce theregisters and delete the resultant identity moves producing the followingcode:
- ;; X is in EAX, Y is in ECX
- mov %EAX, %EDX
- sar %EDX, 31
- idiv %ECX
- ret
This approach is extremely general (if it can handle the X86 architecture, itcan handle anything!) and allows all of the target specific knowledge about theinstruction stream to be isolated in the instruction selector. Note thatphysical registers should have a short lifetime for good code generation, andall physical registers are assumed dead on entry to and exit from basic blocks(before register allocation). Thus, if you need a value to be live across basicblock boundaries, it must live in a virtual register.
Call-clobbered registers
Some machine instructions, like calls, clobber a large number of physicalregisters. Rather than adding <def,dead>
operands for all of them, it ispossible to use an MO_RegisterMask
operand instead. The register maskoperand holds a bit mask of preserved registers, and everything else isconsidered to be clobbered by the instruction.
Machine code in SSA form
MachineInstr
’s are initially selected in SSA-form, and are maintained inSSA-form until register allocation happens. For the most part, this istrivially simple since LLVM is already in SSA form; LLVM PHI nodes becomemachine code PHI nodes, and virtual registers are only allowed to have a singledefinition.
After register allocation, machine code is no longer in SSA-form because thereare no virtual registers left in the code.
The MachineBasicBlock class
The MachineBasicBlock
class contains a list of machine instructions(MachineInstr instances). It roughlycorresponds to the LLVM code input to the instruction selector, but there can bea one-to-many mapping (i.e. one LLVM basic block can map to multiple machinebasic blocks). The MachineBasicBlock
class has a “getBasicBlock
” method,which returns the LLVM basic block that it comes from.
The MachineFunction class
The MachineFunction
class contains a list of machine basic blocks(MachineBasicBlock instances). Itcorresponds one-to-one with the LLVM function input to the instruction selector.In addition to a list of basic blocks, the MachineFunction
contains a aMachineConstantPool
, a MachineFrameInfo
, a MachineFunctionInfo
, anda MachineRegisterInfo
. See include/llvm/CodeGen/MachineFunction.h
formore information.
MachineInstr Bundles
LLVM code generator can model sequences of instructions as MachineInstrbundles. A MI bundle can model a VLIW group / pack which contains an arbitrarynumber of parallel instructions. It can also be used to model a sequential listof instructions (potentially with data dependencies) that cannot be legallyseparated (e.g. ARM Thumb2 IT blocks).
Conceptually a MI bundle is a MI with a number of other MIs nested within:
- --------------
- | Bundle | ---------
- -------------- \
- | ----------------
- | | MI |
- | ----------------
- | |
- | ----------------
- | | MI |
- | ----------------
- | |
- | ----------------
- | | MI |
- | ----------------
- |
- --------------
- | Bundle | --------
- -------------- \
- | ----------------
- | | MI |
- | ----------------
- | |
- | ----------------
- | | MI |
- | ----------------
- | |
- | ...
- |
- --------------
- | Bundle | --------
- -------------- \
- |
- ...
MI bundle support does not change the physical representations ofMachineBasicBlock and MachineInstr. All the MIs (including top level and nestedones) are stored as sequential list of MIs. The “bundled” MIs are marked withthe ‘InsideBundle’ flag. A top level MI with the special BUNDLE opcode is usedto represent the start of a bundle. It’s legal to mix BUNDLE MIs with individualMIs that are not inside bundles nor represent bundles.
MachineInstr passes should operate on a MI bundle as a single unit. Membermethods have been taught to correctly handle bundles and MIs inside bundles.The MachineBasicBlock iterator has been modified to skip over bundled MIs toenforce the bundle-as-a-single-unit concept. An alternative iteratorinstr_iterator has been added to MachineBasicBlock to allow passes to iterateover all of the MIs in a MachineBasicBlock, including those which are nestedinside bundles. The top level BUNDLE instruction must have the correct set ofregister MachineOperand’s that represent the cumulative inputs and outputs ofthe bundled MIs.
Packing / bundling of MachineInstrs for VLIW architectures shouldgenerally be done as part of the register allocation super-pass. Morespecifically, the pass which determines what MIs should be bundledtogether should be done after code generator exits SSA form(i.e. after two-address pass, PHI elimination, and copy coalescing).Such bundles should be finalized (i.e. adding BUNDLE MIs and input andoutput register MachineOperands) after virtual registers have beenrewritten into physical registers. This eliminates the need to addvirtual register operands to BUNDLE instructions which wouldeffectively double the virtual register def and use lists. Bundles mayuse virtual registers and be formed in SSA form, but may not beappropriate for all use cases.
The “MC” Layer
The MC Layer is used to represent and process code at the raw machine codelevel, devoid of “high level” information like “constant pools”, “jump tables”,“global variables” or anything like that. At this level, LLVM handles thingslike label names, machine instructions, and sections in the object file. Thecode in this layer is used for a number of important purposes: the tail end ofthe code generator uses it to write a .s or .o file, and it is also used by thellvm-mc tool to implement standalone machine code assemblers and disassemblers.
This section describes some of the important classes. There are also a numberof important subsystems that interact at this layer, they are described later inthis manual.
The MCStreamer API
MCStreamer is best thought of as an assembler API. It is an abstract API whichis implemented in different ways (e.g. to output a .s file, output an ELF .ofile, etc) but whose API correspond directly to what you see in a .s file.MCStreamer has one method per directive, such as EmitLabel, EmitSymbolAttribute,SwitchSection, EmitValue (for .byte, .word), etc, which directly correspond toassembly level directives. It also has an EmitInstruction method, which is usedto output an MCInst to the streamer.
This API is most important for two clients: the llvm-mc stand-alone assembler iseffectively a parser that parses a line, then invokes a method on MCStreamer. Inthe code generator, the Code Emission phase of the code generator lowershigher level LLVM IR and Machine* constructs down to the MC layer, emittingdirectives through MCStreamer.
On the implementation side of MCStreamer, there are two major implementations:one for writing out a .s file (MCAsmStreamer), and one for writing out a .ofile (MCObjectStreamer). MCAsmStreamer is a straightforward implementationthat prints out a directive for each method (e.g. EmitValue -> .byte
), butMCObjectStreamer implements a full assembler.
For target specific directives, the MCStreamer has a MCTargetStreamer instance.Each target that needs it defines a class that inherits from it and is a lotlike MCStreamer itself: It has one method per directive and two classes thatinherit from it, a target object streamer and a target asm streamer. The targetasm streamer just prints it (emitFnStart -> .fnstart
), and the objectstreamer implement the assembler logic for it.
To make llvm use these classes, the target initialization must callTargetRegistry::RegisterAsmStreamer and TargetRegistry::RegisterMCObjectStreamerpassing callbacks that allocate the corresponding target streamer and pass itto createAsmStreamer or to the appropriate object streamer constructor.
The MCContext class
The MCContext class is the owner of a variety of uniqued data structures at theMC layer, including symbols, sections, etc. As such, this is the class that youinteract with to create symbols and sections. This class can not be subclassed.
The MCSymbol class
The MCSymbol class represents a symbol (aka label) in the assembly file. Thereare two interesting kinds of symbols: assembler temporary symbols, and normalsymbols. Assembler temporary symbols are used and processed by the assemblerbut are discarded when the object file is produced. The distinction is usuallyrepresented by adding a prefix to the label, for example “L” labels areassembler temporary labels in MachO.
MCSymbols are created by MCContext and uniqued there. This means that MCSymbolscan be compared for pointer equivalence to find out if they are the same symbol.Note that pointer inequality does not guarantee the labels will end up atdifferent addresses though. It’s perfectly legal to output something like thisto the .s file:
- foo:
- bar:
- .byte 4
In this case, both the foo and bar symbols will have the same address.
The MCSection class
The MCSection
class represents an object-file specific section. It issubclassed by object file specific implementations (e.g. MCSectionMachO
,MCSectionCOFF
, MCSectionELF
) and these are created and uniqued byMCContext. The MCStreamer has a notion of the current section, which can bechanged with the SwitchToSection method (which corresponds to a “.section”directive in a .s file).
The MCInst class
The MCInst
class is a target-independent representation of an instruction.It is a simple class (much more so than MachineInstr) that holds atarget-specific opcode and a vector of MCOperands. MCOperand, in turn, is asimple discriminated union of three cases: 1) a simple immediate, 2) a targetregister ID, 3) a symbolic expression (e.g. “Lfoo-Lbar+42
”) as an MCExpr.
MCInst is the common currency used to represent machine instructions at the MClayer. It is the type used by the instruction encoder, the instruction printer,and the type generated by the assembly parser and disassembler.
Target-independent code generation algorithms
This section documents the phases described in the high-level design of thecode generator. It explains how they work and some of the rationale behindtheir design.
Instruction Selection
Instruction Selection is the process of translating LLVM code presented to thecode generator into target-specific machine instructions. There are severalwell-known ways to do this in the literature. LLVM uses a SelectionDAG basedinstruction selector.
Portions of the DAG instruction selector are generated from the targetdescription (*.td
) files. Our goal is for the entire instruction selectorto be generated from these .td
files, though currently there are stillthings that require custom C++ code.
Introduction to SelectionDAGs
The SelectionDAG provides an abstraction for code representation in a way thatis amenable to instruction selection using automatic techniques(e.g. dynamic-programming based optimal pattern matching selectors). It is alsowell-suited to other phases of code generation; in particular, instructionscheduling (SelectionDAG’s are very close to scheduling DAGs post-selection).Additionally, the SelectionDAG provides a host representation where a largevariety of very-low-level (but target-independent) optimizations may beperformed; ones which require extensive information about the instructionsefficiently supported by the target.
The SelectionDAG is a Directed-Acyclic-Graph whose nodes are instances of theSDNode
class. The primary payload of the SDNode
is its operation code(Opcode) that indicates what operation the node performs and the operands to theoperation. The various operation node types are described at the top of theinclude/llvm/CodeGen/ISDOpcodes.h
file.
Although most operations define a single value, each node in the graph maydefine multiple values. For example, a combined div/rem operation will defineboth the dividend and the remainder. Many other situations require multiplevalues as well. Each node also has some number of operands, which are edges tothe node defining the used value. Because nodes may define multiple values,edges are represented by instances of the SDValue
class, which is a<SDNode, unsigned>
pair, indicating the node and result value being used,respectively. Each value produced by an SDNode
has an associated MVT
(Machine Value Type) indicating what the type of the value is.
SelectionDAGs contain two different kinds of values: those that represent dataflow and those that represent control flow dependencies. Data values are simpleedges with an integer or floating point value type. Control edges arerepresented as “chain” edges which are of type MVT::Other
. These edgesprovide an ordering between nodes that have side effects (such as loads, stores,calls, returns, etc). All nodes that have side effects should take a tokenchain as input and produce a new one as output. By convention, token chaininputs are always operand #0, and chain results are always the last valueproduced by an operation. However, after instruction selection, themachine nodes have their chain after the instruction’s operands, andmay be followed by glue nodes.
A SelectionDAG has designated “Entry” and “Root” nodes. The Entry node isalways a marker node with an Opcode of ISD::EntryToken
. The Root node isthe final side-effecting node in the token chain. For example, in a single basicblock function it would be the return node.
One important concept for SelectionDAGs is the notion of a “legal” vs.“illegal” DAG. A legal DAG for a target is one that only uses supportedoperations and supported types. On a 32-bit PowerPC, for example, a DAG with avalue of type i1, i8, i16, or i64 would be illegal, as would a DAG that uses aSREM or UREM operation. The legalize types and legalize operations phasesare responsible for turning an illegal DAG into a legal DAG.
SelectionDAG Instruction Selection Process
SelectionDAG-based instruction selection consists of the following steps:
- Build initial DAG — This stage performs a simple translation from theinput LLVM code to an illegal SelectionDAG.
- Optimize SelectionDAG — This stage performs simple optimizations on theSelectionDAG to simplify it, and recognize meta instructions (like rotatesand
div
/rem
pairs) for targets that support these meta operations.This makes the resultant code more efficient and the select instructionsfrom DAG phase (below) simpler. - Legalize SelectionDAG Types — This stage transforms SelectionDAG nodesto eliminate any types that are unsupported on the target.
- Optimize SelectionDAG — The SelectionDAG optimizer is run to clean upredundancies exposed by type legalization.
- Legalize SelectionDAG Ops — This stage transforms SelectionDAG nodes toeliminate any operations that are unsupported on the target.
- Optimize SelectionDAG — The SelectionDAG optimizer is run to eliminateinefficiencies introduced by operation legalization.
- Select instructions from DAG — Finally, the target instruction selectormatches the DAG operations to target instructions. This process translatesthe target-independent input DAG into another DAG of target instructions.
- SelectionDAG Scheduling and Formation — The last phase assigns a linearorder to the instructions in the target-instruction DAG and emits them intothe MachineFunction being compiled. This step uses traditional prepassscheduling techniques.After all of these steps are complete, the SelectionDAG is destroyed and therest of the code generation passes are run.
One great way to visualize what is going on here is to take advantage of a fewLLC command line options. The following options pop up a window displaying theSelectionDAG at specific times (if you only get errors printed to the consolewhile using this, you probably need to configure yoursystem to add support for it).
-view-dag-combine1-dags
displays the DAG after being built, before thefirst optimization pass.-view-legalize-dags
displays the DAG before Legalization.-view-dag-combine2-dags
displays the DAG before the second optimizationpass.-view-isel-dags
displays the DAG before the Select phase.-view-sched-dags
displays the DAG before Scheduling.
The -view-sunit-dags
displays the Scheduler’s dependency graph. This graphis based on the final SelectionDAG, with nodes that must be scheduled togetherbundled into a single scheduling-unit node, and with immediate operands andother nodes that aren’t relevant for scheduling omitted.
The option -filter-view-dags
allows to select the name of the basic blockthat you are interested to visualize and filters all the previousview-*-dags
options.
Initial SelectionDAG Construction
The initial SelectionDAG is naïvely peephole expanded fromthe LLVM input by the SelectionDAGBuilder
class. The intent of this passis to expose as much low-level, target-specific details to the SelectionDAG aspossible. This pass is mostly hard-coded (e.g. an LLVM add
turns into anSDNode add
while a getelementptr
is expanded into the obviousarithmetic). This pass requires target-specific hooks to lower calls, returns,varargs, etc. For these features, the TargetLowering interface is used.
SelectionDAG LegalizeTypes Phase
The Legalize phase is in charge of converting a DAG to only use the types thatare natively supported by the target.
There are two main ways of converting values of unsupported scalar types tovalues of supported types: converting small types to larger types (“promoting”),and breaking up large integer types into smaller ones (“expanding”). Forexample, a target might require that all f32 values are promoted to f64 and thatall i1/i8/i16 values are promoted to i32. The same target might require thatall i64 values be expanded into pairs of i32 values. These changes can insertsign and zero extensions as needed to make sure that the final code has the samebehavior as the input.
There are two main ways of converting values of unsupported vector types tovalue of supported types: splitting vector types, multiple times if necessary,until a legal type is found, and extending vector types by adding elements tothe end to round them out to legal types (“widening”). If a vector gets splitall the way down to single-element parts with no supported vector type beingfound, the elements are converted to scalars (“scalarizing”).
A target implementation tells the legalizer which types are supported (and whichregister class to use for them) by calling the addRegisterClass
method inits TargetLowering
constructor.
SelectionDAG Legalize Phase
The Legalize phase is in charge of converting a DAG to only use the operationsthat are natively supported by the target.
Targets often have weird constraints, such as not supporting every operation onevery supported datatype (e.g. X86 does not support byte conditional moves andPowerPC does not support sign-extending loads from a 16-bit memory location).Legalize takes care of this by open-coding another sequence of operations toemulate the operation (“expansion”), by promoting one type to a larger type thatsupports the operation (“promotion”), or by using a target-specific hook toimplement the legalization (“custom”).
A target implementation tells the legalizer which operations are not supported(and which of the above three actions to take) by calling thesetOperationAction
method in its TargetLowering
constructor.
If a target has legal vector types, it is expected to produce efficient machinecode for common forms of the shufflevector IR instruction using those types.This may require custom legalization for SelectionDAG vector operations thatare created from the shufflevector IR. The shufflevector forms that should behandled include:
- Vector select — Each element of the vector is chosen from either of thecorresponding elements of the 2 input vectors. This operation may also beknown as a “blend” or “bitwise select” in target assembly. This type of shufflemaps directly to the
shuffle_vector
SelectionDAG node. - Insert subvector — A vector is placed into a longer vector type startingat index 0. This type of shuffle maps directly to the
insert_subvector
SelectionDAG node with theindex
operand set to 0. - Extract subvector — A vector is pulled from a longer vector type startingat index 0. This type of shuffle maps directly to the
extract_subvector
SelectionDAG node with theindex
operand set to 0. - Splat — All elements of the vector have identical scalar elements. Thisoperation may also be known as a “broadcast” or “duplicate” in target assembly.The shufflevector IR instruction may change the vector length, so this operationmay map to multiple SelectionDAG nodes including
shuffle_vector
,concat_vectors
,insert_subvector
, andextract_subvector
.
Prior to the existence of the Legalize passes, we required that every targetselector supported and handled every operator and type even if they are notnatively supported. The introduction of the Legalize phases allows all of thecanonicalization patterns to be shared across targets, and makes it very easy tooptimize the canonicalized code because it is still in the form of a DAG.
SelectionDAG Optimization Phase: the DAG Combiner
The SelectionDAG optimization phase is run multiple times for code generation,immediately after the DAG is built and once after each legalization. The firstrun of the pass allows the initial code to be cleaned up (e.g. performingoptimizations that depend on knowing that the operators have restricted typeinputs). Subsequent runs of the pass clean up the messy code generated by theLegalize passes, which allows Legalize to be very simple (it can focus on makingcode legal instead of focusing on generating good and legal code).
One important class of optimizations performed is optimizing inserted sign andzero extension instructions. We currently use ad-hoc techniques, but could moveto more rigorous techniques in the future. Here are some good papers on thesubject:
“Widening integer arithmetic” Kevin Redwine and Norman Ramsey International Conference on Compiler Construction (CC) 2004
“Effective sign extension elimination” Motohiro Kawahito, Hideaki Komatsu, and Toshio Nakatani Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Designand Implementation.
SelectionDAG Select Phase
The Select phase is the bulk of the target-specific code for instructionselection. This phase takes a legal SelectionDAG as input, pattern matches theinstructions supported by the target to this DAG, and produces a new DAG oftarget code. For example, consider the following LLVM fragment:
- %t1 = fadd float %W, %X
- %t2 = fmul float %t1, %Y
- %t3 = fadd float %t2, %Z
This LLVM code corresponds to a SelectionDAG that looks basically like this:
- (fadd:f32 (fmul:f32 (fadd:f32 W, X), Y), Z)
If a target supports floating point multiply-and-add (FMA) operations, one ofthe adds can be merged with the multiply. On the PowerPC, for example, theoutput of the instruction selector might look like this DAG:
- (FMADDS (FADDS W, X), Y, Z)
The FMADDS
instruction is a ternary instruction that multiplies its firsttwo operands and adds the third (as single-precision floating-point numbers).The FADDS
instruction is a simple binary single-precision add instruction.To perform this pattern match, the PowerPC backend includes the followinginstruction definitions:
- def FMADDS : AForm_1<59, 29,
- (ops F4RC:$FRT, F4RC:$FRA, F4RC:$FRC, F4RC:$FRB),
- "fmadds $FRT, $FRA, $FRC, $FRB",
- [(set F4RC:$FRT, (fadd (fmul F4RC:$FRA, F4RC:$FRC),
- F4RC:$FRB))]>;
- def FADDS : AForm_2<59, 21,
- (ops F4RC:$FRT, F4RC:$FRA, F4RC:$FRB),
- "fadds $FRT, $FRA, $FRB",
- [(set F4RC:$FRT, (fadd F4RC:$FRA, F4RC:$FRB))]>;
The highlighted portion of the instruction definitions indicates the patternused to match the instructions. The DAG operators (like fmul
/fadd
)are defined in the include/llvm/Target/TargetSelectionDAG.td
file.“F4RC
” is the register class of the input and result values.
The TableGen DAG instruction selector generator reads the instruction patternsin the .td
file and automatically builds parts of the pattern matching codefor your target. It has the following strengths:
At compiler-compile time, it analyzes your instruction patterns and tells youif your patterns make sense or not.
It can handle arbitrary constraints on operands for the pattern match. Inparticular, it is straight-forward to say things like “match any immediatethat is a 13-bit sign-extended value”. For examples, see the
immSExt16
and relatedtblgen
classes in the PowerPC backend.It knows several important identities for the patterns defined. For example,it knows that addition is commutative, so it allows the
FMADDS
patternabove to match “(fadd X, (fmul Y, Z))
” as well as “(fadd (fmul X, Y),Z)
”, without the target author having to specially handle this case.It has a full-featured type-inferencing system. In particular, you shouldrarely have to explicitly tell the system what type parts of your patternsare. In the
FMADDS
case above, we didn’t have to telltblgen
that allof the nodes in the pattern are of type ‘f32’. It was able to infer andpropagate this knowledge from the fact thatF4RC
has type ‘f32’.Targets can define their own (and rely on built-in) “pattern fragments”.Pattern fragments are chunks of reusable patterns that get inlined into yourpatterns during compiler-compile time. For example, the integer “
(notx)
” operation is actually defined as a pattern fragment that expands as“(xor x, -1)
”, since the SelectionDAG does not have a native ‘not
’operation. Targets can define their own short-hand fragments as they see fit.See the definition of ‘not
’ and ‘ineg
’ for examples.In addition to instructions, targets can specify arbitrary patterns that mapto one or more instructions using the ‘Pat’ class. For example, the PowerPChas no way to load an arbitrary integer immediate into a register in oneinstruction. To tell tblgen how to do this, it defines:
- // Arbitrary immediate support. Implement in terms of LIS/ORI.
- def : Pat<(i32 imm:$imm),
- (ORI (LIS (HI16 imm:$imm)), (LO16 imm:$imm))>;
If none of the single-instruction patterns for loading an immediate into aregister match, this will be used. This rule says “match an arbitrary i32immediate, turning it into an ORI
(‘or a 16-bit immediate’) and an LIS
(‘load 16-bit immediate, where the immediate is shifted to the left 16 bits’)instruction”. To make this work, the LO16
/HI16
node transformationsare used to manipulate the input immediate (in this case, take the high or low16-bits of the immediate).
- When using the ‘Pat’ class to map a pattern to an instruction that has oneor more complex operands (like e.g. X86 addressing mode), the pattern mayeither specify the operand as a whole using a
ComplexPattern
, or else itmay specify the components of the complex operand separately. The latter isdone e.g. for pre-increment instructions by the PowerPC back end:
- def STWU : DForm_1<37, (outs ptr_rc:$ea_res), (ins GPRC:$rS, memri:$dst),
- "stwu $rS, $dst", LdStStoreUpd, []>,
- RegConstraint<"$dst.reg = $ea_res">, NoEncode<"$ea_res">;
- def : Pat<(pre_store GPRC:$rS, ptr_rc:$ptrreg, iaddroff:$ptroff),
- (STWU GPRC:$rS, iaddroff:$ptroff, ptr_rc:$ptrreg)>;
Here, the pair of ptroff
and ptrreg
operands is matched onto thecomplex operand dst
of class memri
in the STWU
instruction.
- While the system does automate a lot, it still allows you to write custom C++code to match special cases if there is something that is hard toexpress.
While it has many strengths, the system currently has some limitations,primarily because it is a work in progress and is not yet finished:
- Overall, there is no way to define or match SelectionDAG nodes that definemultiple values (e.g.
SMULLOHI
,LOAD
,CALL
, etc). This is thebiggest reason that you currently still _have to write custom C++ codefor your instruction selector. - There is no great way to support matching complex addressing modes yet. Inthe future, we will extend pattern fragments to allow them to define multiplevalues (e.g. the four operands of the X86 addressing mode, which arecurrently matched with custom C++ code). In addition, we’ll extend fragmentsso that a fragment can match multiple different patterns.
- We don’t automatically infer flags like
isStore
/isLoad
yet. - We don’t automatically generate the set of supported registers and operationsfor the Legalizer yet.
- We don’t have a way of tying in custom legalized nodes yet.
Despite these limitations, the instruction selector generator is still quiteuseful for most of the binary and logical operations in typical instructionsets. If you run into any problems or can’t figure out how to do something,please let Chris know!
SelectionDAG Scheduling and Formation Phase
The scheduling phase takes the DAG of target instructions from the selectionphase and assigns an order. The scheduler can pick an order depending onvarious constraints of the machines (i.e. order for minimal register pressure ortry to cover instruction latencies). Once an order is established, the DAG isconverted to a list of MachineInstrs andthe SelectionDAG is destroyed.
Note that this phase is logically separate from the instruction selection phase,but is tied to it closely in the code because it operates on SelectionDAGs.
Future directions for the SelectionDAG
- Optional function-at-a-time selection.
- Auto-generate entire selector from
.td
file.
SSA-based Machine Code Optimizations
To Be Written
Live Intervals
Live Intervals are the ranges (intervals) where a variable is live. They areused by some register allocator passes to determine if two or more virtualregisters which require the same physical register are live at the same point inthe program (i.e., they conflict). When this situation occurs, one virtualregister must be spilled.
Live Variable Analysis
The first step in determining the live intervals of variables is to calculatethe set of registers that are immediately dead after the instruction (i.e., theinstruction calculates the value, but it is never used) and the set of registersthat are used by the instruction, but are never used after the instruction(i.e., they are killed). Live variable information is computed foreach virtual register and register allocatable physical registerin the function. This is done in a very efficient manner because it uses SSA tosparsely compute lifetime information for virtual registers (which are in SSAform) and only has to track physical registers within a block. Before registerallocation, LLVM can assume that physical registers are only live within asingle basic block. This allows it to do a single, local analysis to resolvephysical register lifetimes within each basic block. If a physical register isnot register allocatable (e.g., a stack pointer or condition codes), it is nottracked.
Physical registers may be live in to or out of a function. Live in values aretypically arguments in registers. Live out values are typically return values inregisters. Live in values are marked as such, and are given a dummy “defining”instruction during live intervals analysis. If the last basic block of afunction is a return
, then it’s marked as using all live out values in thefunction.
PHI
nodes need to be handled specially, because the calculation of the livevariable information from a depth first traversal of the CFG of the functionwon’t guarantee that a virtual register used by the PHI
node is definedbefore it’s used. When a PHI
node is encountered, only the definition ishandled, because the uses will be handled in other basic blocks.
For each PHI
node of the current basic block, we simulate an assignment atthe end of the current basic block and traverse the successor basic blocks. If asuccessor basic block has a PHI
node and one of the PHI
node’s operandsis coming from the current basic block, then the variable is marked as _alive_within the current basic block and all of its predecessor basic blocks, untilthe basic block with the defining instruction is encountered.
Live Intervals Analysis
We now have the information available to perform the live intervals analysis andbuild the live intervals themselves. We start off by numbering the basic blocksand machine instructions. We then handle the “live-in” values. These are inphysical registers, so the physical register is assumed to be killed by the endof the basic block. Live intervals for virtual registers are computed for someordering of the machine instructions [1, N]
. A live interval is an interval[i, j)
, where 1 >= i >= j > N
, for which a variable is live.
Note
More to come…
Register Allocation
The Register Allocation problem consists in mapping a program** Pv, that can use an unboundednumber of virtual registers, to a program ** Pp that contains a finite (possibly small) number of physicalregisters. Each target architecture has a different number of physicalregisters. If the number of physical registers is not enough to accommodate allthe virtual registers, some of them will have to be mapped into memory. Thesevirtuals are called spilled virtuals.
How registers are represented in LLVM
In LLVM, physical registers are denoted by integer numbers that normally rangefrom 1 to 1023. To see how this numbering is defined for a particulararchitecture, you can read the GenRegisterNames.inc
file for thatarchitecture. For instance, by inspectinglib/Target/X86/X86GenRegisterInfo.inc
we see that the 32-bit registerEAX
is denoted by 43, and the MMX register MM0
is mapped to 65.
Some architectures contain registers that share the same physical location. Anotable example is the X86 platform. For instance, in the X86 architecture, theregisters EAX
, AX
and AL
share the first eight bits. These physicalregisters are marked as aliased in LLVM. Given a particular architecture, youcan check which registers are aliased by inspecting its RegisterInfo.td
file. Moreover, the class MCRegAliasIterator
enumerates all the physicalregisters aliased to a register.
Physical registers, in LLVM, are grouped in Register Classes. Elements in thesame register class are functionally equivalent, and can be interchangeablyused. Each virtual register can only be mapped to physical registers of aparticular class. For instance, in the X86 architecture, some virtuals can onlybe allocated to 8 bit registers. A register class is described byTargetRegisterClass
objects. To discover if a virtual register iscompatible with a given physical, this code can be used:
- bool RegMapping_Fer::compatible_class(MachineFunction &mf,
- unsigned v_reg,
- unsigned p_reg) {
- assert(TargetRegisterInfo::isPhysicalRegister(p_reg) &&
- "Target register must be physical");
- const TargetRegisterClass *trc = mf.getRegInfo().getRegClass(v_reg);
- return trc->contains(p_reg);
- }
Sometimes, mostly for debugging purposes, it is useful to change the number ofphysical registers available in the target architecture. This must be donestatically, inside the TargetRegisterInfo.td
file. Just grep
forRegisterClass
, the last parameter of which is a list of registers. Justcommenting some out is one simple way to avoid them being used. A more politeway is to explicitly exclude some registers from the allocation order. See thedefinition of the GR8
register class inlib/Target/X86/X86RegisterInfo.td
for an example of this.
Virtual registers are also denoted by integer numbers. Contrary to physicalregisters, different virtual registers never share the same number. Whereasphysical registers are statically defined in a TargetRegisterInfo.td
fileand cannot be created by the application developer, that is not the case withvirtual registers. In order to create new virtual registers, use the methodMachineRegisterInfo::createVirtualRegister()
. This method will return a newvirtual register. Use an IndexedMap<Foo, VirtReg2IndexFunctor>
to holdinformation per virtual register. If you need to enumerate all virtualregisters, use the function TargetRegisterInfo::index2VirtReg()
to find thevirtual register numbers:
- for (unsigned i = 0, e = MRI->getNumVirtRegs(); i != e; ++i) {
- unsigned VirtReg = TargetRegisterInfo::index2VirtReg(i);
- stuff(VirtReg);
- }
Before register allocation, the operands of an instruction are mostly virtualregisters, although physical registers may also be used. In order to check if agiven machine operand is a register, use the boolean functionMachineOperand::isRegister()
. To obtain the integer code of a register, useMachineOperand::getReg()
. An instruction may define or use a register. Forinstance, ADD reg:1026 := reg:1025 reg:1024
defines the registers 1024, anduses registers 1025 and 1026. Given a register operand, the methodMachineOperand::isUse()
informs if that register is being used by theinstruction. The method MachineOperand::isDef()
informs if that registers isbeing defined.
We will call physical registers present in the LLVM bitcode before registerallocation pre-colored registers. Pre-colored registers are used in manydifferent situations, for instance, to pass parameters of functions calls, andto store results of particular instructions. There are two types of pre-coloredregisters: the ones implicitly defined, and those _explicitly_defined. Explicitly defined registers are normal operands, and can be accessedwith MachineInstr::getOperand(int)::getReg()
. In order to check whichregisters are implicitly defined by an instruction, use theTargetInstrInfo::get(opcode)::ImplicitDefs
, where opcode
is the opcodeof the target instruction. One important difference between explicit andimplicit physical registers is that the latter are defined statically for eachinstruction, whereas the former may vary depending on the program beingcompiled. For example, an instruction that represents a function call willalways implicitly define or use the same set of physical registers. To read theregisters implicitly used by an instruction, useTargetInstrInfo::get(opcode)::ImplicitUses
. Pre-colored registers imposeconstraints on any register allocation algorithm. The register allocator mustmake sure that none of them are overwritten by the values of virtual registerswhile still alive.
Mapping virtual registers to physical registers
There are two ways to map virtual registers to physical registers (or to memoryslots). The first way, that we will call direct mapping, is based on the useof methods of the classes TargetRegisterInfo
, and MachineOperand
. Thesecond way, that we will call indirect mapping, relies on the VirtRegMap
class in order to insert loads and stores sending and getting values to and frommemory.
The direct mapping provides more flexibility to the developer of the registerallocator; however, it is more error prone, and demands more implementationwork. Basically, the programmer will have to specify where load and storeinstructions should be inserted in the target function being compiled in orderto get and store values in memory. To assign a physical register to a virtualregister present in a given operand, use MachineOperand::setReg(p_reg)
. Toinsert a store instruction, use TargetInstrInfo::storeRegToStackSlot(…)
,and to insert a load instruction, use TargetInstrInfo::loadRegFromStackSlot
.
The indirect mapping shields the application developer from the complexities ofinserting load and store instructions. In order to map a virtual register to aphysical one, use VirtRegMap::assignVirt2Phys(vreg, preg)
. In order to mapa certain virtual register to memory, useVirtRegMap::assignVirt2StackSlot(vreg)
. This method will return the stackslot where vreg
’s value will be located. If it is necessary to map anothervirtual register to the same stack slot, useVirtRegMap::assignVirt2StackSlot(vreg, stack_location)
. One important pointto consider when using the indirect mapping, is that even if a virtual registeris mapped to memory, it still needs to be mapped to a physical register. Thisphysical register is the location where the virtual register is supposed to befound before being stored or after being reloaded.
If the indirect strategy is used, after all the virtual registers have beenmapped to physical registers or stack slots, it is necessary to use a spillerobject to place load and store instructions in the code. Every virtual that hasbeen mapped to a stack slot will be stored to memory after being defined and willbe loaded before being used. The implementation of the spiller tries to recycleload/store instructions, avoiding unnecessary instructions. For an example ofhow to invoke the spiller, see RegAllocLinearScan::runOnMachineFunction
inlib/CodeGen/RegAllocLinearScan.cpp
.
Handling two address instructions
With very rare exceptions (e.g., function calls), the LLVM machine codeinstructions are three address instructions. That is, each instruction isexpected to define at most one register, and to use at most two registers.However, some architectures use two address instructions. In this case, thedefined register is also one of the used registers. For instance, an instructionsuch as ADD %EAX, %EBX
, in X86 is actually equivalent to %EAX = %EAX +%EBX
.
In order to produce correct code, LLVM must convert three address instructionsthat represent two address instructions into true two address instructions. LLVMprovides the pass TwoAddressInstructionPass
for this specific purpose. Itmust be run before register allocation takes place. After its execution, theresulting code may no longer be in SSA form. This happens, for instance, insituations where an instruction such as %a = ADD %b %c
is converted to twoinstructions such as:
- %a = MOVE %b
- %a = ADD %a %c
Notice that, internally, the second instruction is represented as ADD%a[def/use] %c
. I.e., the register operand %a
is both used and defined bythe instruction.
The SSA deconstruction phase
An important transformation that happens during register allocation is calledthe SSA Deconstruction Phase. The SSA form simplifies many analyses that areperformed on the control flow graph of programs. However, traditionalinstruction sets do not implement PHI instructions. Thus, in order to generateexecutable code, compilers must replace PHI instructions with other instructionsthat preserve their semantics.
There are many ways in which PHI instructions can safely be removed from thetarget code. The most traditional PHI deconstruction algorithm replaces PHIinstructions with copy instructions. That is the strategy adopted by LLVM. TheSSA deconstruction algorithm is implemented inlib/CodeGen/PHIElimination.cpp
. In order to invoke this pass, the identifierPHIEliminationID
must be marked as required in the code of the registerallocator.
Instruction folding
Instruction folding is an optimization performed during register allocationthat removes unnecessary copy instructions. For instance, a sequence ofinstructions such as:
- %EBX = LOAD %mem_address
- %EAX = COPY %EBX
can be safely substituted by the single instruction:
- %EAX = LOAD %mem_address
Instructions can be folded with theTargetRegisterInfo::foldMemoryOperand(…)
method. Care must be taken whenfolding instructions; a folded instruction can be quite different from theoriginal instruction. See LiveIntervals::addIntervalsForSpills
inlib/CodeGen/LiveIntervalAnalysis.cpp
for an example of its use.
Built in register allocators
The LLVM infrastructure provides the application developer with three differentregister allocators:
- Fast — This register allocator is the default for debug builds. Itallocates registers on a basic block level, attempting to keep values inregisters and reusing registers as appropriate.
- Basic — This is an incremental approach to register allocation. Liveranges are assigned to registers one at a time in an order that is driven byheuristics. Since code can be rewritten on-the-fly during allocation, thisframework allows interesting allocators to be developed as extensions. It isnot itself a production register allocator but is a potentially usefulstand-alone mode for triaging bugs and as a performance baseline.
- Greedy — The default allocator. This is a highly tuned implementation ofthe Basic allocator that incorporates global live range splitting. Thisallocator works hard to minimize the cost of spill code.
- PBQP — A Partitioned Boolean Quadratic Programming (PBQP) based registerallocator. This allocator works by constructing a PBQP problem representingthe register allocation problem under consideration, solving this using a PBQPsolver, and mapping the solution back to a register assignment.
The type of register allocator used in llc
can be chosen with the commandline option -regalloc=…
:
- $ llc -regalloc=linearscan file.bc -o ln.s
- $ llc -regalloc=fast file.bc -o fa.s
- $ llc -regalloc=pbqp file.bc -o pbqp.s
Prolog/Epilog Code Insertion
Compact Unwind
Throwing an exception requires unwinding out of a function. The information onhow to unwind a given function is traditionally expressed in DWARF unwind(a.k.a. frame) info. But that format was originally developed for debuggers tobacktrace, and each Frame Description Entry (FDE) requires ~20-30 bytes perfunction. There is also the cost of mapping from an address in a function to thecorresponding FDE at runtime. An alternative unwind encoding is called compactunwind and requires just 4-bytes per function.
The compact unwind encoding is a 32-bit value, which is encoded in anarchitecture-specific way. It specifies which registers to restore and fromwhere, and how to unwind out of the function. When the linker creates a finallinked image, it will create a TEXT,unwind_info
section. This section isa small and fast way for the runtime to access unwind info for any givenfunction. If we emit compact unwind info for the function, that compact unwindinfo will be encoded in the TEXT,unwind_info
section. If we emit DWARFunwind info, the TEXT,unwind_info
section will contain the offset of theFDE in the TEXT,eh_frame
section in the final linked image.
For X86, there are three modes for the compact unwind encoding:
- Function with a Frame Pointer (
EBP
orRBP
) EBP/RBP
-based frame, whereEBP/RBP
is pushed onto the stackimmediately after the return address, thenESP/RSP
is moved toEBP/RBP
. Thus to unwind,ESP/RSP
is restored with the currentEBP/RBP
value, thenEBP/RBP
is restored by popping the stack, and thereturn is done by popping the stack once more into the PC. All non-volatileregisters that need to be restored must have been saved in a small range onthe stack that startsEBP-4
toEBP-1020
(RBP-8
toRBP-1020
). The offset (divided by 4 in 32-bit mode and 8 in 64-bit mode)is encoded in bits 16-23 (mask:0x00FF0000
). The registers saved areencoded in bits 0-14 (mask:0x00007FFF
) as five 3-bit entries from thefollowing table:
Compact Number i386 Register x86-64 Register 1 EBX
RBX
2 ECX
R12
3 EDX
R13
4 EDI
R14
5 ESI
R15
6 EBP
RBP
- Frameless with a Small Constant Stack Size (
EBP
orRBP
is not used as a frame pointer) - To return, a constant (encoded in the compact unwind encoding) is added to the
ESP/RSP
. Then the return is done by popping the stack into the PC. Allnon-volatile registers that need to be restored must have been saved on thestack immediately after the return address. The stack size (divided by 4 in32-bit mode and 8 in 64-bit mode) is encoded in bits 16-23 (mask:0x00FF0000
). There is a maximum stack size of 1024 bytes in 32-bit modeand 2048 in 64-bit mode. The number of registers saved is encoded in bits 9-12(mask:0x00001C00
). Bits 0-9 (mask:0x000003FF
) contain whichregisters were saved and their order. (See theencodeCompactUnwindRegistersWithoutFrame()
function inlib/Target/X86FrameLowering.cpp
for the encoding algorithm.) - Frameless with a Large Constant Stack Size (
EBP
orRBP
is not used as a frame pointer) - This case is like the “Frameless with a Small Constant Stack Size” case, butthe stack size is too large to encode in the compact unwind encoding. Insteadit requires that the function contains “
subl $nnnnnn, %esp
” in itsprolog. The compact encoding contains the offset to the$nnnnnn
value inthe function in bits 9-12 (mask:0x00001C00
).
Late Machine Code Optimizations
Note
To Be Written
Code Emission
The code emission step of code generation is responsible for lowering from thecode generator abstractions (like MachineFunction, MachineInstr, etc) downto the abstractions used by the MC layer (MCInst, MCStreamer, etc). Thisis done with a combination of several different classes: the (misnamed)target-independent AsmPrinter class, target-specific subclasses of AsmPrinter(such as SparcAsmPrinter), and the TargetLoweringObjectFile class.
Since the MC layer works at the level of abstraction of object files, it doesn’thave a notion of functions, global variables etc. Instead, it thinks aboutlabels, directives, and instructions. A key class used at this time is theMCStreamer class. This is an abstract API that is implemented in different ways(e.g. to output a .s file, output an ELF .o file, etc) that is effectively an“assembler API”. MCStreamer has one method per directive, such as EmitLabel,EmitSymbolAttribute, SwitchSection, etc, which directly correspond to assemblylevel directives.
If you are interested in implementing a code generator for a target, there arethree important things that you have to implement for your target:
- First, you need a subclass of AsmPrinter for your target. This classimplements the general lowering process converting MachineFunction’s into MClabel constructs. The AsmPrinter base class provides a number of usefulmethods and routines, and also allows you to override the lowering process insome important ways. You should get much of the lowering for free if you areimplementing an ELF, COFF, or MachO target, because theTargetLoweringObjectFile class implements much of the common logic.
- Second, you need to implement an instruction printer for your target. Theinstruction printer takes an MCInst and renders it to a raw_ostream astext. Most of this is automatically generated from the .td file (when youspecify something like “
add $dst, $src1, $src2
” in the instructions), butyou need to implement routines to print operands. - Third, you need to implement code that lowers a MachineInstr to an MCInst,usually implemented in “
MCInstLower.cpp”. This lowering process isoften target specific, and is responsible for turning jump table entries,constant pool indices, global variable addresses, etc into MCLabels asappropriate. This translation layer is also responsible for expanding pseudoops used by the code generator into the actual machine instructions theycorrespond to. The MCInsts that are generated by this are fed into theinstruction printer or the encoder. Finally, at your choosing, you can also implement a subclass of MCCodeEmitterwhich lowers MCInst’s into machine code bytes and relocations. This isimportant if you want to support direct .o file emission, or would like toimplement an assembler for your target.
Emitting function stack size information
A section containing metadata on function stack sizes will be emitted whenTargetLoweringObjectFile::StackSizesSection
is not null, andTargetOptions::EmitStackSizeSection
is set (-stack-size-section). Thesection will contain an array of pairs of function symbol values (pointer size)and stack sizes (unsigned LEB128). The stack size values only include the spaceallocated in the function prologue. Functions with dynamic stack allocations arenot included.
VLIW Packetizer
In a Very Long Instruction Word (VLIW) architecture, the compiler is responsiblefor mapping instructions to functional-units available on the architecture. Tothat end, the compiler creates groups of instructions called packets orbundles. The VLIW packetizer in LLVM is a target-independent mechanism toenable the packetization of machine instructions.
Mapping from instructions to functional units
Instructions in a VLIW target can typically be mapped to multiple functionalunits. During the process of packetizing, the compiler must be able to reasonabout whether an instruction can be added to a packet. This decision can becomplex since the compiler has to examine all possible mappings of instructionsto functional units. Therefore to alleviate compilation-time complexity, theVLIW packetizer parses the instruction classes of a target and generates tablesat compiler build time. These tables can then be queried by the providedmachine-independent API to determine if an instruction can be accommodated in apacket.
How the packetization tables are generated and used
The packetizer reads instruction classes from a target’s itineraries and createsa deterministic finite automaton (DFA) to represent the state of a packet. A DFAconsists of three major elements: inputs, states, and transitions. The set ofinputs for the generated DFA represents the instruction being added to apacket. The states represent the possible consumption of functional units byinstructions in a packet. In the DFA, transitions from one state to anotheroccur on the addition of an instruction to an existing packet. If there is alegal mapping of functional units to instructions, then the DFA contains acorresponding transition. The absence of a transition indicates that a legalmapping does not exist and that the instruction cannot be added to the packet.
To generate tables for a VLIW target, add _Target_GenDFAPacketizer.inc as atarget to the Makefile in the target directory. The exported API provides threefunctions: DFAPacketizer::clearResources()
,DFAPacketizer::reserveResources(MachineInstr MI)
, andDFAPacketizer::canReserveResources(MachineInstr
MI)
. These functions allowa target packetizer to add an instruction to an existing packet and to checkwhether an instruction can be added to a packet. Seellvm/CodeGen/DFAPacketizer.h
for more information.
Implementing a Native Assembler
Though you’re probably reading this because you want to write or maintain acompiler backend, LLVM also fully supports building a native assembler.We’ve tried hard to automate the generation of the assembler from the .td files(in particular the instruction syntax and encodings), which means that a largepart of the manual and repetitive data entry can be factored and shared with thecompiler.
Instruction Parsing
Note
To Be Written
Instruction Alias Processing
Once the instruction is parsed, it enters the MatchInstructionImpl function.The MatchInstructionImpl function performs alias processing and then does actualmatching.
Alias processing is the phase that canonicalizes different lexical forms of thesame instructions down to one representation. There are several different kindsof alias that are possible to implement and they are listed below in the orderthat they are processed (which is in order from simplest/weakest to mostcomplex/powerful). Generally you want to use the first alias mechanism thatmeets the needs of your instruction, because it will allow a more concisedescription.
Mnemonic Aliases
The first phase of alias processing is simple instruction mnemonic remapping forclasses of instructions which are allowed with two different mnemonics. Thisphase is a simple and unconditionally remapping from one input mnemonic to oneoutput mnemonic. It isn’t possible for this form of alias to look at theoperands at all, so the remapping must apply for all forms of a given mnemonic.Mnemonic aliases are defined simply, for example X86 has:
- def : MnemonicAlias<"cbw", "cbtw">;
- def : MnemonicAlias<"smovq", "movsq">;
- def : MnemonicAlias<"fldcww", "fldcw">;
- def : MnemonicAlias<"fucompi", "fucomip">;
- def : MnemonicAlias<"ud2a", "ud2">;
… and many others. With a MnemonicAlias definition, the mnemonic is remappedsimply and directly. Though MnemonicAlias’s can’t look at any aspect of theinstruction (such as the operands) they can depend on global modes (the sameones supported by the matcher), through a Requires clause:
- def : MnemonicAlias<"pushf", "pushfq">, Requires<[In64BitMode]>;
- def : MnemonicAlias<"pushf", "pushfl">, Requires<[In32BitMode]>;
In this example, the mnemonic gets mapped into a different one depending onthe current instruction set.
Instruction Aliases
The most general phase of alias processing occurs while matching is happening:it provides new forms for the matcher to match along with a specific instructionto generate. An instruction alias has two parts: the string to match and theinstruction to generate. For example:
- def : InstAlias<"movsx $src, $dst", (MOVSX16rr8W GR16:$dst, GR8 :$src)>;
- def : InstAlias<"movsx $src, $dst", (MOVSX16rm8W GR16:$dst, i8mem:$src)>;
- def : InstAlias<"movsx $src, $dst", (MOVSX32rr8 GR32:$dst, GR8 :$src)>;
- def : InstAlias<"movsx $src, $dst", (MOVSX32rr16 GR32:$dst, GR16 :$src)>;
- def : InstAlias<"movsx $src, $dst", (MOVSX64rr8 GR64:$dst, GR8 :$src)>;
- def : InstAlias<"movsx $src, $dst", (MOVSX64rr16 GR64:$dst, GR16 :$src)>;
- def : InstAlias<"movsx $src, $dst", (MOVSX64rr32 GR64:$dst, GR32 :$src)>;
This shows a powerful example of the instruction aliases, matching the samemnemonic in multiple different ways depending on what operands are present inthe assembly. The result of instruction aliases can include operands in adifferent order than the destination instruction, and can use an input multipletimes, for example:
- def : InstAlias<"clrb $reg", (XOR8rr GR8 :$reg, GR8 :$reg)>;
- def : InstAlias<"clrw $reg", (XOR16rr GR16:$reg, GR16:$reg)>;
- def : InstAlias<"clrl $reg", (XOR32rr GR32:$reg, GR32:$reg)>;
- def : InstAlias<"clrq $reg", (XOR64rr GR64:$reg, GR64:$reg)>;
This example also shows that tied operands are only listed once. In the X86backend, XOR8rr has two input GR8’s and one output GR8 (where an input is tiedto the output). InstAliases take a flattened operand list without duplicatesfor tied operands. The result of an instruction alias can also use immediatesand fixed physical registers which are added as simple immediate operands in theresult, for example:
- // Fixed Immediate operand.
- def : InstAlias<"aad", (AAD8i8 10)>;
- // Fixed register operand.
- def : InstAlias<"fcomi", (COM_FIr ST1)>;
- // Simple alias.
- def : InstAlias<"fcomi $reg", (COM_FIr RST:$reg)>;
Instruction aliases can also have a Requires clause to make them subtargetspecific.
If the back-end supports it, the instruction printer can automatically emit thealias rather than what’s being aliased. It typically leads to better, morereadable code. If it’s better to print out what’s being aliased, then pass a ‘0’as the third parameter to the InstAlias definition.
Instruction Matching
Note
To Be Written
Target-specific Implementation Notes
This section of the document explains features or design decisions that arespecific to the code generator for a particular target. First we start with atable that summarizes what features are supported by each target.
Target Feature Matrix
Note that this table does not list features that are not supported fully by anytarget yet. It considers a feature to be supported if at least one subtargetsupports it. A feature being supported means that it is useful and works formost cases, it does not indicate that there are zero known bugs in theimplementation. Here is the key:
Unknown | Not Applicable | No support | Partial Support | Complete Support |
---|---|---|---|---|
Here is the table:
Target | |||||||||||
Feature | ARM | Hexagon | MSP430 | Mips | NVPTX | PowerPC | Sparc | SystemZ | X86 | XCore | eBPF |
---|---|---|---|---|---|---|---|---|---|---|---|
is generally reliable | |||||||||||
assembly parser | |||||||||||
disassembler | |||||||||||
inline asm | |||||||||||
jit | * | ||||||||||
.o file writing | |||||||||||
tail calls | |||||||||||
segmented stacks | * |
Is Generally Reliable
This box indicates whether the target is considered to be production quality.This indicates that the target has been used as a static compiler to compilelarge amounts of code by a variety of different people and is in continuous use.
Assembly Parser
This box indicates whether the target supports parsing target specific .s filesby implementing the MCAsmParser interface. This is required for llvm-mc to beable to act as a native assembler and is required for inline assembly support inthe native .o file writer.
Disassembler
This box indicates whether the target supports the MCDisassembler API fordisassembling machine opcode bytes into MCInst’s.
Inline Asm
This box indicates whether the target supports most popular inline assemblyconstraints and modifiers.
JIT Support
This box indicates whether the target supports the JIT compiler through theExecutionEngine interface.
The ARM backend has basic support for integer code in ARM codegen mode, butlacks NEON and full Thumb support.
.o File Writing
This box indicates whether the target supports writing .o files (e.g. MachO,ELF, and/or COFF) files directly from the target. Note that the target alsomust include an assembly parser and general inline assembly support for fullinline assembly support in the .o writer.
Targets that don’t support this feature can obviously still write out .o files,they just rely on having an external assembler to translate from a .s file to a.o file (as is the case for many C compilers).
Tail Calls
This box indicates whether the target supports guaranteed tail calls. These arecalls marked “tail” and use the fastcc callingconvention. Please see the tail call section for more details.
Segmented Stacks
This box indicates whether the target supports segmented stacks. This replacesthe traditional large C stack with many linked segments. It is compatible withthe gcc implementation used by the Gofront end.
Basic support exists on the X86 backend. Currently vararg doesn’t work and theobject files are not marked the way the gold linker expects, but simple Goprograms can be built by dragonegg.
Tail call optimization
Tail call optimization, callee reusing the stack of the caller, is currentlysupported on x86/x86-64, PowerPC, and WebAssembly. It is performed on x86/x86-64and PowerPC if:
- Caller and callee have the calling convention
fastcc
,cc 10
(GHCcalling convention),cc 11
(HiPE calling convention), ortailcc
. - The call is a tail call - in tail position (ret immediately follows call andret uses value of call or is void).
- Option
-tailcallopt
is enabled or the calling convention istailcc
. - Platform-specific constraints are met.
x86/x86-64 constraints:
- No variable argument lists are used.
- On x86-64 when generating GOT/PIC code only module-local calls (visibility =hidden or protected) are supported.
PowerPC constraints:
- No variable argument lists are used.
- No byval parameters are used.
- On ppc32/64 GOT/PIC only module-local calls (visibility = hidden or protected)are supported.
WebAssembly constraints:
- No variable argument lists are used
- The ‘tail-call’ target attribute is enabled.
- The caller and callee’s return types must match. The caller cannotbe void unless the callee is, too.
Example:
Call as llc -tailcallopt test.ll
.
- declare fastcc i32 @tailcallee(i32 inreg %a1, i32 inreg %a2, i32 %a3, i32 %a4)
- define fastcc i32 @tailcaller(i32 %in1, i32 %in2) {
- %l1 = add i32 %in1, %in2
- %tmp = tail call fastcc i32 @tailcallee(i32 %in1 inreg, i32 %in2 inreg, i32 %in1, i32 %l1)
- ret i32 %tmp
- }
Implications of -tailcallopt
:
To support tail call optimization in situations where the callee has morearguments than the caller a ‘callee pops arguments’ convention is used. Thiscurrently causes each fastcc
call that is not tail call optimized (becauseone or more of above constraints are not met) to be followed by a readjustmentof the stack. So performance might be worse in such cases.
Sibling call optimization
Sibling call optimization is a restricted form of tail call optimization.Unlike tail call optimization described in the previous section, it can beperformed automatically on any tail calls when -tailcallopt
option is notspecified.
Sibling call optimization is currently performed on x86/x86-64 when thefollowing constraints are met:
- Caller and callee have the same calling convention. It can be either
c
orfastcc
. - The call is a tail call - in tail position (ret immediately follows call andret uses value of call or is void).
- Caller and callee have matching return type or the callee result is not used.
- If any of the callee arguments are being passed in stack, they must beavailable in caller’s own incoming argument stack and the frame offsets mustbe the same.
Example:
- declare i32 @bar(i32, i32)
- define i32 @foo(i32 %a, i32 %b, i32 %c) {
- entry:
- %0 = tail call i32 @bar(i32 %a, i32 %b)
- ret i32 %0
- }
The X86 backend
The X86 code generator lives in the lib/Target/X86
directory. This codegenerator is capable of targeting a variety of x86-32 and x86-64 processors, andincludes support for ISA extensions such as MMX and SSE.
X86 Target Triples supported
The following are the known target triples that are supported by the X86backend. This is not an exhaustive list, and it would be useful to add thosethat people test.
- i686-pc-linux-gnu — Linux
- i386-unknown-freebsd5.3 — FreeBSD 5.3
- i686-pc-cygwin — Cygwin on Win32
- i686-pc-mingw32 — MingW on Win32
- i386-pc-mingw32msvc — MingW crosscompiler on Linux
- i686-apple-darwin* — Apple Darwin on X86
- x86_64-unknown-linux-gnu — Linux
X86 Calling Conventions supported
The following target-specific calling conventions are known to backend:
- x86_StdCall — stdcall calling convention seen on Microsoft Windowsplatform (CC ID = 64).
- x86_FastCall — fastcall calling convention seen on Microsoft Windowsplatform (CC ID = 65).
- x86_ThisCall — Similar to X86_StdCall. Passes first argument in ECX,others via stack. Callee is responsible for stack cleaning. This convention isused by MSVC by default for methods in its ABI (CC ID = 70).
Representing X86 addressing modes in MachineInstrs
The x86 has a very flexible way of accessing memory. It is capable of formingmemory addresses of the following expression directly in integer instructions(which use ModR/M addressing):
- SegmentReg: Base + [1,2,4,8] * IndexReg + Disp32
In order to represent this, LLVM tracks no less than 5 operands for each memoryoperand of this form. This means that the “load” form of ‘mov
’ has thefollowing MachineOperand
s in this order:
- Index: 0 | 1 2 3 4 5
- Meaning: DestReg, | BaseReg, Scale, IndexReg, Displacement Segment
- OperandTy: VirtReg, | VirtReg, UnsImm, VirtReg, SignExtImm PhysReg
Stores, and all other instructions, treat the four memory operands in the sameway and in the same order. If the segment register is unspecified (regno = 0),then no segment override is generated. “Lea” operations do not have a segmentregister specified, so they only have 4 operands for their memory reference.
X86 address spaces supported
x86 has a feature which provides the ability to perform loads and stores todifferent address spaces via the x86 segment registers. A segment overrideprefix byte on an instruction causes the instruction’s memory access to go tothe specified segment. LLVM address space 0 is the default address space, whichincludes the stack, and any unqualified memory accesses in a program. Addressspaces 1-255 are currently reserved for user-defined code. The GS-segment isrepresented by address space 256, the FS-segment is represented by address space257, and the SS-segment is represented by address space 258. Other x86 segmentshave yet to be allocated address space numbers.
While these address spaces may seem similar to TLS via the thread_local
keyword, and often use the same underlying hardware, there are some fundamentaldifferences.
The thread_local
keyword applies to global variables and specifies that theyare to be allocated in thread-local memory. There are no type qualifiersinvolved, and these variables can be pointed to with normal pointers andaccessed with normal loads and stores. The thread_local
keyword istarget-independent at the LLVM IR level (though LLVM doesn’t yet haveimplementations of it for some configurations)
Special address spaces, in contrast, apply to static types. Every load and storehas a particular address space in its address operand type, and this is whatdetermines which address space is accessed. LLVM ignores these special addressspace qualifiers on global variables, and does not provide a way to directlyallocate storage in them. At the LLVM IR level, the behavior of these specialaddress spaces depends in part on the underlying OS or runtime environment, andthey are specific to x86 (and LLVM doesn’t yet handle them correctly in somecases).
Some operating systems and runtime environments use (or may in the future use)the FS/GS-segment registers for various low-level purposes, so care should betaken when considering them.
Instruction naming
An instruction name consists of the base name, a default operand size, and a acharacter per operand with an optional special size. For example:
- ADD8rr -> add, 8-bit register, 8-bit register
- IMUL16rmi -> imul, 16-bit register, 16-bit memory, 16-bit immediate
- IMUL16rmi8 -> imul, 16-bit register, 16-bit memory, 8-bit immediate
- MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory
The PowerPC backend
The PowerPC code generator lives in the lib/Target/PowerPC directory. The codegeneration is retargetable to several variations or subtargets of the PowerPCISA; including ppc32, ppc64 and altivec.
LLVM PowerPC ABI
LLVM follows the AIX PowerPC ABI, with two deviations. LLVM uses a PC relative(PIC) or static addressing for accessing global values, so no TOC (r2) isused. Second, r31 is used as a frame pointer to allow dynamic growth of a stackframe. LLVM takes advantage of having no TOC to provide space to save the framepointer in the PowerPC linkage area of the caller frame. Other details ofPowerPC ABI can be found at PowerPC ABI. Note: This link describes the 32 bit ABI. The 64 bit ABI is similar exceptspace for GPRs are 8 bytes wide (not 4) and r13 is reserved for system use.
Frame Layout
The size of a PowerPC frame is usually fixed for the duration of a function’sinvocation. Since the frame is fixed size, all references into the frame can beaccessed via fixed offsets from the stack pointer. The exception to this iswhen dynamic alloca or variable sized arrays are present, then a base pointer(r31) is used as a proxy for the stack pointer and stack pointer is free to growor shrink. A base pointer is also used if llvm-gcc is not passed the-fomit-frame-pointer flag. The stack pointer is always aligned to 16 bytes, sothat space allocated for altivec vectors will be properly aligned.
An invocation frame is laid out as follows (low memory at top):
Linkage |
Parameter area |
Dynamic area |
Locals area |
Saved registers area |
Previous Frame |
The linkage area is used by a callee to save special registers prior toallocating its own frame. Only three entries are relevant to LLVM. The firstentry is the previous stack pointer (sp), aka link. This allows probing toolslike gdb or exception handlers to quickly scan the frames in the stack. Afunction epilog can also use the link to pop the frame from the stack. Thethird entry in the linkage area is used to save the return address from the lrregister. Finally, as mentioned above, the last entry is used to save theprevious frame pointer (r31.) The entries in the linkage area are the size of aGPR, thus the linkage area is 24 bytes long in 32 bit mode and 48 bytes in 64bit mode.
32 bit linkage area:
0 | Saved SP (r1) |
4 | Saved CR |
8 | Saved LR |
12 | Reserved |
16 | Reserved |
20 | Saved FP (r31) |
64 bit linkage area:
0 | Saved SP (r1) |
8 | Saved CR |
16 | Saved LR |
24 | Reserved |
32 | Reserved |
40 | Saved FP (r31) |
The parameter area is used to store arguments being passed to a calleefunction. Following the PowerPC ABI, the first few arguments are actuallypassed in registers, with the space in the parameter area unused. However, ifthere are not enough registers or the callee is a thunk or vararg function,these register arguments can be spilled into the parameter area. Thus, theparameter area must be large enough to store all the parameters for the largestcall sequence made by the caller. The size must also be minimally large enoughto spill registers r3-r10. This allows callees blind to the call signature,such as thunks and vararg functions, enough space to cache the argumentregisters. Therefore, the parameter area is minimally 32 bytes (64 bytes in 64bit mode.) Also note that since the parameter area is a fixed offset from thetop of the frame, that a callee can access its split arguments using fixedoffsets from the stack pointer (or base pointer.)
Combining the information about the linkage, parameter areas and alignment. Astack frame is minimally 64 bytes in 32 bit mode and 128 bytes in 64 bit mode.
The dynamic area starts out as size zero. If a function uses dynamic allocathen space is added to the stack, the linkage and parameter areas are shifted totop of stack, and the new space is available immediately below the linkage andparameter areas. The cost of shifting the linkage and parameter areas is minorsince only the link value needs to be copied. The link value can be easilyfetched by adding the original frame size to the base pointer. Note thatallocations in the dynamic space need to observe 16 byte alignment.
The locals area is where the llvm compiler reserves space for local variables.
The saved registers area is where the llvm compiler spills callee savedregisters on entry to the callee.
Prolog/Epilog
The llvm prolog and epilog are the same as described in the PowerPC ABI, withthe following exceptions. Callee saved registers are spilled after the frame iscreated. This allows the llvm epilog/prolog support to be common with othertargets. The base pointer callee saved register r31 is saved in the TOC slot oflinkage area. This simplifies allocation of space for the base pointer andmakes it convenient to locate programmatically and during debugging.
Dynamic Allocation
Note
TODO - More to come.
The NVPTX backend
The NVPTX code generator under lib/Target/NVPTX is an open-source version ofthe NVIDIA NVPTX code generator for LLVM. It is contributed by NVIDIA and isa port of the code generator used in the CUDA compiler (nvcc). It targets thePTX 3.0/3.1 ISA and can target any compute capability greater than or equal to2.0 (Fermi).
This target is of production quality and should be completely compatible withthe official NVIDIA toolchain.
Code Generator Options:
Option | Description |
---|---|
sm_20 | Set shader model/compute capability to 2.0 |
sm_21 | Set shader model/compute capability to 2.1 |
sm_30 | Set shader model/compute capability to 3.0 |
sm_35 | Set shader model/compute capability to 3.5 |
ptx30 | Target PTX 3.0 |
ptx31 | Target PTX 3.1 |
The extended Berkeley Packet Filter (eBPF) backend
Extended BPF (or eBPF) is similar to the original (“classic”) BPF (cBPF) usedto filter network packets. Thebpf() system callperforms a range of operations related to eBPF. For both cBPF and eBPFprograms, the Linux kernel statically analyzes the programs before loadingthem, in order to ensure that they cannot harm the running system. eBPF isa 64-bit RISC instruction set designed for one to one mapping to 64-bit CPUs.Opcodes are 8-bit encoded, and 87 instructions are defined. There are 10registers, grouped by function as outlined below.
- R0 return value from in-kernel functions; exit value for eBPF program
- R1 - R5 function call arguments to in-kernel functions
- R6 - R9 callee-saved registers preserved by in-kernel functions
- R10 stack frame pointer (read only)
Instruction encoding (arithmetic and jump)
eBPF is reusing most of the opcode encoding from classic to simplify conversionof classic BPF to eBPF. For arithmetic and jump instructions the 8-bit ‘code’field is divided into three parts:
- +----------------+--------+--------------------+
- | 4 bits | 1 bit | 3 bits |
- | operation code | source | instruction class |
- +----------------+--------+--------------------+
- (MSB) (LSB)
Three LSB bits store instruction class which is one of:
- BPF_LD 0x0
- BPF_LDX 0x1
- BPF_ST 0x2
- BPF_STX 0x3
- BPF_ALU 0x4
- BPF_JMP 0x5
- (unused) 0x6
- BPF_ALU64 0x7
When BPF_CLASS(code) == BPF_ALU or BPF_ALU64 or BPF_JMP,4th bit encodes source operand
- BPF_X 0x1 use src_reg register as source operand
- BPF_K 0x0 use 32 bit immediate as source operand
and four MSB bits store operation code
- BPF_ADD 0x0 add
- BPF_SUB 0x1 subtract
- BPF_MUL 0x2 multiply
- BPF_DIV 0x3 divide
- BPF_OR 0x4 bitwise logical OR
- BPF_AND 0x5 bitwise logical AND
- BPF_LSH 0x6 left shift
- BPF_RSH 0x7 right shift (zero extended)
- BPF_NEG 0x8 arithmetic negation
- BPF_MOD 0x9 modulo
- BPF_XOR 0xa bitwise logical XOR
- BPF_MOV 0xb move register to register
- BPF_ARSH 0xc right shift (sign extended)
- BPF_END 0xd endianness conversion
If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of
- BPF_JA 0x0 unconditional jump
- BPF_JEQ 0x1 jump ==
- BPF_JGT 0x2 jump >
- BPF_JGE 0x3 jump >=
- BPF_JSET 0x4 jump if (DST & SRC)
- BPF_JNE 0x5 jump !=
- BPF_JSGT 0x6 jump signed >
- BPF_JSGE 0x7 jump signed >=
- BPF_CALL 0x8 function call
- BPF_EXIT 0x9 function return
Instruction encoding (load, store)
For load and store instructions the 8-bit ‘code’ field is divided as:
- +--------+--------+-------------------+
- | 3 bits | 2 bits | 3 bits |
- | mode | size | instruction class |
- +--------+--------+-------------------+
- (MSB) (LSB)
Size modifier is one of
- BPF_W 0x0 word
- BPF_H 0x1 half word
- BPF_B 0x2 byte
- BPF_DW 0x3 double word
Mode modifier is one of
- BPF_IMM 0x0 immediate
- BPF_ABS 0x1 used to access packet data
- BPF_IND 0x2 used to access packet data
- BPF_MEM 0x3 memory
- (reserved) 0x4
- (reserved) 0x5
- BPF_XADD 0x6 exclusive add
Packet data access (BPF_ABS, BPF_IND)
Two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and(BPF_IND | <size> | BPF_LD) which are used to access packet data.Register R6 is an implicit input that must contain pointer to sk_buff.Register R0 is an implicit output which contains the data fetchedfrom the packet. Registers R1-R5 are scratch registers and must notbe used to store the data across BPF_ABS | BPF_LD or BPF_IND | BPF_LDinstructions. These instructions have implicit program exit conditionas well. When eBPF program is trying to access the data beyondthe packet boundary, the interpreter will abort the execution of the program.
- BPF_IND | BPF_W | BPF_LD is equivalent to:
- R0 = ntohl((u32 ) (((struct sk_buff *) R6)->data + src_reg + imm32))
eBPF maps
eBPF maps are provided for sharing data between kernel and user-space.Currently implemented types are hash and array, with potential extension tosupport bloom filters, radix trees, etc. A map is defined by its type,maximum number of elements, key size and value size in bytes. eBPF syscallsupports create, update, find and delete functions on maps.
Function calls
Function call arguments are passed using up to five registers (R1 - R5).The return value is passed in a dedicated register (R0). Four additionalregisters (R6 - R9) are callee-saved, and the values in these registersare preserved within kernel functions. R0 - R5 are scratch registers withinkernel functions, and eBPF programs must therefor store/restore values inthese registers if needed across function calls. The stack can be accessedusing the read-only frame pointer R10. eBPF registers map 1:1 to hardwareregisters on x86_64 and other 64-bit architectures. For example, x86_64in-kernel JIT maps them as
- R0 - rax
- R1 - rdi
- R2 - rsi
- R3 - rdx
- R4 - rcx
- R5 - r8
- R6 - rbx
- R7 - r13
- R8 - r14
- R9 - r15
- R10 - rbp
since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passingand rbx, r12 - r15 are callee saved.
Program start
An eBPF program receives a single argument and containsa single eBPF main routine; the program does not contain eBPF functions.Function calls are limited to a predefined set of kernel functions. The sizeof a program is limited to 4K instructions: this ensures fast termination anda limited number of kernel function calls. Prior to running an eBPF program,a verifier performs static analysis to prevent loops in the code andto ensure valid register usage and operand types.
The AMDGPU backend
The AMDGPU code generator lives in the lib/Target/AMDGPU
directory. This code generator is capable of targeting a variety ofAMD GPU processors. Refer to User Guide for AMDGPU Backend for more information.