-------------------------------
Cortex M0 (ARMv6-M):
-------------------------------
Cortex M0 is the simplest, smallest and most popular core used in devices worldwide. NOTE: all registers are 32 bit wide here, even though most inst are only 16 bit wide.

For Cortex M0 inst set, Look in ARMv6-M arch reference manual. (For Cortex M3, look in ARMv7-M arch reference manual, chapter A4 and A5 (page 85 - page 417) for details and encoding)

ARMv6-M supports subset of T32 ISA = here all inst are 16 bit inst except these 32 bit inst: BL, DMB, DSB, ISB, MRS and MSR instructions. Not all inst in T32 ISA are supported by v6-M (it doesn't support CBZ, CBNZ and IT inst from T32 ISA, which are supprted by v7-M). As can be seen, T32 ISA (or even subset of T32 ISA as shown below) has lot more inst than Thumbs1 ISA.

these group of inst supported by M0: (Total inst =  86 inst listed in ARM cortex-M0 reference manual)
1. Arithmetic: ADD, ADDS, ADCS (Add with carry), SUB, SUBS, SBCS (subtract with carry), MULS, RSBS (same as NEG in Thumbs1)
2. Branch: B, BAL(unconditional), BL, BX, Bxx (conditional, various flavors = BEQ, BNE, BCS/BHS, BCC/BLO, BMI, BPL, BVS/BVC, BHI, BLS, BGE/BGT/BLE/BLT), BLX (link and exchange)
3. Data Xfer (load/store): MOV/MOVS/MVNS, MSR/MRS (MRS=rd special reg, MSR=wrt special reg), Unsigned load/store (LDR/STR, LDRH/STRH, LDRB/STRB), signed load (LDRSH,LDRSB),Load/store multiple (LDM/STM, LDMIA/STMIA, LDMFD, STMEA, PUSH/POP)
4. Logical: ANDS, ORRS, EORS, BICS (bit clear), ASRS (arithmetic shift), logical shift (LSLS/LSRS), RORS (rotate right), TST, reverse bytes (REV/REV16/REVSH) => 
5. Bit oriented (compare): CMP/CMN
6. Pack/unpack: SXTB, SXTH, UXTB, UXTH => there inst not there in 8051.

7. barrier: DMB/DSB/ISB (memory barrier), 

8. hint: SEV (send event), WFE (wake from event), WFI (wait for interrupt. this inst puts processor in sleep until wakeup event happens), NOP (Hint inst)
9. Misc: ADR, CPSIE/CPSID (enable/disable interrupt), BKPT (breakpoint), , SVC (supervisor call). . NOTE: none of these inst not present in Thumbs1 ISA.

Inst:
----
B (branch):
-----------
causes branch to target addr (PC) = current_addr + offset. offset can be even number (lsb=0), since all access are 16 bit aligned (is 16 bit in T1/T2 but 32 bit in T3/T4).
 T1 encoding: 16 bit [15:0]: [15:12]=1101, [11:8]=cond, [7:0]=8 bit imm value (-256 to +254 => values allowed are -128 to +127, but since msb=0, so it becomes -256 to +254 ).
 T2 encoding: 16 bit [15:0]: [15:11]=11100, [10:0]=11 bit imm value (-2048 to +2046).
 T3 encoding: 32 bit [31:0]: imm value = {S, NOT(J1 EOR S), NOT(J2 EOR S),imm6,imm11} = -2^20 to (+2^20-2)
   HW1 (lower 16 bits: [15:0]): [15:11]=11110, [10]=S, [9:6]=cond, [5:0]=imm6
   HW2 (upper 16 bits: [31:16]: [15:14]=10, [13]=J1, [12]=0. [11]=J2, [10:0]=imm11
 T4 encoding: 32 bit [31:0]: imm value = {S, NOT(J1 EOR S), NOT(J2 EOR S),imm10,imm11} = 2^24 to (+2^24-2)
   HW1 (lower 16 bits: [15:0]): [15:11]=11110, [10]=S, [9:0]=imm10
   HW2 (upper 16 bits: [31:16]: [15:14]=10, [13]=J1, [12]=0. [11]=J2, [10:0]=imm11

-------------------------------------------------
3 addressing modes: [Rn, offset]: Rn=base reg
1. offset addressing: offset value added/sub from addr in base reg and used as the addr for mem access. base reg is unaltered.
2. Pre-index addressing: offset value added/sub from addr in base reg and used as the addr for mem access. but here base reg is written with the new addr.
3. Post-index addressing: addr in base reg is used as the addr for mem access. but here base reg is written with the new value which is offset value added/sub from addr in base reg.  



Alignment:
---------
For ARMv6-M, instruction fetches are always halfword-aligned and data accesses are always naturally aligned. So, inst fetch can only be done from addr whose botom bit=0. However, in compiled code we'll still see some addresses with the bottom bit set.  The bottom bit is used to show the destination address is a Thumb instruction.  It is not treated as part of the address. Each inst fetch brings in 4 bytes of inst. Since each inst is min 2 bytes, so each fetch can bring in 2 16bit inst, or 1 32 bit inst or 1 16 bit and part of 1 32 bit inst or parts of 2 32 bit inst. Inst prefetches are done in advance whenever there is a empty slot on AHB bus. inst to execute next is calculated as = (address_of_current_instruction) + (size_of_executed_instruction) after each inst. If that inst is already prefetched, it's executed else inst fetch is done for that address. Inst fetches are done with "WRITE" line set low. Data access are done with "WRITE" set high for writing to mem (inst STRB, STRH, STR), and set low for reading from mem (inst LDRB, LDRH, LDR). Naturally aligned means that an access will be aligned to its size.  So for a 4-byte access, it will be on a 4-byte boundary. So, if ld/st is STRB/LDRB,STRH/LDRH,STR/LDR, compiler will generate ld/st with addr aligned on byte or halfword or word boundary. So, if we do STRH from addr 0xF1 in C ode (because struct pointer points to that addr), then compiler will generate code for STRH from addr 0xF2 to make it HW aligned. (It didn't do str from addr 0xF0 as complier will only move to forward addr and not backward). Should software attempt an unaligned data access, a fault will be generated.

For C structs, the C standard says that struct has the same alignment as its most aligned member. So, if the struct has something defined as uint32_t, then it's starting addr will be aligned to Word. But individual members of struct will then be aligned as per the member's data size.

 

Cortex M0 pins:
-------------
i/p pins
----
reset pins: PORESETn(for jtag/sw), DBGRESETn, HRESETn, SYSRESETREQ(o/p)
clks: FCLK(for WIC), SCLK, HCLK, DCLK
irq: IRQ[31:0]
scan: SE (scan enable H during shifting and L during normal op) RSTBYPASS (bypasses internal reset sync so that ATPG tool can have controllability of reset flops)

o/p pins:
----
HALTED: indicates that uP is in debug state.
LOCKUP: indicates that uP is in architected lockup state, as the result of an unrecoverable exception.
SLEEPING: indicates that uP is idle, waiting for interrupt on either IRQ[31:0], NMI or internal Systick (int entered due to WFI), or HIGH on RXEV (int entered due to WFE)
DEEPSLEEP: active when SLEEPDEEP bit in SCR set to 1. and SLEEPING is HIGH.
WAKEUP:

AHB bus:
i/p: HADDR[31:0], HBURST[2:0], HMASTLOCK, HPROT[3:0], HSIZE[2:0], HTRANS[1:0], HWDATA[31:0], HWRITE
o/p: HRDATA[31:0], HREADY, HRESP

M0 debug i/f can be configured for SW i/f or JTAG i/f but not both.
-----
debug: CDBGPWRUPACK(i/p), CDBGPWRUPREQ(o/p)
serial wire i/f: SWCLKTCK(sw clk), SWDITMS(sw data), SWDO(o/p), SWDOEN(swd o/p pad ctl signal),
JTAG i/f:
i/p: nTRST, TDI, SWCLKTCK(jtag TCK)), SWDITMS(jtag TMS)
o/p: TDO, nTDOEN(jtag TDO o/p ctrl signal)

 

ISA:

Any ISA needs a minimum of 3 inst types: load/store inst, conditional branch and logical (and|or|not) function. Load/Store inst needed to move contents from one place to other. Branch inst needed to implement if-else conditions which form the basis of doing intelligent things based on conditions. Logical functions are needed to implement any arithmetic/logical operations as add, multiply, etc (since basic gates as and, or, not are sufficient to implement any logic function). However, such an ISA would require very large code size to do simple operations like ADD, SUB, CMP, etc. So, we extend the ISA to include more commonly used inst, so that code size and compute cycles are reduced.

ARM had multiple inst set that it supported, and over so many products, it became very confusing. details of inst set are provided in ARM Architecture Reference Manual (ARM ARM)

Initially ARM inst set (called ARM ISA or A32) was the only ISA in ARM processors, which was fixed 32 bit inst set. Bits 27 to 20 stored various opcode, while other fixed bits stored source/dest reg number. There are 16 registers in user space (R0-R15). However code size for ARM ISA is large due to inst being 32 bit, even though it has good perf with low power. So, later 16 bit inst (called THUMBS ISA) were added to improve code density. 1st popular chip to include Thumbs ISA was ARM7TDMI (T in name implies Thumb ISA). Initially Thumb ISA included only 16 bit inst, but later more inst were added to both ARM ISA and Thumbs ISA, resulting in few 32 bit inst in Thumbs ISA. ARM ISA being fixed inst size was RISC style, while Thumbs ISA being both 16 and 32 bit inst size was CISC style.

THUMBS ISA: these were mostly 16 bit. To be able to use THUMB inst in ARM processors (which support ARM ISA), there is a decompressor in ARM hardware, which decompresses THUMBS 16 bit ISA into ARM 32 bit ISA, which is then passed onto ARM instruction decoder. Each 16 bit Thumb inst had an equiv 32 bit ARM inst. The main motivation for introducing Thumbs ISA was to reduce code size - by encoding most commonly used 32 bit ARM instructions in 16 bit. This was very useful in embedded designs, where memory is limited and expensive. Thus the inst length became variable - while most inst were 16 bit, few inst were encoded in 32 bit, where 16 bit encoding wasn't possible (but the inst was needed to improve cycle time). There were 2 set of THUMBS ISA introduced.

  • THUMBS1:  had 35 inst of which only 'BL' inst is 32 bit. All 16bit inst can only access lower 8 registers (R0-R7), since there are only 3 bit encoding for registers. Since, most inst were 16 bit, this reduced codesize by 30% and reduced I-cache misses, but increased cycle counts. Solution was to blend THUMBS1 and ARM inst, where critical section of code was written in ARM ISA and rest in THUMBS1 ISA. This gave rise to Thumbs2 ISA. Loosely speaking, Thumbs ISA usually refers to Thumbs1 ISA.
  • THUMBS2: So, THUMBS2 ISA introduced which had all THUMBS1 ISA + some new 16 bit inst for code size wins. Few new 32 bit inst were also added (DMB, DSB, ISB, MRS, MSR, BL/BLX). On top of this 32 bit equiv inst for corresponding 16 bit inst were also provided. Thus Thumbs2 provided most of ARM inst too. This resulted in total of 100's of inst. However, virtually all inst available in ARM ISA (with exception of few) were now available in THUMB2 which allowed to have a unified assembly language (UAL) for ARM and THUMBS2, which can then be compiled to generate binaries for either ISA. Thumbs2 was also known as T32 ISA (Thumbs 32 bit ISA). T32 provided the flexibility to programmers to code sections of their pgm in 16 bit as well as 32 bit, depending on whether code size or performance was more important. There is no mode to switch b/w 16 bit and 32 bit, all T32 inst (whether they are 16 bit or 32 bit) are decoded by ARM core the same way, to generate internal ARM 32 bit inst. So, Thumbs2 is a confusing mnemonic, instead we'll use T32 for it. NOTE: T32 or Thumbs2 includes all inst which are 16/32 bit Thumbs inst, as well as equiv 32 bit ARM inst. Thumbs2 was introduced in Cortex M3, which was the 1st cortex processor. Thumbs2 kind of unified Thumbs1 ISA and ARM ISA into one, which allowed cortex processors to run in 1 operation state, instead of switching b/w ARM state (when running 32 bit ARM ISA) and Thumbs state (when running 16 bit Thumbs ISA). This was a huge advantage in terms of perf for Cortex cores.

UAL (unified assembly language): This assembly language syntax is for ARM assembly tools. T32 had there own assembly language syntax, while A32 had their own. This caused confusion. Since most inst were almost same b/w T32 and A32, UAL was developed to allow both ISA to have same assembly language syntax. This allowed easier porting b/w the 2. We'll use the new UAL syntax for any assembly language code. "THUMB" directive in assembly file indicates that the code is in UAL syntax ("CODE16" directive implies it's in traditional Thumbs syntax). Since most inst in T32 have 16 bit and 32 bit variants, compilers choose which variant to use to generate assembly code. Suffix ".W" after any inst indicates it's 32 bit inst (W=wide), while ".N" indicates it's 16 bit inst (N=narrow). If no suffix provided, then assembler can choose b/w 16 bit or 32 bit (but usually defaults to 16 bit to get smaller code size).


Since THUMB2 was allowed to be backward compatible with THUMBS1 (meaning any code in THUMB1 should run on THUMB2 m/c), this implied that 16 bit inst from THUMBS1 could not be changed. Trick was to make processor recognize new 16 bit inst as well as new 32 bit inst. On looking at original Thumbs1 ISA, it was seen that bits [15:13] of only 2 inst were 111. These 2 inst were "B" (unconditional branch), which was 16 bit inst and "BL" (long branch with link) which was 32 bit inst. So, to accommodate 32 bit inst, the 3 MSB [15:13] of 16 bit inst were used to indicate if it was 32 bit inst. Process was as follows:

Look at Bits[15:13] of first HalfWord(HW): If it's anything other than "111", it's current 16 bit Thumbs inst. If it's "111" => it may be B, BL or some other new inst.  Now, look at bits [12:11]:

1. 00 => If bit[12:11]=00, it's current THUMBS1 unconditional Branch (B), which is a 16bit inst.

2. 01, 10, 11 => If bit[12:11]=anything else, it's a THUMBS2 32 bit instruction (inlcuding BL which was Thumbs1 32 bit inst).

new THUMBS2 inst which were 16 bit were encoded in remaining 16 bit encodings left.


NOTE: Thumb instruction execution enforces 16-bit alignment on all inst.  This means that 32-bit inst are treated as two halfwords, hw1 and hw2, with hw1 at the lower address. So, 32 bit inst is as follows:
Data: 31:24  23:16  15:8  7:0 => HW2=[31:16], HW1=[15:0]
Addr:  A+3    A+2    A+1   A  => Addr can only be 16 bit aligned, so lsb of Addr is always 0.

THUMBS2 introduced CBZ (Branch if zero) inst, which previously required 2 separate inst. It also introduced predication, if then inst (IT) which caused next 1-4 inst in memory to be conditional. THUMBS2 performance was 98% of ARM perf, and code size was 30% less than ARM ISA. So, THUMBS2 became the ISA of choice.

Later ARM ISA for 64 bit (aka A64) were added to inst set. So, in nutshell, A64 ISA is for 64 bit processor, while A32/T32 ISA is for 32 bit processors. When we talk about Thumbs ISA, we'll mean Thumbs2 ISA, or refer to it as T32 also. There is no more Thumbs1 ISA. It's all Thumbs2 or T32. A32 is the other 32 bit ISA used in A and R profiles. A32 and T32 are almost the same ISA, with the processor decompressing T32 into A32 internally. T32 ISA is just the compressed version of A32 ISA to save memory space (where some 32 bit inst from A32 were encoded in 16 bit inst in T32). Thus there is not much diff b/w T32 and A32 ISA.

Thumbs1 Instructions:

There are 35 total Thumbs 1 inst (34 are 16 bit while 1 is 32 bit). Out of 16 bits, few msb bits are used for opcode encoding, while lower bits specify reg, const, etc. There are 19 instruction format for these 35 inst. Instruction format refers to the opcode, reg num location (i.e same opcode ADD may appear in 3 or 4 formats, depending on whether it's adding 2 reg, or adding a constant number, etc). Below we list all 35 inst based on their type, and NOT on their format (all inst listed in ARM7 TDMI manual):

1. Arithmetic: 6 inst = ADD, ADC (add with carry bit), SUB, SBC (sub with carry bit), MUL (multiply 2 reg), NEG (2's complement, Rd=-Rs),. These arithmetic can be b/w reg or b/w reg and constant.

ADD has 4 formats:

- add 9 bit signed constant to stack pointer

- add 10 bt constant to either PC or SP, and load resulting addr into a reg. So, this is a "load addr" instead of "load datat"

- add 8 bit constant to one reg and store in another reg

- add 2 reg

NEG: do negative of 1 reg and store in other reg, i.e Rd = -Rs

2. load from mem: 7 inst = LDRB (load byte), LDRH (load half word or 2 bytes), LDR (aka LDRW or load full word or 4 bytes), LDM/LDMIA (load multiple), LDSB (load sign extended byte), LDSH (load sign extended half word), POP. NOTE: load/store inst is not there in 8051. Move inst in 8051 does load/store func. move inst in ARM does move from reg to reg, and not from/to mem.

LDMIA: load multiple reg  (only 8 reg possible from R0 to R7, since 8 bits allocated in inst) from contents of mem, specified by addr contained in a base reg (3 bits allocated for base reg, 000=R0 ... 111=R7).

POP: pop reg specified by the list (optionally LR also, depending on opcode bit for LR), from the stack in mem (i.e load contents from stack mem to reg). Only 8 reg possible in the list from R0 to R7, since 8 bits allocated in inst. used during function/subroutine calls

3. store to mem: 5 inst = STRB, STRH, STR (aka STRW or store word), STM/STMIA (store multiple), PUSH. These store don't have sign extended version as in load. These inst same as those of load above, except they store to mem (instead of loading from mem)

PUSH: same as pop, except it does push of reg contents to the stack (i.e store contents from reg to stack mem). used during function/subroutine calls

4. move from reg to reg: 2 inst

MOV (move one reg to another reg, or move constant to another reg),

MVN (move NOT of one reg to another reg),

5. logical: 8 inst = AND, ORR (or), EOR (xor), LSL (logical shift left, <<), LSR (logical shift right >>), ASR (arithmetic shift right), ROR (rotate right), BIC,  TST (AND test)

BIC: bit clear = AND NOT of 2 reg, i.e Rd = Rd AND NOT Rs

TST: (AND test) = set condition code (N,Z,C,V flag in PSR reg) on Rd AND Rs, so smilar to AND, but sets condition code too

6. Branch: 4 inst = B, Bxx, BL, BX. branch may be conditional or unconditional.

B: unconditional PC relative branch. offset is bit[10:0], so 11 bits, but it's shifted left (<< 1) by one, since addr is always HW aligned. So, the offset actually becomes 12 bit 2's complement offset, so range of addr that can be jumped to is PC +/- 2048 bytes.

BX: branch indirect. performs unconditional branch to addr specified in LO or HI reg.

BL: long unconditional branch with link. This is the only 32 bit inst in Thumbs1. This is same as Bxx, except that offset is 23 bit 2's complement, where upper 11 bits of offset are stored in 1st 16 bit inst, and lower 11 bits are stored in 2nd 16 bit inst ( and then shifted left by one, so addr becomes 23 bits). Addr of inst following the BL is placed in LR, so that after end of branch, PC can return back to where it was.

Bxx: This performs conditional branch depending on state of CPSR condition code. condition code in N,Z,C,V can be each bit set or clear. So, 8 opcodes allocated for each bit (N,Z,C,V) either set or clear. Remaining 6 opcodes allcated to combination of set/clear bits. There are 4 bits for opcode from 0000 to 1111, but only 14 opcodes coded (BEQ, BNE, BCS, BCC, etc).

7. compare b/w 2 reg: 2 inst = CMP, CMN.

CMP: this inst used in 3 ways: compare b/w 2 reg and set condition flag, or subtract 2 reg, or compare b/w reg and constant. These all set condition flag (N,Z,C,V flag in PSR reg)

CMN: add 2 reg and set condition flag

8. software interrupt: 1 inst = SWI (It's not hardware interrupt). It causes the processor to switch to ARM state and enter supervisor (SVC) mode. It loads SWI vector addr (addr 0x08 in vector table) into the PC. This vector addr is also known as non-maskable interrupt (NMI) addr. This 16 bit inst has 8 bit comment field which can be used by SWI handler, it is ignored by the processor.

Thumbs2 instructions: These include all Thumbs1 ISA + new 16 bit inst + new 32 bit inst + 32 bit equiv inst for all 16 bit inst. Total inst icount is over 100. Some inst from here are part of v7-M arch only (.e they are not supported in v6-M arch)

1. Arithmetic:16 bit

ADR

SDIV/UDIV = signed/unsigned divide

CPY

RSB

REV/REV16/REVH/REVSH => reverses byte order (individual bytes are not reversed or modified). REV is for full word, REVH is for half word (both half words are reversed separately), while REVSH is for reversing the lower half word, and then sign extending the result with MSB.

RBIT => reverses bit order in data word. Useful for processing serial bit streams in data communication, where the entire stream needs to be reversed

BFC/BFI => bit field clear(BFC), bit field insert (BFI)

SBFX/UBFX => signed and unsigned bit field extract

SXTB/SXTH, UXTB/UXTH = used to extend a byte/HW int a Word. S=sign extend with MSB(bit [7] for byte or bit [15] for HW), U=unsigned, value is 0 extended to 32 bits

2. barrier instructions: new 32 bit inst, to force a memory/inst barrier. It forces all mem access/inst before it to complete, before allowing mem access/inst coming after it to complete. This may be needed in complex mem systems, when out of order execution can cause race conditions. All 3 inst below can't be coded in high level laguage, so these can be accessed via functions defined in CMSIS compliant device driver library. i.e void __DMB(void); //function defn for DMB inst

DMB: data mem barrier. It forces all mem access before it to complete, before new mem access can be done. This is helpful in multi processor systems, where shared mem is used.

DSB: data sync barrier. It forces all mem access before it to complete, before allowing inst coming after it to complete.

ISB: inst sync barrier. It forces all inst before it to complete, before allowing inst coming after it to complete.

3. move inst to rd/wrt special reg:

MRS: move contents of special reg (i.e APSR, IPSR, PSR, MSP, PSP, etc) to general purpose reg. This causes rd of special purpose reg.

MSR: move contents of general purpose reg to special reg. This causes wrt of special purpose reg. MRS is used in conjunction with MSR as part of rd-modify-wrt seq (ex: to update a PSR to clear Q flag)

3. hint inst: 16 bit

SEV = send event, causes an event to be signaled to all processors within a multiprocessor system. It also sets local event reg to 1.

WFE = sleep and wait for event,

WFI = sleep and wait for interrupt. this inst puts processor in sleep until wakeup event happens,

NOP = no operation

4. branch: 16 bit

CBZ

CBNZ

BLX = branch indirect with link. This is unsupported inst, but existed in traditional ARM processors.

IT = If then. allows upto 4 succeeding inst to be conditionally executed. It avoids branch penalties, as there is no change to pgm flow.

5. misc : 16 bit

SVC = supervisor call. causes SVC exception

BKPT = breakpoint

CPS (CPSIE/CPSID) = change processor state

6. All 32 bit equiv inst for 16 bit inst above (thiese include 16 bit inst from Thumbs1 as well as from Thumbs2 (ex: if total number of 16 bit inst were 50, then there are 50 equiv 32 bit inst in Thumbs2 ISA)

ARM instructions:

A32 is pretty much similar to T32, except that it has no 16 bit inst. So, it can be considered a subset of T32 with some minor changes. A64 is more complex ISA, and competes with x86_64.

 

ARM profile:

Starting in early 2000, ARM decided to diversify its product portfolio to cater to different segments of market. It developed architecture version 7 (called as v7), and defined 3 profiles, each of which was to cater to different segment of market. These became the 3 ARM profiles: A=Application profile, R=Realtime profile and M=Microcontroller profile, targeting different market segments. It branded these family of processors in these 3 profiles as "cortex" processors. Earlier arch version before v7 was v6 which was used in ARM11 family of processors.

ARM's very first cortex core was 32 bit Cortex-M3, introduced in 2004. It's a microcontroller core (M). It's for embedded use in microcontrollers. It was based on v7-M arch. Cortex-M0 and Cortex-M1 were designed later with fewest instruction set possible for the Cortex family, to become the smallest silicon die (Cortex-M0 core can be designed in less than 10K gates, where gate refers to 2 i/p NAND gate, which is very impressive. In contrast, modern x86 processors have billions of gates). M0 and M1 were based on older v6-M arch, which evolved from v6 arch. These are the only 2 cortex processors from v6 arch, rest of the cortex processors are from arch v7 and above.

Then in 2005, it came up with 32 bit Application core (A), known as Cortex-A8. It was basically a full blown processor core for use in high performing SOC. The Cortex-A8 was the first Cortex design to be adopted on a large scale in consumer devices. In 2012, it introduced 1st 64 bit cores. It introduced the powerful Cortex-A57 core, and the energy efficient alternative Cortex-A53 core.

In 2011, it introduced Cortex real time core (R), known as Cortex-R5. These are optimized for hard realtime and safety critical applications. It's similar to A profile, but adds features which make it more fault tolerant.

 M profile is 32 bit, and is the most popular core present in billions of embedded devices. 64 bit A profile is most popular in consumer hand held devices as phones, tablets, wearables, etc.

These are the 3 profiles:

1. Application profile     (A): Cortex A5, A8, A9, A15, A32, A53 thru A77 series used for complex OS and user app. It is the only one to support 64 bit ISA. A5, A8, A9, A15, A32 are 32 bit, while A53 onwards are 64 bit. Around late 2000, more and more applications were developed for 64 bit processors from Intel/AMD, and ARM had to move to 64 bit. With 64 bit, ARM moved to newer v8 architecture. All these A profiles support virtual memory system arch (VMSA) based on memory management unit (MMU). This is essential, as some OS require presence of MMU in order to work. These 64 bit ARM processors are the ones that are getting used in most of the SOC chips that you see used in phones, watches, handheld, etc. These easily run over 1GHz clk frequency. For ex, Raspberry Pi Broadcom SOC has A53 in Pi 3, and A72 in Pi 4. Many companies now prefer to design their own cpu based on v8-A arch by licensing arch, instead of licensing RTL for Cortex A cores. This is done to differentiate their products with others, as anyone using a standard cortex core from ARM can't be much better than the competitor using the same cortex ARM core. For ex, the SOCs in earliest ipad and iphones, used Cortex A8 and A9 cores from ARM, but starting with iPhone5, Apple started designing their own cores, based on ARM v7-A arch, which was called "Swift" (it had 2 ARM cores). In iphone 5S and later ipad Air and Mini, Apple designed in house 2 core processor called "Cyclone" based on ARM v8-A arch. Later SOCs from Apple had multiple performance cores and multiple efficiency cores on the same chip (for ex, A12X Bionic SOC from Apple has four high-performance cores and four high-efficiency cores).

More details on various ARM v8-A cores: https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores

More info on Apple processors here: https://en.wikipedia.org/wiki/Apple-designed_processors

A profile supports both ARM (A64, A32) and THUMB (T32) inst set.

2. Real Time profile       (R): Cortex R4-R8 series used for real time app in embedded systems. HAS FPU for high perf. These support protected memory system arch (PMSA) based on memory protection unit (MPU). 

It supports both ARM (A32) and THUMB (T32) inst set.

3. MicroController profile (M): Cortex M0-M3 series used for embedded microcontrollers (instead of 8051). M3 was the very first cortex core (2004), followed by M1, M0 and then M0+ (2012). There are other less commonly used variants as M4, , M7, M23, etc. M profile supports only Thumbs ISA (no support for ARM ISA). It supports 2 variants of T32 ISA: Thumbs1 and Thumbs2. Thumbs2 being a superset of Thumbs1 is what is supported by almost all M profile cores, so Thunmb1 is more or less obsolete.
Cortex M0/M1: supports subset of THMUBS2 (T32) inst set.
Cortex M3: supports more complete set of THMUBS2 (T32) inst set. more inst supported here (mostly 32 bit equiv for 16 bit inst), as it's more powerful

So, T32 ISA is the most widely used (as it's used in M profile), so we'll concentrate on this. A32 and A64 are more complex ISA with lot more instructions. NOTE: when we say that a cortex core supports T32, A32 or A64 ISA, it supports only a subset of such ISA (not every instruction in that ISA). We need to look in the reference manual for that processor, to know exactly what inst it supports.

ARM architecture:

Besides the 3 profiles and different ISA, ARM defines diff versions of arch. These arch are still evolving. Few ARM arch are: ARMv4T, ARMv5E, ARMv6, ARMv6-M, ARMv7-A, ARMv7-R, ARMv7-M, and ARMv8-A processor architectures => architecture get complex, but faster (with pipelining) as we go from v4T to v8. Arch v4T thru v6 were used in classic ARM cores, and so not relevant for cortex family. Subscript A,R,M after the version num, indicates in which profiles that arch is used. Starting from version v7, arch were defined for each profile. These arch capture the ISA and profile info, so each cortex processor is tied to one of these arch:

ARMv6-M = For M profile. implemented in Cortex M0/M0+/M1. Simplest arch. These are the only cortex cores that are based on v6. All other cortex cores based on v7/v8 arch.

ARMv7-M = For M profile. implemented in Cortex M3

ARM v7-R = For R profile. implemented in Cortex R4 thru R8

ARM v7-A = For A profile. implemented in Cortex A8 thru A17 which were all 32 bit

ARM v8-A = For A profile. implemented in Cortex A32, and A53 thru A77 which were all 64 bit (except A32 which was 32 bit, but still has v8 arch). Most comlex arch. More minor revisions of these as v8.1, etc released. This competes with most advanced processors developed by Intel/AMD.

So, M profile uses v6, v7. R profile uses v7, while A profile uses v7,v8.

General ARM arch: All reg in ARM arch are 32 bit, irrespective of ISA. ARM arch is RISC. It has uniform Register File, and register load/store.

13 general purpose registers - R0 to R12

R0 to R7 are LO reg, and can be accessed by all 16 bit Thumb inst and all 32 bit Thumbs2 inst.

R8 to R12 are HI reg and can be accessed by all 32 bit Thumbs2 inst, but not by all 16 bit Thumbs inst.

3 special meaning register -R13, R14, R15

R13  = can be MSP=Main stack pointer or PSP=Process stack pointer, only one of these 2 reg can be accessed at any time. Usually we would expect only 1 SP, but having 2 SP allows 2 separate stack memories to be setup. MSP is the default SP and can be used by any code that requires privileged access, while PSP is used by unprivileged code. So, MSP is used by OS kernel as well as exception handlers, while PSP is used by user application code. Since PUSH and POP operations are always word aligned (i.e addr = 0x0, 0x4, 0x8, etc), SP has it's lowest 2 bits tied to 00.Having 2 SP, prevents stack error in user application (thread mode) from corrupting stack used by OS (handler mode)

R14  = LR=Link reg. Used to store return pgm counter when a subroutine/function is called. Even though PC bit 0 is always 0, LR bit 0 is readable/writable and is therfore not guaranteed to be 0. This LSB bit set to 0 indicates ARM state, while 1 means Thumbs state.

R15  = PC=pgm counter. bit 0 of PC is always 0, as inst addr are half-word aligned. However, in branching, either by writing to PC or using branch inst, LSB of target addr is always set to 1 to indicate Thumb state operation. Setting to 0 will cause it to switch to ARM state, which may not be supported causing fault exception. NOTE: even though LSB is written as 1, branch takes place with LSB=0, as LSB bit is tied to 0, and can't be changed.

4. few special purpose reg: These reg can only be accessed via MRS and MSR inst. These reg are not mapped to mem, are just like R0-R15 reg.

PSR = pgm status reg. Divided into 3 = APSR (application psr), IPSR (intr psr), EPSR (execution psr)

Interrupt mask reg = PRIMASK, FAULTMASK, BASEPRI, etc

Control reg = CONTROL. Bit 1 of this reg control which SP is used for thread mode. If it's 1, PSP is used in thread mode, while if 0, MSP is used for both thread and handler mode.

memory map: Since addr is 32 bits for T32/A32, addr from 0x0000_0000 to 0xFFFF_FFFF can be accessed, resulting in 4GB of mem. These are divided into several segement. Each segment has particular attributes like can be written, can be cached, etc.

1. 0x0000_0000 to 0x1FFF_FFFF (code segment) => Bottom 0.5GB is for pgm code. This is where we have flash mem or ROM to store our whole pgm

2. 0x2000_0000 to 0x3FFF_FFFF (on chip sram segment) => Next 0.5GB is for pgm data. This is where we have sram or volatile mem to rd/wrt our data (as stack, var, etc)

3. 0x4000_0000 to 0x5FFF_FFFF (on chip peripheral segment) => Next 0.5GB is for peripheral devices. This is where all AHB/APB reg for all peripheral devices are stored

4. 0x6000_0000 to 0x9FFF_FFFF (off chip sram segment) => Next 1.0GB is for external volatile mem. Mem regions 1,2 and 4 above are the only ones from where code execution is allowed

5. 0xA000_0000 to 0xDFFF_FFFF (off chip peripheral segment) => Next 1.0GB is for external peripheral devices

6. 0xE000_0000 to 0xFFFF_FFFF (system segment) => Last 0.5GB is for mem mapped reg. This contains all system reg (i.e IPR, SCR, CPUID, etc), ROM tables, and some vendor specific area. Except for small part of this space, most of this mem space transactions don't appear as rd/wrt on AHB bus, as all these reg are in NVIC which is internal to processor. The processor supports only word size accesses in the range 0xE0000000 - 0xEFFFFFFF.

The processor contains a bus matrix that arbitrates the processor core and optional Debug Access Port (DAP) memory accesses to both the external memory system and to the internal NVIC and debug components. Transactions are routed as follows:
1. All accesses below 0xE0000000 or above 0xF0000000 appear as AHB-Lite transactions on the AHB-Lite master port of the processor.
2. Accesses in the range 0xE0000000 to 0xEFFFFFFF are handled within the processor and do not appear on the AHB-Lite master port of the processor.

NVIC: Nested Vectored Interrupt controller: NVIC provides nested intr support, i.e intr can be programmed to different priority levels, and depending on priority levels, new intr can override current running intr. Whenener the processor gets an intr, it jumps to appr intr handler and executes that code. The addr of intr handler is stored in vector table.

Very bottom code segment of mem has this vector table. Addr 0x0000_0000 has initial value of MSP. From Addr 0x0000_0004 onwards, we store jump addr for exception #1 to exception number #255. Addr 0x0000_0004 stores jump addr for reset exception, Addr 0x0000_0008 stores jump addr for NMI (non maskable interrupt) exception and so on. Upto exception #15 aresystem intr generated due to some system error. Exception #16 onwards are external intr which are activated when external intr line is pulled active by an external device. There may be anywhere from 32 external intr line to 255 external intr line. Depending on which one is activated, the code jumps to that addr in vector table, which has the addr of that exception stored at that entry.

--------

MicroProcessor vs Microcontroller:

Companies like Intel, AMD make microprocessors, while ARM, TI, etc make microcontrollers. Core component of both of these is processor core. It's the unit that runs instructions and performs computations. A processor core by itself is not very helpful, as it needs a memory from which to fetch instructions, memory to write results, as well as connections to other peripherals.

A microprocessor is advanced processor core with bare minimum connections to it, to make it functional. It will usually have a memory connection (DDR2, DDR3, etc), and a high speed communication channel (i.e PCIE, HyperTransport, etc), which allows it to be connected to any peripheral, that supports that communication protocol. Thus, a microprocessor is not a complete system by itself, but needs memory and other peripheral connections in order to operate. It is fabricated as a chip, and sold to be used in bigger systems.

A microcontroller on the other hand contains simple processor core, along with memory and peripheral bus where all peripherals can be connected. Thus it is a complete system by itself. It is fabricated as a chip, and depending on what peripherals it has, it can be used for that specific purpose.

Processor Types:

In digital hardware design, processors are the backbone. They are present in most of the digital chips, as embedded processors. They are also sold as standalone chips.

Since most of the digital designs involve involve some form of micro-controller or micro-processor,  we will start with these CPU cores. These CPU cores run on their own instruction set (ISA) which tell the CPU core what to do next. Big companies designing CPU cores have their own proprietery ISA. Intel/AMD have x86 ISA, ARM has Thumbs/ARM ISA, TI has MSP430 ISA and so on. We can't use these ISA in our CPU designs freely without violating some patent. However, there are still source implementation of CPU cores around these ISA. Looks like open source implementation of these ISA have GPL/BSD license, so should be safe to use. To See all such open source processors, check this link: http://parallel.princeton.edu/openpiton/open_source_processors.php.

Fortunately, there is open source ISA available that you are free to use in your designs. It isn't designed from any of these proprietary ISA, and hence is going to be free for ever. It's the RISC-V ISA, co-designed by famous David Patterson (author of Computer Architecture book that is used worldwide in graduate computer architecture courses). More info here: https://en.wikipedia.org/wiki/RISC-V

Website for RISC-V is here: https://riscv.org/

Many cores have been implemented using this ISA. You can download source code for Cores, SOC or even buy the processors which are built using this ISA. Link to download here: https://github.com/riscv/riscv-cores-list

Outside of processors, if you want to get code for other cores or IP such as DSP cores, Filter, USB, Bus, etc, you can download such code written in verilog, vhdl, SystemVerilog here: https://opencores.org/

So, there is really no need to write any piece of code from scratch, unless for learning purpose.

As explained above, processors can be open source, or they can be proprietery processors. Most popular processors are:

1. x86 based processors: These are Intel and AMD processors. They are based off Intel's old x86 design. Only AMD and one other company has patent/licensing rights to make x86 chips.

2. ARM based processors: These are processors based off ARM architecture. ARM is a british company, and it used to sell processor design to other companies, which wanted to embed a processor in their own chip. These processors were small and used very little power, so were most suitable for embedded designs. Intel's x86 processors were meant for higher end usage as in desktops, laptops etc where performance was more important than power.

3. Open source designs: Apart from above mentioned proprietery designs, many open source design for processors have crept up. Especially in the last 10 years, there has been a tremendous push to have open source processor, so that anyone can design processors free of royalty or licensing fees. The biggest hurdle to open source processors was something that could be adopted by masses and be scalable for future. RISC-V is the the one gaining most momentum right now.


 

 

Bus Architectures:

This section lists the most popular bus architectures in existence today. All chips use some form of bus protocol to communicate between different blocks on the same chip, or to talk to other chips. Bus are simple physical wires that connect one chip to the other. So, the first question we might have is why are there so many different kinds of buses when ultimately they all just transfer signals over the same kind of physical wires. Won't they all perform more or less the same, and achieve the same result. That's true. For the physical part, wires don't change their physical transmission capabilities based on bus protocol. The difference in performance between these different buses come mainly due to the software part: how the bus protocol is designed, how much of the work is being in software vs hardware, how scalable the protocol is, etc.

This is how a new bus architecture gets introduced. There is a bus architecture that's proposed by a company or a group of companies. If it gains enough traction, it' starts getting implemented in devices.This is how all basic buses work. When we connect 1 device to another, data xfer takes place between them. It's based on a protocol, that defines how the receiver and the sender transport data and interprets them. For ex, when we connect HDMI cable on back of HD TV to a DVD player, the DVD player and the TV talk using this HDMI protocol on that cable. That protocol defines exactly the cmds and data sequence to be followed. Many of these standards are from old times, and doesn't have any patents or owner. However, many of the recent protocols are properitery, and if you want to use them in your chip, you have to pay a royalty to the owner. You can also have your own protocol, since what is needed is basically a clock line and a data line, and you can have the simplest protocol, and have the bus be called based on your "name". Note that all these bus protocols refer to digital buses, as we use the buses to xfer 0's and 1's.

We'll look at the brief history of various bus protocols:

1st generation bus: These were simple buses as MCA, ISA, EISA, VESA bus.

2nd generation bus: These were more complicated buses, but had more functionality and achieved better speeds. These buses include PCi, AGP, PCI-X, etc.

3rd generation bus: These were radically changed design to take advantage of improvements in physical wire transmission capabilities, and could achieve clock speeds in GHz.  PCI Express, HDMI etc fall in this bucket.

Serial Vs Parallel Bus Architectures:

Parallel bus architecture: In the early days, we had parallel bus architecture, and that is what 1st and 2nd generation buses used. Parallel bus architecture is one where different bits of a bus are transmitted in parallel. They are all transmitted on a clock edge, and then received on next clock edge. This architecture made sense as more the bits you transfer in parallel, higher your throughput would be. This worked for clock speeds up to 500MHz or so.

However, in early 2000, there were significant advances in bus technology wrt phyical layout, transistor design, etc which allowed for much higher speeds ( > 1GHz). Parallel bus architecture wasn't suited for these very high speeds. Multiple reasons for this:

  1. Cross talk between adjacent parallel data lines: Data signal travelling over one data line corrupts data in adjacent line. This is OK for low speeds, but at high speeds, capacitance behaves like a short circuit, so signal on one line basically starts traveling on adjacent line too. One solution is to space out parallel lines, but that takes more area, and hence higher cost.
  2. Skew between various bits: As number of bits in bus increased, there were more lines in parallel. They all need to remain within a certain skew, i.e one bit of a bus can't arrive much earlier or much later than all the other bits of the bus. They all have to arrive at the same time, in order to achieve high speed. If they arrive skewed with each other, then clock speed has to be reduced by the skew amount to allow all bits of the parallel bus to be captured correctly. This put a limit on the number of bits of a bus that can be sent out in parallel.
  3. ElectroMagnetic interference (EMI): This is unwanted interference when high frequency signals are transmitted. These radiate energy, and can couple to other signals and distort their waveform. One way to prevent EMI from disturbing your signal is to shield your signal in cyclinderical wire  shield. However, doing it for all wires of the buses becomes expensive.

Serial Bus architecture: Serial bus architecture got rid of all these issues. However, it required a much higher clock speed to maintain same throughput as parallel bus architecture. In fact, if we had 16 bit parallel bus running at 500MHz in parallel bus, then for serial bus, we needed 16*500=8GHz clock speed to maintain same throughput. This is very speed clk, but now there is no issue of cross talk or skew as there were no adjacent lines in parallel bus. EMI is still there, but can be mitigated by shielding wire (as there is only 1 wire, it's much simpler now).

SERDES in serial bus architecture: Internally on chip, data is transferred in parallel, but just before it comes to pads of the chip, it's converted to serial data. It's transferred as serial 1 wire data on the motherboard, and goes to the pads of the other chip as serial data. Right after the data enters the pads of other chip, it's again converted to parallel data. Why do we do this conversion? i.e why not just transfer data serially all the way. The reason is that parallel data is still faster and requires much lower clock speed. Most of the problem that we had with parallel data transfer was on motherboard, where it's harder to manage crosstalk and skew. This logic on transmitting side chip, that converts parallel data into serial data, is called serializer (or SER). Similarly, This logic on receiving side chip, that converts serial data into parallel data, is called deserializer (or DES). Both of these components combined are called SERDES (serializer deserializer). You will hear this term a lot with all serial bus arch.  Most of the modern bus architectures are all serial bus architectures, and are also called SERDES arch. There are also Equalizer circuit on both TX and RX just before the chip pins connect to the cable/channel. These Equalizers are needed since cable starts attenuating the signals at low freq, but we need to transmit signals at much higher freq. Equalizers on TX side boost the signal Bandwidth, so that even after attenuation. the signal maintains it's signal levels at hugher freq, which can then be recovered by the RX equalizer.

A full diagram of SERDES logic is shown below. FIXME attach picture belo:

 

There's a very good video lecture on youtube: (by "analog layout and design")

https://www.youtube.com/watch?v=FGzQV4a9KAw

 

PIPE (Phy Interface for PCI Express and USB SuperSpeed Architecture):

There are many SERDES bus architecture is use such as USB, DDR, PCIe, etc. They all have same kind of design: an analog phy that has the TX/RX, a digital controller that communicates with the analog Phy, and then software and device drivers that interact with the digital controller.

PIPE: Comes PIPE, which is a spec developed for Phy interface. This is to enable development of a phy as a discrete IC or as a macrocell for inlcusion in ASIC designs. It defines a standard interface between such a PHY and a Media Access Layer (MAC) & Link Layer ASIC. One of the motivations behind Phy and digital i/f is that Peripheral and IP vendors will be able to develop and validate their designs, insulated from the high-speed and analog circuitry issues associated with the PCI Express or USB SuperSpeed PHY interfaces, thus minimizing the time and risk of their development cycles.Though PIPE refers to PCIE Express in it's name, it's applicable to USB too. Below is the pic of a PIPE i/f.

 

 

Common Bus Architecture and Protocols:
 

There are many serial nThese are some of the common Bus Standards in use today:

SPI and I2C are not only the 2 oldest bus standards, but also the simplest. They are still widely used today amongst 2 chips to transfer data betweenthem.

SPI: This is a 4 wire bus protocol. Very simple design. Widely used as simple communication interface between 2 chips.

I2C: This is a 2 wire interface. It's slightly complicated, but advantage is that it only uses 2 pins, so reduces overall area and cost.

USB: Uiversal Serial Bus. Most common bus interface that you see in all devices.

PCI: Peripheral Component Interconnect Bus. Another very popular bus format

HDMI: Mainly used to transfer audio/video signals from one electronic device to another for display purposes.

 MIPI:

DDR: