Repairing a dead PDP-11/35 KD11A

How it all began ...

I have a PDP-11/35 in a BA11K box in my collection for a long time, but I never really tried whether it was in working condition. As I had already moved it to 3 different locations, I thought it was about time to check out this box. So, moving approximately 50 kilos for the fourth time, but now to my test corner. First I opened the box by removing the top lid and made a note of all slots and positions where boards were installed. Then I pulled all boards and put them safely in ESD bags. Cleaned the backplane slots and removed dust from the inside. It was remarkable how clean this system was.
Moving on to the bottom. As long as the H742 power supply is at the rear side of the box this thing remains a really heavy beast. Removed the bottom cover, and again I was looking at a very clean interior. There were no bent pins and all wiring looked fine. So, time to connect mains power. I always use an 861 power controller on my desk. That way, I can turn on/off the system with the front panel key, and disconnect mains power using the circuit breaker on the 861.
A first time "power on" is always a "moment of truth", but when I turned the key both fans started to run immediately and I did not hear "additional" sounds which could be an indication of bad bearings in the fans. With my Fluke meter I measured all power supply voltages of which +5V (logic) and +20V (core memory) are the most important ones. Of course, AC LO and DC LO are also checked, and they looked fine as well. The +5V was slightly higher (5.25V), but as the power supply was working under "no load" condition, this slightly higher output voltage can be expected.

All looking fine, and I switched power off. Being confident hat this system was in general in good shape, I installed all boards in the backplane again. Normally, I would not do that, and start with a minimum configuration. When that configuration checks out OK, I add a few boards until the whole system runs fine. In the case of the PDP-11/35, the minimum CPU consists of 5 boards, but if the CPU has options installed, like floating point or memory management, you must install the options as well, because there are jumpers on several basic CPU boards configuring the options. Especially the 11/35 (and 11/40) uses jumpers "all over the place" to configure the processor options.
Anyway, I installed all boards and switched on power ... all fine! I tried to store some numbers in several memory locations, and they read back all correctly. Now I toggled in the simple "chasing light" pattern program.

    address    data     instruction      comment
   ------------------------------------------------------------
    001000     012700   move #1,R0
    001002     000001
    001004     006100   rol R0
    001006     000005   reset            bus reset takes 70 ms   
    001010     000775   br .-4           back to rol R0

Set the program counter to the start address and toggled START. The "chasing light" pattern was also running fine. So, now I connected a VT510 terminal to the console interface and used the switch register to DEP octal 65 in location 777564. A single "5" appeared on the screen. Hitting the "3" key on the keyboard and reading 777562 showed 63 octal on the DATA lamps. So, console connection is also working. Very happy, I switched off the 11/35.

A week later, I turned on the PDP-11/35. If the core memory would be OK, I should be able to simply load the start address and toggle START. The "chasing light" program should run as core memory does not loose its last stored data. And indeed, the "chasing light" program worked.
I installed the bottom and top cover and moved the box to another location. A few weeks later I switched on the PDP-11/35 again, but now the "chasing light" program did not run, and worse, the front panel toggles increasingly responded more erratically, up to the point that nothing worked any more. Not happy with this situation, I opened the top cover, but there was nothing "weird". So, I placed the box on its side ... and I heard a sound ... opened the bottom lid and a long screw fell out. Apparently, one of the long screws that mount a system unit (backplane) in the box was not tightened and moving the box "vibrated" the screw loose. The distance between the bottom cover and the pins of the backplane is just a few millimeters, so this screw had caused a short circuit condition to some of the logic of the processor. This fault would not go away "from itself". If I wanted to get this 11/35 back in working condition, I had to find the one or more damaged ICs.


Observations

Back to the test bench. After removing the top and bottom cover, I first checked the power supply voltages. They were all fine, so at least the damage was not "major", like many ICs fried because of a way too high supply voltage. The +20V for the core memory is only connected to the core system backplanes, thus the short circuit could not have connected +20V to the CPU logic.

So, what is the system doing? Maybe there is a pattern, possibly leading to some solution.
Before switching on power, I set the switch register to 000077, and the ENA/HALT switch to HALT. When the 11/35 is switched on, the ADRS lamps are all off and the DATA lamps have some value, but is never the same when the system is switched on. BUS and CONS are also always on. When I set the switch register to any value and then toggle LOAD ADRS, the ADRS lamps remain all off (they should show the switch setting of the switch register), and the DATA lamps all go on. That's wrong as well. Further, RUN, PROC, BUS, and CONS go on. After that, the front panel is completely not responsive. While the ENA/HALT switch is in the HALT position, toggling START (initialize, "reset") has no effect either.


Documentation

Needless to say, you need documentation to debug this system. From bitsavers /pdf/dec/pdp11/1140 you can download the PDP-11/35 (/40) documentation. There are several documents, and it does not harm to read them all, but the manuals that you really need are

  1. KD11-A_Maint.pdf     (KD11-A processor maintenance manual, EK-KD11A-MM-001)
  2. DEC-11-HKDAA-A-D_KD11-A_Processor_Manual.pdf     (KD11-A processor manual, DEC-HKDAA-A-D)
  3. KD11-A_RevN_Engineering_Drawings_Nov77.pdf     (KD11-A Field Maintenance Print Set)
The first two manuals are almost 400 pages together. It is too much to read them through, but have a look "here and there" to get an idea of the components of the processor, what the main parts are and their relationship. That gives you a good start on understanding the 11/35 processor, and helps you find a way through the schematic diagrams.
Back to top


Fault finding - where to begin?

The 11/35 is an overwhelming piece of hardware. The basic processor consists of 5 boards, 4 "hex width" and one "quad width" board. That brings the count of ICs to almost 500! This 11/35 also has the memory management option, which adds another 100 ICs, and more complexity ... and one or maybe more are defective. So, where to begin? I can only describe how I went along. Maybe that was not the best approach, but it learned me a lot of the PDP-11/35 and I did things I had not done before - it is an experience of which much was learned.

After reading, I figured that it would be very useful to have a KM11 diagnostic panel. I do have an original KM11, but from previous use (to check a dead PDP-11/10, but that's another story), I know that several bulbs are dead. So, the displayed microcode bits are not correct. I decided that after some 7 years (!) it was time to build the KM11 replica kit designed by Guy Sotomayor. You can put the KM11 diagnostic panel on an "extender" in slot 1 position E or position F. When the KM11 is in position F you can execute the 11/35 microcode stepping and see relevant signals on the 28 LEDs. When the KM11 is installed in position E you can see information of the floating point and memory management module. A first "quick check" showed that the processor is actually executing the microcode, so the CPU is not completely "dead".

KD11-A template for KM11
Part of a stepped microcode sequence on a "KM11 diagnostic paper" write down.

What I also learned is that you really need to have the documentation available. I made copies of what seemed to me the most relevant pages at this point. I can make notes on the copies and keep the original documentation in "virgin" condition.
I never tried microcode stepping before, so this was a new learning moment for me. Reading the LEDs of the KM11 was a time-consuming task and then writing it down, every microcode step again. I have drawn a small table with the KD11-A mask layout for the KM11 and duplicated that 25 times on a single A4 sheet of paper. This way, I can step the microcode, put a cross on the sheet for each lamp that is lit and continue. Afterwards, at home, I can decode the patterns, look up which microcode mnemonic each pattern represents and check the logic flow. (It is just 11 degrees Celsius at the moment in my "museum").


First measurements - are the switches read?

Not knowing where to start, I figured looking at the executed microcode steps might give me clues as what the processor is doing. After switching power on I stepped through the microcode, marking down the lit "lamps" on the "KM11 diagnostic paper". The results are at the right side. PUPP is the abbreviation for "Previous Microcode Program pointer" and BUPP is the abbreviation for "Base Microcode Program pointer". In the KD11-A documentation and schematic diagrams the word "microcode" is abbreviated to the single letter "U". Reading the crossed lamps of the "KM11 diagnostic paper" and writing the value down in octal, I came to this table. Each new line is the result of "toggling" the MCLK switch down and up again on the KM11. Using the microcode listing (see page 37, KD11-A Field Maintenance Print Set), I can add the microcode mnemonics. It is clear that the microcode is executing a loop.
     PUPP  BUPP   mnemonic
    -----------------------   
      030   315
      315   046    CON06
      046   026    CON04
      026   046    CON06
      046   026    CON04
      026   046    CON06
      046   026    CON04

Searching the microcode flow diagrams (pages 12-19, KD11-A Field Maintenance Print Set), I found the executed microcode sequence on page 16. The loop tests for a switch from the front panel. So, to fall through this loop I have to toggle one of the switches. LOAD ADRS is an obvious choice ... I toggled LOAD ADRS and clocked MCLK several times. The result is at the right side.

The "CON06 - CON04" loop is exited with CON07. After a few more microcode steps the processor is back in the "CON06 - CON04" loop. Further nothing has changed. To get more insight on what is happening, I read chapter 4 of the KD11-A processor maintenance manual. Very enlightening!

Looking at the microcode flow diagram (page 16 FMPS), the microcode LAD00 should have been executed after CON11, but instead CON05 and CON13 are executed. At the right side of page 16 you can see that the microcodes CON05 and CON13 are executed after power up. It seems that the microcode to process the LOAD ADRS toggle is never executed.

     PUPP  BUPP   mnemonic
    -----------------------   
      026   046    CON06
      046   026    CON04
      026   027    CON07
      027   044    CON08
      044   047    CON09
      047   045    CON10
      045   050    CON11
      050   030    CON05
      030   315    CON13
      315   046    CON06
      046   026    CON04
      026   046    CON06
      046   026    CON04

Some initial conclusions
On page 16 FMPS is a branch after the microcode CON06 and is tagged SWITCH. The "leg" -SWITCH is taken if no toggle has been pressed, and that is where the processor loops waiting for an activated toggle from the front panel. This signal SWITCH can be found in the schematic diagram on page 61 FMPS, the output of the 7474 flipflop E12 pin 8. As the microcode flow executes as expected at this point, I can assume that the "famous" 7474 flipflop is working. Looking at the signals, you see that LOAD ADRS, CONT, EXAM, and DEP can trigger the flipflop via the 7430 E9.
switchBUBC2BUBC1BUBC0octal
LOAD ADRS1117
CONT1106
EXAM1015
DEP1004
START0102
-0000
  Further, gates E5, E6, E7, E8, and E23 generate the signals BUBC0(BUT30), BUBC1(BUT30), and BUBC2(BUT30). I get back to these signals further down ... These 3 signals form a binary code unique for the pressed toggle switch, see the table at the left. I measured these signals with the Fluke Voltmeter.

At microcode CON10 it is checked which toggle has been activated. The BUT code is 30. BUT stands for "Basic Microcode Test". At microcode CON11 the test changes the BUT code 30 to 37, 35, 36, 34, or 32 depending on the activated toggle switch. Note that the low 3-bit code matches the code in the table! At the bottom of the microcode flow diagram on page 16 FMPS, you see the decision "twisties" for the activated toggle switch and each branch continues on page 17.

But, unfortunately, whatever toggle switch is pressed, the BUT code does not change. As a result, the "twisty" for code 30 "CONSOLE RECYCLE" is executed. So, the activated toggle is not "seen". The "good" news is that while stepping through the microcode the ADDRESS and DATA lights on the front panel light on and off. So, more of the processor hardware is OK. The question is "how (where) is the microcode changed?"


How do toggle switches change the microcode?

To understand more of the power-up sequence of the processor, I had set the switch register to 000077, and this time the HALT/ENA switch to ENABLE. After switching on, the ADDRESS lamps on the console are all off and the DATA lamps show 162700 this time.
The executed microcode sequence is now 337 - 334 - 335 - 332 - 333 - 002. This sequence is described on page 11 FMPS.
Depending on the position of the HALT/ENA switch, the branch to SERVICE D or CONSOLE A is taken. Could it be that SERVICE D on page 11 is a typo?
I find microcode 002 on page 15 top left, but it is then name SERVICE B. The sequence continues 015 - 010 - 216 - 215 - 115 - 326. This looks like the TRAP A flow on page 11, to label -MM FAULT, and then continuing in the flow of TRAP D: 327 - 113 - 330 - 331 - 077 - 140 - 332 - 333.
And then we are back at the decision HALT/ENA switch on ENABLE or on HALT. During these sequences the signal MSYN is also pulsed, so again, more hardware seems to be fine.

But how are the Base Microcode Test (BUT) low 3 bits changed?
On page 17 FMPS you can see the loop that is executed for the front panel toggle switches. CON04 tests whether a toggle switch is active. For a detailed inspection you use the microcode of CON04 (026) and look at page 37 FMPS. That table lists of each of the 256 microcodes the state of all 56 bits of the microcode instruction. The bits that are relevant at this point are the UBF bits. UBF stands for Microcode Branch Field. With these bits the multiplexers are controlled which modify the BUT value (Basic Microcode Test), see page 42 FMPS at the lower left side.
The SWITCH signal (from page 61 FMPS) is connected to multiplexer E97, input D6. The UBF bits for microcode CON04 is 06. If the UBF bits are 06, then the output of multiplexer E97 equals the state of the signal SWITCH.
Back to page 16 FMPS. Microcodes CON06 and CON07 test whether the signal SWITCH is active. (Why CON06? Read the maintenance manual ...). As the flow continues to CON07, we can conclude that E97 is OK (at least, input D6 appears at the output and this signal reaches the correct destination).

Microcodes CON08 and CON09 form a switch debounce loop. I assume that when you step through the microcode the delay loop time is expired, and thus microcode CON10 is reached. All this confirms my understanding and the link between the microcode flow diagrams and the schematic diagrams.

Microcode CON10 checks which toggle switch was activated. Note that next to the rectangle that describes CON10 the number "030" is written. That is the Basic Microcode Test number. As described earlier, the toggle witches are not "seen" and microcode CON05 is executed. The value of microcode CON05 is "030".
The value of microcode CON10 is "045". Back to page 37 FMPS. Microcode CON10 sets the UBF bits to 30 (octal). Back to page 42 FMPS. UBF bits set to 30, thus the signals UBF0, UBF1, UBF2 are "0" and UBF3, UBF4 are "1". These 5 signals are used to enable the 3 multiplexers and connect a specific input to the output. To be exact, E72 en E81 are enabled en connect input D8 to the output, E98 en E90 are also enabled, and here relevant, of multiplexer E98 the input D0 appears at the output.
The output of E81 generates signal BUBC0.
The output of E72 generates signal BUBC1.
The output of E98 generates signal BUBC2.
The output of E90 generates signal BUBC3.
The output of E82 generates signal BUBC4 and BUBC5.

The BUT code 030 is the "base code" which can be modified by the executed test (result). I have not checked, but I guess that the BUBCx signals are all wired-OR signals. As soon as one of the connections of a wired-OR becomes active, the combined signal becomes active. Thus, by switching an input of the multiplexer to the output (based on the UBF bits), the next to be executed microcode instruction is determined. Let's see if this is correct for UBF bits equal to "030".
BUBC0 = E81/D8 = K5-6 BUBC0(BUT30)
BUBC1 = E72/D8 = K5-6 BUBC1(BUT30)
BUBC2 = E98/D0 = K5-6 BUBC2(BUT30)
These signals all come from page K5-6, that is page 61 FMPS, top right side! Checking the logic circuits you can see that
BUBC0(BUT30) = EXAM + LOAD
BUBC1(BUT30) = CONT + LOAD + START
BUBC2(BUT30) = EXAM + CONT + LOAD + DEP
and the table of the front panel toggle switches matches.
If we add the octal value of the activated toggle switch to the Base Microcode Test number (030), we get the microcode numbers for LOAD ADRS, EXAM, CONT, DEP, START, and CONSOLE RECYCLE (page 16-17 FMPS).


Keeping a toggle switch pressed while stepping the microcode

I measured the output signal of the gates E8, E7, E23 en E12 (page 61 FMPS), and they are fine. The signal SWITCH, output E12, is also OK. So, continuing on page 42 FMPS. The signals BUBCx remain "0" as the BUT remains "030". But I measured that the gate output signals arrive on the inputs of the multiplexers and that the UBF bits are as expected. So, could it be that the multiplexers are defective? But it would be a weird coincidence that all 3 are defective, although all signals are tied to pins on the backplane ...

New idea
The combined signal SWITCH is latched by a 7474 flipflop, but the signals of the front panel toggle switches are _not_ latched. As far as I know, in my tests, I stepped the microcode, toggled the LOAD ADRS toggle switch and continued stepping, seeing the (incorrect) result. However, as the toggle switch signals are not latched, at the moment they are evaluated their signals are no longer active!

At the right side you can see the microcode execution when the LOAD ADRS toggle switch is pressed and kept pressed while the following microcode instructions are stepped. After CON11, the microcode 030 is now changed to 037, LAD00!

The microcode instructions that follow are exactly matching the flow diagram on page 17 FMPS.

What is also striking, is that after execution of LAD01 the front panel DATA lamps show 000077, the setting of the switch register. After the execution of LAD03 the front panel DATA lamps still show 000077, but now the ADDRESS lamps also show 000077. You can see this sequence in the picture "KM11 diagnostic paper" above.
After CON09 the front panel DATA lamps are all off, but the front panel ADDRESS lamps still show 000077.
And that is exactly the correct behavior!

I repeated this test with the EXAM toggle switch. If the toggle switch is not kept pressed, but only momentarily, the BUT code remains 030. When the EXAM toggle switch is kept pressed, the BUT code changes to 035, and the next microcode instruction executed is 053 (EXM01). And the complete microcode instruction flow for EXAM is correctly executed as described on page 17 FMPS.

     PUPP  BUPP   mnemonic
    -----------------------   
       :     :          
      046   027    CON07
      027   044    CON08
      044   047    CON09
      047   045    CON10
      045   050    CON11
      050   037    LAD00
      037   051    LAD01
      051   052    LAD02
      052   033    LAD03
      033   030    CON09
      030   315
      315   046
      046   026


More reading - new idea to test

Although I had already read most of the KD11-A processor maintenance manual, I only vaguely remembered that you can do something with the microcode address and the console Switch Register. Reading that part again gave me a new idea to test the console switches. Till now I used the switches "MCLK" and "MCLK ENAB" of the KM11 diagnostic panel to step through the microcode. However, there is a comparator that compares the 8-bit microcode address "PUPP" with the 8 lower Switch Register switches. When the microcode address matches the 8 Switch Register switches, the signal "UPP MATCH H" is activated. If the "MSTOP" switch of the KM11 diagnostic panel is active and the signal "UPP MATCH H" becomes active, the execution is halted. My next tests will be the following.
When I step through the console switch microcode loop while holding the LOAD ADRS toggle pressed, the CPU will execute that path. When the microcode execution of LAD00 is done and I let the CPU run "full speed", the console becomes unresponsive. Unfortunately, I do not know what is executing (if there is any execution at all).


Swapping the M7233 (Instruction Decode) with another (unknown status) M7233 showed some interesting points. First of all, the front panel switches are now operating as they should. However, there are "stuck bits" at "1" that seem to flip back to "0" when "neighbour" switches are also switched, but not always. OK, the swapped M7233 is not OK, but at least the console switches are working. With comparing measurements of the switches's signals on both M7233 modules I hope to localize the issue.
With the original M7233 board installed, I still get the same behavior: the console switches become inoperative after the first use of one of them. Checking the microstepping counter, I see that the CPU is looping on microstep 044 / 047 (CON8 / CON9). This is the delay timer loop for switch debouncing. This loop is never ending. Looking at page 16 FMPS shows that in 044/CON8 BUT 12 (Branch on microcode Test 12) is used. The UBF bits (microbranch field) are 012 when the microcode is 044. Turning to diagram K3-2 (FMPS page 42), the UBF bit 0-4 are used as address slectors for the multiplexers. The 16 to 1 multiplexer 74150 (E97) connects input D10 to the output when UBF0-4 = 12. Input D10 connects to K1-7 (FMPS page 26) E65 74H40 output, which is "1" when D15-D00 is all zero. Thus, the microcode loops CON8 / CON9 until during the CON8 microcode (when UBF = 012), the 16 data line signals are all zero. When that happens D10 input of multiplexer E97 is "1" and the output of E97 becomes "0". This output determines bit 0 of the BUBC (Branch Microcode) which is used for BUT (Branch Microcode Test) 15:00. On D15-00 = 0 the microcode address 044 changes to 045, leaving the debounce loop.
Using my HP1650A logic analyser, I see that the input D10 on E97 is constantly "0", and when the address selector inputs of E97 are 01010 (octal 12), the output of E97 [BUBC0(BUT17:00)] stays "1". When I swap the M7233 I see something different! I see the input D10 change to "0" and during CON8 (UBF = 012) there is a short pulse going "0" on the output of E97. Every time I toggle one of the switches this short "0" appears on the output. When I install the original M7233, I see during UBF = 012 no change on D10, thus no change on the output of E97. That output modifies bit 0 of the BUT, so that CON8 (044) changes to CON10 (045) getting out of the loop. As D10 does not change, the microcode will loop here forever.
The question is "how does the M7233 (IR DECODE) influence the signal D(15:00)=0 H on M7231 (DATA PATH) ADRS DECODE on K1-7 (FMPS page 26)"?

. . . to be continued . . .
Back to top