Kendall Square Research (KSR) was a supercomputer company headquartered originally in Kendall Square in Cambridge, Massachusetts in 1986, near Massachusetts Institute of Technology (MIT). It was co-founded by Steven Frank and Henry Burkhardt III, who had formerly helped found Data General and Encore Computer and was one of the original team that designed the PDP-8. KSR produced two models of supercomputer, the KSR1 and KSR2.
As the company scaled up quickly to enter production, they moved in the late 1980s to 170 Tracer Lane, Waltham, Massachusetts. KSR refocused its efforts from the scientific to the commercial marketplace, with emphasis on parallel relational databases and OLTP operations. It then got out of the hardware business, but continued to market some of its data warehousing and analysis software products.
The first KSR1 system was installed in 1991. With new processor hardware, new memory hardware and a novel memory architecture, a new compiler port, a new port of a relatively new operating system, and exposed memory hazards, early systems were noted for frequent system crashes. KSR called their cache-only memory architecture (COMA) by the trade name Allcache; reliability problems with early systems earned it the nickname Allcrash, although memory was not necessarily the root cause of crashes. A few KSR1 models were sold, and as the KSR2 was being rolled out, the company collapsed amid accounting irregularities involving the overstatement of revenue.
KSR used a proprietary processor because 64-bit processors were not commercially available. However, this put the small company in the difficult position of doing both processor design and system design. The KSR processors were introduced in 1991 at 20 MHz and 40 MFlops. At that time, the 32-bit Intel 80486 ran at 50 MHz and 50 MFlops. When the 64-bit DEC Alpha was introduced in 1992, it ran at up to 192 MHz and 192 MFlops, while the 1992 KSR2 ran at 40 MHz and 80 MFlops.
One customer of the KSR2, the Pacific Northwest National Laboratory, a United States Department of Energy facility, purchased an enormous number of spare parts, and kept their machines running for years after the demise of KSR.
KSR, along with many of its competitors (see below), went bankrupt during the collapse of the supercomputer market in the early 1990s. KSR went out of business in February 1994, when their stock was delisted from the stock exchange.
The KSR systems ran a specially customized version of the OSF/1 operating system, a Unix variant, with programs compiled by a KSR-specific port of the Green Hills Software C and FORTRAN compilers. The architecture was shared memory implemented as a cache-only memory architecture or “COMA”. Being all cache, memory dynamically migrated and replicated in a coherent manner based on the access pattern of individual processors. The processors were arranged in a hierarchy of rings, and the operating system mediated process migration and device access. Instruction decode was hardwired, and pipelining was used. Each KSR1 processor was a custom 64-bit reduced instruction set computing (RISC) CPU clocked at 20 MHz and capable of a peak output of 20 million instructions per second (MIPS) and 40 million floating-point operations per second (MFLOPS). Up to 1088 of these processors could be arranged in a single system, with a minimum of eight. The KSR2 doubled the clock rate to 40 MHz and supported over 5000 processors. The KSR-1 chipset was fabricated by Sharp Corporation while the KSR-2 chipset was built by Hewlett-Packard.
Besides the traditional scientific applications, KSR with Oracle Corporation, addressed the massively parallel database market for commercial applications. The KSR-1 and -2 supported Micro Focus COBOL and C/C++ programming languages, and the Oracle PRDBMS and the MATISSE OODBMS from ADB, Inc. Their own product, the KSR Query Decomposer, complemented the functions of the Oracle product for SQL uses. The TUXEDO transaction monitor for OLTP was also provided. The KAP program (Kuck & Associate Preprocessor) provided for pre-processing for source code analysis and parallelization. The runtime environment was termed PRESTO, and was a POSIX compliant multithreading manager.
The KSR-1 processor was implemented as a four-chip set in 1.2 micrometer complementary metal–oxide–semiconductor (CMOS). These chips were: the cell execution unit, the floating point unit, the arithmetic logic unit, and the external I/O unit (XIO). The CEU handled instruction fetch (two per clock), and all operations involving memory, such as loads and stores. 40-bit addresses were used, going to full 64-bit addresses later. The integer unit had 32, 64-bit-wide registers. The floating point unit is discussed below. The XIO had the capacity of 30 MB/s throughput to I/O devices. It included 64 control and data registers.
The KSR processor was a 2-wide VLIW, with instructions of 6 types: memory reference (load and store), execute, control flow, memory control, I/O, and inserted. Execute instructions included arithmetic, logical, and type conversion. They were usually triadic register in format. Control flow refers to branches and jumps. Branch instructions were two cycles. The programmer (or compiler) could implicitly control the quashing behavior of the subsequent two instructions that would be initiated during the branch. The choices were: always retain the results, retain results if branch test is true, or retain results if branch test is false. Memory control provided synchronization primitives. I/O instructions were provided. Inserted instructions were forced into a flow by a coprocessor. Inserted load and store were used for direct memory access (DMA) transfers. Inserted memory instructions were used to maintain cache coherency. New coprocessors could be interfaced with the inserted instruction mechanism. IEEE standard floating point arithmetic was supported. Sixty-four 64-bit wide registers were included.
The following example of KSR assembly performs an indirect procedure call to an address held in the procedure’s constant block, saving the return address in register c14. It also saves the frame pointer, loads integer register zero with the value 3, and increments integer register 31 without changing the condition codes. Most instructions have a delay slot of 2 cycles and the delay slots are not interlocked, so must be scheduled explicitly, else the resulting hazard means wrong values are sometimes loaded.
In the KSR design, all of the memory was treated as cache. The design called for no home location- to reduce storage overheads and to software transparently, dynamically migrate/replicate memory based on where it was be utilized; A Harvard architecture, separate bus for instructions and memory was used. Each node board contained 256 kB of I-cache and D-cache, essentially primary cache. At each node was 32 MB of memory for main cache. The system level architecture was shared virtual memory, which was physically distributed in the machine. The programmer or application only saw one contiguous address space, which was spanned by a 40-bit address. Traffic between nodes traveled at up to 4 gigabytes per second. The 32 megabytes per node, in aggregate, formed the physical memory of the machine.
Specialized input/output processors could be used in the system, providing scalable I/O. A 1088 node KSR1 could have 510 I/O channels with an aggregate in excess of 15 GB/s. Interfaces such as Ethernet, FDDI, and HIPPI were supported.