PS3 Media Server is a DLNA compliant Upnp Media Server for the PS3, written in Java, with the purpose of streaming or transcoding any kind of media files, with minimum configuration. Universal Media Server is a media server capable of serving videos, audio and images to any DLNA-capable device. It is free, regularly updated and has more features than any other media server, including paid media servers. It streams to many devices including Sony PlayStation 3 (PS3).

Cell is a multi-core microprocessor microarchitecture that combines a general-purpose PowerPC core of modest performance with streamlined coprocessing elements [1] which greatly accelerate multimedia and vector processing applications, as well as many other forms of dedicated computation.

The first major commercial application of Cell was in Sony's PlayStation 3 game console , released in The Cell architecture includes a memory coherence architecture that emphasizes power efficiency, prioritizes bandwidth over low latency , and favors peak computational throughput over simplicity of program code.

For these reasons, Cell is widely regarded as a challenging environment for software development. Over engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers.

An early patent version of the Broadband Engine was shown to be a chip package comprising four "Processing Elements", which was the patent's description for what is now known as the Power Processing Element PPE.

The world's three most energy efficient supercomputers, as represented by the Green list, are similarly based on the PowerXCell 8i. On May 17, , Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the then-forthcoming PlayStation 3 console. The relationship between cores and threads is a common source of confusion.

The PPE core is dual threaded and manifests in software as two independent threads of execution while each active SPE manifests as a single thread. In the PlayStation 3 configuration as described by Sony, the Cell processor provides nine independent threads of execution. On June 28, , IBM and Mercury Computer Systems announced a partnership agreement to build Cell-based computer systems for embedded applications such as medical imaging , industrial inspection , aerospace and defense , seismic processing , and telecommunications.

Sony's high performance media computing server ZEGO uses a 3. E processor. The longer name indicates its intended use, namely as a component in current and future online distribution systems; as such it may be utilized in high-definition displays and recording equipment, as well as HDTV systems.

Additionally the processor may be suited to digital imaging systems medical, scientific, etc. In a simple analysis, the Cell processor can be split into four components: external input and output structures, the main processor called the Power Processing Element PPE a two-way simultaneous-multithreaded PowerPC 2. A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to such blocks. One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer, with a view to enabling maximal asynchrony and concurrency in data processing inside a chip.

The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. Despite having Turing complete architectures, the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work. As most of the "horsepower" of the system comes from the synergistic processing elements, the use of DMA as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge to software developers who wish to make the most of this horsepower, demanding careful hand-tuning of programs to extract maximal performance from this CPU.

The PPE and bus architecture includes various modes of operation giving different levels of memory protection , allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE. The SPE contains bit registers only. These can be used for scalar data types ranging from 8-bits to bits in size or for SIMD computations on a variety of integer and floating point formats.

System memory addresses for both the PPE and SPE are expressed as bit values for a theoretic address range of 2 64 bytes 16 exabytes or 16,, terabytes. In practice, not all of these bits are implemented in hardware. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means bits.

The PPE [30] [31] [32] is the PowerPC based, dual-issue in-order two-way simultaneous-multithreaded core with a stage pipeline acting as the controller for the eight SPEs, which handle most of the computational workload.

PPE has limited out of order execution capabilities; it can perform loads out of order and has delayed execution pipelines. The PPE will work with conventional operating systems due to its similarity to other bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution.

The size of a cache line is bytes. Additionally, IBM has included an AltiVec VMX unit [33] which is fully pipelined for single precision floating point Altivec 1 does not support double precision floating-point vectors.

IU contains L1 instruction cache, branch prediction hardware, instruction buffers and dependency checking login. Each PPE can complete two double precision operations per clock cycle using a scalar fused-multiply-add instruction, which translates to 6. SPEs don't have any branch prediction hardware hence there is a heavy burden on the compiler.

The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a bit, entry register file and measures An SPE can operate on sixteen 8-bit integers, eight bit integers, four bit integers, or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation.

In one typical usage scenario, the system will load the SPEs with small programs similar to threads , chaining the SPEs together to handle each step in a complex operation. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. Compared to its personal computer contemporaries, the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in CPUs like the Pentium 4 and the Athlon However, comparing only floating point abilities of a system is a one-dimensional and application-specific metric.

Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictors.

The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches The PowerXCell 8i variant, which was specifically designed for double-precision, reaches The EIB also includes an arbitration unit which functions as a set of traffic lights.

The EIB is presently implemented as a circular ring consisting of four byte-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. While this figure is often quoted in IBM literature, it is unrealistic to simply scale this number by processor clock speed.

The arbitration unit imposes additional constraints. A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it does not work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track. Each participant on the EIB has one byte read port and one byte write port.

The limit for a single participant is to read and write at a rate of 16 bytes per EIB clock for simplicity often regarded 8 bytes per system clock.

Each SPU processor contains a dedicated DMA management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model. Data flows on an EIB channel stepwise around the ring.

Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations.

However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency. Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole.

In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels. Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in.

So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of This number reflects the peak instantaneous EIB bandwidth scaled by processor frequency.

However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. Each unit on the EIB can simultaneously send and receive 16 bytes of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle.

Since each snooped address request can potentially transfer up to bytes, the theoretical peak data bandwidth on the EIB at 3.

This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.

In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain All things considered the theoretic Two bit channels can provide a theoretical maximum of The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path.

Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of The FlexIO interface can be clocked independently, typ. Generating a measured 1. A single BladeCenter chassis can achieve 6. The performance is reported as Sony 's PlayStation 3 video game console was the first production application of the Cell processor, clocked at 3.

This system assumed the 1 spot on the June Top list as the first supercomputer to run at petaFLOPS speeds, having gained a sustained 1. Clusters of PlayStation 3 consoles are an attractive alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra , in the Computer Science Department at the University of Tennessee, investigated such an application in depth.

As first reported by Wired on October 17, , [57] an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist Gaurav Khanna, from the Physics department of University of Massachusetts Dartmouth , who replaced time used on supercomputers with a cluster of eight PlayStation 3s.

Subsequently, the next generation of this machine, now called the PlayStation 3 Gravity Grid , uses a network of 16 machines, and exploits the Cell processor for the intended application which is binary black hole coalescence using perturbation theory.

