English: SEAForth

Hishnik

Please use only english in this discussion. Here we plan to talk about SEAForth... may be with english-talking folks...

mOleg

http://www.intellasys.net/products/seaforth/index.php

"Multicore Processors, which represent the culmination of more than 200 years of collective development by some of the most experienced software/hardware technologists in the semiconductor industry. Based on the proprietary Scalable Embedded Array™ (SEA) Platform, SEAforth system-on-chip solutions are poised to raise the performance- per-watt bar in a host of embedded consumer electronics applications."

forther

deleted

garbler

This is a diagram for Enumera CPU [http://www.dnull.com/~sokol/amorp/emtalk.ppt]
somehow failed startup which was predecessor of Intellasys efforts for building
multicore forth CPUs (at least, this is an opinion of John Sokol who was an initiator
of this startup).

p.s. Amorphous OS (for Enumera CPU) [http://www.dnull.com/~sokol/amorp/amorphous1.ppt]

Hishnik

Some words from Hishnik (hishnik means 'predator')

It is a good time to combine two important things: trying to write English and discuss SEAForth project

Indeed, SEAForth is a very complex object to analyze and discuss, so I want to just start discussion, not to tell about any aspect of this system. Also, if this (English-language) branch of our forum will not continued by somebody English-speaking, I’ll return to my native…

I don’t know how to start. There are at least two independent, huge aspects of discussion: commercial and electronic/Forth programming. I have a serious, deep question about all of them.

Ok, let’s start from economics. The area of processor designing is running forward year to year. It is not only ‘more designs every year’, it is ‘simple to enter to design, harder to enter to production’. As a result, it is very, very simple to create any kind of processor core, but harder and harder to find, where customers for this production are. The situation is similar to programming – is anyone will be impressed by program? I mean ‘just program’, not ‘good program’. Programmers are not only elite of high-tech, even school-boy may create the Windows application. You can say ‘but chip-design is very different’… but it is wrong. With PLD you can construct every realistic digital device when download it into real programmable chip. More important, hardware description languages is quite similar to generic programming languages, so individual, who familiar with programming, have a good chances to become a chip-designer.
What is all above means? Answer is simple – having a rich set of processors around, we’ll waiting for good new one. Not ‘new forth-processor’, but ‘new good forth-processor’. There is another point to discuss: with smaller circuit line width we need more and more volumes to make new chip profitable. So, we must look on applicability, compatibility, price etc., but only on preferred architecture and brand (if we count ‘Forth’ as a brand). Looking for history of TF16 (Russian FPGA-based Forth-CPU, now implemented in silicon), I was surprised by whole politics of developer team. Instead of constantly searching of application areas, a ‘Forth-CPU’ was just kicked out into small group of enthusiasts. Of course, they says ‘Great! A new Forth-CPU!’… and starts waiting for another good news.

Errr… and what about programs, compilers, shells… ordering? I don’t completely know, how it is about SEAForth, but feels situation is the same. It is more than year from first announcing this chip, but there are still one application and one simulator in the web. Also, I’ve seen the message to Forth-groups with invitation to start writing applications for new chip. I’m agreed, it is interesting solution, but why it was putting into hard silicon without serious prototyping and testing by many groups Forth-programmers? There is a ‘black hole’, when developers says ‘we have a great Forth-CPU (how are you about purchasing it)’, but forthers says ‘yes, it is a great Forth-CPU… let us read some new success stories, we love Forth and appreciate its spreading!”. I wonder for what part I’m right

I repeat, just putting new processor in silicon is not enough to make it popular.

Now let’s look to architecture and features. 24 cores, running at 1 GHz, is impressive. It is ’24 billions operations per second’. And more, it is ’24 billions Forth words per second’. Really impressive. Remember, x86 must spend 5-10 cycles per Forth word and runs at 3-4 GHz. This means, SEAForth outperforms the mainstream Intel solutions. Furthermore, Intel is lowering the clock speed of it’s new Core architecture, compared to Pentium4. It is strange, are you? Yes, Core Duo has two processor cores – it is a way to multicore solution (SEAForth is winner here). But Core also has powerful memory interface! Are you enjoyed by thing, performing 10E20 actions per second _inside_, but bringing 1000 bytes… for example, to display. Inside the chip we may have a beautiful photo-realistic 3D world, but how we can observe it? Let’s look into typical mainboard with Core2 Duo – two channels of high-speed DDR2 memory, high-speed serial interfaces, PCI Express bus, dedicated high-speed video memory… large (very large, when compared to 64 words per SEAForth core) cache memory. Is it enough? No, Intel continues to extend memory bandwidth. Summarize, SEAForth implements a quite strange concept – in general, programs will constantly freezes, and typical (not peak) performance will be determined by the external memory bandwidth. One or two memory interfaces is not enough.
All right, let’s try to find an area for successive application of SF (just look at my avatar – I’m smiling so wide… I’m an optimist!

). Is there any task, which not needed the huge off-chip traffic? Hopefully, yes! It is many kinds of DSP applications. They requiring relatively narrow bandwidth of input data, when transforms input stream into some useful things, when put them out, also using not very hungry hardware interfaces. We may have about 10 MSPS for input stream, but billions operations inside the chip. It is all we need! Unfortunately, many DSP algorithms needs about 2-4-8-16 kwords to implement (and it’s could be done at prototype testing stage!), so we must find something uncommon. That’s main problem for DSP. I’m see some words about h.264. Yes, its possible. If we’ll outperforms (or suggests a lower price) than tons of other decoder devices. Hmmm, my children loves cartoons from DVD/mp3/tape/FM/audio/video player… indeed I don’t know the entire list of supported medias and standards. It’s cost about 80 USD – including case, cables, other chips and components on the board (and board itself).

That else? FFT, wavelets, digital filtering, neural networks. All of them needs MAC devices and storages for coefficient tables. Please, increase memory sizes… Stop… not increase. Multiple it!!! :))

One thing is seemed to start feeling themselves good – genetic algorithms. Parallel calculations, small amount of memory – it is enough… as a starting point.

Ufff… I’m tired. My English is quite ugly, I’m using Russian grammar constantly… and I want a cup of tea!

in4

About such path of design - I think there was some reasons.
But result rather interesting! It's like concept-car. You cann't use all ideas in everyday life, only some of them. I'll use some of such solution in my work!

I'm agree with Hishnik - I want more RAM onboard in SEAForth-like chip! From 32kWords (enough for typing a texts as in PDA) and more, more... Flash as external memory is bad choice due to low endurance - only 300 000 writes. Using FRAM or MRAM better, but it's speed is too low.

I want use SEAForth-like system as a SoC.

forther

Looks like it is not possible anymore to post into this forum without the registration. And it make it impossible at all to post in here for non-russian speakers, because the registration form is all in russian.

We have to address this issue ASAP (or find another place for this discussion). Meanwhile, I'll post the message, qualified as "a spam" from my account.

=================================================
From: Guze

Hi, a good Russian friend pointed me to this forum, and after reading the previous posts, I had to get in

Keep in mind that the first chip, with 24 cores, was done as a proof of concept demonstrating how a bunch of tiny cores could cooperate together to do big things. They are combined with a set of A/D and D/A converters to provide analog interfaces. And as you've noted they have a FLASH interface, but we use that only for loading the programs into the cores... applications that need additional RAM use the chip's external RAM interface -- typically for data buffers for audio (or perhaps video) blocks. Your comments about 64 words of local memory being limiting are true, but it's surprising what you can do with it... for instance, our lead coder has written FIR filters using only about 16 words of that space. The key is to spread your algorithms over multiple cores. For instance, instead of trying to hold an entire 8x8 matrix of data in one core and operate on it there, store each row in a separate core and operate on it in parallel -- 8 times faster.

We are rapidly developing new chips in this family with more cores... a LOT more cores... and bigger local memories (though not in the 32k size). Our position is that if you need more memory for your application, then you're not spreading your code over enough cores. To us that local memory is precious -- but the cores are so cheap they're basically free!!!!! Use as many cores as you need. Combine them to make FIFOs, DSPs, H.264 codecs (as we're doing internally). Remember the key to using this family of chips is to spread your code over LOTS of cores -- we've tested our concept at 1024 cores and with that computing power, the sky is the limit on your applications.

Enjoy.

in4

Programming such out of the ordinary chip require something like shift in mind (from traditional programming of CPU).
In discussion yesterday forther propose to me look SEAForth internal memory as a register of conventional CPU!
Rather new viewpoint as to me...

Before that I think about SEAForth as about i8080-based computer in one chip and with improvements...

jeff

As system architect at IntellaSys I can say a few things about the SEAforth design. I came to Chuck in 1990 and began paying to develop and build a parallel Forth chip. As my understanding of how Chuck's ideas on Forth software had evolved after making several generations of Forth chips the instruction set of the proposed chip evolved to reflect the more modern ideas. We intended to make a multi-core after we had the single-core design with coprocessors working in order to keep development costs in the six figure range. We did P21, I21, f21 and later at IntellaSys the SEAforth design. F21 prototyped on-chip ROM but demonstated the bottleneck of having only external RAM. The biggest issue in optimizing code on F21 was keeping things on-chip or on-page as memory access was ten to one hundred times slower than stack access.

With on-chip memory in SEAforth stack access is only about four times faster than RAM, but that is all you need to be able to keep up full execution speed execution of stack opcodes from memory without expensive pipelining or caching. While pricing has not been announced and may vary according to volume and what software is bundled with a purchase one should think of these individual processors as being in the class of processors that cost way under a dollar each. Most processors that only cost a few cents can't do hundreds of millions of opcode a second, do tens of millions of analog samples a second, respond to real-time events in a few picoseconds, or communicate with other processors at over 4 gigabits per second.

Much of the low cost and high speed is related to small size, directly related. More memory means slower, bigger, more expensive and fewer nodes for a given sized chip. The real estate area for ROM is much less than RAM. The modest 64 words of RAM for holding up to 256 cpu instructions per node is about half the size of the node. So if each processor had ten times the memory, 640 words there would be on tenth as many processors on a chip. If each processor had a hundred times as much memory, 6.4K words then there would be one one hundredth times as many processors on a chip of the same size and cost.

So why such a small number as 64? The number was choosen based on analysis of code. I found it interesting that in 1970 Chuck said that most applications fit in one K. And people seemed to accept that for years. But by the nineties when he made the same statement many programmers objected saying that their programs were huge and could never fit in 1K. I found it interesting that we had a lot of programs and modules in the iTV products like our embedded Internet browser/email package . We had teams coding in machineForth, using the 5-bit opcodes packed together, and teams using traditional (Standard) Forth. The difference in size was quite dramatic, after a dozen man-years of coding one machineForth module had expaned to 1.1K and another to almost 2K while traditional Forth modules tended to be an order of magnitude larger.

A number of things were done to further optimize the instruction set to further reduce the size of programs and their execution efficiency. And since the idea was to put multiple core on a single die the decision was made to make the on-chip memory only as big as it needed to be as shown by real programs being simulated. I found it amusing that when people said that software had bloated so that 1K was no longer sufficient to hold a program Chuck decided to reduce memory to 64 words.

Of course many people say that efficient software coding is not important because processors are cheap and memory is cheap. But in the embedded world the goal is zero cost and zero power consumption and infinite volume. And if you paid for ten time more memory than you needed you may not see the other disadvantages of having ten times more code than is needed. There is a cost for developing efficient code.
It is just a different cost than for developing inefficient code. You pay up front or you pay later.

The idea is that with multiprocessing on a multi-core chip one will want to get multiple cores to cooperate on a problem and have each stage in a pipeline perform only a part of the program. This allows programs to be faster and the hardware to be faster and smaller. The issue is that there is a tradeoff between on-chip memory and the number of nodes you get for a given cost. Unlike uncooperative software systems that require big expensive OS to manage the software fights Forth traditionally used cooperative software techniques to avoid the problems that most programmers in other environments have.

Sure there are problems that require manipulation of very large random access data sets which are not easily optimized in the manner described above. For that people make chips that are hundreds or thousands of times bigger and more expensive. But they are intended for a completely different class of applications. The bigger chips tend to give up realtime performance and quickness for raw speed on big things because they are for things that are very different than what SEAforth is optimized to do.

And without interupts and memory managment a processor can be made to respond to an event in the real world by running the code associated with that event much fast than could ever be possible with interupts, they require memory accesses. And it is impossible to respond to a bunch of different events quickly with a single processor and interupts, but when you have processors dedicated to realtime event processing, like you need to do to things like single-chip soft-radio applications that require things like phase locked loops in software at R/F frequencies, the constraints are quite different than with single processor designs.

I saw a different multi-core chip recently that was designed to run C code, and which was designed to have enough memory to host a copy of Linux on each of the nodes on chip. So they needed 64K bytes of RAM for each on-chip node for C and Linux while we have a design with only 64 words on the first prototype chips because we are not intending to need to use C and Linux for what we do. Forth needs much less.
So for the same size, price, and power consumption one could get one processor with a copy of Linux to do multitasking or hundreds of processors running in parallel each at several times higher speed than the big expensive design. These two things are so different that they just don't or won't compete for the same application areas.

Traditional Forth decades ago was about the problem of matching Forth ideas to hardware that wasn't designed to do Forth. That sometimes meant an impedence mismatch because Forth only needed a little and designers put in a lot. So sometimes Forth only used a few of the opcodes that a user had to pay for and power on a design. Chuck reasoned that if the processor executed Forth directly this translation layer where the impedence mismatch and inefficiency took place that Forth could be much more simple and more efficient. And things that were done in Forth for processors designed to run other things didn't have to be done in Forth for Forth, things could be done in a way that complemented Forth design and which could make Forth programs smaller and simpler. Chuck said you can't discuss software without a hardware context, and if you can define the hardware you have just what you need and nothing more and you can make the software much simpler and easier.

We have a number of groups working on various applications. They tend to be things that were not possible without this level of cheap and faster hardware. They also require software of course. Software is involved in creating the hardware and software for the hardware is needed to get any value from the hardware. That's where Forth comes in.

Best Wishes

Hishnik

I'm really wonder, why streaming, highly-parallel DSP algorithms cannot be run at SIMD architectures. Of course, it can be based on the Forth-processor, (for example as my newest design, implemented in FPGA). We may adopt multicore design, but it is not a best solution. In this case, regular, streaming and very similar tasks for several cores may be easily replaced by the SIMD approach. Further step on the way of simultaneous analysis of hardware and software may be consist of replacing some actions, performed by processor, by the actions, performed by hardware. In this case, tha ratio between data stream and commands stream may be 10, 50, 100 and greater. So I'm trying to imagine, what I lost in the scheme, when one processor core performing controls by the 128 independent DSP-microcores?

Also, 'hardware&software co-design' is comes up years ago. There are many development tools, aiming to analyse the properties and requirements of software, then helps to generate an appropriate hardware platform. The question is 'how effective we are implement this approach?'

Alexander

take a minute for relax

C & SEA - Could you find 10 differences?

Yes, I wonder while reading datasheets about SEAForth

It makes me to remember about scalable architecutre that was 20 years ago.

Wlad

Alexander писал(а):

It makes me to remember about scalable architecutre that was 20 years ago.

Whose?

English: SEAForth

Кто сейчас на конференции