This thesis investigates the properties of systems composed of recurrent neural networks. Systems of networks with different learning time-dynamics are of special interest. The idea is to create a system that possesses a long-term memory (LTM) and a working memory (WM). The WM is implemented as a memory that works in a similar way to the LTM, but with learning and forgetting at much shorter time scales. The recurrent networks are used with an incremental, Bayesian learning rule. This learning rule is based on Hebbian learning. In this thesis there is a thorough investigation of how to design the connection between two neural networks with different time-dynamics. Another field of interest is the possibility to compress the memories in the working memory, without any major loss in functionality. In the end of the thesis, these results are used to create a system that is aimed at modeling the cerebral cortex.










En studie av interagerande Bayesianska artificiella neuronnät med inkrementell inlärning


I denna rapport kommer egenskaperna hos system uppbyggda av återkopplade neuronnät att undersökas. System uppbyggda av nätverk som lär in på skilda tidsskalor är av speciellt intresse. Målet är att skapa ett system med ett långtidsminne och ett arbetsminne. Arbetsminnet kommer att realiseras på samma sätt som ett långtidsminne, men det kommer att arbeta med mycket kortare tidsskalor. De återkopplade neurala nätverken kommer att tränas med en inkrementell, Bayesiansk, inlärningsregel. Inlärningsregeln är baserad på Hebbsk inlärning. I rapporten finns en noggrann undersökning av hur man kan koppla ihop två neurala nätverk. Möjligheten att komprimera representationen i arbetsminnet, utan att det medför sämre prestanda, kommer också att studeras. I slutet av rapporten kommer resultaten från de inledande experimenten att användas till att skapa ett system som efterliknar hjärnbarken.










Acknowledgements and general information.



This master’s thesis in Computer Science was preformed at Studies of Artificial Neural Systems (Sans), at the department of Numerical Analysis and Computer Science (Nada), the Royal Institute of Technology (KTH). The work was done during the autumn of the year 2000. Professor Anders Lansner, head of the SANS research group, was the examiner of the project.


I would like to thank Anders Sandberg for his help and support during the project. Anders S. managed to explain all questions I had during the work on the project. He also taught me a diverse set of new knowledge. Anders Lansner woke my interest on the subject of research, and he has been encouraging through out the work. I also would like thank everyone else in the SANS group for an inspiring environment; Örjan Ekeberg, Erik Fransén, Pål Westermark, Peter Raicevic, Anders Fagergren, Jeanette Hellgren Kotaleski, Alexander Kozolov and Erik Aurell.    







1.0       Introduction.. 4

1.1      Short-term Storage Process. 4

1.2      Design Considerations and Concepts. 4

1.3      Overview of the Thesis. 5

2.0       Background.. 6

2.1      The Constituents of Memory. 6

2.1.1     Long- and Short-term Memory. 6

2.1.2     Explicit and Implicit Memory. 7

2.2      The Nervous System.. 7

2.2.1     Nerve Cells. 8

2.2.2     The Cerebral Cortex. 8

2.3      Computational Structures Designed to Mimic Biological Neural Networks. 9

2.3.1     Different Approaches to Associative Memories. 10

2.3.2     The Hopfield Model 11

2.3.3     Extensions to the Hopfield Model 12

2.4      Bayesian Attractor Networks. 12

2.4.1     The Bayesian Artificial Neural Network with Incremental Learning. 13

2.4.2     Equations of the Incremental, Bayesian Learning Rule. 15

2.4.3     A Biological Interpretation of the Bayesian Attractor Network. 16

2.5      Interesting Concepts. 17

2.5.1     Short-term Variable Binding. 17

2.5.2     Chunking. 17

3.0       Methods. 19

3.1      Design and Input 19

3.1.1     Network and Systems. 19

3.1.2      Input 19

3.2      Network Operation. 19

3.2.1     Training. 20

3.2.2     Testing. 20

3.3      Parameters. 21

4.0       Network Structures. 23

4.1      Systems with High and Low Plasticity. 23

4.1.1     One Network with Two Sets of Recurrent Connections. 23

4.1.2     Two Networks, Each with One Set of Recurrent Connections. 25

4.2      Plastic Connections. 26

4.2.1     Plastic Connections. 26

4.2.2     Sparse Plastic Connections. 28

4.2.3     Differently Represented Patterns in LTM and STM... 29

4.3      Summary. 30

5.0       Properties of Connected Networks. 31

5.1      Systems with Reduced Size of the STM... 31

5.1.1     STM as a Subset of the Hypercolumns in LTM... 31

5.1.2     STM as a Set of Sub-sampled Hypercolumns in LTM... 32

5.1.3     STM is a Subset of Sub-sampled Hypercolumns. 33

5.2      Interfering Effects. 34

5.2.1     Effects of LTM on STM... 35

5.2.2     Effects of STM on LTM... 36

5.3      LTM Helped by STM on Retrieval 38

5.4      STM Ability to Suppress Old Information in the LTM... 39

5.5      Summary. 40

6.0       STM Used as Working Memory.. 41

6.1      System Based on LTM and STM... 41

6.2      System Built with Modules of LTM and STM... 43

6.3      Summary. 44

7.0       Conclusions. 46

8.0       References. 48

8.1      Figure References. 48

8.2      Bibliography. 48









1.0     Introduction



It is fairly clear that the brain possesses several kinds of memory systems and processes. These can be divided into two major categories: long-term memory (LTM) and short-term memory (STM). This thesis will focus on how the STM can be constructed. The thesis will also be concerned with the interaction between LTM and STM and the effects that arise in systems that are comprised of these two sorts of memories. 



1.1         Short-term Storage Process


Long-term memory processes have for a long time been considered to reside in the synapses. Long-term potentiation (LTP) and long-term depression (LTD) have been observed to occur in synapses. LTP and LTD in the synapses are thought to constitute the long-term memories that we possess [1]. Based on these observations, parallels between artificial neural networks and populations of nerve cells have been made. Based on these ideas, an artificial neural network (ANN) can be used as a memory. This type of ANN is called attractor networks. Attractor networks have been suggested to constitute a good model of how the LTM works [2-6].


It is a common view in the research community that short-term storage of memories is based on a persistent activity in the neurons. In this view each short-term memory resides in the brain as a pattern of continuously active neurons. Approximately seven different memories can be stored in the STM [7]. These seven memories are maintained through some sort of cyclic reactivation process. Exactly how this cyclic reactivation process is implemented is unknown. The problem with this cyclic reactivation is that it is sensitive to noise and disturbances. If the cyclic reactivation process is disrupted, the memories are lost. Currently there is much research on the subject of how to construct a redundant reactivation process of the memories in the STM [8-10].


In this thesis, a different view is adopted of how the short-term memory process is attained. In the presented hypothesis, the short-term memories are stored in the synapses between neurons. This means that the short-term memory process is similar to the long-term memory process, but it works on a shorter time scale. In this storage scheme there is no need for a cyclic reactivation mechanism to maintain the memories. Instead, noise is used to randomly activate and maintain the memories in the STM. This process, for maintaining the memories, is more redundant than the cyclic process. Another good characteristic is that the memories are not lost if the system is disrupted. In this thesis, I have used an attractor network with high plasticity to simulate the STM.


The main focus of the thesis was to establish that an attractor network with high plasticity could be used as a model of STM, and that the STM could function as a working memory. 





1.2         Design Considerations and Concepts


The most important function of an STM is to hold information about the current situation. The STM also needs to be able to swiftly change its associations as new information arrives. The STM does not need to capture the details of the arriving information. The details are captured and stored by the LTM.  This means that it is more important for the STM to make the correct associations to the pattern in LTM than to be able to store the complete pattern.


An important question is how the memories stored in the LTM most effectively can be represented in the STM. The representation of memories in the STM does not need to be identical to the representation of the memories in the LTM. A requirement to make this possible is that there is a distinct connection from each memory in the STM to the corresponding memory in LTM. Several different methods can be used to create the compressed memories of the STM and associate these STM memories to their corresponding memories in the LTM. 


A central concept is working memory. The term working memory, introduced by Alan Baddely [11], is a concept used in cognitive psychology. This thesis shows how the STM, based on the Bayesian neural network, can be used as a working memory. In systems with a working memory, short-term variable binding (STVB) can be made. STVB is sometimes referred to as role filling. STVB is a basic function needed to construct a logical reasoning system.


The basic function of an auto-associative memory is, as the name suggests, to associate the input to one of the stored memories. If an input pattern is presented to the auto-associative memory, the memory will respond with the stored memory that most closely resembles the input. The associative memories in this thesis were implemented with attractor networks. Attractor networks are constructed with recurrent networks of artificial neurons. In the recurrent network, each neuron has connections to all other neurons.  Each of these connections is equipped with a weight that controls the influence between the neurons. The connection-weights form a matrix. The attractor network stores the patterns by altering the values of the weights [12].


The attractor networks were constructed with artificial neurons. The artificial neural networks were implemented with a palimpsest, incremental, Bayesian learning rule [13]. This Bayesian learning rule allows the user to control the temporal properties of plasticity in the network by the modification of a single parameter.













1.3         Overview of the Thesis


Chapter 2 contains a basic introduction to cognitive neuroscience. There is a description of how memories are categorized into explicit and implicit memories. The chapter also contains an overview of the anatomy of the nervous system. The anatomy and function of nerve cells are briefly presented.  A short presentation of artificial neural networks is given. A closer look at attractor networks and the Hopfield model is made. An overview of associative memories is given, as they constitute a central concept in this thesis. Then, the incremental, Bayesian learning rule is introduced. A biological interpretation of this learning rule is also made. Finally, some interesting concepts that can be found in auto-associative memory systems are presented. 


In chapter 3 the implementation of the Bayesian network model is presented. The physical realization of the neural networks is presented along with the choice of parameter values. The basic behavior of attractor networks based on the Bayesian learning rule is described. The environment, in which the networks operate, is also presented along with how the networks were tested.


The concern of chapter 4 is to illustrate how a system can be made out of two networks and how these networks can be made to cooperate. A few basic design ideas are studied.


The focus of chapter 5 is on how the representation of the memories can be compressed in the STM. A couple of different alternatives are studied. Then, interest is turned to some important functions that can be found in a system composed of an LTM and an STM.


Finally, in chapter 6, there is a demonstration of how a working memory can be useful. STM is used as a working memory in the systems. The systems are presented with a task that requires the use of both an LTM and STM. It is also shown how larger systems can be built with the use of smaller modules constructed out of a single LTM and STM.






















2.0     Background



2.1         The Constituents of Memory


During the latter part of the 20th century, the study of the brain moved from a peripheral position within both the biological and psychological sciences to become an interdisciplinary field called neuroscience that now occupies a central position within each discipline. This realignment occurred because the biological study of the brain became incorporated into a common framework with cell and molecular biology on the one side and psychology on the other. In recent years, neuroscientists and cognitive psychologists have recognized many important aspects of different kinds of memory. There are a lot of speculations regarding how a biological memory is constructed and what functions it has. Since the memory is a very integrated system, it is hard to test specific parts / properties of it. The mixture of two disciplines is one of the reasons why there are so many ideas concerning how the memory is constructed and why there is a jungle of terminology surrounding the subject. 


Different memory systems have been distinguished according to several attributes or criteria. Here are some of the more important differences: the content or kind of information those systems mediate and store (episodic / semantic / procedural memory) and how they store and retrieve that information (explicit / implicit memory). Another distinction is the memory’s storage capacity and the duration of the information storage (LTM / STM).



Figure 1    An illustration of how the properties of memory can be viewed to be orthogonal. The horizontal axis represents the retention time span of memories and the vertical axis could be said to represent awareness of memories.  



The differentiation of memories into the categories LTM / STM and explicit / implicit memories have been made by psychologists. These categories are not quite relevant on a cellular level. When one is simulating memories with neural networks it is more relevant to talk about the retention and induction time of memories. The retention time is the time period over which memories are stored and the induction time is the time needed to learn a new memory.

2.1.1       Long- and Short-term Memory


LTM can be thought of as a sturdy memory with almost unlimited capacity. The LTM is thought to reside in the different receptive areas of the cerebral cortex [14]. A closer description of the cortical areas of the cerebral cortex is presented in section 2.2. LTM can be seen to be composed of two different types of memory: declarative memories and nondeclarative memories. An example of a declarative memory is the name of your mother, while your cycling skill is a nondeclarative memory. The time scale for induction of long-term memory operations ranges from minutes to years. The time span of a memory depends on a number of factors. One of the most important factors is the number of times the memory is presented to you. 


The concept of an STM has been around for a long time. The time scale of short-term memory operations ranges from less than a second to minutes. It is an appealing idea that there exists some sort of temporary memory storage where sensory impressions could temporarily be stored before they are processed or before they become consolidated into LTM. Several kinds of STM have been described, again mainly on the basis of storage-time distinctions and phenomenal or neuropsychological data. The shortest of STM would be iconic memory [15], which has the capacity to retain a visual image for up to 1 second after presentation. Echoic memory is used to store sounds, and has a slightly longer time span then iconic memory. Immediate memory would last a few seconds longer. Although different STM have been proposed, I will not deepen the discussion into the subject of different kinds of STM. Instead I will adapt a broader view of the subject.


The definition of STM that transcends the temporal criterion is working memory. Working memory is a concept of STM that derives from cognitive psychology [11]. Working memory is thought to be a temporary storage used in the performance of cognitive behavioural tasks, such as reading, problem solving, and delay tasks (e.g., delayed response and delayed matching to sample), all of which require the integration of temporally separate items of information. Baddeley has more recently developed his view of working memory, and he now states that it constitutes of a phonological loop, a visuospatial sketchpad and the central executive [11]. 



2.1.2       Explicit and Implicit Memory




Figure 2    A hierarchical view of the constituents of explicit and implicit memory. Explicit memories are memories of which you are aware. Implicit memories are memories you possess, but are not aware of. Explicit memories can be divided into two categories: episodic and semantic memories. Episodic memories are whole scenarios. Semantic memories are lexical memories, i.e. words or memories of facts. A form of implicit memory is procedural memory. As mentioned earlier, the group of implicit memories are memories you are not aware of, i.e. the skill of cycling. 



Explicit  (or declarative) memory is the memory of events and facts; it is what is commonly understood as personal memory. One part of it contains the temporally and spatially encoded events of the subject’s life. For this reason it has alternately been called episodic memory [16, 17]. Another part contains the knowledge of facts that are no longer ascribable to any particular occasion in life; they are facts that, through single or repeated encounters, the subject has come to categorize as concepts, abstractions, and evidence of reality, without necessarily remembering when or where he or she acquired it. This is what Tulving has called semantic memory [17]. The retention time of explicit memories is often longer than the induction time of implicit memories.


Implicit (or nondeclarative) memory, the counterpart of declarative memory, is a somewhat difficult concept to grasp. It can be viewed as the memory for the development of motor skills although it encompasses a wide variety of skills and mental operations. Cohen and Squire called this type of memory procedural memory [18]. Implicit memory can also be viewed as the influence of recent experiences on behaviour, even though the recent experiences are not explicitly remembered. For example, if you have been reading the newspaper while ignoring a television talk show, you may not explicitly remember any of the words that they used in the talk show. But in a later discussion, you will more likely use the words that were used in the talk show.  Psychologists call this phenomenon priming, because hearing certain words “primes” you to use them yourself.



2.2         The Nervous System


The nervous system consists of the central nervous system and the peripheral nervous system. The central nervous system (CSN) is the spinal cord and the brain, which in turn includes a great many substructures. The peripheral nervous system (PNS) has two divisions: the somatic nervous system, which consists of the nerves that convey messages from the sense organs to the CNS and from the CNS to the muscles and glands, and the autonomic nervous system, a set of neurons that control the heart, the intestines and other organs.


The brain is the major component of the nervous system and it is a complex piece of “hardware”. Weighing approximately 1.4 kilogram in an adult human, it consists of more than 1010 neurons and approximately 6x1013 connections between these neurons [19]. The struggle to understand the brain has been made easier because of the pioneering work of Ramón y Cajál [20], who introduced the idea of neurons as structural constituents of the brain. I will now make some comparisons that are far from exact, but quite illustrative. Typically, neurons are five to six orders of magnitude slower than silicon logic gates; events in a silicon chip happen in the nanosecond (10-9 s) range, whereas neural events happen in the millisecond (10-3 s) range. However, the brain makes up for the relatively slow rate of operation of a neuron by having a truly staggering number of neurons with a massive number of interconnections between them. Although the brain constitutes an incredibly large number of neurons, it is still very energy efficient. The brain use approximately 10-16 joules per operation per second, whereas the corresponding value for the computers in use today is about 10-6 joules per operation per second [21]. If one makes the assumption that the brain consumes 400 kg-calories/24h, the brain has an effect of 20 watts, which is equal to a modern processor.  



2.2.1       Nerve Cells


What sets neurons apart from other cells are their shape and their ability to convey electrical signals. The anatomy of a neuron can be divided into three major components: the soma (cell body), dendrites and an axon.  The soma contains a nucleus, mitochondria, ribosomes and the other structures typical of animal cells. Neurons come in a wide variety of shapes and sizes in different parts of the brain. The pyramidal cell is one of the most common types of cortical neurons. About 80% of all cortical neurons are pyramidal cells. The typical pyramidal cell can receive more than 10,000 synaptic contacts, and it can project onto thousands of target cells. Axons are the transmission lines from the soma to the synapse, and dendrites are the transmission lines from the synapses to the soma. These two types of cell filaments are often distinguished on a morphological ground. An axon often has few branches and greater length, whereas a dendrite has more branches and shorter length. There are some exceptions to this view. Dendrites may contain dendritic spines where specialized axons can attach [22].




Figure 3    A typical nerve cell. The nerve cell is depicted here with only its most important filaments: the dendrites, the axon with its synaptic terminals, and the cell body. Although the depicted cell has the most common characteristics of a nerve cell, there is a lot of variation in the appearance of nerve cells.  [Figure 3]



Synapses are elementary structural and functional units that mediate the interactions between neurons. The most common kind of synapse is a chemical synapse. A presynaptic process releases a transmitter substance that diffuses across the synaptic junction between neurons and then acts on postsynaptic receptors. Thus a synapse converts a presynaptic electrical signal into a chemical signal and then back again into a postsynaptic electrical signal. It is assumed that a synapse is a simple connection that can impose excitation or inhibition on the receptive neuron (but not both) [23]. It is established that synapses can store information about how easily signals should pass through them. The process that accounts for this ability is LTP (Long Term Potentiation). In the case of inhibitory synapses there is a similar process called LTD (Long Term Depression) [1].


The majority of neurons encode their outputs as a series of brief voltage pulses. These pulses, commonly known as action potentials or spikes, originate at or close to the soma (cell body) of neurons and then propagate across the individual neurons at constant velocity and amplitude. The reasons for the use of action potentials for communication among neurons are based on the physics of axons. The transportation of the actionpotentials is an active process. The axon is equipped with ion-pumps that actively transport K+, Na+, Cl-, Ca2+ and other ions in and out through the axon’s cell membrane. The active transportation of action potentials is necessary when the axons span great distances otherwise the action potentials would be attenuated too much. If the action potentials are reduced too much when they reach the end of the synapses, they are not able to initiate the release of transmitter substances. The myelin or fat that surrounds the axons lessen the reduction of the action potentials and increases the speed of transmission [24]. 



2.2.2       The Cerebral Cortex


The surface of the forebrain consists of two cerebral hemispheres: one on the left side and one on the right side that surround all the other forebrain structures. Each hemisphere is organized to receive sensory information, mostly from the contralateral side of the body, and to control muscles, mostly on the contralateral side, through axons to the spinal cord and cranial nerve nuclei. The cellular layers on the outer surface of the cerebral hemispheres form a grey matter known as the cerebral cortex. Large numbers of axons extend inward from the cortex, forming the white matter of the cerebral hemispheres. Neurons in each hemisphere communicate with neurons in the corresponding part of the other hemisphere through the corpus callosum, a large bundle of axons.



Figure 4    The cerebral cortex of a human brain. In the picture the cortex has been divided into its major functional areas. In literature one can find various ways to divide the cortex into to areas. There are descriptions of the cortex where it has been divided into more than 50 functional areas. [Figure 4]



The cerebral cortex has a very versatile functionality. At a glance, the cortex seems to be structurally highly uniform. This suggests that the functionality of the cortex is very general, but we know that different areas of the cortex handle specific tasks. This is supported by the fact that the microscopic structure of the cells of the cerebral cortex varies substantially from one cortical area to another. The differences in appearance relate to differences in the connections, and hence the function. Much research has been directed toward understanding the relationship between structure and function. The sensory and motor cortical areas have been found to have at least a partial, hierarchical order. In the case of the sensory cortical areas, there are many connections from the higher order sensory areas to the prefrontal cortex. In the case of the motor cortical areas, there are many connections leading from the prefrontal cortical area to the higher order motor cortical areas [25].




Figure 5    The cerebral cortex is divided into six layers. Layers 2 and 3 are often considered as a single layer, layer 2 & 3. There is also a thought of a division of the cortex into columns vertical to the layers of cortex. These columns can be summed into larger structures, called hypercolumns. [Figure 5]



In humans and most other mammals, the cerebral cortex contains up to six distinct laminas, layers of cell bodies that are parallel to the surface of the cortex. Layers 2 and 3 are usually seen as one layer. Most of the incoming signals arrive in layer 4. The neurons in layer 4 send most of their output up to layer 2 & 3. Outgoing signals leave from layers 5 and 6.


In the sensory cortical areas, the cells or neurons with similar interests tend to be vertically arrayed in the cortex, forming cylinders known as cortical columns. The small structures, called mini-columns, are about 30 mm in diameter. These columns are summed up into larger structures called hypercolumns that are about 0.4­­-1.0 mm. In the artificial neural network used to run the simulations described later on, there will be a similar concept to the hypercolumns. Outside the sensory areas the structures of the columns are less distinct. Each column in a hypercolumn can be seen to perform a small and specific piece of the work that is preformed by the entire hypercolumn. Within a hypercolumn the communication between the columns constituting the hypercolumn is very intensive [26].    



2.3         Computational Structures Designed to Mimic Biological Neural Networks                                                                  


Neural networks are very interesting because they work in a completely different way than a conventional digital computer does. Neural networks process information using a vast number of non-linear computational units. This means that the computations are done in a non-linear and highly parallel manner. A conventional computer, based on the von Neumann machine, often only uses one computational unit, hence it processes the information in a sequential manner. It is often said that neural networks are superior to the standard von Neumann machines. This is not true, but it is a fact that neural networks and von Neumann machines are good at different forms of computations [27].  


The neural networks used in this thesis were implemented on regular desktop computers. This is usually the case since it is much easier to construct an implementation in software than in hardware, but a hardware implementation of a neural network would be much more resource efficient.



2.3.1       Different Approaches to Associative Memories


An associative memory is a memory that stores its inputs without labelling them (memories are not given an address). To recall a memory you need to present the associative memory with an input similar to the memory you want to retrieve. There are two types of associative memories: auto-associative memories (which are sometimes also referred to as content addressable memories) and hetro-associative memories. When a fragmented pattern is presented to an auto-associative memory, the memory tries to reconstruct the pattern. If a fragmented pattern is presented to a hetro-associative memory, the memory tries to associate the presented pattern to another pattern. Note that all the associations are learned in advance [28].  


The basic idea behind an auto-associative memory is very simple. Each memory is represented by a pattern. A pattern is a vector containing N binary values corresponding to the states of the N neurons. When an auto-associative memory has been trained with a set of P patterns { xm } and then presented with a new pattern xP+1, the auto-associative memory will respond by producing whichever one of the stored patterns most closely resembles xP+1. This could of course be done with a conventional computer program that computes the Hamming distance between pattern xP+1 and each of the P stored patterns, where the Hamming distance between two binary vectors is the number of bits that are different in the two vectors. But if the patterns are large and very many (these two attributes usually come together), the auto-associative memory with its highly parallel structure will be immensely faster then the conventional computer program. An example application is image recognition. Imagine that you receive a very noisy image of your house. If this image previously has been stored in the auto-associative memory, the memory will produce a reconstruction of the image.


Associative memories have more nice features than just noise removal. Associative memories have the very important ability to generalize. This makes it possible for associative memories to handle situations where they are presented with memories never before encountered. Another side of generalization is categorization of memories. This is also a feature handled by associative memories. Categorization means that similar memories are stored as one memory. The common features of the memories, stored in the same category, are stored robustly and are easy to retrieve. The individual details of each memory in the category leave only a minor trace in the memory [29]. When a memory is retrieved from the category, it is very likely to posses the details of the most recently stored memory.        


The workings of the associative memory are usually explained by an energy abstraction. In this abstraction, the memories stored in the associative memory construct an energy landscape. The energy landscape has as many dimensions as the stored memories have attributes. This often means that the energy landscape has a high number of dimensions. The energy landscape in figure 6 only has two dimensions and thus the memories stored in the corresponding associative memory only have two attributes. Each learned memory creates a local minimum in the energy landscape. These local minimums are called attractors. In this concept, the input to the associative memory is a position in the energy landscape, where information is stored as a basin in the energy landscape. The retrieval of a memory can be seen as a search for a local minimum in the energy landscape. The starting point for this search is the input, which is similar to the memory that is going to be retrieved. Although this view can be quite elusive, since we are talking about a high dimensional space, it nonetheless gives an illustrative view of the way an associative memory works.



Figure 6    An illustration of the energy landscape that is produced by the associative memory. The basins, the lowest points in this energy landscape, are called attractors. The attractors could be said to constitute the memories in an associative memory. These types of networks are referred to as attractor networks.



There are several ways an associative memory can be constructed. The most common method is to use the Hopfield model to construct an associative memory. In this thesis I will use an advanced version of the Hopfield model, based on the laws of probabilities, to construct associative memories.



2.3.2       The Hopfield Model


The idea behind the Hopfield model is largely based on Donald Hebb’s well known work [30]: assume that we have a set of neurons, which are connected to each other through connection-weights (representing synapses). In the discrete Hopfield model, the neurons can either be active or non-active. When the neurons are stimulated with a pattern of activity, correlated activity causes connection-weights between them to grow, strengthening their connections. This makes it easier for neurons, which in the past have been associated, to activate each other. If the network is trained with a pattern and then presented with a partial pattern that fits the learned pattern, it will stimulate the remaining neurons of the pattern to become active, completing it. If two neurons are anti-correlated (one neuron is active while the other neuron is not) the connection-weights between them are weakened or become inhibitory. This form of learning is called Hebbian learning, and is one of the most commonly used non-supervised forms of learning in neural networks.    


The Hopfield network consists of a set of neurons and a corresponding set of unit delays, forming a multiple-loop feedback system [12]. If N is the number of neurons in the network, the number of feedback loops is equal to N2-N. The ”–N” expression represents the exclusion of self-feedback. Basically, the output of each neuron is fed back via a unit delay element to each of the other neurons in the network. Note that the neurons do not have self-feedback. The reason for this is that self-feedback would create a static network, which in turn means a non-functioning memory.  


Each feedback-loop in the Hopfield network is associated with a weight, wij (the weight between neuron i and neuron j). Since we had N2-N feedback loops, we will have N2-N weights where wij = wji. Imagine that we have P patterns, where each pattern, xm, is a vector containing the values 1 or –1. Then a weight matrix can be constructed in the following manner:




where m is the index within a set of patterns, P is the number of patterns, and N is the number of units in a pattern (N is the size of the vectors in the set { xm }). The patterns represent  activation of the neurons. The neurons can be in the states oi Î ±1.


To recall a pattern (of activation), oi, in this network we can use the following update rule:



If the underlying network is recurrent the process of recollection is iterative. This iterative process, where the instable and noisy memory becomes stable and clear, is called relaxation.


Since the network will have a symmetric weight matrix, wij, its possible to define an energy function called Lyapunov function [31]. The Lyapunov function is a finite-valued function that always decreases as the network changes states during relaxation. According to Lyapunov’s Theorem 1, the function will have a minimum somewhere in the energy landscape, which means the dynamics must end up in an attractor. The Lyapunov function for a pattern x is defined by:



The Hopfield model constitutes a very simple and appealing way to create an associative memory. The model has a problem called catastrophic forgetting.  Catastrophic forgetting occurs when the Hopfield network is loaded with too many patterns. It can be said to occur when there are too many basins in the energy landscape. If the network is loaded with too many patterns, errors in the recalled patterns will be very severe. The storage capacity of the Hopfield network is approximately 0.14N patterns, where N is the numbers of neurons in the network [31].


The Hopfield model can also be made continuous. The model is then described by a system of non-linear first-order differential equations. These equations represent a trajectory in state space, which seeks out the minima of the energy (Lyapunov) function E and comes to an asymptotic stop at such fixed points (in analogy to the discrete Hopfield model presented where these fixed points are found instantaneously).



2.3.3       Extensions to the Hopfield Model


The standard correlation based learning rule used in the Hopfield model suffers from catastrophic forgetting. To cope with this situation Nadal, Toulouse and Changeaux [32] proposed a so-called marginalist-learning paradigm where the acquisition intensity is tuned to the present level of cross talk “noise” from other patterns. This makes the most recently learned pattern the most stable. New patterns are stored on top of older ones, which are gradually overwritten and become inaccessible, a so-called “palimpsest memory”. This system retains the capacity to learn at the price of forgetfulness.


Another smoothly forgetting-learning scheme is learning within bounds, where the synaptic weights wij are bounded by –A £ wij £ A. A is the maximum value that the weights can attain. This learning scheme was proposed by Hopfield [12]. The learning rule for training patterns xn is



where n is the pattern number and N is the number of nodes (neurons). c is a clipping function



The optimal capacity 0.05N is reached for A » 0.4 [33]. For high values of A, catastrophic forgetting occurs, for low values the network remembers only the most recent pattern. This implies a decrease in storage capacity from 0.14N of the standard Hopfield model. Total capacity has been sacrificed for long-term stability.



2.4         Bayesian Attractor Networks


As previously discussed, there are several approaches to creating a memory in a neural network context. This thesis uses associative memories with palimpsest properties and a structure with hypercolumns based on a Bayesian attractor network with incremental learning. This memory model is used because it is a good model of the structures in the cerebral cortex and, at the same time, is comparably simple. The model also makes sense from a statistical viewpoint and avoids the need of threshold control, which is often necessary in recurrent networks.


The artificial neural network with hypercolumns and incremental Bayesian learning developed by Sandberg et al. [13] is a development of the original Bayesian artificial neural network model developed by Lansner et al. [34, 35], which was developed to be used with one-layer recurrent networks. Extensions to higher order networks with a hidden layer also exist [36].


The Bayesian learning method is a learning rule intended for units that add their inputs multiplied by weights and use that sum to determine, by using a non-linear function, their output (activation). This is similar to other algorithms for artificial neural networks. The weights in a Bayesian network are set in accordance with rules derived from Bayes’ expressions concerning conditional probabilities. This means that the activation of the units can be equated with the confidence in the feature it regenerates. The rule is local, i.e. it only uses data readily available at either end of a connection. The algorithm easily allows for an adjustment of the time span over which statistical data is collected. The time span is adjusted by a single variable, often called a. By regulating the value of a, and hence the time window for collecting statistical data, the memory span of the network is regulated, i.e. palimpsest properties are achieved.


The Bayesian learning rule can be extended to handle continuous valued attributes. This has been done by Holst and Lansner [36], using an extended network capable of handling graded inputs, i.e. probability distributions given as input, and mixed models. 


To deal with correlations between units that cause biases in the posterior probability estimates, hypercolumns were introduced [35]. A hyper-column, named in analogy with cortical hypercolumns [37], is a module of units that represents all possible combinations of values of some primary features and hence provides a anti-correlated representation of the network input. The activation within a hypercolumn is normalized to sum to one.



Figure 7    A small recurrent Bayesian neural network with the six neurons divided into three hypercolumns. Note that there are no recurrent connections within each hypercolumn. Instead, the activation within each hypercolumn is normalized. With some imagination it can also be seen how the weights wij form a matrix.  



I am now going to present a continuous, incremental Bayesian learning rule with palimpsest memory properties. The forgetfulness can conveniently be regulated by the time constant of the running averages used in the weight updating. This implies that it is easy to construct an STM or LTM memory with this learning rule.



2.4.1       The Bayesian Artificial Neural Network with Incremental Learning


Bayesian Confidence Propagation Neural Networks (BCPNN) are based on Hebbian learning and derived from Bayes theorem for conditional probabilities:



where m is an attribute value of a certain class x. The purpose of calculating the probabilities of the observed attributes for each class is to make as few classification errors as possible. The reason we want to use Bayes theorem is that it is often impossible to make a good estimate of P(x|m) directly from the training data set. On the other hand, a good estimate of P(m|x) is often possible to achieve. Next we will see how this can be implemented in a neural network context.


The input to the network is a binary vector, x. The vector x is composed of the smaller vectors x1, x2,…,xN. Each of these sections x1, x2,…,xN are representing the input to a hypercolumn. This means that the input space, which represents all possible inputs to the network, can be written as X = X1, X2,…,XN. Each variable Xi can take on a set of Mi different values. This means xi will be composed of Mi binary component attributes xii¢ (the i¢ possible state of the i attribute of xi) with a normalised total probability




From the input, x, we want to estimate the probability of a class or set of attributes y. (The class y is the output of the network and the input, x, is seen as an attribute.) The vector y has the same structure as the vector x. If we condition on X (where unknown attributes retain their prior distributions) and assume the attributes xi, to be both independent, P(x) = P(x1)P(x2)…P(xN), and conditionally independent, P(x|y) = P(x1| y)P(x2| y)…P(xN| y), we get:







where oii¢  = P(xii¢|Xi).


Since y can just be regarded as another random variable, it can be included among the attributes xi and there is no reason to distinguish the case of calculating yjj¢ by calculating xii¢. If X represents known or estimated information, we want to create a neural network which calculates P(y) from the given information. If we take the logarithm of the above formula we get:




Now, let the input X(t) to the network be viewed as a stochastic process X(t,×) in continuous time. Let Xii¢(t) be component ii¢ of X(t), the observed input. Then we can define Pii¢(t)=P{X ii¢(t)=1} and P ii¢jj¢(t)=P{X ii¢(t)=1, X jj¢(t)=1}. Equation (1) then becomes:




Given the information {X(t¢),t¢<t} we now want to estimate Pii¢(t) and P ii¢jj¢(t). This can be done by using current unit activity oii¢(n) at time n with the following two estimators where t is a suitable time constant.






The estimator in equation (3) estimates the probability of a single neuron to become active per unit time. The estimator in equation (4) estimates the probability for two neurons to simultaneously be active. L is the estimated probability per time unit or rate estimated probability. This means that L is estimated from a subset of the events that has occurred, whereas P is estimated from all events that have occurred. The rate estimator explains the palimpsest property of the learning rule.


These estimates can be combined into a connection-weight, which is updated over time. The bias can also easily be stated as






The base for the logarithms is irrelevant, but for performance reasons the natural logarithm is often the best choice. Logarithms with other bases are often derived from the computation of the natural logarithm. 


The usual equation for neural network activation is




where hj is the support value of unit j, bj is its bias, wij the weight from i to j and f(hi) the output of unit i calculated using the transfer function f. The output f(hi) equals oii¢ in equation (1) and (2). In the basic Hopfield model the activation function, f(), is, as we earlier saw, a step function.


The form in equation (2) is slightly more involved than (7), and has to be implemented as a pi-s neural network or approximated [35, 36]. The activation equation in the learning rule is:




Comparing terms in equations (8) and (2) we make the identifications









P(xii¢|x) = oii¢ = f(hii¢) can be identified as the output of unit ii¢, the probability that event xii¢ has occurred or an inference that it has occurred. Since inferences are uncertain, it is reasonable to allow values between zero and one, which correspond to different levels of confidence in xii¢.


Since the independence assumption is often only approximately fulfilled and we deal with approximations of probability, it is necessary to normalise the output within each hypercolumn:




The network is used with an encoding mode where the weights are set and a retrieval mode where inferences are made. Input to the network is introduced by setting the activations of the relevant units (representing known events or features). As the network is updated the activation spreads, creating a posterior inference of the likelihood of other features. 


As we discussed earlier, networks with update rules like equation (7) and symmetric weight matrices have an energy function that can be defined, and convergence to a fixed point is assured [33]. In this case this does not strictly apply, but for activation patterns leaving only one nonzero unit in each hypercolumn, it does apply. In practice it almost always converges, even though there is no input.


In the absence of any information, there is a risk for underflow in the calculations. Therefore we introduce a basic low rate l0. In the absence of signals, Lii¢(t) and Ljj¢(t) now converges towards l0 and Lii¢jj¢ towards l20, producing wii¢jj¢(t) = 1 for large t (corresponding to uncoupled units). The smallest possible weight value, if the state variables are initialised to l0 and l20 respectively, is 4l20, and the smallest possible bias log (l0). The upper bound on the weights becomes 1/l0. This learning rule is hence a form of learning within bounds, although in practice the magnitude of the weights rarely comes close to the bounds.



2.4.2       Equations of the Incremental, Bayesian Learning Rule


The learning rule (equations (3)-(6)) of the preceding section can be used in an attractor network similar to the Hopfield model by combining them with an update rule similar to equations (8)-(12). The activities of the units can then be updated using a relaxation scheme (for example by sequentially changing the units with the largest discrepancies between their activity and their support from other units). One could also use a random or synchronous updating similar to an ordinary attractor neural network, moving it towards a more consistent state. This latter approach is used here. The continuous time version of the update and learning rule takes the following form (the discrete version of the equations used with Euler’s method are derived from the continuous version):














where t0 is the time constant of change in unit state. The variable a = 1/t is the inverse of the learning time constant; it is a more convenient parameter than t. By setting a temporarily to zero the network activity can change with no corresponding weight changes, for example during retrieval mode.


The use of hypercolumns in the model presented implies that there will be no recurrent connections within a hypercolumn of the network. Recurrent connections within a hypercolumn are fully anti-correlated. The self-recurrent connection is fully correlated. Thus the weights connecting the neurons within a hypercolumn would either be set to their minimum or their maximum value. 


Each neuron in the network will have a bias that is derived from the basic set of recurrent connections. Connections projected from other populations of neurons will not add any bias to the receiving neurons, although it would make sense from a mathematical point of view to include the bias in the projection.



2.4.3       A Biological Interpretation of the Bayesian Attractor Network


Auto-associative memories based on artificial neural attractor networks, like for example early binary associative memories and the more recent Hopfield net, have been proposed as models for biological associative memory [12, 38]. They can be regarded as formalisations of Donald Hebb’s original ideas of synaptic plasticity and emerging cell assemblies. In this view each neuron in the artificial neural network is thought to equal a single nerve cell in the biological neural network. Figure 8 is an illustration of an artificial neuron. With some imagination it is possible to see the similarities with a nerve cell.




Figure 8    Depicted here is an artificial neuron, and its functions. Some parallels to a biological neuron are implied in the figure. Note that the output is conveyed to several other neurons. 



Each connection-weight, wij, in figure 8 can be interpreted as a synaptic connection between two neurons. In figure 9 one of these connections is depicted in more detail.



Figure 9    This figure shows a single synaptic connection between two neurons in the artificial neural network. The values of Li (=Pi, Pj), Lij (=Pij), and wij , derived in equations (15, 16, 18), can be interpreted as shown in the figure. Pj, Pi, Pij, are values associated with synaptic terminals’, synapses’ and dendrites’ ability to convey a signal from cell j to cell i.



Although the above presented view of each neuron corresponding to a nerve cell is appealing, it is not realistic. Real neurons are not as versatile as our artificial neurons, i.e. a real neuron cannot impose both inhibition and excitation, which is stated by Dalés law [23]. A better view of the correspondence between the artificial neurons and real neurons is to think of the artificial neurons as corresponding to a cortical column of real neurons. In the Bayesian attractor neural network there is a structure of hypercolumns, where each hypercolumn correspond to a group of cortical columns.



2.5         Interesting Concepts


There are a couple of interesting concepts or functions I hope to find in the simulations of the memory systems developed in the experiments.  These concepts originate from cognitive psychology.



2.5.1       Short-term Variable Binding


In a memory that is going to be implemented in a decision making system, there is a need not only to be able to recall earlier events, but also to be able to recall these events with current situation data. This type of process is usually called short-term variable binding (STVB) or role filling. To illustrate this concept I will give an example:


John is visiting his grandfather Sven. After he has visited his grandfather, John meets his two friends, Max and Sven. When Max talks about Sven with John, John knows that the Sven Max is talking about is not his grandfather. 


To achieve STVB in a system, the system will of course need an LTM, and also some sort of STM that can accommodate the temporary bindings. One of the main focuses of this thesis will be to investigate how STVB can be achieved.               



2.5.2       Chunking


The chunking process is a specialisation of the memory, which allows it to more effectively remember certain things. The chunking learning process recruits a new idea to represent each thought, and strengthens associations in both directions between the new chunk idea and its constituents. Thus, the inventory of ideas in the mind does not remain constant over time, but rather increases due to chunking. The representation of a chunk is constructed out of its constituents. An example of chunking is how the set { 1 2 3 } is remembered. The set can be remembered as 1, 2, and 3. However the chunked version of the set is remembered as the number 123.


There are two primary reasons for chunking: First, chunking helps us to overcome the limited attention span of thought by permitting us to represent thoughts of arbitrary complexity of constituent structure by a single (chunk) idea. Second, chunking permits us to have associations to and from a chunk idea that are different from the associations to and from its constituent ideas. This is very important for minimizing associative interference.


In this thesis I have studied how the representation of the short-term memories could be done more effectively. I have also studied how these efficient short-term representations associate to the long-term representations. Although the work of this thesis does not directly focus on the chunking process, I thought it was interesting to mention the similarities between the STM & LTM interaction and chunking.




3.0     Methods                                                              


Since the simulations in this thesis are based upon the Bayesian artificial neural network model developed in [13], I have tried to use similar settings and architectures. In all simulations, the neural networks were first trained on a set of patterns and then tested. This means some consideration must be taken before the artificial neural networks used in this thesis can be implemented in a real-time system.  


The Bayesian artificial neural network model was implemented in both Matlab and C code. All plots were made with Matlab. The experiments, either implemented in Matlab or C, were run on both Microsoft Windows 9x and SUN Solaris UNIX operating systems.



3.1         Design and Input


3.1.1       Network and Systems


The LTM was implemented as a recurrent network consisting of 100 neurons divided into 10 hypercolumns with 10 neurons in each hypercolumn. This LTM configuration was used throughout the thesis with no exceptions. As for the STM, there were a couple of different implementations with respect to the number of neurons and hypercolumns. The two most common implementations of the STM used 100 and 30 neurons, respectively. All the networks used had at least one set of recurrent connections. As mentioned earlier there were no recurrent connections within the hypercolumns of the networks, as the internal representation in a hypercolumn is supposed to be completely anti-correlated.


Almost all systems were constructed under the assumption that the input to the systems always passed the LTM before it entered the STM. The output from the system was always extracted through the LTM. When the systems are used and not only tested, all input/output is handled by the LTM. In a real-time system, the data presented to the STM will always be delayed. Since the simulations in this thesis were not run in real-time, there was no need to be concerned about this delay. The LTM and STM exerted a disruptive influence on each other during training and operation. In chapter 5 I investigated these interferences between networks.


A set of recurrent connections within a network and also a set of connections between networks are called projections. A projection does not only represent the physical connection, the concept also incorporates the connection-weights. Each connection between two neurons is equipped with two weights that represent the correlation between the neurons in both directions. Since the networks are auto-associative, the weights are equal in both directions. This does not apply to the connections between two networks, where hetero-associations may arise. In the models, a matrix represents the projections. The bias was not included in the projections.    



3.1.2       Input


The input to the artificial neural networks was vectors of binary numbers (0 and 1). These input vectors were constructed in respect to the hypercolumn structure of the LTM. This meant that only one out of ten neurons in a hypercolumn structure was activated, and this was always the case. So the entire input of ten hypercolumns only caused 10 out of 100 neurons to be activated. The input could therefore be considered sparse. The sparseness of the input affects the storage capacity of the network. If the input is too “dense”, it will affect the storage capacity negatively. In every run a new set of patterns was generated. The patterns were generated from a rectangular probability-density function.


In chapter 4 the input always consists of sets with 100 patterns. In chapter 5 the input always consists of sets with 50 patterns.  Chapter 6 contain experiments with structured data. Therefore the input is sets with different number of patterns. The LTM is affected by the size of the input set. This is more thoroughly explained in 3.2.



3.2         Network Operation                             


As mentioned earlier, the systems designed in this thesis were not operated in “real-time”. The systems had a training mode, where the memories were stored in the system. Then, during the operation mode, the memories were retrieved. The system design outlined in figure 10 was the most frequently used design. In biological memory systems the theta rhythm may control the switch between the training and operation mode of the memory network.



Figure 10  In all simulations, the artificial neural networks were first put in a Training mode and trained with a set of patterns.  When the training phase was completed, the networks were put in an Operation mode, and tested. Note that non-conducting static projections are not depicted in the figure (Training mode). 



The differential equations 13, 15 and 16 have been solved using Euler’s method. The time-step was chosen to be h=0.1, and the integrations lasted for 1 unit of time. (This meant that 10 steps were taken with Euler’s algorithm while integrating from 0 to 1.) It often took much longer time than one time unit to train or retrieve a pattern properly.  In the case of training, a strong memory of a single pattern was achieved trough repeated presentations of the pattern. The relaxation process that occurs during the operation mode was almost never fully completed. (Fully completed means that no more changes in the neurons’ activity would occur if the relaxation process was extended.)    



3.2.1       Training


During the training mode equations (15) and (16) were solved for each network in the system. Equations (17) and (18) were then used to compute the bias and the projections for the networks. In the case of projections between two networks, the same equations were applied with the exception of equation (17). The bias was chosen not to be included in the projections between networks. A biological interpretation of this is that the entire dendritic tree of the synapse is given the same bias value. This means that synapses close to the soma are not given any priority. In a real neuron, synapses closer to the soma generate a stronger signal then synapses further out in the dendritic tree [24]. Mathematically it also makes sense to incorporate the bias values into the projection, even though this was not done here.


The three main parameters that controlled the network during training mode were the value of a, the number of patterns and the time spent training each pattern.



3.2.2       Testing


The main purpose of this thesis was not the retrieval-performance of the networks but it was to prove that networks with different time-dynamics could be used in the same system. However, to rate how good the designs were, retrieval-performance was of great importance when different designs were investigated.


To initiate the retrieval of patterns (memories) the networks were usually presented with a copy of the learned pattern with two errors. (The content of two of the hypercolumns were altered.) In figures describing systems, this type of input is denoted as “Input with errors”. In some experiments in chapter 5 and 6 the networks were presented with only a few of the hypercolumns of the learned patterns. The hypercolumns that were not presented to the network were filled with zeros.  


The plots over single networks were constructed from 50 runs. In the plots of several networks, each data point was often constructed from 20 runs. The data presented in the tables were accumulated from 100 runs of the networks.


During testing, the networks were put in operation mode. Equations (13) and (14) were used in order to perform the relaxation. The relaxation process was always one time unit long. When a network received a projection from another network, equation (13) was replaced with equation (19). Equation (19) introduces the constant gain factor g. The value of g varied around 1. The purpose of g was to introduce a control instrument over the influence of projections between the networks in a system. The direction of the projections that g applied to was denoted, i.e. gSTM®LTM  (In this case the projection from the STM to the LTM is scaled with g.)  





Here wsii¢jj¢ denoted the connection-weights and osjj¢ the activity pattern of the sending neurons. The variable Ns is the number of neurons in the sending network.


Successful retrieval was defined as the fraction of patterns that were correctly recalled after relaxation to a tolerance of 0.85 overlap. In the normal case where the input consisted of 10 hypercolumns, a recalled pattern was only allowed to differ in one hypercolumn from the original pattern to be classified as correct. The retrieval ratio of the system was often plotted as a continuous line. The retrieval-ratio of the subsystems were often also plotted, i.e. the LTM (plotted as a dotted line) and the STM (plotted as a dash-dotted line).



3.3         Parameters


In the neural network model at hand there are several parameters that can be chosen more or less arbitrarily. As mentioned earlier, the choice of these parameters is consistent with [13]. In this section the default values of the parameters are listed. (These default values are common among many of the simulations.) It has been mentioned in the text when new constants are introduced or when the default values are altered.


In all simulations, the value 0.001 is used for l0. The parameter l0 can be seen as the background noise in the neurons. The parameter l0 also have implications on the maximum excitation that can be conveyed from one neuron to another.


The experiments in chapter 4 were run with 100 patterns. The capacity for an optimal trained LTM with 100 neurons divided into 10 hypercolumns is about 60 patterns. The LTM in the experiments in chapter 4 had a set to 0.0005. This low value of a implied that the LTM in chapter 4 could not form properly “deep” attractors after training for 1 unit of time. These two conditions generated a situation where the memories stored in the LTM had a small chance of correct retrieval. However all trained memories left some sort of trace in the LTM. Chapter 4 investigated the possibility of using an STM to extract those memory traces.     


Chapters 5 and 6 contain experiments where the systems where presented with 50 patterns and the LTM was run with a = 0.005. This setting of a allowed the LTM to learn all 50 patterns.


The STM networks in this thesis always had the value of a set to 0.5. I tried to choose a so that the STM remembered the 10 most recent patterns presented to it. This means that the STM was not affected if the number of patterns presented to it were 50 or 100. Contrary to the LTM in chapter 4, there were no memory traces stored in the STM of the first patterns in the training set.    



                                                                                                                              A1                                                    B1                                                    C1

                                                                                                                  A2                                                    B2                                                    C2


Figure 11  Three connection-weight / projection matrices. A1 is from an STM. B1 is from an LTM in chapters 5 and 6. C1 is from an LTM in chapter 4. The strength of the connection-weights is colour-coded in A1, B1 and C1 between the logarithmic values 0 and 5. The brighter a dot is, the stronger the connection. The diagonals of the matrices all have black squares, showing the absence of connections within a hypercolumn.  A2, B2 and C2 are the corresponding distributions of the connection-weights. The vertical line seen in A2, B2 and C2 represents the 1000 self-recurrent connection-weights that have been deleted (the deleted connection-weights were set to 1).



The projections in a network with a large value of a were set up differently from how the projections in a network with a small value of a were set up. In a net trained with a small value of a (LTM) I found that the distribution of the inhibitory and excitatory weights was very distinct. In figures 11-B2 and 11-C2, one can see that the connection-weights are either set to be inhibitory or excitatory. There are not many connection-weights with a value between the two groups of inhibitory and excitatory connection-weights. While in a network trained with a large value of a (STM), the values of the connection-weights were evenly distributed between inhibitory and excitatory connection-weights. The value of a can, in this view, be said to regulate the speed at which the exhibitory weights are returned to their l0 value, their inhibitory state.


When I coupled an STM to an LTM, the STM memory had an interfering effect on the neurons in the LTM. This meant that the LTM had a smaller probability to relax to the correct pattern. To prohibit this impairment of the LTM, I introduced a gain constant, gSTM®LTM, between the STM and the LTM (equation (19)). In the systems that were trained with 100 patterns the value of gSTM®LTM was set to 0.03 and in the systems trained with 50 patterns the gSTM®LTM was set to 0.1. These values were derived from trial and error processes and they seemed to give the STM a reasonable influence on the LTM.



4.0     Network Structures



Chapter 4 investigates the basic concepts of connected networks. In the first part of this chapter, the importance of having recurrent connections with different plasticity kept in different networks (having a separate STM and LTM) is studied. Then, how plastic connections are constructed and used are examined.


The systems in the experiments in this chapter were trained with sets of 100 patterns. The LTM was trained with a = 0.0005. The choice of a meant that all patterns in the training set were remembered, but very poorly. The LTM had a poor retrieval-ratio of about 0.3. 



4.1         Systems with High and Low Plasticity 


The information in a neural network is stored in the projections. Two systems were studied here. The two systems had an equal number of connections, but different numbers of neurons. The number of neurons, N was equal to 100.


The first system, a network of N neurons, had two projections with a total of 2N2-20N connections.



Figure 12 The networks A and B are connected with one-to-one connections. The connection-weights, wi, were usually set to a value around 10. 



The second system had two separate networks, with 2N neurons and 2N2-19N connections. The neurons between these two networks were connected with a one-to-one projection. This meant that all elements in the projection matrix, except the diagonal elements, were set to 1. 


The question was which of these two systems had the best function. The systems used approximately the same amount of connections. Since memories are stored in the connections, this comparison seemed motivated. (Chapter 5 describes how the design of the second system is made more effective.) 



4.1.1       One Network with Two Sets of Recurrent Connections


Naturally, a neuron takes much more space and uses many more resources than a connection between two neurons. This means that if the number of neurons in a network can be minimized at the expense of more connections in the network, it is a good thing. The system, in this experiment, used few neurons and a moderate number of connections.


Real synapses may posses both low and high plasticity properties. In this experiment the two projections with different plasticity can be seen to form a single projection that has both low and high plasticity properties.  


The system was based on an LTM. An “STM” projection with high plasticity was added to the system’s existing “LTM” projection with low plasticity. The high plasticity projection used a = 0.5 and the low plasticity projection used a = 0.0005. The bias values were derived from training of the low plasticity projection, “LTM”.  The projection with high plasticity was scaled down with g = 0.03. The value of g was chosen after evaluating the results of the experiment in section 4.2.1. The system was trained with equation (19) instead of equation (13).   


Figure 13               The operation modes of the system. The bias values were derived from the projection with low plasticity. The system was constructed with two projections with different plasticity. Each of the projections, were also treated as individual networks (LTM and STM).      



The retrieval-ratio of the system is shown in figure 14. The system’s two projections, with high and low plasticity, were used to create one separate LTM and one separate STM. The separate retrieval-ratio of the LTM is shown as a dotted line, and the retrieval-ratio of the STM is shown as a dash-dotted line, in figure 14. The following text refers to these two (LTM and STM) individualised memories. The LTM and the STM were isolated to provide a comparison of the performance. Figure 14 show that the system provides a compromise between the LTM and STM.


The retrieval-ratio of the first 90 patterns was slightly lower for the system than for the LTM. The retrieval-ratio of the last 10 patterns was lower for the system than for the STM. The system seemed to provide a compromise of the retrieval-ratio between the LTM and STM. Since the high and low plasticity projections in the system were interacting during the iterative process of relaxation, there was a problem with interference between the two projections. The STM interfered with the LTM during retrieval of the first 90 patterns. The STM did not have enough influence over the LTM to control the relaxation process completely during the last 10 patterns.


The system was able to retrieve patterns 85-90 with slightly higher retrieval-ratio than the LTM or STM. This implies that during the retrieval of these five patterns the LTM and STM were able to cooperate. This proves that the basic idea of having several projections with different plasticities in a single system can be beneficial.   


The compromise between a high retrieval-ratio of the first and the last patterns was controlled by the value of g. Adjustments in g could not improve the projections ability to cooperate. This suggested that the design with two projections and one population of neurons was not optimal. 



Figure 14  During the retrieval of the five patterns between patterns 85 and 90 the LTM and STM, constituting the system, were able to cooperate. Note the increased retrieval-ratio of the system for the last 10 patterns.



4.1.2       Two Networks, Each with One Set of Recurrent Connections


This system basically had the same two projections as the system in section 4.1.1. The main difference between the systems was that each of the projections in this system were projected at a separate group of neurons. The purpose of this experiment was to determine if it was beneficial to use two networks with different plasticity values.



Figure 15  The system had an STM and an LTM of equal size. 1-to-1 connections were used to connect the STM to the LTM. The input with errors was fed to both the LTM and STM. Output was extracted from the LTM.



The system was composed of an LTM and STM of equal size. These two memories were connected with a 1-to-1 projectionSTM®LTM. The diagonal elements of the projectionSTM®LTM were set to 10. This value was derived from a trial and error process shown in figure 16. When the retrieval-ratio of the system was tested, both of the networks were fed with input.


The trial and error process to determine the value of the diagonal elements was performed with ten runs of the system. For each run, the diagonal elements were set to different values. The result of these 10 runs is shown in figure 16. The value of the diagonal elements could have been set to any value between approximately 7-500.


If the diagonal elements, or weights, had been set to one, there would not have been a connection between the two memories. Figure 16 shows this fact when the x-axis equals 1. When the weights were set to 1 (which is 0 on the logarithmic x-axis in figure 16) the retrieval-ratio of the system became equal to that of the LTM.  And if the weights had been set to a value smaller than one, there would have been an inhibitory effect on the neurons in the LTM. If the weights had been set to a value much larger than 500, the system would have shown good performance on the last learned patterns, but the system would not have been able to recall the patterns learned in the beginning of the training set. This is caused by the strong input from the STM, which makes it impossible for the LTM to relax into a stable state.  



Figure 16  The plot shows 10 runs of the network, with different values of the connection-weights. Performance is measured separately for the first 1-90 patterns and the last 91-100 patterns. The solid lines show the performance of the system.  



In figure 17 the retrieval-ratio of the system is shown. During the first 80 patterns the retrieval-ratio of the system is equal to that of the LTM. Then, for patterns 80 to 90, the retrieval-ratio is better than both that of the LTM and STM. During the last 10 patterns the retrieval-ratio is equal to that of the STM.


It was interesting to see that the cooperation between the two memories (projections in the previous system in section 4.1.1) was functioning well. The disruptive influence of the STM on the LTM was almost negligible. The retrieval-ratio of the last 10 patterns was almost 1. A good retrieval-ratio of the most recent patterns is necessary when the STM is to be used as a working memory.


The STM has a strong influence on the LTM. The STM has the ability to both support and suppress memories in the LTM with great efficiency. This is an important feature since it provides a way to increase the importance of the last learned patterns. Later on in this thesis, these properties are used to generate useful functions in large memory-systems. Systems with STM are designed to prove the possibility of constructing a working memory. The STM also provides the possibility to make reinstatements (reactivation of previously learned memories) of the latest memories into the LTM.


The combination of these facts proved that the design with two individual networks with different plasticity was superior to the design of the system in section 4.1.1.


Figure 17  The system was based on two separate networks, an STM and an LTM of equal size. The STM and LTM were connected with one-to-one connections between the neurons of each network. 



4.2         Plastic Connections


The experiments presented in this section were designed to investigate how a system built upon two networks could be connected with plastic projections. Both of the networks, LTM and STM, were of equal size in all of the simulations preformed. Different ideas of how to utilise the plastic projections were investigated.



4.2.1       Plastic Connections


There are many connections between the neurons in the cortex, especially between neurons that are close together. It seems very unlikely that these neurons are hardwired and unable to form new connections or delete old connections. In this experiment, the neurons of the two networks (STM & LTM) were allowed to form whatever connections they wanted. As with the recurrent connections, these connections can be made with different plasticities.   


The connections between the STM and LTM were plastic in this experiment. The projectionSTM®LTM matrix was no longer a diagonal of weights. Instead it was a full matrix of weights, representing all possible wirings between the neurons of the two networks. When the size of the STM differs from that of the LTM, or when the pattern representation in the STM differs from that in the LTM, there is a need for a plastic projection. The plastic projections were trained with the same Bayesian learning rule that was used to train the network’s recurrent projections. The recurrent projectionSTM®STM and the projectionSTM®LTM were trained with a = 0.5. The system’s training and operation mode is seen in figure 18. The projectionSTM®LTM was scaled down with gSTM®LTM = 0.03. 



Figure 18  The system used in the experiments of this section. Note the added plastic projection from the STM to the LTM. The plastic projection is a full matrix (100x100) of weights.



                                To determine the value of gSTM®LTM a trial and error process was used. Figure 19 shows 10 runs of the system, with different values of gSTM®LTM in each run. The gain g was set to the value 0.03 which corresponds to approximately -3.5 on the logarithmic scale of figure 19. If the value of g is set to a smaller value, the retrieval-ratio of patterns 91-100 is decreased. If g is set to a larger value than 0.03 the retrieval-ratio of patterns 1-90 is decreased.



Figure 19  Ten runs of a system with plastic projection. The systems retrieval-ratio is plotted against the logarithmic value of gSTM®LTM. An optimum can be found around -3.5. (The exponential of -3.5 is approximately 0.03.) Compare this figure with figure 16.    


Figure 20 shows the retrieval-ratio of this system. The performance is very similar to that of the system in section 4.1.2 were one-to-one connections was used. Comparing figure 20 with figure 17, one can see that the projectionSTM®LTM interferes with the LTM more than the one-to-one projectionSTM®LTM did. If gSTM®LTM had been set to 1, this disruptive effect would have been very prominent. The disruptive effect that the STM exerts on the LTM depends on the number of elements that the projectionSTM®LTM contains.


The use of a plastic projection causes a small loss of retrieval-ratio performance, compared to the use of a 1-to-1 projection.  This performance loss is compensated by the versatility that the plastic projection provides. Plastic projections allow different representations of the same data in the system’s different networks. Later it will be shown that this can generate an increase of the system’s performance.



Figure 20  The performance for a system with a plastic connection between an STM and an LTM of equal size. Compare this figure with figure 17.              




4.2.2       Sparse Plastic Connections


Two groups of neurons that are far apart in the brain are usually very sparsely connected. This sounds reasonable since it minimizes the total wiring. It can easily be understood that all of the neurons in the brain cannot be connected to each other for volume reasons. The experiment I performed here was aimed at seeking out how the performance is affected when connections between the LTM and STM are deleted.


In the experiment, the projection that connected the STM to the LTM was made sparse. The sparse projection matrix was achieved through a random deletion of elements (deleted elements were set to 1) in the projection matrix after the projection had been trained. I made four runs of the system with different values of gSTM®LTM in each run. 


The influence of the STM on the LTM was reduced when the number of connections was reduced. The influence was then made stronger through an increase of gSTM®LTM. The correlation between the sparseness of the projection matrix and the value of gSTM®LTM was of great interest. Figure 21 shows four plots with different values of gSTM®LTM.


The result of a system with gSTM®LTM = 0.03 is shown in figure 21A. The systems long-term memory storage capacity was not compromised by the STM. About 40% of the elements in the projectionSTM®LTM could be deleted before the system’s performance was affected. When all of the elements of the projectionSTM®LTM had finally been deleted, the system’s performance was equal to the performance of the LTM alone.


In figure 21B, the value of gSTM®LTM was increased to 0.1. The increased value of g made it possible to eliminate 60% of all of the elements in the projectionSTM®LTM without any major loss in performance. The decrease in the performance for the last 10 patterns was steeper than in the figures 17 and 19. Figure 21C shows a simulation with gSTM®LTM = 0.5.


Figure 21D shows the performance of the system where gSTM®LTM = 1. The STM suppresses the LTM very effectively. Almost all elements in the projection had to be removed before the suppressed retrieval-ratio of the LTM could rise.                




                                                                                   A                                                                                                       B



                                                                                      C                                                                                                    D


Figure 21  The performance for four different values of gSTM®LTM. The plot in the upper left have g = 0.03, the plot in the upper right have g = 0.1, the plot in the lower left have g = 0.5 and the plot in the lower right have g = 1. On the left hand side of the figures all of the connections between the networks are present and on the right hand side of the figures all of the connections between the networks are removed.



It was very interesting to see that more than 60% of the connections could be deleted without any major loss in performance. This provides a hint that it would be possible to shrink the STM without any loss in performance. The most interesting feature of the experiment was that it clearly showed the need to have a scale-factor (gSTM®LTM) between the connected networks. When gSTM®LTM is set to 1 it is almost impossible to regulate the influence of the STM on the LTM with the density of elements (connections) in the projection. This is seen in figure 21D. The system in figure 21D acts either as an STM or as an LTM.



4.2.3     Differently Represented Patterns in LTM and STM  


An interesting question is what happens if the patterns are represented differently in the LTM and the STM? If the connections that provide the LTM with input are different from the connections that provide the STM with input, the representation of patterns may differ in the memories. In the experiment, I studied how the transformation of the patterns affected the performance of the system. Was it possible for the STM and LTM to cooperate with different representations of the memories?  



Figure 22  The system used two different sets of patterns. I studied how the STM could help the LTM to retrieve the correct patterns, although it had been trained with a different set of patterns.



In the experiment, I produced one set of patterns that was used to train the LTM. I produced another set of patterns that was used to train the STM. The hypercolumn structure of the input patterns existed in both sets of patterns. The projectionSTM®LTM was plastic. The constants of the system were set to the same values as in the previous experiments.


In the experiment, the representation of the data differed in the STM and LTM. This meant that there must occur a hetro-association between the patterns in the LTM and STM. This was done by the plastic projectionSTM®LTM. It was interesting to see that the performance of this system was almost better then that of the system in 4.2.1, where we had the same representation of the patterns in both of the memories. The slightly better performance can be accredited to a more diverse and uncorrelated input.




Figure 23  The performance for a system with a plastic projection between the LTM and STM. Different representations of the data were used in the LTM and STM.



4.3         Summary


Basic question posed was do two separate networks with different plasticity work better then one single network with two recurrent projections with different plasticity. It was concluded that a separation of neurons into two networks with different plasticity was good. The use of two networks allowed one of the networks to specialize on storing long-term memories and the other to specialize on storing short-term memories. This reduced the disruptive effects between the long- and short-term memories. It could also be established that two networks with different plasticities could be made to work together.


The concept with plastic projections between the LTM and STM was seen to work. It was also established that if an LTM and an STM were connected with plastic weights, the data could be represented differently in the two networks.


The constant gSTM®LTM was introduced to provide an instrument that could control the level of influence between networks with different plasticity and size. If gSTM®LTM was set to 1, the STM had a dominant influence on the LTM. While the LTM had problems retrieving old patterns that were not stored in the STM.    




5.0     Properties of Connected Networks



The basic design criteria of systems comprising of an LTM and an STM was developed in chapter 4. In this chapter I explain the further development of the system’s design. The goal of the experiments in this chapter are to provide an information base that can be used to design the systems of chapter 6, which incorporate an STM that functioned as a working memory.



5.1         Systems with Reduced Size of the STM


The LTM stores all memories, although the most recently learned memories are not given precedence over older memories. The role of the STM is to give the latest learned memories such precedence. The STM can achieve this without having to store the entire memory. Remember that the STM in chapter 4 stored entire patterns. Instead of storing the patterns, the STM in this section will hold “pointers” to the latest acquired memories. Each of these pointers in the STM are pointing at a particular memory in the LTM. On retrieval of one of those particular memories the pointer becomes active and aids the retrieval of the memory. The compressed representation in the STM can be considered as a chunk representing an item in the LTM. 


The experiments described in this section were designed to find out how a compressed STM could be constructed. Different representations in the STM were tried. An investigation of different sizes of the compressed STM was also performed. Note that when the system was in operating mode, the activity was first propagated from the LTM to the STM. Then the activity was propagated back into the LTM (figure 24). This was a big change from the systems in chapter 4, where the STM was directly fed with activity. 



Figure 24  The design outline of the system in sections 5.1.1-5.1.3. Note that during operation the activity in the LTM is propagated through plastic projection to the STM, then the activity is propagated back to the LTM. The STM consisted of 10-30 neurons.    



The systems in this section were comprised of an LTM and a smaller STM. The LTM and STM were connected in both directions with plastic projections.  Each of the systems were designed with three different sizes of the STM: 10, 20 and 30 neurons.


The plastic projections were trained with aprojection = 0.5. The constant g was set to 1 in both directions. The systems were trained with 50 patterns. This implied that the LTM was able to learn all of the patterns.


The retrieval of the patterns was initiated by presenting the system with five hypercolumns of the patterns. The remaining five hypercolumns were left blank (the activity for all units was set to zero). 



5.1.1       STM as a Subset of the Hypercolumns in LTM


Here, the STM was constructed through a sub-sampling of the hypercolumns in the LTM. The STM with 10 neurons was constructed simply through copying the contents of the first hypercolumn into the LTM. The STM with 20 neurons was constructed out of the first two hypercolumns in the LTM. The STM with 30 neurons was constructed out of the first three hypercolumns in the LTM. This meant that just a few of the attributes (hypercolumns) of an object (memory) were accommodated by the STM. These few attributes were stored with the full depth of detail retained.


Figure 25  The outline of how the input patterns were constructed. The input to the 30 neurons of the STM was constructed out of the input to the first three hypercolumns of the LTM. When the STM was constructed with 10 or 20 neurons, 1 or 2 hypercolumns of the LTM were used.      



Figure 26 shows the system’s retrieval-ratio with three different sizes of the STM. The system with only 10 neurons in the STM  (dash-dotted line) allowed the LTM to recall old patterns and at the same time enabled the system to recall the latest learned patterns nearly perfectly.   


An STM with only 10 neurons has less influence on the LTM than a system with 30 neurons. The STM consisting of 10 neurons generates a projection on to the LTM containing 10x100 = 1000 elements, while an STM consisting of 30 neurons generates a projection with 30x100 = 3000 elements. An STM consisting of 30 neurons generates a more distinct retrieval suggestion to the LTM than an STM consisting of 10 neurons.  Setting the constant gSTM®LTM to a value less then 1 can adjust the influence of the STM.


It was interesting to see that even a small STM was able to help the LTM to activate the 10 latest patterns correctly. This effect confirms the idea that the STM does not need to contain any information about the patterns, but instead can act as a pointer to the patterns stored in the LTM.   



Figure 26  Three systems with different sizes of the STM are shown in the plot. Presenting the systems with five out of ten hypercolumns tested the retrieval-ratios of the systems. 




5.1.2       STM as a Set of Sub-sampled Hypercolumns in LTM


The LTM was in this experiment constructed through a compression of each hypercolumn in the LTM. The STM contained the same number of hypercolumns as the LTM (10). Each hypercolumn in the STM was comprised of 1, 2 or 3 neurons. This meant that all of the attributes of an object were stored in the STM, but with less detail. This approach was the opposite of the approach taken in section 5.1.1.


When the STM consists of 10 neurons, the coding of the patterns in the LTM to the STM becomes trivial. Since the patterns stored in the LTM consist of 10 hypercolumns with at least one unit active in each hypercolumn, the resulting STM patterns will have 10 out of 10 units active.


Figure 27  The outline for how the input patterns were constructed. Data within each hypercolumn was compressed. All of the 10 hypercolumns of the LTM are represented in the STM with 1, 2 or 3 neurons.



The patterns stored in the STM were highly correlated since each hypercolumn only had 1, 2 or 3 different attribute values. Instinctively, this leads one to believe that the system should have a poor performance, especially when the STM is composed of 10 neurons. The STM that consisted of 10 neurons was not able to hold any information, and, therefore, one would expect the corresponding system to behave as a single LTM.




Figure 28  The figure shows how the hypercolumns in the LTM were transformed to the hypercolumns of the STM. The left figure corresponds to the case where the STM consisted of 30 neurons. The figure to the right corresponds to a system with an STM of 10 neurons. Note that the right figure portraits the trivial coding.



The system built with an STM that consisted of 10 neurons proved to perform better than a single LTM (figure 29). The retrieval-ratio of this system and of the systems in section 5.1.1 was very similar. This can be explained by the fact that much of the information is stored in the plastic projection between the STM and LTM. The information stored in the STM seems to be of less importance.



Figure 29  The performance of a system, where each hypercolumn of the LTM was compressed from 10 neurons to 3 neurons in the STM. Note that even the trivial case, where the STM consists of 10 neurons that are all active, can hold information in the projection.



5.1.3       STM is a Subset of Sub-sampled Hypercolumns


It has been suggested (A. Lansner personal communication) that the relationship between the number of hypercolumns and the number of neurons in a network, for maximal capacity, should be



where H is the number of hypercolumns and N is the number of neurons. In this section the compression of the LTM was achieved through a compromise of sub-sampling and adopting a subset of the hypercolumns. The number of hypercolumns in the STM was chosen to follow the hypothesised relation.


The STM with 30 neurons had 6 hypercolumns, the STM with 20 neurons had 4 hypercolumns and the STM with 10 neurons had 2 hypercolumns. Each hypercolumn in the STM consisted of 5 neurons. 


Figure 30  Shown here is the outline for how the input patterns were constructed when the STM was made of 30 neurons. When the STM was constructed with 20 neurons it had 4 hypercolumns, and when it was constructed with 10 neurons it had 2 hypercolumns.        




Figure 31  The hypercolumns of the LTM were compressed to half their size in the STM. This applied to all STM, independently of the number of neurons.     



Figure 32 show that the design approach provides good performance. The STM constructed in this manner holds more information than the STM in the two previous designs. More information is equal to a more distinctive pointer. It was interesting to see that the size of the STM did not give any significant performance difference of the system.


Comparing the results of the experiments in 5.1, it is obvious that it is a good strategy to use an STM that has a sub-sampled representation.



Figure 32  The performance for a system where the STM is a subset of sub-sampled hypercolumns of the LTM. Note that the retrieval-ratio of the last five patterns is very similar for all of the three systems and that it does not seem to depend on the size of the STM.



5.2         Interfering Effects


When faced with the task of design there are often several parameters that can be adjusted in the system. This section contains an investigation on how some of the most important parameters affect the system.



5.2.1       Effects of LTM on STM


In these two experiments, the focus was on how the plastic projection from the LTM to the STM affected the performance of the entire system. As in section 5.1 the activity of the LTM was propagated to the STM and then back to the LTM. The phenomenon of interest in these two experiments was the self-induced interference generated by the LTM. In the two following experiments a plastic projection was used from the LTM to the STM. One of the experiments used a plastic projection with a high plasticity, and the other experiment used a plastic projection with low plasticity.


The STM and LTM were of equal size, 100 neurons. The STM had a = 0.5. The system had a one-to-one projection from the STM to the LTM. The diagonal elements of the projection were set to the value 10. The choice of the value 10 caused a small impairment of the LTMs’ ability to recall old patterns.  


There was a plastic projection from the LTM to the STM. In the case of a low plasticity projection, a was set to 0.005 and in the case of a high plasticity projection, a was set to 0.5. The plastic projectionLTM®STM was scaled with gLTM®STM. Note that in this experiment, the gLTM®STM constant applied to the projection from the LTM to the STM. The value of gLTM®STM was varied between 0.025 and 400.


The system was trained with 50 patterns and presented with noisy patterns to test the retrieval-ratio. 



Figure 33  The system was used to test the effects of the projection from the LTM to the STM. The projection from the STM to the LTM was one-to-one and the diagonal elements were set to the value 10. From the LTM to the STM there was a plastic projection.



In figure 34 and figure 35 the retrieval-ratio for the two systems is shown. The value of gLTM®STM did not seem to affect the system as long as it was small. When the value of gLTM®STM exceeded exp (3) » 20 a steep fall in performance for both the systems was seen. Most likely this performance drop could be accredited to too much excitation of the STM. The projectionSTM®STM has up to this point, gLTM®STM = exp (3) » 20, been able to suppress the activity imposed by the projection from the LTM.


The system with a high plasticity projectionLTM®STM had a constant performance for the last 10 patterns, independent of the value of gLTM®STM. While the performance for the first 40 patterns slowly deteriorated as the value of gLTM®STM increased. When gLTM®STM exceeded 20 the performance of the system drastically dropped. The fact that the performance for the last 10 patterns remained constant was logic, since we used a projection with high plasticity. As the influence of the LTM on the STM increased, the memories in the STM were reinforced.



Figure 34  The system with a high plasticity projection from the LTM to the STM.



The system with low plasticity projections (figure 35) had a slowly deteriorating performance for the last 10 patterns as the value of gLTM®STM increased. The performance for the first 40 patterns was independent of the value of gLTM®STM. When gLTM®STM exceeded 20, the performance drastically dropped as in the other system. The performance for the last 10 patterns slowly decreased, as the influence of the LTM on the STM increased (as gLTM®STM increased in value). The performance decrease was expected since the system had a low plasticity projection, which is not good at storing the most recently learned patterns.



Figure 35  A system with a low plasticity projection from the LTM to the STM.




5.2.2       Effects of STM on LTM


The projection in the direction from the STM to the LTM is more important than the reciprocal projection since the state of the LTM is equal to the system’s output. In this section I looked at the effects of the projection from the STM to the LTM. First a system with only a one-to-one projection between the STM and LTM was studied. Then a system composed of an LTM and an STM of equal size with plastic projection was studied. Finally a system with a compressed STM and plastic projection was studied. All of the LTMs had a = 0.005 and all of the STMs had a = 0.5. The systems were all trained with 50 patterns.


The first system was constructed as shown in figure 15. The system had a fixed 1-to-1 projection from the STM to the LTM. The diagonal elements of the 1-to-1-projection matrix were set to 10. When the system was operated, noisy input was fed directly to both the LTM and the STM.


Figure 36 shows the performance of the system with a 1-to-1 projectionSTM®LTM. When the value of the elements (weights) in the projection were smaller than exp (-2) » 0.1 a decrease in the performance of the most recently learned patterns was seen. This decrease seemed to be linear with respect to the logarithm of the weights. It was also interesting to see that the retrieval-ratio of the older patterns was not affected when the weights were smaller than 0.1. In a comparison between figures 36 and 37, it is seen that a 1-to-1 projection does not interfere the LTM as much as plastic projection does.



Figure 36  A system with a 1-to1 projection from the STM to the LTM. The system was tested with different values of the weights in the 1-to-1 projection.



The second system was constructed with an STM and an LTM of equal size. The system design is shown in figure 18. The projection from the STM to the LTM was plastic. The plastic projection was trained with a = 0.5. The system was trained with 50 patterns. The retrieval-ratio was tested with noisy patterns that were fed to both the LTM and the STM.


The performance of the second system is shown in figure 37. The retrieval-ratio for the last 10 patterns falls sharply when the value of gSTM®LTM exceeds exp (3) » 20. The retrieval-ratio for the first 40 patterns starts to fall when the value of gSTM®LTM exceeds exp (-3) » 0.05.



Figure 37  A system with equal size of the LTM and STM. The plastic projection from the STM to the LTM was trained with a set to 0.5.



The third and last experiment was made with a system similar to the one in figure 18. The difference was that this system had a reduced size of the STM. When operated, the system was fed with input directly to both the LTM and STM. The input to the LTM had errors, while the input to the STM had no errors. The STM was made of 30 neurons. The input to the STM was the same as the input to the first three hypercolumns of the LTM.





Figure 38  In this experiment the system had a smaller STM than in the previous experiment.



The retrieval-ratio of the first 40 patterns is similar to that of the previous experiment. The retrieval-ratio of the last 10 patterns starts to drop earlier than in the previous experiment. The shallower drop of the retrieval-ratio of the last 10 patterns seen in this experiment compared with the previous experiment can be accredited to the reduced influence of the compressed STM.


5.3         LTM Helped by STM on Retrieval


How much information is needed to retrieve a pattern in LTM and how does an STM affect the retrieval? These two questions were answered by the experiments in this section. These two questions become very relevant when one designs a system where the patterns are divided into individual modules, as in chapter 6.  


The same system as in 5.1.1 was used. The STM was composed of 30 neurons divided into three hypercolumns. The first three hypercolumns of the patterns were stored in the STM. The system is shown in figure 24. The system was run with four different values of gSTM®LTM. The gSTM®LTM scaled the projection from the STM to the LTM. The pattern retrieval was initiated with 1 to 5 of the hypercolumns constituting the learned patterns.


In figure 39A one can see how the STM suppresses the LTM. The STM has a positive effect on the retrieval of the most recently learned patterns. Even though only the information of one hypercolumn is presented to the system, it can retrieve the correct pattern.


Figure 39B and figure 39C show the retrieval-ratio of two systems with a value of gSTM®LTM between those used in the systems shown in figure 39A and figure 39D.


In figure 39D the retrieval-ratio of a system that does not have any STM is shown. If the system is only presented with one hypercolumn, the retrieval-ratio becomes very low. The system needs to be presented with at least four hypercolumns before the retrieval-ratio becomes good.




                                                                                       A                                                                                                   B


                                                                                       C                                                                                                  D


Figure 39  Illustration of how the STM affects retrieval of memories. The system in plot A had gSTM®LTM =1, plot B had gSTM®LTM =0.5, plot C had gSTM®LTM =0.1 and in plot D the system had no STM.   



5.4         STM Ability to Suppress Old Information in the LTM


This experiment was designed to verify that a system composed of an LTM and STM put most significance on the latest learned patterns. This means that if two patterns are very similar, and the system is asked to retrieve one of these patterns, it should retrieve the most recently learned pattern. This also connects to the concept of STVB (Short Term Variable Binding).


The system used was identical with the system in section 5.1.1. The patterns used as input to the system consisted of two parts, called “Tag” and “Content”. The input to the STM was the part of the pattern called “Tag”. The “Tag” can be seen to represent a variable while the “Content” is representing the content of the variable.



Figure 40  A pattern, with the parts tag and content defined. The tag is represented by the first three hypercolumns, and the content is represented by hypercolumns 4 to 10.   



The system was trained with 50 patterns. The first pattern presented to the system, pattern A, was repeated a number of times. The 48:th pattern was called pattern B. Patterns A and B were very similar; their first six hypercolumns were identical.


When all of the 50 patterns had been presented and learnt by the system, the first six hypercolumns of pattern A (or B; the first six hypercolumns of patterns A and B were identical) were presented to the system. The system now had the choice of converging to pattern A or B. The results are shown in figure 41. The system was run without the STM and the results of this run are shown in figure 42. It may look strange that the retrieval-ratio is sometimes larger than one. The cause of this odd characteristic is that sometimes the last 4 hypercolumns of the patterns A and B are similar enough to cause both patterns, A and B, to be collapsed into one single pattern.  


The system in this experiment performs an STVB task. The example of STVB, given in section 2.5.1 described how John knew if he was talking to his grandfather, named Sven, or his friend, also named Sven. To manage this task, John had to know which Sven he most recently met. The system in this experiment is presented with a similar task. Referring to the example given in section 2.5.1, the name of a person is in this experiments represented by the “Tag” and a physical person is represented by the “Content”. Pattern A can be seen to represent John’s grandfather, named Sven, and pattern B to represent John’s friend, also named Sven. The repeated training of pattern A (representing the grandfather) can be seen as many previous conversations with the grandfather. The problem the system now faces, is that even if the system has had many dialogs with the grandfather, as soon as the system starts to talk to the friend, the system knows directly that it is not talking to the grandfather anymore. To manage this task, the system needs to swiftly change its references. The STM plays a crucial role in the swift change of references. 


In the first experiment where the STM was enabled, the system almost only retrieved the latest learned pattern, pattern B (figure 41).



Figure 41  This histogram shows the system’s retrieval-ratio of the latest learned pattern, B. The retrieval-ratio of pattern B was tested after different amounts of training with pattern A. Pattern A and B had the same “Tag”. The figure shows that even if pattern A had been trained 20 times, the system retrieved the most recent pattern, B.



When the system’s STM was disabled (Figure 42), the system only retrieved pattern B as long as pattern A had not been trained extensively. Once pattern A had been trained 3-4 times, the system almost never retrieved pattern  B. 



Figure 42  The performance the system when the STM was disabled. When pattern A has been repeated more than two times it is almost impossible for the system to retrieve pattern B.



This experiment shows that an STM can have a great impact on the system’s behaviour. With the STM, the system could easily change the binding from the name variable to a new content. Without the use of an STM, the system could not perform this task well.




5.5         Summary


The fundamental concept that the STM does not need to contain the complete patterns, which are stored in the LTM, was tried. Three different approaches were taken to the STM’s design. It was seen that the approaches that generated a sparse representation in the STM were generally good. It was further concluded that the STM, together with the projection from the STM to the LTM, holds the distinctive memory traces (pointers). The STM were able to hold about 10 distinct memory traces. The information of particular patterns was stored in the projection between the STM and LTM.


An investigation of how the LTM and STM interacted with each other was conducted. It was concluded that the interfering effect of the LTM on the STM was minimal. But if gSTM®LTM was made bigger than 20, a drastic drop in the retrieval-ratio occurred. How the STM interfered with the LTM was also studied. If the two networks were connected with a 1-to-1 projection, the interference was minimal. The disruptive effect became much larger when plastic projections were used. It was established that a smaller STM interfered less with the LTM than a large STM.


An experiment was performed to investigate how much data a system of an LTM and an STM needed to retrieve a memory (pattern). I concluded that recently learned memories could be retrieved after presenting the system with two hypercolumns of the particular memory. To retrieve an older memory, the system needed to be presented with information in more hypercolumns.


Finally I saw that a memory system with both an LTM and STM always gives precedence to the most recently learned memories.    





6.0     STM Used as Working Memory



This chapter concerns how an STM can be implemented as a working memory. Two different systems were studied. The system in section 6.1 was based on a single STM and a single LTM. The system in 6.2 was constructed with modules that were built of one LTM and one STM. These two systems were tested on the task presented in figure 43. 


To demonstrate the need of both an STM and an LTM, I have designed the following test. There are four different places, Place 1-4. In each place a box can be placed. There are four different boxes, Boxes 1-4. Each box has a certain content. Since there are four boxes, there are four different contents, Content 1-4, one for each box.  The task was to keep track of the boxes as they moved around to different places. The working memory is supposed to hold the information that Box 1 and Box 2 have switched places. The long-term memory holds the information about what content each box has. This means that the systems needs both a long-term memory and a working memory to be able to perform the task. 


Figure 43  Each of the Boxes contains a previously learned and unique Content. The system is first presented with Situation 1, then Situation 2. After these two situations have been presented, the system is asked to retrieve the Place where “Content 1” is stored. The 4 Places are supposed to be well known. The 4 different Boxes and their individual Content are also supposed to be well known. This task is presented to the two systems in sections 6.1 and 6.2.



6.1         System Based on LTM and STM


The purpose of this experiment was to show that an STM could function as a working memory. The system was built with one LTM and one STM. In section 6.2, a modified version of this system was used as a module in a larger system. 


The system used was almost identical to the system in section 5.1.1 with 30 neurons in the STM. There were two differences. The first difference was that the input to the STM was taken from hypercolumns 4-6 of the patterns. (Instead of the first three hypercolumns.) This difference is negligible since the memories in compressed STM acts as a “pointer”. The only reason why hypercolumns 4-6 was used instead of hypercolumns 1-3 was to prevent underflow of data in the STM. The second difference was that only the first six hypercolumns of the LTM were connected to the STM.  


The patterns were composed of three different parts as seen in figure 44. There were four different Places. There were also four different Boxes. Each Box had a particular Content specific to that particular Box.



Figure 44  The input to the system had the structure outlined here. Each Box had a certain Content. Each box was placed in a certain Place. There were four places and four boxes with their individual contents.



The system was trained 2 times with all possible combinations of Boxes, and their contents, in different Places (16 different combinations). The purpose of this training was to teach the system each of the four “Box-Content” constellations. After these 2´16=32 patterns had been presented to the system, the system was presented with 10 patterns that contained noise. The last six patterns were more intricate. Patterns 43-46 corresponded to situation 1 in figure 43. Patterns 47-48 correspond to situation 2 (only the novelties in the new situation were learned) in figure 43.  All of the patterns, 1-48, are documented in table 1.



Pattern No









































Table 1      This table shows the training set of 48 patterns. Patterns 1-16 contain all possible combinations of X and Y where X, Y Î {17-32}.



When the system was tested, it was presented with the four different contents, Content 1-4. The system was then asked to retrieve the corresponding Box and Place to each Content.


The system’s LTM associated each Content to its corresponding Box. The Box, in turn, associated itself to the correct Place using the working memory.


In table 2 the results of the run are shown. Naturally, the retrieval-ratios of the contents are 1. The retrieval-ratios for the Boxes and the Places are also almost 1. Note that the system has no problem keeping track of the last minute switch of Place between Box 1 and Box 2. 



Fraction of correct retrieval of:




Box 1




Box 2




Box 3




Box 4





Table 2      The performance of the system. The system was fed with a “content”, and then asked to retrieve the place where this content was.



The same system was also tested with the STM disabled. The results are shown in table 3. The retrieval-ratios of the content and box were still 1. This is what is expected since the retrieval of the Box is made with the LTM.


It was interesting to see what happened to retrieval of the Place. The Places where Box 3 and Box 4 were put, were retrieved approximately 25% of the time. This corresponds to the random frequency, when picking between four equal likely alternatives. The same retrieval-ratio, 25%, was expected for the Place were Box 1 and Box 2 were placed. Instead I found the retrieval-ratio to be zero in these two cases.



Fraction of correct retrieval of:




Box 1




Box 2




Box 3




Box 4





Table 3      The results of the system with the STM disabled. Note that the system can not keep track of the switch between boxes 1 and 2.



I did not further investigate this matter since I did not find it relevant to the working memory issue. However, the reason for the zero retrieval-ratio was probably because the convergence was slower.  All of the systems in this thesis used a fixed convergence time of 1 time-unit. The convergence-time in this case was probably up to 10 time-units.



6.2         System Built with Modules of LTM and STM


The aim of this experiment was to show that a modular system could be designed and that this modular system could perform equally well as the “integrated” system in 6.1. In this new modular system, the representation of the Box/Content and the Place were separated into separate modules. The approach of constructing the systems with LTM & STM modules has many benefits over the systems with a single LTM & STM. If the system is required to handle a new type of input / class of attributes, it is easy just to add a module. Furthermore, if the properties of a certain class of attributes are altered, it is easy to alter the corresponding module.


The modular system is based on two modules where each module contains an LTM and an STM. This system was tested on the same task as the system in section 6.1. Each of these modules is identical to the system in section 5.1.1. The STM contains 30 neurons and is connected to the LTM with a plastic projection. Figure 45 shows the modular system. The bi-directional projection between STM 1 and STM 2 has a = 0.5. The projections from STM 1 to LTM 2 and STM 2 to LTM 1 also have a = 0.5. The bi-directional projection between LTM 1 and LTM 2 had a = 0.005. The projections between LTM 1 and LTM 2 were made spares by random deletion of 70% of the elements in the projections. The deletion of these elements made the separation between the modules more clear. The aim was to minimize the number of connection between the modules.



Figure 45  A system constructed of two smaller systems from section 5.1.1. This system shows how bigger systems can be constructed out of smaller modules. The system has two LTMs connected with sparse plastic projections, and two STMs connected with plastic projections.



In section 6.1, all of the information was stored in a single LTM. In this system, the Place and Box / Content memories are stored in two separate LTMs. Figure 46 shows how the Place memories are represented in LTM 1. Figure 47 shows how the Box / Content memories are represented in LTM 2. If the system is equipped with a third module, the representation of the Box / Content can be separated.




Figure 46  The input to module 1 of the system.





Figure 47  The input to module 2 of the system.



The system was trained with the set of patterns described in table 1. Each pattern was presented to the network during one time-unit. On retrieval the system was fed with each of the four Content. Retrieval (relaxation) was also made during one time-unit. The output was taken from both LTM 2 and LTM 1.


The retrieval process of the Place memory started with input of the Content. The system then used the LTM 2 to activate the Box memory. The Box memory then activated the Place memory using the STM projections on LTM 1.


Table 4 shows the results of 100 runs. This result shows that the modular design is working well.



Fraction of correct retrieval of:




Content 1




Content 2




Content 3




Content 4





Table 4      The performance of the modular system when executing the task described in the beginning of chapter 6.



Table 5 shows the results of a run with the system where the two short-term memories, STM 1 and STM 2, have been disabled. The retrieval-ratios of the Place memories are very poor. This is expected, since the retrieval of these memories is dependent on the working memory.  If the results in table 5 are compared with the results in table 3, one finds that the retrieval-ratios for Content 1 and Content 2 are no longer zero, and the retrieval-ratios for Content 3 and Content 4 have increased. As earlier stated, these differences can be traced back to the relaxation time and the different network structures and data representations.









Fraction of correct retrieval of:




Content 1




Content 2




Content 3




Content 4





Table 5      The performance of the modular system when STM 1 and STM 2 were disabled.



This experiment shows that several Bayesian networks can be used in a modular designed system. The experiment also, once again, proved the usefulness of a working memory. It remains to be studied how these modular systems scale to larger sized modules.



6.3         Summary


It was shown that an STM based on a attractor network could function as a working memory. It was also shown that a system with both an LTM and an STM could solve problems that would not have been possible to solve with a system comprising of only an LTM or an STM.


Larger systems based on modules of LTM and STM were constructed. It was concluded that these systems could be applied to solve the same problems as the smaller system, which were based on a single LTM and STM. The advantages of the modular system were that its capabilities could easily be extended and modified.


7.0     Conclusions                                                      


The focus of this thesis was to find out if an STM based on fast changes in the synapses (weights) could be constructed with the incremental, Bayesian learning rule. After it had been established that it was possible to create an STM with fast changing synapses, the focus turned to how the design of the STM could be refined. Several designs of the STM were tried and evaluated. When this foundational work had been completed, the attention was turned to the concept of a working memory. The question was if it was possible to construct a system with a working memory out of an STM and LTM.  


Modelling the short-term storage process with fast changing synapses proved to be successful. The idea that the short-term memory process is similar to the long-term memory process allows one to adopt a concrete view of the STM. It also made it possible for me to implement the STM as a high plasticity version of the LTM. If the short-term memory process is based on some sort of persistent activity instead of fast changes in the synapses, this model can still be applicable when modelling the STM. The model proved that an STM could effectively be implemented with an existing LTM, with very few additional neurons.


On the network level, it was established that Bayesian networks could successfully be operated with several projections with different plasticity. This was a basic requirement to make it possible for networks to cooperate. Useful insights on how to design an STM were achieved.


Two different approaches to connect networks were tried. The first approach used a single population of neurons that had two projections with different plasticities. The second approach used two networks with different plasticities. This approach used 10% more neurons, but only 60% of the connections as compared to the first approach. The first approach could only retain the last one or two memories while the second approach could retain about 10 of the most recently presented memories. In the cerebral cortex, both of these types of memory may exist. The first type presented with a short memory span, may exist in the visual regions of the cortex. The second type of memory may be found in the prefrontal regions of the cortex where it may be used as a working memory [25].  


The constant g was introduced to control the influence of a projection between populations (networks) with different plasticities and sizes. The STM had a larger influence on the LTM than the LTM had on the STM. The advantage of a network with low plasticity was that more memories could be stored. The disadvantage was that the memory became more sensitive to interferences. The size of the projections also determined how much influence it had. A projection composed of many weights had, naturally, more influence than a projection with few weights.


It was established that the performance of a system with two projections, with different plasticities, was improved if the system was divided into one high- and one low-plasticity network. It was also established that the STM could be made much smaller than the LTM. If the STM were to store the last 10 patterns, the STM needed to be able to distinctively store a pointer or an address to each of these last 10 memories. If the STM had the same size of hypercolumns as the LTM, the STM needed 10 neurons to store 10 patterns. The number of neurons could be increased even more if the hypercolumns in the STM were smaller than those of the LTM.


Systems constructed with an LTM and an STM were shown to be able to use their STM as a working memory. This enabled these systems to perform operation that otherwise would have been impossible. The working memory made it possible for the systems to perform “role filling”. 


Modules, or systems, of one LTM and one STM were used to construct a large memory system. It was proved that the functionality of a single module/system was also present in a large system composed of several modules. The system had more connections within each module than between the modules. This characteristic, of localized connections, conforms well to what has been seen in real neural systems [24].


There are several ways the presented system can be interpreted, in a cortical sense. In the first interpretation of the system, a cortical hypercolumn correspond to a single module. Each hypercolumn in the network corresponds to a cortical column. The individual neuron of the system corresponds to a small number of inhibitory and excitatory nerve cells. The other interpretation of the system is that a module corresponds to a whole sensory area of the cortex, i.e. visual area of the cortex. Each hypercolumn in the network corresponds to a cortical hypercolumn and each neuron in the system corresponds to a cortical column.          


An interesting concept to be studied in the future is how the g factor affects the systems, and also how a variable g factor can be used when a system is extended with an attention control [39].


The incremental, Bayesian learning rule, was created to deal with unlimited amounts of data. None of the systems in this thesis were run with continuous streams of input and output data. The systems were first put in training mode, and then they were put in operation mode. To enable systems to operate on continuous data streams, some sort of regulating rhythm is needed that can control the switch between learning and retrieval mode of the system. The development of such an addition to the learning rule is underway. The brain is thought to operate in the same manner, switching between an input and output mode. The theta rhythm is thought to control the switching between these two modes in the brain [40].


The working memory incorporates a notion of time into the network. With the help of the working memory, the system can keep track of the last occurring event. Even this small notion of time proved to be useful when the systems performed tasks that were not only those performed by associative memories. If a system is to become more than just associative memory, it needs to be able to incorporate the dimension of time.   

8.0     References


8.1         Figure References


Figure 3.          “Biological Psychology II: Brain Structure and Function”, 2000-10-10 http://psych.wisc.edu/faculty/pages/croberts/catonpic/topic4/neuron.gif


Figure 4.          “Biological Psychology II: Brain Structure and Function”, 2000-10-10 http://psych.wisc.edu/faculty/pages/croberts/catonpic/topic4/cortex2.gif


Figure 5.          “Biological Psychology II: Brain Structure and Function”, 2000-10-10 http://psych.wisc.edu/faculty/pages/croberts/catonpic/topic4/cortical.gif



8.2         Bibliography


1.             Lynch, G., 1999, Memory Consolidation and Long-Term Potentiation, in The new cognitive neurosciences. Bradford Books / MIT Press. p. 139.


2.             Amit, D.J. and N. Brunel, 1995, Learning internal representation in an attractor neural network. Network. 6: p. 359.


3.             Haberly, L.B. and J.M. Bower, 1989, Olfactory cortex: model circuit for study of associative memory. Trends Neurosci. 12(7): p. 258-64.


4.             Hasselmo, M.E., B.P. Anderson, and J.M. Bower, 1992, Cholinergic modulation of cortical associative memory function. J. Neurophysiol. 67: p. 1230-1246.


5.             Fransén, E. and A. Lansner, 1995, Low spiking rates in a population of mutually exciting pyramidal cells. Network: Computation in Neural Systems. 6: p. 271-288.


6.             Fransén, E. and A. Lansner, 1998, A model of cortical associative memory based on a horizontal network of connected columns. Network: Computation in Neural Systems. 9:

p. 235-264.


7.             Miller, G.A., 1956, The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. The Psychological Review. 63: p. 81-97.


8.             Tanaka, S. and S. Okada, 1999, Functional prefrontal cortical circutry for visuospatial working memory formation: A computational model. Neurocomputing. 26-27: p. 891-899.


9.             Seriés, P. and P. Tarroux, 1999, Synchrony and delay activity in cortical column models. Neurocomputing. 26-27: p. 505-510.


10.           Erickson, C., B. Jagadeesh, and R. Desimone, 1999, Learning and memory in the inferior temporal cortex of the Macaque, in The new cognitive neurosciences. Bradford Books/ MIT. p. 743.


11.           Baddeley, A., 1983, Working Memory. Philos. Trans R. Soc. Lond. Biol. (302): p. 311-324.


12.           Hopfield, J.J., 1982, Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci USA. 79(8): p. 2554-8.


13.           Sandberg, A., et al., 1999, An incremental Bayesian learning rule, Nada, KTH.


14.           Fuster, J., M., 1995, Memory in the Cerebral Cortex. London: The MIT Press.


15.           Coltheart, M., 1983, Iconic memory. Philos. Trans R. Soc. Lond. Biol. (302): p. 283-294.


16.           Tulving, E., 1983, Elements of Episodic Memory. Oxford: Clarendon Press.


17.           Tulving, E., 1987, Multiple memory systems and consciousness. Hum. Neurobiol. 6: p.



18.           Cohen, N.J. and L.R. Squire, 1980, Preserved learning and retention of pattern-analyzing skill in amnesia: Dissociation of knowing how and knowing that. Science. 210: p. 207-210.


19.           Shepherd, G.M. and C. Koch, 1990, The Synaptic Organization of the Brain. New York: Oxford University Press.


20.           Cajál, R., 1911, Histologie du Systéms Nerveux de l'homme et des vertébrés.


21.           Faggin, F., 1991, VLSI Implementation of Neural Networks, in An Introduction to Neural and Electronic Networks.


22.           Freeman, W.J., 1975, Mass Action in the Nervous System. New York: Academic Press.


23.           Dale, H.H., 1935, Pharmacology and nerve endings. Proc. R. Soc. Med. 28: p. 319-332.


24.           Johnston, D. and S.M.-S. Wu, 1998, Fundamentals of Cellular Neurophysiology: Bradford Books / MIT.


25.           Fuster, J., M., 1989, The prefrontal cortex. 2 ed. New York: Raven Press.


26.           Calvin, W.H., 1995, Cortical Columns, Modules, and Hebbian Cell Assemblies, in The handbook of brain theory and neural networks. Bradford Books / MIT Press. p. 269-272.


27.           Churchland, P.S. and T.J. Sejnowski, 1992, The Computational Brain. Cambridge: MIT Press.


28.           Eggermont, J.J., 1990, The Correlative Brain: Theory and Expirement in Neural Interaction.


29.           Lansner, A. 1991. A recurrent bayesian ANN capable of extracting prototypes from unlabeled and noisy examples. in Artificial Neural Networks. Espoo, Finland: Elsevier, Amsterdam.


30.           Hebb, D.O., 1949, The Organization of Behavior. New York: John Wiley Inc.


31.           Haykin, S., 1999, Neural networks: a comprehensive foundation. 2 ed: Prentice-Hall Inc.


32.           Nadal, J.P., et al., 1986, Networks of formal neurons and memory palimpsests. Europhysics Letter. 1(10): p. 535-542.


33.           Hertz, J., A. Krogh, and R.G. Palmer, 1991, Introduction to the Theory of Neural Computation: Addison-Wesely.


34.           Lansner, A. and Ö. Ekeberg. 1989. A One-Layered Feedback Artificial Neural Network with a Bayesian Learning Rule. in Nordic Symposium on Neural Computing. Hanasaari Culture Center, Espoo, Finland.


35.           Lansner, A. and A. Holst, 1996, A higher order Bayesian neural network with spiking units. Int. J. Neural Systems. 7(2): p. 115-128.


36.           Holst, A., 1997, The Use of a Bayesian Neural Network Model for Classification Tasks, Nada, KTH, Stockholm.


37.           Hubel, D.H. and T.N. Wiesel, 1974, Uniformity of monkey striate cortex: A parallel relationship between field size, scatter and magnification factor. J. Comp. Neurol. 158:

p. 295-306.


38.           Amit, D.J., 1989, Modeling Brain Function: The world of attractor neural networks. Cambridge University Press.


39.           Hasselmo, M., B. Wyble, and G. Wallstein, 1996, Encoding and Retrieval of Episodic Memories: Role of Cholinergic and GABAergic Modulation in the Hippocampus, in Hippocampus. Wiley-Liss Inc. p. 693-708.


40.           Kalat, J.W., 1998, Biological Psychology. 6 ed: Brooks/Cole Publishing Company.