Numenta Technology Discussion
From Anita Borg Institute Wiki
Presentation
Notes
Computers started as multi-purpose machines. They have been great at some tasks like math and databases and communications, but not very good at visual perception or languages, etc. The brain is a machine that does this very well. Turing - a computer can work as a model of any other machine. So Numenta is trying to make the computer model the brain.
Picture a brain: the old brain is the stem. The wrinkly stuff surrounding the brain in a thin sheet is the neocortex. The neocortex arrived later in the evoluationary times scale. The old brain provides helpful behaviour but those behaviours are not very adaptive. The neocortex came along and now a lot of the behaviours have been taken over by the new brain. The cell structure of the neocortex is roughly the same whether you are looking at a vision area or an auditory area or even if you are looking at a bit of neocortex from a monkey. So it has been proposed that the neocortex uses the same algorithm everywhere. This is not necessarily believed by everyone, but there is some evidence for that. Feret experiments by Michael Merzenich and Jitendra Sharma show rewiring can happen. Sensory substitutation is further evidence. Kevin O'Regan and Alva Noe had a camera that converted visual input into sensory input (sensations on the subject's back). That can result in "seeing". It looks like the neocortex might be a universal learning machine? No. "The No Free Lunch Theorem: 'No learning algorithm has an inherent superiority over other learning algorithms for all problems' (Wolpert 95)" Algorithms that are successful exploit certain characteristics of the world (where they are penalized in worlds where their assumptions do not hold). So, what assumptions does the neocortex make? "Are there assumptions that can be made about the world that are 1. General enough to be applied to a large class of problems. 2. Specific enough to make learning possible."
The neocortex takes in sensations and takes actions that affect the world. "Unsupervised nature of vision" - the pixels on a puppy's retina make up some picture as it walks towards a bowl of water or a bone. These pixels are not categorized in a supervised way --- nobody tells the puppy it is moving towards a bone or towards a water bowl. But as the puppy moves closer it sees a sequence of images If it can stitch it together in time it gets a label in an unsupervised way. Vision is a spatial-temporal process, not merely a spatial process.
Neocortex is organized as a hierarchy. Neurons at the lowest level see only small pieces of the picture. Neurons at the highest level see the whole. If the sensory input (say, an image) shifts slightly, responses at the highest level do not change, but responses at the lowest level do. So the higher levels see more space and change more slowly, and lower levels see less space and change more quickly. This is the "multi-scale organization of the world"
HTM - "Hierarchical Temporal Memory"
- not a model at the level of each neuron, selective
- a theory of how the neocortex works
- also a technology
"1) Creates a model of its world 2) Recognizes new patterns 3) Predicts 4) Generates behavior"
How does each node discover causes?
- Assign causes to common spatial patterns
- Assign causes to common sequences
Each node talks to the higher level through sequence names
- All nodes do the same thing
- learns common spatial patterns
- learns common sequences (groups patterns with common cause)
- sequence names passed up
- predicted spatial patterns passed down
- creates hierarchical model of causes
- Bayesian methods resolve ambiguity
The system is trained from videos of the rough drawing moving different direction. Has to learn only scale variations and translations. HTM passes information both directions in the hierarchy. Lower levels pass cleaner pictures images to the top level. Each level is storing transition matrix Trying this on greyscale recognition on the ALOI image data set. Computer is cshown There are papers on some of the applications. "Anything requiring precise timing or high order temporal data" are potential applications. Have not moved towards "learning while taking actions."
NuPIC "research release" can be downloaded from the website for no charge. www.numenta.com Partnered with several companies
Q: Any academic partners? A: Not officially
Q: Maybe the human brain isn't good at learning. How is the training time? A: We are assuming humans are good at learning. There has been a trend in machine learning going on pointing towards how humans learn. The time is probably fast compared to neural networks. Recognition is almost instantanious. But it takes a lot of training time. Used to be days, now is 10 minutes. But neural networks are universal learning machines, so he suspects this will be faster.
Q: What about text analytics? A: Don't know. Not in favour of trying this now. Language is not visual. "I cannot learn Russian by scanning it, taking a book and flipping through it." We need grounded, embodied learning first.
Q: System seems visual, hierarchical. What about other senses? A: Last started applying it to auditory signals and have also started looking at traffic on a freeway - not a modality humans have. "We think that all other modalities will turn out to be similar to visual"
Q: Say more about the training data: how many examples, what kind of transformations, cluttered images, etc. A: The line drawing example used only clean images and scale, no rotations. The number of training images ran into hundreds of thousands. Checked for overfitting.
Q: What operatoring platform? A: Mac, Windows and Linux (some flavours, not all)
Q: Training with noisy images? A: No. First versions on the learning side not good at noisy images. Now are trying it more.
Q: What is your background and other people's on the team? A: EE, undergrad in India, Master's at Stanford, PhD Stanford. Worked in wireless communication theory and switched to neuroscience. Work with lots of professors and scientists, most in machine learning. Not so many neuroscientists, because they are not thinking about algorithms.
Q: Methodology question: compare this to a neural network that uses features in space and time. A: In neural networks, the training signal is n to n. No way to train it. Backpropagation does not build in enough bias to have it learn slow features. If you build in the constraints, then it looks similar. Q: Do you have an example where a neural network failed? A: I could try to do it, but I don't want to spend all my time making a neural network work. Things don't work the first time. So you should try. Q: Does the type of data matter? Image, auditory? A: Do pre-processing on the greyscale images. We could learn it by putting a sparse prior on the features but it is computationally expensive and training already takes one week. It should not depend too much on pre-processing. "You have to have an idea about what are the higher level states that you are expecting to learn" but within that there is a lot of variation.
Q: Question about training. A: Supervising at the last level. It can be unsupervised. Mostly happy with having many levels unsupervised and then supervising the last level. Unsupervised all the way is hard to interpret.
Q: A problem with neural networks is that the model you end up with can be hard to interpret, it is kind of hidden. How about this method? A: It is interpretable. You can play the co-occurrences down through the hierarchy to find out what is triggering it.
Q: Are you pulling from biology to determine what to teach it when and in what sequence? A: Not too much. We can get some ideas about shortcuts for training just from the algorithm and the computation, and try to match with biology, but it is not a tight map.
Q: Relationship to Bruno Olsen's work on sparse visual perception? A: Bruno is scientist at Berkeley and works on sparse distributed representation of images. Example: you can recognize "yellow volkswagon" with a single neuron, or as the co-occurance of "yellow" and "volkswagon". "It is better to have neurons be shared." "We use sparse distributed coding in our hierarchy." Something can be represented as the two neurons being active rather than creating a new one and needing to learn new transformations. We use the argument based on generalization---"even if you had capacity, you do not want to allocate a new neuron"
Q: What are your goals for the second version? A: Focusing on image recognition. Want to train on 100 objects and recognize wide variations on each of those objects.
Q: Seems like you're starting with the end result. What about taking raw data and finding patterns? A: We do assume the hierarchical structure and fill in the details at lower levels. We have an algorithm that discovers things but it is extremely expensive.
Q: Using McLellan's work on parallel distributed processing? A: No. Maybe the higher level ideas but none of the details.