The speed with which humans and monkeys can detect the presence of animals in complex natural scenes constitutes a major challenge for models of visual processing. Here we use simulations using SpikeNet (www.spikenet-technology.com) to demonstrate that even complex visual forms can be detected and localised using a feed-forward processing architecture that uses the order of firing in a single wave of spikes to code information about the stimulus. Neurons in later recognition layers learn to recognize particular visual forms within their receptive field by increasing the synaptic weights of inputs that fire early in response to a stimulus. This concentration of weights on early firing inputs is a natural consequence of Spike-Time Dependent-Plasticity – STDP (see Guyonneau et al, 2005, Neural Computation, 17, 859). The resulting connectivity patterns produce neurons that respond selectively to arbitrary visual forms while retaining a remarkable degree of invariance image transformations. For example, selective responses are obtained with image size changes of roughly +-20%, rotations of around +-12°, and viewing angle variations of approximately +-30°. Furthermore, there is also very good tolerance to variations in contrast and luminance and to the addition of noise or blurring. The performance of this neurally-inspired architecture raises the possibility that our ability to detect animals and other complex forms in natural scenes could depend on the existence of very large numbers of neurons in higher order visual areas that have learned to respond to a wide range of image fragments, each of which is diagnostic for the presence of an animal part. The outputs of such a system could be used to trigger rapid behavioural responses, but could also be used to initiate complex and time consuming processes that include scene segmentation, something that is not achieved during the initial feed-forward pass.