Previous research on behavioural modelling of saccadic image interpretation (Henderson, 1982 Psychological Science 8 51 - 55) has emphasised the sampling of informative parts under visual attention to guide visual perception. We propose a system of sequential attention for object recognition that (i) groups n-tuples of local gradient based image descriptors (Lowe, 2004 International Journal of Computer Vision 60 91 - 110) being scale, rotation, and to high degree illumination tolerant, defining a vocabulary of prototypical code descriptors, (ii) selects only informative groupings for further processing, (iii) learns a predictive mapping from a current perceptual state in a Markov decision process to a next saccadic action, and (iv) present a model of object recognition being capable of integrating sequential information by minimization of entropy in the Bayesian modelling of object hypotheses. The innovative abstraction level of informative groupings provides perceptual meta-states in sensory-motor attention, enabling the learning of a purposeful grammar integrating atomic feature-saccade mappings into a meaningful recognition behaviour. We demonstrate highly accurate recognition of outdoor facades in a mobile vision application, using the sensory-motor context of trans-saccadic object recognition.