Recent works on the modeling of the visual attention have successfully simulated ocular behavior of human observers performing a visual search task on scenes. Unfortunately, these models do not account for the visual information that is represented in memory. This issue is yet crucial, since (i) having been previously attended doesn’t guarantee that visual information is represented in memory, and (ii) the visual information that is represented in memory will guide subsequent behaviours. Here we propose a multinomial processing-tree model of the visual representation of scenes, when observers have to perform a task on visual scenes. Multinomial models are substantively motivated models that provide a means of measuring latent cognitive processes from observable raw data. The current model aim to determine what is the visual information that is extracted, processed and represented in memory when observers have to perform a complex task on the visual scenes. More interesting, the model assesses and weights up the cognitive processes underlying the visual representations of scenes under active viewing. The model was tested and validated with empirical data. Results and implication of theories of visual perception will be further discussed.