Abstract:
VIsual STorytelling (VIST) is a task that transforms a sequence of images into narrative text stories. A narrative story requires an understanding of the contexts and relationships among images. Our study introduces a story generation process that emphasizes creating a coherent narrative by constructing both image and narrative contexts to control the coherence. First, the image contexts are generated from the content of individual images, using image features and scene-graphs that detail the elements of the images. Second, the narrative context is generated by focusing on the overall image sequence. Ensuring that each caption fits coherency within the overall story maintains continuity and coherence. We also introduce a narrative concept summary, which is external knowledge represented as a knowledge-graph. This summary encapsulates the narrative concept of an image sequence to enhance the understanding of its overall content. Following this, both image and narrative contexts are used to generate a coherent and engaging narrative. This framework is based on Long Short-Term Memory (LSTM) with an attention mechanism. We evaluate the proposed method using the VIST dataset, and the results highlight the importance of understanding contexts of an image sequence in generating coherent and engaging stories. The study demonstrates the importance of involving narrative context in the generation process to ensure the coherence of the generated narrative.
Type: 31th Intl. Conf. on MultiMedia Modeling (MMM2025)
Publication date: To be published in Jan 2025