Abstrakt:
VIsual STorytelling (VIST) is a task to transform a sequence of images into narrative text stories. Successfully generating a narrative story requires an understanding of the contexts and relationships among images. Our study introduces a story generation framework based on the Attention Mechanism on Long-Short-Term Memory (LSTM). In the generation process, both local and global contexts of the image sequence are considered. First, local context is based on individual image content, which utilizes the image features and scene-graph of each image. This context focuses on generating captions for each image and providing image details. Second, the global context refers to comprehensive information on the overall image sequence, which is constructed by aggregating all individual image content. The global context ensures that each caption fits cohesively within the overall story, maintaining continuity and coherence. Both the local and global contexts are used to generate a cohesive and engaging narrative. The VIST dataset is used to train and evaluate the proposed framework. Preliminary results highlight the importance of understanding image sequence contexts in generating coherent and engaging stories.
Typ: Poster at MIRU Symposium (画像の認識・理解シンポジウム)
Veröffentlichungsdatum: August 2024