Question-Answering on in-Video Content
Nan @ Jina AI
Oct. 24, 2021
Asking questions is a natural way to perform a search. When you want to know the definition of Document in Jina, you will naturally ask, “What is the Document in Jina?”. The expected answer can be found either from Jina’s docs or the introduction videos on Jina’s YouTube channel. Thanks to the latest advances in NLP, AI models can automatically find these answers from the content.
The goal of this tutorial is to build a Question-Answering (QA) system for video content. Although most existing QA models only work for text, most videos in our life have speech which contains rich information about the video and can be converted to text via speech recognition (STT). Thereafter, videos with speech naturally fit question-answering via text.
In this tutorial, we will show you how to find and extract content from videos that answers a query question. Instead of just finding related videos and having the user skim through the whole video, QA models can tell the user which second they should start from to get the answer to their question.
Build the Flow
To convert speech information from the videos into text, we can rely on STT algorithms. Fortunately, for most videos on YouTube, you can download the subtitles that are generated automatically via STT. In this example, we assume the video files already have subtitles embedded. By loading these subtitles, we can get the text of the speech together with the beginning and ending timestamps.
Tips
You can use youtube-dl
to download YouTube videos with embedded subtitles:
youtube-dl --write-auto-sub --embed-subs --recode-video mkv -o zvXkQkqd2I8 https://www.youtube.com/watch\?v\=zvXkQkqd2I8
Note
Subtitles generated with STT are not 100% accurate. Usually, you need to post-process the subtitles. For example, in the toy data, we use an introduction video of Jina. In the auto-generated subtitles, Jina
is misspelled as gena
, gina
, etc. Worse still, most of the sentences are broken and there is no punctuation.
With the subtitles of the videos, we further need a QA model. The input to the QA model usually has two parts: the question and the context. The context denotes the candidate texts that contain the answers. In our case, the context corresponds to the subtitles from which the answers are extracted.
To save computational cost, we want to have the context as short as possible. To generate such contexts, one can use either traditional information sparse vectors or dense vectors. In this example, we decide to use the dense vectors that are shipped together with the QA model.
Note
With traditional methods, retrieval can also be done using BM25, Tf-idf, etc.
Choose Executors
We use VideoLoader
to extract subtitles from the videos. It uses ffmpeg to extract subtitles and then generates chunks based on the subtitles using webvtt-py. The subtitles are stored in the chunks
together with other meta-information in the tags
, including timestamp and video information. Extracted subtitles have the following attributes:
tags | information |
---|---|
text | Text of the subtitle |
location | Index of the subtitle in the video, starting from 0 |
modality | always set to text |
tags[‘beg_in_seconds’] | Beginning of the subtitle in seconds |
tags[‘end_in_seconds’] | End of the subtitle in seconds |
tags[‘video_uri’] | URI of the video |
For the QA model, we choose the Dense Passage Retrieval (DPR) models that were originally proposed by Facebook Research. These consist of:
An embedding model to encode the questions and answers into vectors in the same space. This way, we can retrieve candidate sentences that are most likely to contain the answer.
A reader model that extracts exact answers from candidate sentences.
Note
DPR
is a set of tools and models for open domain Q&A tasks.
For the indexer, we choose SimpleIndexer
for demonstration purposes. It stores both vectors and meta-information together. You can find more information on Jina Hub
Go through the Flow
Because the indexing and querying Flows have only one shared Executor, we create separate Flows for each task.
Index
The index request contains Documents that have the path information of the video files stored in their uri
attribute.
There are three Executors in the index Flow:
VideoLoader
extracts the subtitles and stores them inchunks
. Each chunk of the Document has one subtitle stored in itstext
attribute.DPRTextEncoder
encodes the subtitles into vectors.SimpleIndexer
stores the vectors and other meta-information
Query
There are four Executors in the query Flow:
DPRTextEncoder
takes the question stored in thetext
attribute of the query Document and encodes it into a vector.SimpleIndexer
retrieves related subtitles by finding the nearest neighbours in the vector space. The retrieved results are stored in thematches
attribute of the query Document. Each Document in thematches
also has all the meta-information about the subtitles, which is retrieved bySimpleIndexer
together with subtitle text.DPRReaderRanker
finds exact answers by using the question and the candidate subtitles. The question and candidate subtitles are stored in thetext
attributes of the Document and itsmatches
respectively. Replacing the existingmatches
, theDPRReaderRanker
stores the best-matched answers in thetext
attribute of thematches
. Other meta-information is also copied into the new matches, includingtags['beg_in_seconds']
,tags['end_in_seconds']
, andtags['video_uri']
. TheDPRReaderRanker
returns two types ofscores
. Thescores['relevance_score']
measures the relevance to the question of the subtitle from which the answer is extracted. Thescores['span_score']
indicates the weight of the extracted answer among the subtitles.Text2Frame
gets the video frame information from the retrieved answers and prepares the Documentmatches
for displaying in the frontend.
The overall structure of the query Flow is as follows:
Use DPRTextEncoder
differently in two Flows
You might note that DPRTextEncoder
is used in both the index and query Flows:
In the index Flow it encodes subtitle text
In the query Flow it encodes query questions
In these two cases, we need to choose different models to encode the different attributes of the Documents. To achieve this, we use different initialization settings for DPRTextEncoder
by overriding the with
arguments in the YAML file. To do this, we need to pass the new argument to uses_with
. You can find more information in Jina’s docs.
# index.yml
...
- name: encoder
uses: jinahub://DPRTextEncoder/v0.2
uses_with:
pretrained_model_name_or_path: 'facebook/dpr-ctx_encoder-single-nq-base'
encoder_type: 'context'
traversal_paths:
- 'c'
...
# query.yml
...
- name: encoder
uses: jinahub://DPRTextEncoder/v0.2
uses_with:
pretrained_model_name_or_path: 'facebook/dpr-question_encoder-single-nq-base'
encoder_type: 'question'
batch_size: 1
...
Get the Source Code
You can find the code at example-video-qa.
Most of the Executors used in this tutorial are available on Jina Hub:
Next Steps
In this example, we rely on subtitles embedded in the video. For videos without subtitles, we need to build Executors using STT models to extract speech information. If the video contains other sounds, you can resort to VADSpeechSegmenter for separating speech beforehand.
Another direction to extend this example is to consider the videos’ other text information. While subtitles contain rich information about the video, not all text information is included in subtitles. A lot of videos have text information embedded in images. In such cases, we need to rely on OCR models to extract text information from the video frames.
Overall, searching in-video content is a complex task and Jina makes it a lot easier.