Question-Answering Chatbot via Transformer
Susana @ Jina AI
June 15, 2021
We will use the hello world chatbot for this tutorial. You can find the complete code here and we will go step by step.
At the end of this tutorial, you will have your own chatbot. You will use text as an input and get a text result as output. For this example, we will use a covid dataset. You will understand how every part of this example works and how you can create new apps with different datasets on your own.
Define data and work directories
We can start by creating an empty folder, I’ll call mine tutorial
and that’s the name you’ll see through the tutorial but feel free to use whatever you wish.
We will display our results in our browser, so download the static folder from here, and paste it into your tutorial folder. This is only the CSS and HTML files to render our results. We will use a dataset in a .csv
format. We’ll use the COVID dataset from Kaggle.
Download it under your tutorial
directory:
wget https://static.jina.ai/chatbot/dataset.csv
Create Documents from a csv file
To create a Document in Jina, we do it like this:
doc = Document(content='hello, world!')
In our case, the content of our Document needs to be the dataset we want to use:
from jina.types.document.generators import from_csv
with open('dataset.csv') as fp:
docs = from_csv(fp, field_resolver={'question': 'text'})
So what happened there? We created a generator of Documents docs
, and we used from_csv to load our dataset. We use field_resolver
to map the text from our dataset to the Document attributes.
Finally, we can combine the 2 previous steps (loading the dataset into Documents and starting the context) and index like this:
from jina.types.document.generators import from_csv
with flow, open('dataset.csv') as fp:
flow.index(from_csv(fp, field_resolver={'question': 'text'}))
See Also
from_csv is a function that belongs to the jina.types.document.generators module. Feel free to check it to find more generators.
Important
flow.index
will send the data to the /index
endpoint. However, both of the added Executors do not have an /index
endpoint. In fact, MyTransformer
and MyIndexer
only expose endpoints /foo
and /bar
respectively:
class MyTransformer(Executor):
@requests(on='/foo')
def foo(self, **kwargs):
print(f'foo is doing cool stuff: {kwargs}')
class MyIndexer(Executor):
@requests(on='/bar')
def bar(self, **kwargs):
print(f'bar is doing cool stuff: {kwargs}')
This simply means that no endpoint will be triggered by flow.index
. Besides, our Executors are dummy and still do not have logic to index data. Later, we will modify Executors so that calling flow.index
does indeed store the dataset.
Create Flow
Let’s put the Executors and the Flow together and re-organize our code a little bit. First, we should import everything we need:
import os
import webbrowser
from pathlib import Path
from jina import Flow, Executor, requests
from jina.logging.predefined import default_logger
from jina.types.document.generators import from_csv
Then we should have our main
and a tutorial
function that contains all the code that we’ve done so far. tutorial
accepts one parameter that we’ll need later: port_expose
(the port used to expose our Flow)
def tutorial(port_expose):
class MyTransformer(Executor):
@requests(on='/foo')
def foo(self, **kwargs):
print(f'foo is doing cool stuff: {kwargs}')
class MyIndexer(Executor):
@requests(on='/bar')
def bar(self, **kwargs):
print(f'bar is doing cool stuff: {kwargs}')
flow = (
Flow()
.add(name='MyTransformer', uses=MyTransformer)
.add(name='MyIndexer', uses=MyIndexer)
)
with flow, open('dataset.csv') as fp:
flow.index(from_csv(fp, field_resolver={'question': 'text'}))
if __name__ == '__main__':
tutorial(8080)
If you run this, it should finish without errors. You won’t see much yet because we are not showing anything after we index.
To actually see something we need to specify how we will display it. For our tutorial we will do so in our browser. After indexing, we will open a web browser to serve the static html files. We also need to configure and serve our Flow on a specific port with the HTTP protocol so that the web browser can make requests to the Flow. So, we’ll use the parameter port_expose
to configure the Flow and set the protocol to HTTP. Modify the function tutorial
like so:
def tutorial(port_expose):
class MyTransformer(Executor):
@requests(on='/foo')
def foo(self, **kwargs):
print(f'foo is doing cool stuff: {kwargs}')
class MyIndexer(Executor):
@requests(on='/bar')
def bar(self, **kwargs):
print(f'bar is doing cool stuff: {kwargs}')
flow = (
Flow(cors=True)
.add(name='MyTransformer', uses=MyTransformer)
.add(name='MyIndexer', uses=MyIndexer)
)
with flow, open('dataset.csv') as fp:
flow.index(from_csv(fp, field_resolver={'question': 'text'}))
# switches the serving protocol to HTTP at runtime
flow.protocol = 'http'
flow.port_expose = port_expose
url_html_path = 'file://' + os.path.abspath(
os.path.join(
os.path.dirname(os.path.realpath(__file__)), 'static/index.html'
)
)
try:
webbrowser.open(url_html_path, new=2)
except:
pass # intentional pass, browser support isn't cross-platform
finally:
default_logger.success(
f'You should see a demo page opened in your browser, '
f'if not, you may open {url_html_path} manually'
)
flow.block()
See Also
For more information on what the Flow is doing, and how to serve the Flow with f.block()
and configure the protocol, check the Flow fundamentals section.
Important
Since we want to call our Flow from the browser, it’s important to enable Cross-Origin Resource Sharing with Flow(cors=True)
Ok, so it seems that we have plenty of work done already. If you run this you will see a new tab open in your browser, and there you will have a text box ready for you to input some text. However, if you try to enter anything you won’t get any results. This is because we are using dummy Executors. Our MyTransformer
and MyIndexer
aren’t actually doing anything. So far they only print a line when they are called. So we need real Executors.
Create Executors
We will be creating our Executors in a separate file: my_executors.py
.
Sentence Transformer
First, let’s import the following:
from typing import Dict
from jina import Executor, DocumentArray, requests
from jina.types.arrays.memmap import DocumentArrayMemmap
from sentence_transformers import SentenceTransformer
Now, let’s implement MyTransformer
:
class MyTransformer(Executor):
"""Transformer executor class """
def __init__(
self,
pretrained_model_name_or_path: str = 'paraphrase-mpnet-base-v2',
device: str = 'cpu',
*args,
**kwargs,
):
super().__init__(*args, **kwargs)
self.model = SentenceTransformer(pretrained_model_name_or_path, device=device)
self.model.to(device)
@requests
def encode(self, docs: 'DocumentArray', *args, **kwargs):
import torch
with torch.no_grad():
texts = docs.get_attributes("text")
embeddings = self.model.encode(texts, batch_size=32)
for doc, embedding in zip(docs, embeddings):
doc.embedding = embedding
MyTransformer
exposes only one endpoint: encode
. This will be called whenever we make a request to the Flow, either on query or index. The endpoint will create embeddings for the indexed or query Documents so that they can be used to get the closed matches.
Note
Encoding is a fundamental concept in neural search. It means representing the data in a vectorial form (embeddings).
Encoding is performed through a sentence-transformers model (paraphrase-mpnet-base-v2
by default). We get the text attributes of docs in batch and then compute embeddings. Later, we set the embedding attribute of each Document.
Simple Indexer
Now, let’s implement our indexer (MyIndexer
):
class MyIndexer(Executor):
"""Simple indexer class """
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._docs = DocumentArrayMemmap(self.workspace + '/indexer')
@requests(on='/index')
def index(self, docs: 'DocumentArray', **kwargs):
self._docs.extend(docs)
@requests(on='/search')
def search(self, docs: 'DocumentArray', **kwargs):
"""Append best matches to each document in docs
:param docs: documents that are searched
:param kwargs: other keyword arguments
"""
docs.match(self._docs, metric='cosine', normalization=(1, 0), limit=1)
MyIndexer
exposes 2 endpoints: index
and search
. To perform indexing, we use DocumentArrayMemmap which is a Jina data type. Indexing is a simple as adding the Documents to the DocumentArrayMemmap
.
See Also
Learn more about DocumentArrayMemmap.
To perform the search operation, we use the method match
which will return the top match for the query Documents using the cosine similarity.
See Also
.match
is a method of both DocumentArray
and DocumentArrayMemmap
. Learn more about it in this section.
To import the Executors, just add this to the imports:
from my_executors import MyTransformer, MyIndexer
Put all together
Your app.py
should now look like this:
import os
import webbrowser
from pathlib import Path
from jina import Flow, Executor
from jina.logging.predefined import default_logger
from jina.types.document.generators import from_csv
from my_executors import MyTransformer, MyIndexer
def tutorial(port_expose):
flow = (
Flow(cors=True)
.add(name='MyTransformer', uses=MyTransformer)
.add(name='MyIndexer', uses=MyIndexer)
)
with flow, open('dataset.csv') as fp:
flow.index(from_csv(fp, field_resolver={'question': 'text'}))
# switch to REST gateway at runtime
flow.protocol = 'http'
flow.port_expose = port_expose
url_html_path = 'file://' + os.path.abspath(
os.path.join(
os.path.dirname(os.path.realpath(__file__)), 'static/index.html'
)
)
try:
webbrowser.open(url_html_path, new=2)
except:
pass # intentional pass, browser support isn't cross-platform
finally:
default_logger.success(
f'You should see a demo page opened in your browser, '
f'if not, you may open {url_html_path} manually'
)
flow.block()
if __name__ == '__main__':
tutorial(8080)
And your directory should be:
.
└── tutorial
├── app.py
├── my_executors.py
├── static/
├── our_flow.svg #This will be here if you used the .plot() function
└── dataset.csv
And we are done! If you followed all the steps, now you should have something like this in your browser: