Fuzzy String Matching in 30 Lines
Different behavior on Jupyter Notebook
Be aware of the following when running this tutorial in jupyter notebook. Some python built-in attributes such as __file__
do not exist. You can change __file__
for any other file path existing in your system.
Now that you understand all fundamental concepts, let’s practice the learnings and build a simple end-to-end demo.
We will use Jina to implement a fuzzy search solution on source code: given a snippet source code and a query, find all lines that are similar to the query. It is like grep
but in fuzzy mode.
Preliminaries
Client-Server architecture
Server
Character embedding
Let’s first build a simple Executor for character embedding:
import numpy as np
from jina import DocumentArray, Executor, requests
class CharEmbed(Executor): # a simple character embedding with mean-pooling
offset = 32 # letter `a`
dim = 127 - offset + 1 # last pos reserved for `UNK`
char_embd = np.eye(dim) * 1 # one-hot embedding for all chars
@requests
def foo(self, docs: DocumentArray, **kwargs):
for d in docs:
r_emb = [ord(c) - self.offset if self.offset <= ord(c) <= 127 else (self.dim - 1) for c in d.text]
d.embedding = self.char_embd[r_emb, :].mean(axis=0) # average pooling
Indexer with Euclidean distance
from jina import DocumentArray, Executor, requests
class Indexer(Executor):
_docs = DocumentArray() # for storing all documents in memory
@requests(on='/index')
def foo(self, docs: DocumentArray, **kwargs):
self._docs.extend(docs) # extend stored `docs`
@requests(on='/search')
def bar(self, docs: DocumentArray, **kwargs):
docs.match(self._docs, metric='euclidean', limit=20)
Put it together in a Flow
from jina import Flow
f = (Flow(port_expose=12345, protocol='http', cors=True)
.add(uses=CharEmbed, replicas=2)
.add(uses=Indexer)) # build a Flow, with 2 shard CharEmbed, tho unnecessary
Start the Flow and index data
from jina import Document
with f:
f.post('/index', (Document(text=t.strip()) for t in open(__file__) if t.strip())) # index all lines of _this_ file
f.block() # block for listening request
Caution
open(__file__)
means open the current file and use it for indexing. Note in some enviroment such as Jupyter Notebook and Google Colab, __file__
is not defined. In this case, you may want to replace it to open('my-source-code.py')
.
Query via SwaggerUI
Open http://localhost:12345/docs
(an extended Swagger UI) in your browser, click /search tab and input:
{
"data": [
{
"text": "@requests(on=something)"
}
]
}
That means, **we want to find lines from the above code snippet that are most similar to @request(on=something)
.**Now click Execute button!
Query from Python
Let’s do it in Python then! Keep the above server running and start a simple client:
from jina import Client, Document
from jina.types.request import Response
def print_matches(resp: Response): # the callback function invoked when task is done
for idx, d in enumerate(resp.docs[0].matches[:3]): # print top-3 matches
print(f'[{idx}]{d.scores["euclidean"].value:2f}: "{d.text}"')
c = Client(protocol='http', port=12345) # connect to localhost:12345
c.post('/search', Document(text='request(on=something)'), on_done=print_matches)
, which prints the following results:
[email protected][S]:connected to the gateway at localhost:12345!
[0]0.168526: "@requests(on='/index')"
[1]0.181676: "@requests(on='/search')"
[2]0.218218: "from jina import Document, DocumentArray, Executor, Flow, requests"