DocumentArrayMemmap
When a DocumentArray
object contains a large number of Document
s, holding it in memory can be very demanding, DocumentArrayMemmap
is a drop-in replacement of DocumentArray
in this scenario.
Important
DocumentArrayMemmap
shares almost the same API as DocumentArray
besides insert
, inplace reverse
, inplace sort
.
How does it work?
A DocumentArrayMemmap
stores all Documents
directly on disk, while keeping a small lookup table in memory and a buffer pool of Documents with a fixed size. The lookup table contains the offset and length of each Document
so it is much smaller than the full DocumentArray
. Elements are loaded on-demand to memory during access. Memory-loaded Documents are kept in the buffer pool to allow modifying Documents.
Construct
from jina import DocumentArrayMemmap
dam = DocumentArrayMemmap() # use a local temporary folder as storage
dam2 = DocumentArrayMemmap('./my-memmap') # use './my-memmap' as storage
Delete
To delete all contents in a DocumentArrayMemmap
object, simply call .clear()
. It will clean all content on the disk.
You can also check the disk usage of a DocumentArrayMemmap
by .physical_size
property.
Convert to/from DocumentArray
from jina import Document, DocumentArray, DocumentArrayMemmap
da = DocumentArray([Document(text='hello'), Document(text='world')])
# convert DocumentArray to DocumentArrayMemmap
dam = DocumentArrayMemmap()
dam.extend(da)
# convert DocumentArrayMemmap to DocumentArray
da = DocumentArray(dam)
Advanced
Warning
DocumentArrayMemmap
is in general used for one-way access, either read-only or write-only. Interleaving reading and writing on a DocumentArrayMemmap
is not safe and not recommended in production.
Understand buffer pool
Recently added, modified or accessed Document
s are kept in an in-memory buffer pool. This allows all changes to Document
s to be applied first in memory and then be persisted to disk in a lazy way (i.e. when they quit the buffer pool or when the dam
object’s destructor is called). If you want to instantly persist the changed Document
s, you can call .flush()
.
The number can be configured with the constructor argument buffer_pool_size
(1,000 by default). Only the buffer_pool_size
most recently accessed, modified or added Document
s exist in the pool. Replacement of Document
s follows the LRU strategy.
from jina import DocumentArrayMemmap
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
Warning
The buffer pool ensures that in-memory modified Document
s are persisted to disk. Therefore, you should not reference Document
s manually and modify them if they might be outside of the buffer pool. The next section explains the best practices when modifying Document
s.
Modify elements
Modifying elements of a DocumentArrayMemmap
is possible because accessed and modified Document
s are kept in the buffer pool:
from jina import DocumentArrayMemmap, Document
d1 = Document(text='hello')
d2 = Document(text='world')
dam = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])
dam[0].text = 'goodbye'
print(dam[0].text)
goodbye
However, there are practices to avoid: Mainly, you should not modify Document
s that you reference manually and that might not be in the buffer pool. Here are some practices to avoid:
Keep more references than the buffer pool size and modify them:
❌ Don’t
from jina import Document, DocumentArrayMemmap
docs = [Document(text='hello') for _ in range(100)]
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
dam.extend(docs)
for doc in docs:
doc.text = 'goodbye'
dam[50].text
hello
✅ Do
Use the dam object to modify instead:
from jina import Document, DocumentArrayMemmap
docs = [Document(text='hello') for _ in range(100)]
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
dam.extend(docs)
for doc in dam:
doc.text = 'goodbye'
dam[50].text
goodbye
It’s also okay if you reference
Document
s less than the buffer pool size:from jina import Document, DocumentArrayMemmap
docs = [Document(text='hello') for _ in range(100)]
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=1000)
dam.extend(docs)
for doc in docs:
doc.text = 'goodbye'
dam[50].text
goodbye
Modify a reference that might have left the buffer pool:
❌ Don’t
from jina import Document, DocumentArrayMemmap
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
my_doc = Document(text='hello')
dam.append(my_doc)
# my_doc leaves the buffer pool after extend
dam.extend([Document(text='hello') for _ in range(99)])
my_doc.text = 'goodbye'
dam[0].text
hello
✅ Do
Get the
Document
from the dam object and then modify it:from jina import Document, DocumentArrayMemmap
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
my_doc = Document(text='hello')
dam.append(my_doc)
# my_doc leaves the buffer pool after extend
dam.extend([Document(text='hello') for _ in range(99)])
dam[my_doc.id].text = 'goodbye' # or dam[0].text = 'goodbye'
dam[0].text
goodbye
To summarize, it’s a best practice to rely on the dam
object to reference the Document
s that you modify.
Maintain consistency
Considering two DocumentArrayMemmap
objects that share the same on-disk storage ./memmap
but sit in different processes/threads. After some write operations, the consistency of the lookup table and the buffer pool may be corrupted, as each DocumentArrayMemmap
object has its own version of the lookup table and buffer pool in memory. .reload()
and .flush()
solve this issue:
from jina import Document, DocumentArrayMemmap
d1 = Document(text='hello')
d2 = Document(text='world')
dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])
assert len(dam) == 2
assert len(dam2) == 0
dam2.reload()
assert len(dam2) == 2
dam.clear()
assert len(dam) == 0
assert len(dam2) == 2
dam2.reload()
assert len(dam2) == 0
You don’t need to call .flush()
if you add new Document
s. However, if you modified an attribute of a Document
, you need to use it:
from jina import Document, DocumentArrayMemmap
d1 = Document(text='hello')
dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')
dam.append(d1)
d1.text = 'goodbye'
assert len(dam) == 1
assert len(dam2) == 0
dam2.reload()
assert len(dam2) == 1
assert dam2[0].text == 'hello'
dam.flush()
dam2.reload()
assert dam2[0].text == 'goodbye'