Methods

Embeddings

Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.

Source code in txtai/embeddings/base.py

  1. 25
  2. 26
  3. 27
  4. 28
  5. 29
  6. 30
  7. 31
  8. 32
  9. 33
  10. 34
  11. 35
  12. 36
  13. 37
  14. 38
  15. 39
  16. 40
  17. 41
  18. 42
  19. 43
  20. 44
  21. 45
  22. 46
  23. 47
  24. 48
  25. 49
  26. 50
  27. 51
  28. 52
  29. 53
  30. 54
  31. 55
  32. 56
  33. 57
  34. 58
  35. 59
  36. 60
  37. 61
  38. 62
  39. 63
  40. 64
  41. 65
  42. 66
  43. 67
  44. 68
  45. 69
  46. 70
  47. 71
  48. 72
  49. 73
  50. 74
  51. 75
  52. 76
  53. 77
  54. 78
  55. 79
  56. 80
  57. 81
  58. 82
  59. 83
  60. 84
  61. 85
  62. 86
  63. 87
  64. 88
  65. 89
  66. 90
  67. 91
  68. 92
  69. 93
  70. 94
  71. 95
  72. 96
  73. 97
  74. 98
  75. 99
  76. 100
  77. 101
  78. 102
  79. 103
  80. 104
  81. 105
  82. 106
  83. 107
  84. 108
  85. 109
  86. 110
  87. 111
  88. 112
  89. 113
  90. 114
  91. 115
  92. 116
  93. 117
  94. 118
  95. 119
  96. 120
  97. 121
  98. 122
  99. 123
  100. 124
  101. 125
  102. 126
  103. 127
  104. 128
  105. 129
  106. 130
  107. 131
  108. 132
  109. 133
  110. 134
  111. 135
  112. 136
  113. 137
  114. 138
  115. 139
  116. 140
  117. 141
  118. 142
  119. 143
  120. 144
  121. 145
  122. 146
  123. 147
  124. 148
  125. 149
  126. 150
  127. 151
  128. 152
  129. 153
  130. 154
  131. 155
  132. 156
  133. 157
  134. 158
  135. 159
  136. 160
  137. 161
  138. 162
  139. 163
  140. 164
  141. 165
  142. 166
  143. 167
  144. 168
  145. 169
  146. 170
  147. 171
  148. 172
  149. 173
  150. 174
  151. 175
  152. 176
  153. 177
  154. 178
  155. 179
  156. 180
  157. 181
  158. 182
  159. 183
  160. 184
  161. 185
  162. 186
  163. 187
  164. 188
  165. 189
  166. 190
  167. 191
  168. 192
  169. 193
  170. 194
  171. 195
  172. 196
  173. 197
  174. 198
  175. 199
  176. 200
  177. 201
  178. 202
  179. 203
  180. 204
  181. 205
  182. 206
  183. 207
  184. 208
  185. 209
  186. 210
  187. 211
  188. 212
  189. 213
  190. 214
  191. 215
  192. 216
  193. 217
  194. 218
  195. 219
  196. 220
  197. 221
  198. 222
  199. 223
  200. 224
  201. 225
  202. 226
  203. 227
  204. 228
  205. 229
  206. 230
  207. 231
  208. 232
  209. 233
  210. 234
  211. 235
  212. 236
  213. 237
  214. 238
  215. 239
  216. 240
  217. 241
  218. 242
  219. 243
  220. 244
  221. 245
  222. 246
  223. 247
  224. 248
  225. 249
  226. 250
  227. 251
  228. 252
  229. 253
  230. 254
  231. 255
  232. 256
  233. 257
  234. 258
  235. 259
  236. 260
  237. 261
  238. 262
  239. 263
  240. 264
  241. 265
  242. 266
  243. 267
  244. 268
  245. 269
  246. 270
  247. 271
  248. 272
  249. 273
  250. 274
  251. 275
  252. 276
  253. 277
  254. 278
  255. 279
  256. 280
  257. 281
  258. 282
  259. 283
  260. 284
  261. 285
  262. 286
  263. 287
  264. 288
  265. 289
  266. 290
  267. 291
  268. 292
  269. 293
  270. 294
  271. 295
  272. 296
  273. 297
  274. 298
  275. 299
  276. 300
  277. 301
  278. 302
  279. 303
  280. 304
  281. 305
  282. 306
  283. 307
  284. 308
  285. 309
  286. 310
  287. 311
  288. 312
  289. 313
  290. 314
  291. 315
  292. 316
  293. 317
  294. 318
  295. 319
  296. 320
  297. 321
  298. 322
  299. 323
  300. 324
  301. 325
  302. 326
  303. 327
  304. 328
  305. 329
  306. 330
  307. 331
  308. 332
  309. 333
  310. 334
  311. 335
  312. 336
  313. 337
  314. 338
  315. 339
  316. 340
  317. 341
  318. 342
  319. 343
  320. 344
  321. 345
  322. 346
  323. 347
  324. 348
  325. 349
  326. 350
  327. 351
  328. 352
  329. 353
  330. 354
  331. 355
  332. 356
  333. 357
  334. 358
  335. 359
  336. 360
  337. 361
  338. 362
  339. 363
  340. 364
  341. 365
  342. 366
  343. 367
  344. 368
  345. 369
  346. 370
  347. 371
  348. 372
  349. 373
  350. 374
  351. 375
  352. 376
  353. 377
  354. 378
  355. 379
  356. 380
  357. 381
  358. 382
  359. 383
  360. 384
  361. 385
  362. 386
  363. 387
  364. 388
  365. 389
  366. 390
  367. 391
  368. 392
  369. 393
  370. 394
  371. 395
  372. 396
  373. 397
  374. 398
  375. 399
  376. 400
  377. 401
  378. 402
  379. 403
  380. 404
  381. 405
  382. 406
  383. 407
  384. 408
  385. 409
  386. 410
  387. 411
  388. 412
  389. 413
  390. 414
  391. 415
  392. 416
  393. 417
  394. 418
  395. 419
  396. 420
  397. 421
  398. 422
  399. 423
  400. 424
  401. 425
  402. 426
  403. 427
  404. 428
  405. 429
  406. 430
  407. 431
  408. 432
  409. 433
  410. 434
  411. 435
  412. 436
  413. 437
  414. 438
  415. 439
  416. 440
  417. 441
  418. 442
  419. 443
  420. 444
  421. 445
  422. 446
  423. 447
  424. 448
  425. 449
  426. 450
  427. 451
  428. 452
  429. 453
  430. 454
  431. 455
  432. 456
  433. 457
  434. 458
  435. 459
  436. 460
  437. 461
  438. 462
  439. 463
  440. 464
  441. 465
  442. 466
  443. 467
  444. 468
  445. 469
  446. 470
  447. 471
  448. 472
  449. 473
  450. 474
  451. 475
  452. 476
  453. 477
  454. 478
  455. 479
  456. 480
  457. 481
  458. 482
  459. 483
  460. 484
  461. 485
  462. 486
  463. 487
  464. 488
  465. 489
  466. 490
  467. 491
  468. 492
  469. 493
  470. 494
  471. 495
  472. 496
  473. 497
  474. 498
  475. 499
  476. 500
  477. 501
  478. 502
  479. 503
  480. 504
  481. 505
  482. 506
  483. 507
  484. 508
  485. 509
  486. 510
  487. 511
  488. 512
  489. 513
  490. 514
  491. 515
  492. 516
  493. 517
  494. 518
  495. 519
  496. 520
  497. 521
  498. 522
  499. 523
  500. 524
  501. 525
  502. 526
  503. 527
  504. 528
  505. 529
  506. 530
  507. 531
  508. 532
  509. 533
  510. 534
  511. 535
  512. 536
  513. 537
  514. 538
  515. 539
  516. 540
  517. 541
  518. 542
  519. 543
  520. 544
  521. 545
  522. 546
  523. 547
  524. 548
  525. 549
  526. 550
  527. 551
  528. 552
  529. 553
  530. 554
  531. 555
  532. 556
  533. 557
  534. 558
  535. 559
  536. 560
  537. 561
  538. 562
  539. 563
  540. 564
  541. 565
  542. 566
  543. 567
  544. 568
  545. 569
  546. 570
  547. 571
  548. 572
  549. 573
  550. 574
  551. 575
  552. 576
  553. 577
  554. 578
  555. 579
  556. 580
  557. 581
  558. 582
  559. 583
  560. 584
  561. 585
  562. 586
  563. 587
  564. 588
  565. 589
  566. 590
  567. 591
  568. 592
  569. 593
  570. 594
  571. 595
  572. 596
  573. 597
  574. 598
  575. 599
  576. 600
  577. 601
  578. 602
  579. 603
  580. 604
  581. 605
  582. 606
  583. 607
  584. 608
  585. 609
  586. 610
  587. 611
  588. 612
  589. 613
  590. 614
  591. 615
  592. 616
  593. 617
  594. 618
  595. 619
  596. 620
  597. 621
  598. 622
  599. 623
  600. 624
  601. 625
  602. 626
  603. 627
  604. 628
  605. 629
  606. 630
  607. 631
  608. 632
  609. 633
  610. 634
  611. 635
  612. 636
  613. 637
  614. 638
  615. 639
  616. 640
  617. 641
  618. 642
  619. 643
  620. 644
  621. 645
  622. 646
  623. 647
  624. 648
  625. 649
  626. 650
  627. 651
  628. 652
  629. 653
  630. 654
  631. 655
  632. 656
  633. 657
  634. 658
  635. 659
  636. 660
  637. 661
  638. 662
  639. 663
  640. 664
  641. 665
  642. 666
  643. 667
  644. 668
  645. 669
  646. 670
  647. 671
  648. 672
  649. 673
  650. 674
  651. 675
  652. 676
  653. 677
  654. 678
  655. 679
  656. 680
  657. 681
  658. 682
  659. 683
  660. 684
  661. 685
  662. 686
  663. 687
  664. 688
  665. 689
  666. 690
  667. 691
  668. 692
  669. 693
  670. 694
  671. 695
  672. 696
  673. 697
  674. 698
  675. 699
  676. 700
  677. 701
  678. 702
  679. 703
  680. 704
  681. 705
  682. 706
  683. 707
  684. 708
  685. 709
  686. 710
  687. 711
  688. 712
  689. 713
  690. 714
  691. 715
  692. 716
  693. 717
  694. 718
  695. 719
  696. 720
  697. 721
  698. 722
  699. 723
  700. 724
  701. 725
  702. 726
  703. 727
  704. 728
  705. 729
  706. 730
  707. 731
  708. 732
  709. 733
  710. 734
  711. 735
  712. 736
  713. 737
  714. 738
  715. 739
  716. 740
  717. 741
  718. 742
  719. 743
  720. 744
  721. 745
  722. 746
  723. 747
  724. 748
  725. 749
  726. 750
  727. 751
  728. 752
  729. 753
  730. 754
  731. 755
  732. 756
  733. 757
  734. 758
  735. 759
  736. 760
  737. 761
  738. 762
  739. 763
  740. 764
  741. 765
  742. 766
  743. 767
  744. 768
  745. 769
  746. 770
  747. 771
  748. 772
  749. 773
  750. 774
  751. 775
  752. 776
  753. 777
  754. 778
  755. 779
  756. 780
  757. 781
  758. 782
  759. 783
  760. 784
  761. 785
  762. 786
  763. 787
  764. 788
  765. 789
  766. 790
  767. 791
  768. 792
  769. 793
  770. 794
  771. 795
  772. 796
  773. 797
  774. 798
  775. 799
  776. 800
  777. 801
  778. 802
  779. 803
  780. 804
  781. 805
  782. 806
  783. 807
  784. 808
  785. 809
  786. 810
  787. 811
  788. 812
  789. 813
  790. 814
  791. 815
  792. 816
  793. 817
  794. 818
  795. 819
  796. 820
  797. 821
  798. 822
  799. 823
  800. 824
  801. 825
  802. 826
  803. 827
  804. 828
  805. 829
  806. 830
  807. 831
  808. 832
  809. 833
  810. 834
  811. 835
  812. 836
  813. 837
  814. 838
  815. 839
  816. 840
  817. 841
  818. 842
  819. 843
  820. 844
  821. 845
  822. 846
  823. 847
  824. 848
  825. 849
  826. 850
  827. 851
  828. 852
  829. 853
  830. 854
  831. 855
  832. 856
  833. 857
  834. 858
  835. 859
  836. 860
  837. 861
  838. 862
  839. 863
  840. 864
  841. 865
  842. 866
  843. 867
  844. 868
  845. 869
  846. 870
  847. 871
  848. 872
  849. 873
  850. 874
  851. 875
  852. 876
  853. 877
  854. 878
  855. 879
  856. 880
  857. 881
  858. 882
  859. 883
  860. 884
  861. 885
  862. 886
  863. 887
  864. 888
  865. 889
  866. 890
  867. 891
  868. 892
  869. 893
  870. 894
  871. 895
  872. 896
  873. 897
  874. 898
  875. 899
  876. 900
  877. 901
  878. 902
  879. 903
  880. 904
  881. 905
  882. 906
  883. 907
  884. 908
  885. 909
  886. 910
  887. 911
  888. 912
  889. 913
  890. 914
  891. 915
  892. 916
  893. 917
  894. 918
  895. 919
  896. 920
  897. 921
  898. 922
  899. 923
  900. 924
  901. 925
  902. 926
  903. 927
  904. 928
  905. 929
  906. 930
  907. 931
  908. 932
  909. 933
  910. 934
  911. 935
  912. 936
  913. 937
  914. 938
  915. 939
  916. 940
  917. 941
  918. 942
  919. 943
  920. 944
  921. 945
  922. 946
  923. 947
  924. 948
  925. 949
  926. 950
  927. 951
  928. 952
  929. 953
  930. 954
  931. 955
  932. 956
  933. 957
  934. 958
  935. 959
  936. 960
  937. 961
  938. 962
  939. 963
  940. 964
  941. 965
  942. 966
  943. 967
  944. 968
  945. 969
  946. 970
  947. 971
  948. 972
  949. 973
  950. 974
  951. 975
  952. 976
  953. 977
  954. 978
  955. 979
  956. 980
  957. 981
  958. 982
  959. 983
  960. 984
  961. 985
  962. 986
  963. 987
  964. 988
  965. 989
  966. 990
  967. 991
  968. 992
  969. 993
  970. 994
  971. 995
  972. 996
  973. 997
  974. 998
  975. 999
  976. 1000
  977. 1001
  978. 1002
  979. 1003
  980. 1004
  981. 1005
  982. 1006
  983. 1007
  984. 1008
  985. 1009
  986. 1010
  987. 1011
  1. class Embeddings:
  2. “””
  3. Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts
  4. will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results
  5. that have the same meaning, not necessarily the same keywords.
  6. “””
  7. # pylint: disable = W0231
  8. def init(self, config=None, models=None, kwargs):
  9. “””
  10. Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized.
  11. Args:
  12. config: embeddings configuration
  13. models: models cache, used for model sharing between embeddings
  14. kwargs: additional configuration as keyword args
  15. “””
  16. # Index configuration
  17. self.config = None
  18. # Dimensionality reduction - word vectors only
  19. self.reducer = None
  20. # Dense vector model - transforms data into similarity vectors
  21. self.model = None
  22. # Approximate nearest neighbor index
  23. self.ann = None
  24. # Index ids when content is disabled
  25. self.ids = None
  26. # Document database
  27. self.database = None
  28. # Resolvable functions
  29. self.functions = None
  30. # Graph network
  31. self.graph = None
  32. # Sparse vectors
  33. self.scoring = None
  34. # Query model
  35. self.query = None
  36. # Index archive
  37. self.archive = None
  38. # Subindexes for this embeddings instance
  39. self.indexes = None
  40. # Models cache
  41. self.models = models
  42. # Merge configuration into single dictionary
  43. config = {config, kwargs} if config and kwargs else kwargs if kwargs else config
  44. # Set initial configuration
  45. self.configure(config)
  46. def score(self, documents):
  47. “””
  48. Builds a term weighting scoring index. Only used by word vectors models.
  49. Args:
  50. documents: iterable of (id, data, tags), (id, data) or data
  51. “””
  52. # Build scoring index for word vectors term weighting
  53. if self.isweighted():
  54. self.scoring.index(Stream(self)(documents))
  55. def index(self, documents, reindex=False):
  56. “””
  57. Builds an embeddings index. This method overwrites an existing index.
  58. Args:
  59. documents: iterable of (id, data, tags), (id, data) or data
  60. reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
  61. “””
  62. # Initialize index
  63. self.initindex(reindex)
  64. # Create transform and stream
  65. transform = Transform(self, Action.REINDEX if reindex else Action.INDEX)
  66. stream = Stream(self, Action.REINDEX if reindex else Action.INDEX)
  67. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  68. # Load documents into database and transform to vectors
  69. ids, dimensions, embeddings = transform(stream(documents), buffer)
  70. if embeddings is not None:
  71. # Build LSA model (if enabled). Remove principal components from embeddings.
  72. if self.config.get(“pca”):
  73. self.reducer = Reducer(embeddings, self.config[“pca”])
  74. self.reducer(embeddings)
  75. # Save index dimensions
  76. self.config[“dimensions”] = dimensions
  77. # Create approximate nearest neighbor index
  78. self.ann = self.createann()
  79. # Add embeddings to the index
  80. self.ann.index(embeddings)
  81. # Save indexids-ids mapping for indexes with no database, except when this is a reindex
  82. if ids and not reindex and not self.database:
  83. self.ids = self.createids(ids)
  84. # Index scoring, if necessary
  85. # This must occur before graph index in order to be available to the graph
  86. if self.issparse():
  87. self.scoring.index()
  88. # Index subindexes, if necessary
  89. if self.indexes:
  90. self.indexes.index()
  91. # Index graph, if necessary
  92. if self.graph:
  93. self.graph.index(Search(self, indexonly=True), Ids(self), self.batchsimilarity)
  94. def upsert(self, documents):
  95. “””
  96. Runs an embeddings upsert operation. If the index exists, new data is
  97. appended to the index, existing data is updated. If the index doesnt exist,
  98. this method runs a standard index operation.
  99. Args:
  100. documents: iterable of (id, data, tags), (id, data) or data
  101. “””
  102. # Run standard insert if index doesn’t exist or it has no records
  103. if not self.count():
  104. self.index(documents)
  105. return
  106. # Create transform and stream
  107. transform = Transform(self, Action.UPSERT)
  108. stream = Stream(self, Action.UPSERT)
  109. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  110. # Load documents into database and transform to vectors
  111. ids, , embeddings = transform(stream(documents), buffer)
  112. if embeddings is not None:
  113. # Remove principal components from embeddings, if necessary
  114. if self.reducer:
  115. self.reducer(embeddings)
  116. # Append embeddings to the index
  117. self.ann.append(embeddings)
  118. # Save indexids-ids mapping for indexes with no database
  119. if ids and not self.database:
  120. self.ids = self.createids(self.ids + ids)
  121. # Scoring upsert, if necessary
  122. # This must occur before graph upsert in order to be available to the graph
  123. if self.issparse():
  124. self.scoring.upsert()
  125. # Subindexes upsert, if necessary
  126. if self.indexes:
  127. self.indexes.upsert()
  128. # Graph upsert, if necessary
  129. if self.graph:
  130. self.graph.upsert(Search(self, indexonly=True), Ids(self), self.batchsimilarity)
  131. def delete(self, ids):
  132. “””
  133. Deletes from an embeddings index. Returns list of ids deleted.
  134. Args:
  135. ids: list of ids to delete
  136. Returns:
  137. list of ids deleted
  138. “””
  139. # List of internal indices for each candidate id to delete
  140. indices = []
  141. # List of deleted ids
  142. deletes = []
  143. if self.database:
  144. # Retrieve indexid-id mappings from database
  145. ids = self.database.ids(ids)
  146. # Parse out indices and ids to delete
  147. indices = [i for i, in ids]
  148. deletes = sorted(set(uid for _, uid in ids))
  149. # Delete ids from database
  150. self.database.delete(deletes)
  151. elif self.ann or self.scoring:
  152. # Find existing ids
  153. for uid in ids:
  154. indices.extend([index for index, value in enumerate(self.ids) if uid == value])
  155. # Clear embeddings ids
  156. for index in indices:
  157. deletes.append(self.ids[index])
  158. self.ids[index] = None
  159. # Delete indices for all indexes and data stores
  160. if indices:
  161. # Delete ids from ann
  162. if self.isdense():
  163. self.ann.delete(indices)
  164. # Delete ids from scoring
  165. if self.issparse():
  166. self.scoring.delete(indices)
  167. # Delete ids from subindexes
  168. if self.indexes:
  169. self.indexes.delete(indices)
  170. # Delete ids from graph
  171. if self.graph:
  172. self.graph.delete(indices)
  173. return deletes
  174. def reindex(self, config=None, function=None, kwargs):
  175. “””
  176. Recreates embeddings index using config. This method only works if document content storage is enabled.
  177. Args:
  178. config: new config
  179. function: optional function to prepare content for indexing
  180. kwargs: additional configuration as keyword args
  181. “””
  182. if self.database:
  183. # Merge configuration into single dictionary
  184. config = {config, kwargs} if config and kwargs else config if config else kwargs
  185. # Keep content and objects parameters to ensure database is preserved
  186. config[“content”] = self.config[“content”]
  187. if objects in self.config:
  188. config[“objects”] = self.config[“objects”]
  189. # Reset configuration
  190. self.configure(config)
  191. # Reset function references
  192. if self.functions:
  193. self.functions.reset()
  194. # Reindex
  195. if function:
  196. self.index(function(self.database.reindex(self.config)), True)
  197. else:
  198. self.index(self.database.reindex(self.config), True)
  199. def transform(self, document, category=None, index=None):
  200. “””
  201. Transforms document into an embeddings vector.
  202. Args:
  203. documents: iterable of (id, data, tags), (id, data) or data
  204. category: category for instruction-based embeddings
  205. index: index name, if applicable
  206. Returns:
  207. embeddings vector
  208. “””
  209. return self.batchtransform([document], category, index)[0]
  210. def batchtransform(self, documents, category=None, index=None):
  211. “””
  212. Transforms documents into embeddings vectors.
  213. Args:
  214. documents: iterable of (id, data, tags), (id, data) or data
  215. category: category for instruction-based embeddings
  216. index: index name, if applicable
  217. Returns:
  218. embeddings vectors
  219. “””
  220. # Initialize default parameters, if necessary
  221. self.defaults()
  222. # Get vector model
  223. model = self.indexes.model(index) if index and self.indexes else self.model if self.model else self.indexes.model()
  224. # Convert documents into embeddings
  225. embeddings = model.batchtransform(Stream(self)(documents), category)
  226. # Reduce the dimensionality of the embeddings. Scale the embeddings using this
  227. # model to reduce the noise of common but less relevant terms.
  228. if self.reducer:
  229. self.reducer(embeddings)
  230. return embeddings
  231. def count(self):
  232. “””
  233. Total number of elements in this embeddings index.
  234. Returns:
  235. number of elements in this embeddings index
  236. “””
  237. if self.ann:
  238. return self.ann.count()
  239. if self.scoring:
  240. return self.scoring.count()
  241. if self.database:
  242. return self.database.count()
  243. if self.ids:
  244. return len([uid for uid in self.ids if uid is not None])
  245. # Default to 0 when no suitable method found
  246. return 0
  247. def search(self, query, limit=None, weights=None, index=None, parameters=None, graph=False):
  248. “””
  249. Finds documents most similar to the input query. This method will run either an index search
  250. or an index + database search depending on if a database is available.
  251. Args:
  252. query: input query
  253. limit: maximum results
  254. weights: hybrid score weights, if applicable
  255. index: index name, if applicable
  256. parameters: dict of named parameters to bind to placeholders
  257. graph: return graph results if True
  258. Returns:
  259. list of (id, score) for index search
  260. list of dict for an index + database search
  261. graph when graph is set to True
  262. “””
  263. results = self.batchsearch([query], limit, weights, index, [parameters], graph)
  264. return results[0] if results else results
  265. def batchsearch(self, queries, limit=None, weights=None, index=None, parameters=None, graph=False):
  266. “””
  267. Finds documents most similar to the input queries. This method will run either an index search
  268. or an index + database search depending on if a database is available.
  269. Args:
  270. queries: input queries
  271. limit: maximum results
  272. weights: hybrid score weights, if applicable
  273. index: index name, if applicable
  274. parameters: list of dicts of named parameters to bind to placeholders
  275. graph: return graph results if True
  276. Returns:
  277. list of (id, score) per query for index search
  278. list of dict per query for an index + database search
  279. list of graph per query when graph is set to True
  280. “””
  281. # Determine if graphs should be returned
  282. graph = graph if self.graph else False
  283. # Execute search
  284. results = Search(self, indexids=graph)(queries, limit, weights, index, parameters)
  285. # Create subgraphs using results, if necessary
  286. return [self.graph.filter(x) for x in results] if graph else results
  287. def similarity(self, query, data):
  288. “””
  289. Computes the similarity between query and list of data. Returns a list of
  290. (id, score) sorted by highest score, where id is the index in data.
  291. Args:
  292. query: input query
  293. data: list of data
  294. Returns:
  295. list of (id, score)
  296. “””
  297. return self.batchsimilarity([query], data)[0]
  298. def batchsimilarity(self, queries, data):
  299. “””
  300. Computes the similarity between list of queries and list of data. Returns a list
  301. of (id, score) sorted by highest score per query, where id is the index in data.
  302. Args:
  303. queries: input queries
  304. data: list of data
  305. Returns:
  306. list of (id, score) per query
  307. “””
  308. # Convert queries to embedding vectors
  309. queries = self.batchtransform(((None, query, None) for query in queries), query”)
  310. data = self.batchtransform(((None, row, None) for row in data), data”)
  311. # Dot product on normalized vectors is equal to cosine similarity
  312. scores = np.dot(queries, data.T).tolist()
  313. # Add index and sort desc based on score
  314. return [sorted(enumerate(score), key=lambda x: x[1], reverse=True) for score in scores]
  315. def explain(self, query, texts=None, limit=None):
  316. “””
  317. Explains the importance of each input token in text for a query. This method requires either content to be enabled
  318. or texts to be provided.
  319. Args:
  320. query: input query
  321. texts: optional list of (text|list of tokens), otherwise runs search query
  322. limit: optional limit if texts is None
  323. Returns:
  324. list of dict per input text where a higher token scores represents higher importance relative to the query
  325. “””
  326. results = self.batchexplain([query], texts, limit)
  327. return results[0] if results else results
  328. def batchexplain(self, queries, texts=None, limit=None):
  329. “””
  330. Explains the importance of each input token in text for a list of queries. This method requires either content to be enabled
  331. or texts to be provided.
  332. Args:
  333. queries: input queries
  334. texts: optional list of (text|list of tokens), otherwise runs search queries
  335. limit: optional limit if texts is None
  336. Returns:
  337. list of dict per input text per query where a higher token scores represents higher importance relative to the query
  338. “””
  339. return Explain(self)(queries, texts, limit)
  340. def terms(self, query):
  341. “””
  342. Extracts keyword terms from a query.
  343. Args:
  344. query: input query
  345. Returns:
  346. query reduced down to keyword terms
  347. “””
  348. return self.batchterms([query])[0]
  349. def batchterms(self, queries):
  350. “””
  351. Extracts keyword terms from a list of queries.
  352. Args:
  353. queries: list of queries
  354. Returns:
  355. list of queries reduced down to keyword term strings
  356. “””
  357. return Terms(self)(queries)
  358. def exists(self, path=None, cloud=None, kwargs):
  359. “””
  360. Checks if an index exists at path.
  361. Args:
  362. path: input path
  363. cloud: cloud storage configuration
  364. kwargs: additional configuration as keyword args
  365. Returns:
  366. True if index exists, False otherwise
  367. “””
  368. # Check if this exists in a cloud instance
  369. cloud = self.createcloud(cloud=cloud, kwargs)
  370. if cloud:
  371. return cloud.exists(path)
  372. # Check if this is an archive file and exists
  373. path, apath = self.checkarchive(path)
  374. if apath:
  375. return os.path.exists(apath)
  376. # Return true if path has a config.json or config file with an offset set
  377. return path and (os.path.exists(f”{path}/config.json”) or os.path.exists(f”{path}/config”)) and offset in Configuration().load(path)
  378. def load(self, path=None, cloud=None, config=None, kwargs):
  379. “””
  380. Loads an existing index from path.
  381. Args:
  382. path: input path
  383. cloud: cloud storage configuration
  384. config: configuration overrides
  385. kwargs: additional configuration as keyword args
  386. “””
  387. # Load from cloud, if configured
  388. cloud = self.createcloud(cloud=cloud, kwargs)
  389. if cloud:
  390. path = cloud.load(path)
  391. # Check if this is an archive file and extract
  392. path, apath = self.checkarchive(path)
  393. if apath:
  394. self.archive.load(apath)
  395. # Load index configuration
  396. self.config = Configuration().load(path)
  397. # Apply config overrides
  398. self.config = {self.config, config} if config else self.config
  399. # Approximate nearest neighbor index - stores dense vectors
  400. self.ann = self.createann()
  401. if self.ann:
  402. self.ann.load(f”{path}/embeddings”)
  403. # Dimensionality reduction model - word vectors only
  404. if self.config.get(“pca”):
  405. self.reducer = Reducer()
  406. self.reducer.load(f”{path}/lsa”)
  407. # Index ids when content is disabled
  408. self.ids = self.createids()
  409. if self.ids:
  410. self.ids.load(f”{path}/ids”)
  411. # Document database - stores document content
  412. self.database = self.createdatabase()
  413. if self.database:
  414. self.database.load(f”{path}/documents”)
  415. # Sparse vectors - stores term sparse arrays
  416. self.scoring = self.createscoring()
  417. if self.scoring:
  418. self.scoring.load(f”{path}/scoring”)
  419. # Subindexes
  420. self.indexes = self.createindexes()
  421. if self.indexes:
  422. self.indexes.load(f”{path}/indexes”)
  423. # Graph network - stores relationships
  424. self.graph = self.creategraph()
  425. if self.graph:
  426. self.graph.load(f”{path}/graph”)
  427. # Dense vectors - transforms data to embeddings vectors
  428. self.model = self.loadvectors()
  429. # Query model
  430. self.query = self.loadquery()
  431. def save(self, path, cloud=None, kwargs):
  432. “””
  433. Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip.
  434. In those cases, the index is stored as a compressed file.
  435. Args:
  436. path: output path
  437. cloud: cloud storage configuration
  438. kwargs: additional configuration as keyword args
  439. “””
  440. if self.config:
  441. # Check if this is an archive file
  442. path, apath = self.checkarchive(path)
  443. # Create output directory, if necessary
  444. os.makedirs(path, exist_ok=True)
  445. # Copy vectors model
  446. if self.config.get(“storevectors”):
  447. shutil.copyfile(self.config[“path”], os.path.join(path, os.path.basename(self.config[“path”])))
  448. self.config[“path”] = os.path.basename(self.config[“path”])
  449. # Save index configuration
  450. Configuration().save(self.config, path)
  451. # Save approximate nearest neighbor index
  452. if self.ann:
  453. self.ann.save(f”{path}/embeddings”)
  454. # Save dimensionality reduction model (word vectors only)
  455. if self.reducer:
  456. self.reducer.save(f”{path}/lsa”)
  457. # Save index ids
  458. if self.ids:
  459. self.ids.save(f”{path}/ids”)
  460. # Save document database
  461. if self.database:
  462. self.database.save(f”{path}/documents”)
  463. # Save scoring index
  464. if self.scoring:
  465. self.scoring.save(f”{path}/scoring”)
  466. # Save subindexes
  467. if self.indexes:
  468. self.indexes.save(f”{path}/indexes”)
  469. # Save graph
  470. if self.graph:
  471. self.graph.save(f”{path}/graph”)
  472. # If this is an archive, save it
  473. if apath:
  474. self.archive.save(apath)
  475. # Save to cloud, if configured
  476. cloud = self.createcloud(cloud=cloud, kwargs)
  477. if cloud:
  478. cloud.save(apath if apath else path)
  479. def close(self):
  480. “””
  481. Closes this embeddings index and frees all resources.
  482. “””
  483. self.config, self.archive = None, None
  484. self.reducer, self.query, self.model, self.models = None, None, None, None
  485. self.ids = None
  486. # Close ANN
  487. if self.ann:
  488. self.ann.close()
  489. self.ann = None
  490. # Close database
  491. if self.database:
  492. self.database.close()
  493. self.database, self.functions = None, None
  494. # Close scoring
  495. if self.scoring:
  496. self.scoring.close()
  497. self.scoring = None
  498. # Close graph
  499. if self.graph:
  500. self.graph.close()
  501. self.graph = None
  502. # Close indexes
  503. if self.indexes:
  504. self.indexes.close()
  505. self.indexes = None
  506. def info(self):
  507. “””
  508. Prints the current embeddings index configuration.
  509. “””
  510. if self.config:
  511. # Print configuration
  512. print(json.dumps(self.config, sort_keys=True, default=str, indent=2))
  513. def issparse(self):
  514. “””
  515. Checks if this instance has an associated scoring instance with term indexing enabled.
  516. Returns:
  517. True if term index is enabled, False otherwise
  518. “””
  519. return self.scoring and self.scoring.hasterms()
  520. def isdense(self):
  521. “””
  522. Checks if this instance has an associated ANN instance.
  523. Returns:
  524. True if this instance has an associated ANN, False otherwise
  525. “””
  526. return self.ann is not None
  527. def isweighted(self):
  528. “””
  529. Checks if this instance has an associated scoring instance with term weighting enabled.
  530. Returns:
  531. True if term weighting is enabled, False otherwise
  532. “””
  533. return self.scoring and not self.scoring.hasterms()
  534. def configure(self, config):
  535. “””
  536. Sets the configuration for this embeddings index and loads config-driven models.
  537. Args:
  538. config: embeddings configuration
  539. “””
  540. # Configuration
  541. self.config = config
  542. # Dimensionality reduction model
  543. self.reducer = None
  544. # Create scoring instance for word vectors term weighting
  545. scoring = self.config.get(“scoring”) if self.config else None
  546. self.scoring = self.createscoring() if scoring and (not isinstance(scoring, dict) or not scoring.get(“terms”)) else None
  547. # Dense vectors - transforms data to embeddings vectors
  548. self.model = self.loadvectors() if self.config else None
  549. # Query model
  550. self.query = self.loadquery() if self.config else None
  551. def initindex(self, reindex):
  552. “””
  553. Initialize new index.
  554. Args:
  555. reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
  556. “””
  557. # Initialize default parameters, if necessary
  558. self.defaults()
  559. # Initialize index ids, only created when content is disabled
  560. self.ids = None
  561. # Create document database, if necessary
  562. if not reindex:
  563. self.database = self.createdatabase()
  564. # Reset archive since this is a new index
  565. self.archive = None
  566. # Initialize ANN, will be created after index transformations complete
  567. self.ann = None
  568. # Create scoring only if term indexing is enabled
  569. scoring = self.config.get(“scoring”)
  570. if scoring and isinstance(scoring, dict) and self.config[“scoring”].get(“terms”):
  571. self.scoring = self.createscoring()
  572. # Create subindexes, if necessary
  573. self.indexes = self.createindexes()
  574. # Create graph, if necessary
  575. self.graph = self.creategraph()
  576. def defaults(self):
  577. “””
  578. Apply default parameters to current configuration.
  579. Returns:
  580. configuration with default parameters set
  581. “””
  582. self.config = self.config if self.config else {}
  583. # Expand sparse index shortcuts
  584. if not self.config.get(“scoring”) and any(self.config.get(key) for key in [“keyword”, hybrid”]):
  585. self.config[“scoring”] = {“method”: bm25”, terms”: True, normalize”: True}
  586. # Check if default model should be loaded
  587. if not self.model and self.defaultallowed():
  588. self.config[“path”] = sentence-transformers/all-MiniLM-L6-v2
  589. # Load dense vectors model
  590. self.model = self.loadvectors()
  591. def defaultallowed(self):
  592. “””
  593. Tests if this embeddings instance can use a default model if not otherwise provided.
  594. Returns:
  595. True if a default model is allowed, False otherwise
  596. “””
  597. params = [(“keyword”, False), (“defaults”, True)]
  598. return all(self.config.get(key, default) == default for key, default in params)
  599. def loadvectors(self):
  600. “””
  601. Loads a vector model set in config.
  602. Returns:
  603. vector model
  604. “””
  605. # Create model cache if subindexes are enabled
  606. if indexes in self.config and self.models is None:
  607. self.models = {}
  608. # Load vector model
  609. return VectorsFactory.create(self.config, self.scoring, self.models)
  610. def loadquery(self):
  611. “””
  612. Loads a query model set in config.
  613. Returns:
  614. query model
  615. “””
  616. if query in self.config:
  617. return Query(self.config[“query”])
  618. return None
  619. def checkarchive(self, path):
  620. “””
  621. Checks if path is an archive file.
  622. Args:
  623. path: path to check
  624. Returns:
  625. (working directory, current path) if this is an archive, original path otherwise
  626. “””
  627. # Create archive instance, if necessary
  628. self.archive = ArchiveFactory.create()
  629. # Check if path is an archive file
  630. if self.archive.isarchive(path):
  631. # Return temporary archive working directory and original path
  632. return self.archive.path(), path
  633. return path, None
  634. def createcloud(self, cloud):
  635. “””
  636. Creates a cloud instance from config.
  637. Args:
  638. cloud: cloud configuration
  639. “””
  640. # Merge keyword args and keys under the cloud parameter
  641. config = cloud
  642. if cloud in config and config[“cloud”]:
  643. config.update(config.pop(“cloud”))
  644. # Create cloud instance from config and return
  645. return CloudFactory.create(config) if config else None
  646. def createann(self):
  647. “””
  648. Creates an ANN from config.
  649. Returns:
  650. new ANN, if enabled in config
  651. “””
  652. return ANNFactory.create(self.config) if self.config.get(“path”) or self.defaultallowed() else None
  653. def createdatabase(self):
  654. “””
  655. Creates a database from config. This method will also close any existing database connection.
  656. Returns:
  657. new database, if enabled in config
  658. “””
  659. # Free existing database resources
  660. if self.database:
  661. self.database.close()
  662. config = self.config.copy()
  663. # Create references to callable functions
  664. self.functions = Functions(self) if functions in config else None
  665. if self.functions:
  666. config[“functions”] = self.functions(config)
  667. # Create database from config and return
  668. return DatabaseFactory.create(config)
  669. def creategraph(self):
  670. “””
  671. Creates a graph from config.
  672. Returns:
  673. new graph, if enabled in config
  674. “””
  675. if graph in self.config:
  676. # Get or create graph configuration
  677. config = self.config[“graph”] if graph in self.config else {}
  678. # Create configuration with custom columns, if necessary
  679. config = self.columns(config)
  680. return GraphFactory.create(config)
  681. return None
  682. def createids(self, ids=None):
  683. “””
  684. Creates indexids when content is disabled.
  685. Args:
  686. ids: optional ids to add
  687. Returns:
  688. new indexids, if content disabled
  689. “””
  690. # Load index ids when content is disabled
  691. return IndexIds(self, ids) if not self.config.get(“content”) else None
  692. def createindexes(self):
  693. “””
  694. Creates subindexes from config.
  695. Returns:
  696. list of subindexes
  697. “””
  698. # Load subindexes
  699. if indexes in self.config:
  700. indexes = {}
  701. for index, config in self.config[“indexes”].items():
  702. # Create index with shared model cache
  703. indexes[index] = Embeddings(config, models=self.models)
  704. # Wrap as Indexes object
  705. return Indexes(self, indexes)
  706. return None
  707. def createscoring(self):
  708. “””
  709. Creates a scoring from config.
  710. Returns:
  711. new scoring, if enabled in config
  712. “””
  713. # Free existing resources
  714. if self.scoring:
  715. self.scoring.close()
  716. if scoring in self.config:
  717. # Expand scoring to a dictionary, if necessary
  718. config = self.config[“scoring”]
  719. config = config if isinstance(config, dict) else {“method”: config}
  720. # Create configuration with custom columns, if necessary
  721. config = self.columns(config)
  722. return ScoringFactory.create(config)
  723. return None
  724. def columns(self, config):
  725. “””
  726. Adds custom text/object column information if its provided.
  727. Args:
  728. config: input configuration
  729. Returns:
  730. config with column information added
  731. “””
  732. # Add text/object columns if custom
  733. if columns in self.config:
  734. # Work on copy of configuration
  735. config = config.copy()
  736. # Copy columns to config
  737. config[“columns”] = self.config[“columns”]
  738. return config

__init__(config=None, models=None, **kwargs)

Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized.

Parameters:

NameTypeDescriptionDefault
config

embeddings configuration

None
models

models cache, used for model sharing between embeddings

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 33
  2. 34
  3. 35
  4. 36
  5. 37
  6. 38
  7. 39
  8. 40
  9. 41
  10. 42
  11. 43
  12. 44
  13. 45
  14. 46
  15. 47
  16. 48
  17. 49
  18. 50
  19. 51
  20. 52
  21. 53
  22. 54
  23. 55
  24. 56
  25. 57
  26. 58
  27. 59
  28. 60
  29. 61
  30. 62
  31. 63
  32. 64
  33. 65
  34. 66
  35. 67
  36. 68
  37. 69
  38. 70
  39. 71
  40. 72
  41. 73
  42. 74
  43. 75
  44. 76
  45. 77
  46. 78
  47. 79
  48. 80
  49. 81
  50. 82
  51. 83
  52. 84
  53. 85
  54. 86
  1. def init(self, config=None, models=None, kwargs):
  2. “””
  3. Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized.
  4. Args:
  5. config: embeddings configuration
  6. models: models cache, used for model sharing between embeddings
  7. kwargs: additional configuration as keyword args
  8. “””
  9. # Index configuration
  10. self.config = None
  11. # Dimensionality reduction - word vectors only
  12. self.reducer = None
  13. # Dense vector model - transforms data into similarity vectors
  14. self.model = None
  15. # Approximate nearest neighbor index
  16. self.ann = None
  17. # Index ids when content is disabled
  18. self.ids = None
  19. # Document database
  20. self.database = None
  21. # Resolvable functions
  22. self.functions = None
  23. # Graph network
  24. self.graph = None
  25. # Sparse vectors
  26. self.scoring = None
  27. # Query model
  28. self.query = None
  29. # Index archive
  30. self.archive = None
  31. # Subindexes for this embeddings instance
  32. self.indexes = None
  33. # Models cache
  34. self.models = models
  35. # Merge configuration into single dictionary
  36. config = {config, **kwargs} if config and kwargs else kwargs if kwargs else config
  37. # Set initial configuration
  38. self.configure(config)

batchexplain(queries, texts=None, limit=None)

Explains the importance of each input token in text for a list of queries. This method requires either content to be enabled or texts to be provided.

Parameters:

NameTypeDescriptionDefault
queries

input queries

required
texts

optional list of (text|list of tokens), otherwise runs search queries

None
limit

optional limit if texts is None

None

Returns:

TypeDescription

list of dict per input text per query where a higher token scores represents higher importance relative to the query

Source code in txtai/embeddings/base.py

  1. 456
  2. 457
  3. 458
  4. 459
  5. 460
  6. 461
  7. 462
  8. 463
  9. 464
  10. 465
  11. 466
  12. 467
  13. 468
  14. 469
  15. 470
  1. def batchexplain(self, queries, texts=None, limit=None):
  2. “””
  3. Explains the importance of each input token in text for a list of queries. This method requires either content to be enabled
  4. or texts to be provided.
  5. Args:
  6. queries: input queries
  7. texts: optional list of (text|list of tokens), otherwise runs search queries
  8. limit: optional limit if texts is None
  9. Returns:
  10. list of dict per input text per query where a higher token scores represents higher importance relative to the query
  11. “””
  12. return Explain(self)(queries, texts, limit)

batchsearch(queries, limit=None, weights=None, index=None, parameters=None, graph=False)

Finds documents most similar to the input queries. This method will run either an index search or an index + database search depending on if a database is available.

Parameters:

NameTypeDescriptionDefault
queries

input queries

required
limit

maximum results

None
weights

hybrid score weights, if applicable

None
index

index name, if applicable

None
parameters

list of dicts of named parameters to bind to placeholders

None
graph

return graph results if True

False

Returns:

TypeDescription

list of (id, score) per query for index search

list of dict per query for an index + database search

list of graph per query when graph is set to True

Source code in txtai/embeddings/base.py

  1. 373
  2. 374
  3. 375
  4. 376
  5. 377
  6. 378
  7. 379
  8. 380
  9. 381
  10. 382
  11. 383
  12. 384
  13. 385
  14. 386
  15. 387
  16. 388
  17. 389
  18. 390
  19. 391
  20. 392
  21. 393
  22. 394
  23. 395
  24. 396
  25. 397
  26. 398
  27. 399
  1. def batchsearch(self, queries, limit=None, weights=None, index=None, parameters=None, graph=False):
  2. “””
  3. Finds documents most similar to the input queries. This method will run either an index search
  4. or an index + database search depending on if a database is available.
  5. Args:
  6. queries: input queries
  7. limit: maximum results
  8. weights: hybrid score weights, if applicable
  9. index: index name, if applicable
  10. parameters: list of dicts of named parameters to bind to placeholders
  11. graph: return graph results if True
  12. Returns:
  13. list of (id, score) per query for index search
  14. list of dict per query for an index + database search
  15. list of graph per query when graph is set to True
  16. “””
  17. # Determine if graphs should be returned
  18. graph = graph if self.graph else False
  19. # Execute search
  20. results = Search(self, indexids=graph)(queries, limit, weights, index, parameters)
  21. # Create subgraphs using results, if necessary
  22. return [self.graph.filter(x) for x in results] if graph else results

batchsimilarity(queries, data)

Computes the similarity between list of queries and list of data. Returns a list of (id, score) sorted by highest score per query, where id is the index in data.

Parameters:

NameTypeDescriptionDefault
queries

input queries

required
data

list of data

required

Returns:

TypeDescription

list of (id, score) per query

Source code in txtai/embeddings/base.py

  1. 416
  2. 417
  3. 418
  4. 419
  5. 420
  6. 421
  7. 422
  8. 423
  9. 424
  10. 425
  11. 426
  12. 427
  13. 428
  14. 429
  15. 430
  16. 431
  17. 432
  18. 433
  19. 434
  20. 435
  21. 436
  22. 437
  1. def batchsimilarity(self, queries, data):
  2. “””
  3. Computes the similarity between list of queries and list of data. Returns a list
  4. of (id, score) sorted by highest score per query, where id is the index in data.
  5. Args:
  6. queries: input queries
  7. data: list of data
  8. Returns:
  9. list of (id, score) per query
  10. “””
  11. # Convert queries to embedding vectors
  12. queries = self.batchtransform(((None, query, None) for query in queries), query”)
  13. data = self.batchtransform(((None, row, None) for row in data), data”)
  14. # Dot product on normalized vectors is equal to cosine similarity
  15. scores = np.dot(queries, data.T).tolist()
  16. # Add index and sort desc based on score
  17. return [sorted(enumerate(score), key=lambda x: x[1], reverse=True) for score in scores]

batchterms(queries)

Extracts keyword terms from a list of queries.

Parameters:

NameTypeDescriptionDefault
queries

list of queries

required

Returns:

TypeDescription

list of queries reduced down to keyword term strings

Source code in txtai/embeddings/base.py

  1. 485
  2. 486
  3. 487
  4. 488
  5. 489
  6. 490
  7. 491
  8. 492
  9. 493
  10. 494
  11. 495
  12. 496
  1. def batchterms(self, queries):
  2. “””
  3. Extracts keyword terms from a list of queries.
  4. Args:
  5. queries: list of queries
  6. Returns:
  7. list of queries reduced down to keyword term strings
  8. “””
  9. return Terms(self)(queries)

batchtransform(documents, category=None, index=None)

Transforms documents into embeddings vectors.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required
category

category for instruction-based embeddings

None
index

index name, if applicable

None

Returns:

TypeDescription

embeddings vectors

Source code in txtai/embeddings/base.py

  1. 302
  2. 303
  3. 304
  4. 305
  5. 306
  6. 307
  7. 308
  8. 309
  9. 310
  10. 311
  11. 312
  12. 313
  13. 314
  14. 315
  15. 316
  16. 317
  17. 318
  18. 319
  19. 320
  20. 321
  21. 322
  22. 323
  23. 324
  24. 325
  25. 326
  26. 327
  27. 328
  28. 329
  1. def batchtransform(self, documents, category=None, index=None):
  2. “””
  3. Transforms documents into embeddings vectors.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. category: category for instruction-based embeddings
  7. index: index name, if applicable
  8. Returns:
  9. embeddings vectors
  10. “””
  11. # Initialize default parameters, if necessary
  12. self.defaults()
  13. # Get vector model
  14. model = self.indexes.model(index) if index and self.indexes else self.model if self.model else self.indexes.model()
  15. # Convert documents into embeddings
  16. embeddings = model.batchtransform(Stream(self)(documents), category)
  17. # Reduce the dimensionality of the embeddings. Scale the embeddings using this
  18. # model to reduce the noise of common but less relevant terms.
  19. if self.reducer:
  20. self.reducer(embeddings)
  21. return embeddings

close()

Closes this embeddings index and frees all resources.

Source code in txtai/embeddings/base.py

  1. 656
  2. 657
  3. 658
  4. 659
  5. 660
  6. 661
  7. 662
  8. 663
  9. 664
  10. 665
  11. 666
  12. 667
  13. 668
  14. 669
  15. 670
  16. 671
  17. 672
  18. 673
  19. 674
  20. 675
  21. 676
  22. 677
  23. 678
  24. 679
  25. 680
  26. 681
  27. 682
  28. 683
  29. 684
  30. 685
  31. 686
  32. 687
  33. 688
  1. def close(self):
  2. “””
  3. Closes this embeddings index and frees all resources.
  4. “””
  5. self.config, self.archive = None, None
  6. self.reducer, self.query, self.model, self.models = None, None, None, None
  7. self.ids = None
  8. # Close ANN
  9. if self.ann:
  10. self.ann.close()
  11. self.ann = None
  12. # Close database
  13. if self.database:
  14. self.database.close()
  15. self.database, self.functions = None, None
  16. # Close scoring
  17. if self.scoring:
  18. self.scoring.close()
  19. self.scoring = None
  20. # Close graph
  21. if self.graph:
  22. self.graph.close()
  23. self.graph = None
  24. # Close indexes
  25. if self.indexes:
  26. self.indexes.close()
  27. self.indexes = None

count()

Total number of elements in this embeddings index.

Returns:

TypeDescription

number of elements in this embeddings index

Source code in txtai/embeddings/base.py

  1. 331
  2. 332
  3. 333
  4. 334
  5. 335
  6. 336
  7. 337
  8. 338
  9. 339
  10. 340
  11. 341
  12. 342
  13. 343
  14. 344
  15. 345
  16. 346
  17. 347
  18. 348
  19. 349
  1. def count(self):
  2. “””
  3. Total number of elements in this embeddings index.
  4. Returns:
  5. number of elements in this embeddings index
  6. “””
  7. if self.ann:
  8. return self.ann.count()
  9. if self.scoring:
  10. return self.scoring.count()
  11. if self.database:
  12. return self.database.count()
  13. if self.ids:
  14. return len([uid for uid in self.ids if uid is not None])
  15. # Default to 0 when no suitable method found
  16. return 0

delete(ids)

Deletes from an embeddings index. Returns list of ids deleted.

Parameters:

NameTypeDescriptionDefault
ids

list of ids to delete

required

Returns:

TypeDescription

list of ids deleted

Source code in txtai/embeddings/base.py

  1. 198
  2. 199
  3. 200
  4. 201
  5. 202
  6. 203
  7. 204
  8. 205
  9. 206
  10. 207
  11. 208
  12. 209
  13. 210
  14. 211
  15. 212
  16. 213
  17. 214
  18. 215
  19. 216
  20. 217
  21. 218
  22. 219
  23. 220
  24. 221
  25. 222
  26. 223
  27. 224
  28. 225
  29. 226
  30. 227
  31. 228
  32. 229
  33. 230
  34. 231
  35. 232
  36. 233
  37. 234
  38. 235
  39. 236
  40. 237
  41. 238
  42. 239
  43. 240
  44. 241
  45. 242
  46. 243
  47. 244
  48. 245
  49. 246
  50. 247
  51. 248
  52. 249
  53. 250
  54. 251
  55. 252
  56. 253
  1. def delete(self, ids):
  2. “””
  3. Deletes from an embeddings index. Returns list of ids deleted.
  4. Args:
  5. ids: list of ids to delete
  6. Returns:
  7. list of ids deleted
  8. “””
  9. # List of internal indices for each candidate id to delete
  10. indices = []
  11. # List of deleted ids
  12. deletes = []
  13. if self.database:
  14. # Retrieve indexid-id mappings from database
  15. ids = self.database.ids(ids)
  16. # Parse out indices and ids to delete
  17. indices = [i for i, in ids]
  18. deletes = sorted(set(uid for , uid in ids))
  19. # Delete ids from database
  20. self.database.delete(deletes)
  21. elif self.ann or self.scoring:
  22. # Find existing ids
  23. for uid in ids:
  24. indices.extend([index for index, value in enumerate(self.ids) if uid == value])
  25. # Clear embeddings ids
  26. for index in indices:
  27. deletes.append(self.ids[index])
  28. self.ids[index] = None
  29. # Delete indices for all indexes and data stores
  30. if indices:
  31. # Delete ids from ann
  32. if self.isdense():
  33. self.ann.delete(indices)
  34. # Delete ids from scoring
  35. if self.issparse():
  36. self.scoring.delete(indices)
  37. # Delete ids from subindexes
  38. if self.indexes:
  39. self.indexes.delete(indices)
  40. # Delete ids from graph
  41. if self.graph:
  42. self.graph.delete(indices)
  43. return deletes

exists(path=None, cloud=None, **kwargs)

Checks if an index exists at path.

Parameters:

NameTypeDescriptionDefault
path

input path

None
cloud

cloud storage configuration

None
kwargs

additional configuration as keyword args

{}

Returns:

TypeDescription

True if index exists, False otherwise

Source code in txtai/embeddings/base.py

  1. 498
  2. 499
  3. 500
  4. 501
  5. 502
  6. 503
  7. 504
  8. 505
  9. 506
  10. 507
  11. 508
  12. 509
  13. 510
  14. 511
  15. 512
  16. 513
  17. 514
  18. 515
  19. 516
  20. 517
  21. 518
  22. 519
  23. 520
  24. 521
  25. 522
  1. def exists(self, path=None, cloud=None, kwargs):
  2. “””
  3. Checks if an index exists at path.
  4. Args:
  5. path: input path
  6. cloud: cloud storage configuration
  7. kwargs: additional configuration as keyword args
  8. Returns:
  9. True if index exists, False otherwise
  10. “””
  11. # Check if this exists in a cloud instance
  12. cloud = self.createcloud(cloud=cloud, kwargs)
  13. if cloud:
  14. return cloud.exists(path)
  15. # Check if this is an archive file and exists
  16. path, apath = self.checkarchive(path)
  17. if apath:
  18. return os.path.exists(apath)
  19. # Return true if path has a config.json or config file with an offset set
  20. return path and (os.path.exists(f”{path}/config.json”) or os.path.exists(f”{path}/config”)) and offset in Configuration().load(path)

explain(query, texts=None, limit=None)

Explains the importance of each input token in text for a query. This method requires either content to be enabled or texts to be provided.

Parameters:

NameTypeDescriptionDefault
query

input query

required
texts

optional list of (text|list of tokens), otherwise runs search query

None
limit

optional limit if texts is None

None

Returns:

TypeDescription

list of dict per input text where a higher token scores represents higher importance relative to the query

Source code in txtai/embeddings/base.py

  1. 439
  2. 440
  3. 441
  4. 442
  5. 443
  6. 444
  7. 445
  8. 446
  9. 447
  10. 448
  11. 449
  12. 450
  13. 451
  14. 452
  15. 453
  16. 454
  1. def explain(self, query, texts=None, limit=None):
  2. “””
  3. Explains the importance of each input token in text for a query. This method requires either content to be enabled
  4. or texts to be provided.
  5. Args:
  6. query: input query
  7. texts: optional list of (text|list of tokens), otherwise runs search query
  8. limit: optional limit if texts is None
  9. Returns:
  10. list of dict per input text where a higher token scores represents higher importance relative to the query
  11. “””
  12. results = self.batchexplain([query], texts, limit)
  13. return results[0] if results else results

index(documents, reindex=False)

Builds an embeddings index. This method overwrites an existing index.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required
reindex

if this is a reindex operation in which case database creation is skipped, defaults to False

False

Source code in txtai/embeddings/base.py

  1. 100
  2. 101
  3. 102
  4. 103
  5. 104
  6. 105
  7. 106
  8. 107
  9. 108
  10. 109
  11. 110
  12. 111
  13. 112
  14. 113
  15. 114
  16. 115
  17. 116
  18. 117
  19. 118
  20. 119
  21. 120
  22. 121
  23. 122
  24. 123
  25. 124
  26. 125
  27. 126
  28. 127
  29. 128
  30. 129
  31. 130
  32. 131
  33. 132
  34. 133
  35. 134
  36. 135
  37. 136
  38. 137
  39. 138
  40. 139
  41. 140
  42. 141
  43. 142
  44. 143
  45. 144
  46. 145
  47. 146
  48. 147
  49. 148
  50. 149
  1. def index(self, documents, reindex=False):
  2. “””
  3. Builds an embeddings index. This method overwrites an existing index.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. reindex: if this is a reindex operation in which case database creation is skipped, defaults to False
  7. “””
  8. # Initialize index
  9. self.initindex(reindex)
  10. # Create transform and stream
  11. transform = Transform(self, Action.REINDEX if reindex else Action.INDEX)
  12. stream = Stream(self, Action.REINDEX if reindex else Action.INDEX)
  13. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  14. # Load documents into database and transform to vectors
  15. ids, dimensions, embeddings = transform(stream(documents), buffer)
  16. if embeddings is not None:
  17. # Build LSA model (if enabled). Remove principal components from embeddings.
  18. if self.config.get(“pca”):
  19. self.reducer = Reducer(embeddings, self.config[“pca”])
  20. self.reducer(embeddings)
  21. # Save index dimensions
  22. self.config[“dimensions”] = dimensions
  23. # Create approximate nearest neighbor index
  24. self.ann = self.createann()
  25. # Add embeddings to the index
  26. self.ann.index(embeddings)
  27. # Save indexids-ids mapping for indexes with no database, except when this is a reindex
  28. if ids and not reindex and not self.database:
  29. self.ids = self.createids(ids)
  30. # Index scoring, if necessary
  31. # This must occur before graph index in order to be available to the graph
  32. if self.issparse():
  33. self.scoring.index()
  34. # Index subindexes, if necessary
  35. if self.indexes:
  36. self.indexes.index()
  37. # Index graph, if necessary
  38. if self.graph:
  39. self.graph.index(Search(self, indexonly=True), Ids(self), self.batchsimilarity)

info()

Prints the current embeddings index configuration.

Source code in txtai/embeddings/base.py

  1. 690
  2. 691
  3. 692
  4. 693
  5. 694
  6. 695
  7. 696
  8. 697
  1. def info(self):
  2. “””
  3. Prints the current embeddings index configuration.
  4. “””
  5. if self.config:
  6. # Print configuration
  7. print(json.dumps(self.config, sort_keys=True, default=str, indent=2))

isdense()

Checks if this instance has an associated ANN instance.

Returns:

TypeDescription

True if this instance has an associated ANN, False otherwise

Source code in txtai/embeddings/base.py

  1. 709
  2. 710
  3. 711
  4. 712
  5. 713
  6. 714
  7. 715
  8. 716
  9. 717
  1. def isdense(self):
  2. “””
  3. Checks if this instance has an associated ANN instance.
  4. Returns:
  5. True if this instance has an associated ANN, False otherwise
  6. “””
  7. return self.ann is not None

issparse()

Checks if this instance has an associated scoring instance with term indexing enabled.

Returns:

TypeDescription

True if term index is enabled, False otherwise

Source code in txtai/embeddings/base.py

  1. 699
  2. 700
  3. 701
  4. 702
  5. 703
  6. 704
  7. 705
  8. 706
  9. 707
  1. def issparse(self):
  2. “””
  3. Checks if this instance has an associated scoring instance with term indexing enabled.
  4. Returns:
  5. True if term index is enabled, False otherwise
  6. “””
  7. return self.scoring and self.scoring.hasterms()

isweighted()

Checks if this instance has an associated scoring instance with term weighting enabled.

Returns:

TypeDescription

True if term weighting is enabled, False otherwise

Source code in txtai/embeddings/base.py

  1. 719
  2. 720
  3. 721
  4. 722
  5. 723
  6. 724
  7. 725
  8. 726
  9. 727
  1. def isweighted(self):
  2. “””
  3. Checks if this instance has an associated scoring instance with term weighting enabled.
  4. Returns:
  5. True if term weighting is enabled, False otherwise
  6. “””
  7. return self.scoring and not self.scoring.hasterms()

load(path=None, cloud=None, config=None, **kwargs)

Loads an existing index from path.

Parameters:

NameTypeDescriptionDefault
path

input path

None
cloud

cloud storage configuration

None
config

configuration overrides

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 524
  2. 525
  3. 526
  4. 527
  5. 528
  6. 529
  7. 530
  8. 531
  9. 532
  10. 533
  11. 534
  12. 535
  13. 536
  14. 537
  15. 538
  16. 539
  17. 540
  18. 541
  19. 542
  20. 543
  21. 544
  22. 545
  23. 546
  24. 547
  25. 548
  26. 549
  27. 550
  28. 551
  29. 552
  30. 553
  31. 554
  32. 555
  33. 556
  34. 557
  35. 558
  36. 559
  37. 560
  38. 561
  39. 562
  40. 563
  41. 564
  42. 565
  43. 566
  44. 567
  45. 568
  46. 569
  47. 570
  48. 571
  49. 572
  50. 573
  51. 574
  52. 575
  53. 576
  54. 577
  55. 578
  56. 579
  57. 580
  58. 581
  59. 582
  60. 583
  61. 584
  62. 585
  63. 586
  64. 587
  65. 588
  66. 589
  67. 590
  1. def load(self, path=None, cloud=None, config=None, kwargs):
  2. “””
  3. Loads an existing index from path.
  4. Args:
  5. path: input path
  6. cloud: cloud storage configuration
  7. config: configuration overrides
  8. kwargs: additional configuration as keyword args
  9. “””
  10. # Load from cloud, if configured
  11. cloud = self.createcloud(cloud=cloud, kwargs)
  12. if cloud:
  13. path = cloud.load(path)
  14. # Check if this is an archive file and extract
  15. path, apath = self.checkarchive(path)
  16. if apath:
  17. self.archive.load(apath)
  18. # Load index configuration
  19. self.config = Configuration().load(path)
  20. # Apply config overrides
  21. self.config = {self.config, config} if config else self.config
  22. # Approximate nearest neighbor index - stores dense vectors
  23. self.ann = self.createann()
  24. if self.ann:
  25. self.ann.load(f”{path}/embeddings”)
  26. # Dimensionality reduction model - word vectors only
  27. if self.config.get(“pca”):
  28. self.reducer = Reducer()
  29. self.reducer.load(f”{path}/lsa”)
  30. # Index ids when content is disabled
  31. self.ids = self.createids()
  32. if self.ids:
  33. self.ids.load(f”{path}/ids”)
  34. # Document database - stores document content
  35. self.database = self.createdatabase()
  36. if self.database:
  37. self.database.load(f”{path}/documents”)
  38. # Sparse vectors - stores term sparse arrays
  39. self.scoring = self.createscoring()
  40. if self.scoring:
  41. self.scoring.load(f”{path}/scoring”)
  42. # Subindexes
  43. self.indexes = self.createindexes()
  44. if self.indexes:
  45. self.indexes.load(f”{path}/indexes”)
  46. # Graph network - stores relationships
  47. self.graph = self.creategraph()
  48. if self.graph:
  49. self.graph.load(f”{path}/graph”)
  50. # Dense vectors - transforms data to embeddings vectors
  51. self.model = self.loadvectors()
  52. # Query model
  53. self.query = self.loadquery()

reindex(config=None, function=None, **kwargs)

Recreates embeddings index using config. This method only works if document content storage is enabled.

Parameters:

NameTypeDescriptionDefault
config

new config

None
function

optional function to prepare content for indexing

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 255
  2. 256
  3. 257
  4. 258
  5. 259
  6. 260
  7. 261
  8. 262
  9. 263
  10. 264
  11. 265
  12. 266
  13. 267
  14. 268
  15. 269
  16. 270
  17. 271
  18. 272
  19. 273
  20. 274
  21. 275
  22. 276
  23. 277
  24. 278
  25. 279
  26. 280
  27. 281
  28. 282
  29. 283
  30. 284
  31. 285
  1. def reindex(self, config=None, function=None, kwargs):
  2. “””
  3. Recreates embeddings index using config. This method only works if document content storage is enabled.
  4. Args:
  5. config: new config
  6. function: optional function to prepare content for indexing
  7. kwargs: additional configuration as keyword args
  8. “””
  9. if self.database:
  10. # Merge configuration into single dictionary
  11. config = {config, **kwargs} if config and kwargs else config if config else kwargs
  12. # Keep content and objects parameters to ensure database is preserved
  13. config[“content”] = self.config[“content”]
  14. if objects in self.config:
  15. config[“objects”] = self.config[“objects”]
  16. # Reset configuration
  17. self.configure(config)
  18. # Reset function references
  19. if self.functions:
  20. self.functions.reset()
  21. # Reindex
  22. if function:
  23. self.index(function(self.database.reindex(self.config)), True)
  24. else:
  25. self.index(self.database.reindex(self.config), True)

save(path, cloud=None, **kwargs)

Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip. In those cases, the index is stored as a compressed file.

Parameters:

NameTypeDescriptionDefault
path

output path

required
cloud

cloud storage configuration

None
kwargs

additional configuration as keyword args

{}

Source code in txtai/embeddings/base.py

  1. 592
  2. 593
  3. 594
  4. 595
  5. 596
  6. 597
  7. 598
  8. 599
  9. 600
  10. 601
  11. 602
  12. 603
  13. 604
  14. 605
  15. 606
  16. 607
  17. 608
  18. 609
  19. 610
  20. 611
  21. 612
  22. 613
  23. 614
  24. 615
  25. 616
  26. 617
  27. 618
  28. 619
  29. 620
  30. 621
  31. 622
  32. 623
  33. 624
  34. 625
  35. 626
  36. 627
  37. 628
  38. 629
  39. 630
  40. 631
  41. 632
  42. 633
  43. 634
  44. 635
  45. 636
  46. 637
  47. 638
  48. 639
  49. 640
  50. 641
  51. 642
  52. 643
  53. 644
  54. 645
  55. 646
  56. 647
  57. 648
  58. 649
  59. 650
  60. 651
  61. 652
  62. 653
  63. 654
  1. def save(self, path, cloud=None, kwargs):
  2. “””
  3. Saves an index in a directory at path unless path ends with tar.gz, tar.bz2, tar.xz or zip.
  4. In those cases, the index is stored as a compressed file.
  5. Args:
  6. path: output path
  7. cloud: cloud storage configuration
  8. kwargs: additional configuration as keyword args
  9. “””
  10. if self.config:
  11. # Check if this is an archive file
  12. path, apath = self.checkarchive(path)
  13. # Create output directory, if necessary
  14. os.makedirs(path, exist_ok=True)
  15. # Copy vectors model
  16. if self.config.get(“storevectors”):
  17. shutil.copyfile(self.config[“path”], os.path.join(path, os.path.basename(self.config[“path”])))
  18. self.config[“path”] = os.path.basename(self.config[“path”])
  19. # Save index configuration
  20. Configuration().save(self.config, path)
  21. # Save approximate nearest neighbor index
  22. if self.ann:
  23. self.ann.save(f”{path}/embeddings”)
  24. # Save dimensionality reduction model (word vectors only)
  25. if self.reducer:
  26. self.reducer.save(f”{path}/lsa”)
  27. # Save index ids
  28. if self.ids:
  29. self.ids.save(f”{path}/ids”)
  30. # Save document database
  31. if self.database:
  32. self.database.save(f”{path}/documents”)
  33. # Save scoring index
  34. if self.scoring:
  35. self.scoring.save(f”{path}/scoring”)
  36. # Save subindexes
  37. if self.indexes:
  38. self.indexes.save(f”{path}/indexes”)
  39. # Save graph
  40. if self.graph:
  41. self.graph.save(f”{path}/graph”)
  42. # If this is an archive, save it
  43. if apath:
  44. self.archive.save(apath)
  45. # Save to cloud, if configured
  46. cloud = self.createcloud(cloud=cloud, kwargs)
  47. if cloud:
  48. cloud.save(apath if apath else path)

score(documents)

Builds a term weighting scoring index. Only used by word vectors models.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required

Source code in txtai/embeddings/base.py

  1. 88
  2. 89
  3. 90
  4. 91
  5. 92
  6. 93
  7. 94
  8. 95
  9. 96
  10. 97
  11. 98
  1. def score(self, documents):
  2. “””
  3. Builds a term weighting scoring index. Only used by word vectors models.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. “””
  7. # Build scoring index for word vectors term weighting
  8. if self.isweighted():
  9. self.scoring.index(Stream(self)(documents))

search(query, limit=None, weights=None, index=None, parameters=None, graph=False)

Finds documents most similar to the input query. This method will run either an index search or an index + database search depending on if a database is available.

Parameters:

NameTypeDescriptionDefault
query

input query

required
limit

maximum results

None
weights

hybrid score weights, if applicable

None
index

index name, if applicable

None
parameters

dict of named parameters to bind to placeholders

None
graph

return graph results if True

False

Returns:

TypeDescription

list of (id, score) for index search

list of dict for an index + database search

graph when graph is set to True

Source code in txtai/embeddings/base.py

  1. 351
  2. 352
  3. 353
  4. 354
  5. 355
  6. 356
  7. 357
  8. 358
  9. 359
  10. 360
  11. 361
  12. 362
  13. 363
  14. 364
  15. 365
  16. 366
  17. 367
  18. 368
  19. 369
  20. 370
  21. 371
  1. def search(self, query, limit=None, weights=None, index=None, parameters=None, graph=False):
  2. “””
  3. Finds documents most similar to the input query. This method will run either an index search
  4. or an index + database search depending on if a database is available.
  5. Args:
  6. query: input query
  7. limit: maximum results
  8. weights: hybrid score weights, if applicable
  9. index: index name, if applicable
  10. parameters: dict of named parameters to bind to placeholders
  11. graph: return graph results if True
  12. Returns:
  13. list of (id, score) for index search
  14. list of dict for an index + database search
  15. graph when graph is set to True
  16. “””
  17. results = self.batchsearch([query], limit, weights, index, [parameters], graph)
  18. return results[0] if results else results

similarity(query, data)

Computes the similarity between query and list of data. Returns a list of (id, score) sorted by highest score, where id is the index in data.

Parameters:

NameTypeDescriptionDefault
query

input query

required
data

list of data

required

Returns:

TypeDescription

list of (id, score)

Source code in txtai/embeddings/base.py

  1. 401
  2. 402
  3. 403
  4. 404
  5. 405
  6. 406
  7. 407
  8. 408
  9. 409
  10. 410
  11. 411
  12. 412
  13. 413
  14. 414
  1. def similarity(self, query, data):
  2. “””
  3. Computes the similarity between query and list of data. Returns a list of
  4. (id, score) sorted by highest score, where id is the index in data.
  5. Args:
  6. query: input query
  7. data: list of data
  8. Returns:
  9. list of (id, score)
  10. “””
  11. return self.batchsimilarity([query], data)[0]

terms(query)

Extracts keyword terms from a query.

Parameters:

NameTypeDescriptionDefault
query

input query

required

Returns:

TypeDescription

query reduced down to keyword terms

Source code in txtai/embeddings/base.py

  1. 472
  2. 473
  3. 474
  4. 475
  5. 476
  6. 477
  7. 478
  8. 479
  9. 480
  10. 481
  11. 482
  12. 483
  1. def terms(self, query):
  2. “””
  3. Extracts keyword terms from a query.
  4. Args:
  5. query: input query
  6. Returns:
  7. query reduced down to keyword terms
  8. “””
  9. return self.batchterms([query])[0]

transform(document, category=None, index=None)

Transforms document into an embeddings vector.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required
category

category for instruction-based embeddings

None
index

index name, if applicable

None

Returns:

TypeDescription

embeddings vector

Source code in txtai/embeddings/base.py

  1. 287
  2. 288
  3. 289
  4. 290
  5. 291
  6. 292
  7. 293
  8. 294
  9. 295
  10. 296
  11. 297
  12. 298
  13. 299
  14. 300
  1. def transform(self, document, category=None, index=None):
  2. “””
  3. Transforms document into an embeddings vector.
  4. Args:
  5. documents: iterable of (id, data, tags), (id, data) or data
  6. category: category for instruction-based embeddings
  7. index: index name, if applicable
  8. Returns:
  9. embeddings vector
  10. “””
  11. return self.batchtransform([document], category, index)[0]

upsert(documents)

Runs an embeddings upsert operation. If the index exists, new data is appended to the index, existing data is updated. If the index doesn’t exist, this method runs a standard index operation.

Parameters:

NameTypeDescriptionDefault
documents

iterable of (id, data, tags), (id, data) or data

required

Source code in txtai/embeddings/base.py

  1. 151
  2. 152
  3. 153
  4. 154
  5. 155
  6. 156
  7. 157
  8. 158
  9. 159
  10. 160
  11. 161
  12. 162
  13. 163
  14. 164
  15. 165
  16. 166
  17. 167
  18. 168
  19. 169
  20. 170
  21. 171
  22. 172
  23. 173
  24. 174
  25. 175
  26. 176
  27. 177
  28. 178
  29. 179
  30. 180
  31. 181
  32. 182
  33. 183
  34. 184
  35. 185
  36. 186
  37. 187
  38. 188
  39. 189
  40. 190
  41. 191
  42. 192
  43. 193
  44. 194
  45. 195
  46. 196
  1. def upsert(self, documents):
  2. “””
  3. Runs an embeddings upsert operation. If the index exists, new data is
  4. appended to the index, existing data is updated. If the index doesnt exist,
  5. this method runs a standard index operation.
  6. Args:
  7. documents: iterable of (id, data, tags), (id, data) or data
  8. “””
  9. # Run standard insert if index doesn’t exist or it has no records
  10. if not self.count():
  11. self.index(documents)
  12. return
  13. # Create transform and stream
  14. transform = Transform(self, Action.UPSERT)
  15. stream = Stream(self, Action.UPSERT)
  16. with tempfile.NamedTemporaryFile(mode=”wb”, suffix=”.npy”) as buffer:
  17. # Load documents into database and transform to vectors
  18. ids, _, embeddings = transform(stream(documents), buffer)
  19. if embeddings is not None:
  20. # Remove principal components from embeddings, if necessary
  21. if self.reducer:
  22. self.reducer(embeddings)
  23. # Append embeddings to the index
  24. self.ann.append(embeddings)
  25. # Save indexids-ids mapping for indexes with no database
  26. if ids and not self.database:
  27. self.ids = self.createids(self.ids + ids)
  28. # Scoring upsert, if necessary
  29. # This must occur before graph upsert in order to be available to the graph
  30. if self.issparse():
  31. self.scoring.upsert()
  32. # Subindexes upsert, if necessary
  33. if self.indexes:
  34. self.indexes.upsert()
  35. # Graph upsert, if necessary
  36. if self.graph:
  37. self.graph.upsert(Search(self, indexonly=True), Ids(self), self.batchsimilarity)