gp_sparse_vector - 图2 greenplum-db / gpdb

Permalink

Dismiss

Join GitHub today

GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.

Sign up

gp_sparse_vector - 图15 master gp_sparse_vector - 图16

gpdb/gpcontrib/gp_sparse_vector/README

Go to file gp_sparse_vector - 图17

  • Go to file T
  • Go to line L

  • Copy path

    Cannot retrieve contributors at this time

238 lines (197 sloc) 10.9 KB

Raw Blame

gp_sparse_vector - 图18 gp_sparse_vector - 图19 gp_sparse_vector - 图20

Sparse Vector Datatype
====================
Background
====================
When you use arrays of floating point numbers for various calculations, you will
often have long runs of zeros. This is common in many kinds of applications
such as scientific, retail optimization and text processing. Each floating
point number takes 8 bytes of storage in memory and/or disk, so saving those
zeros is often impractical. There are also many computations that benefit from
skipping over the zeros.
Say for example that we have the following array of doubles in Postgres stored
as a “float8[]”:
‘{0, 33,…40,000 zeros…, 12, 22 }’::float8[]
This array would occupy slightly more than 320KB of memory/disk, most of it
zeros. Also, if we perform various operations on this, we’ll be doing work
on 40,001 fields that aren’t important. This kind of array arises often in
text processing, where the dictionary may have 40-100K terms and each vector
will store the number of words in a particular document.
We could use the built in “Array” datatype in Postgres/GP, which has a bitmap
for null values, but since it is not optimized for float8[], or for long runs
of zeros instead of nulls and the bitmap is not RLE compressed, it is a poor
choice for this use-case. If we stored the zeros as NULL in the Postgres/GP
array, then the bitmap for nulls would have 5KB of bits in it to mark the nulls,
which is not nearly as efficient as we’d like.
A simple RLE scheme that is biased toward being efficient for zero value bitmap
results in 6 bytes for bitmap storage.
We also want to implement vector operators for our type that take advantage of
an efficient sparse storage format to make computations faster.
====================
What is it?
====================
The “Sparse Vector Datatype” is named “svec” and implements a vector with
compressed storage of zeros. You can take the above vector and cast it into an
svec like this:
(‘{0, 33,…40,000 zeros…, 12, 22 }’::float8[])::svec
This is the same vector we input (I only used 5 zeros).
We can use operations with svec type like <, >, , **, /,=,+,SUM,COUNT_VEC
etc, and they have meanings associated with typical vector operations. The plus
(+) operator adds each of the terms of two vectors of same dimension together.
For instance, if we have a = {0,1,5} and b = {4,3,2}, then a+b = {4,4,7}. We
can see this using the svec type like this:
lukelonergan=# select
(‘{0,1,5}’::float8[]::svec + ‘{4,3,2}’::float8[]::svec)::float8[];
float8
————-
{4,4,7}
A vector dot product () between these two will result in a scalar result of
type float8. The dot product should be (04+13+52)=13, like this:
lukelonergan=# select ‘{0,1,5}’::float8[]::svec ‘{4,3,2}’::float8[]::svec;
?column?
—————
13
A scalar product () between these two will result in the individual terms
being multiplied. The scalar product should be {04,13,5*2}={0,3,10}, like
this:
lukelonergan=# select
(‘{0,1,5}’::float8[]::svec ‘{4,3,2}’::float8[]::svec)::float8[];
float8
—————
{0,3,10}
Special vector aggregate functions are also useful. SUM is self explanatory.
COUNT_VEC (as opposed to COUNT) evaluates the count of non-zero terms found in
a set of svec and returns an svec with the counts. For instance, if we have
{0,1,5},{10,0,3},{0,0,3},{0,1,0}, then COUNT_VEC() would return {1,2,3}:
lukelonergan=# create table list (a svec);
CREATE TABLE
lukelonergan=# insert into list values
(‘{0,1,5}’::float8[]),
(‘{10,0,3}’::float8[]),
(‘{0,0,3}’::float8[]),
(‘{0,1,0}’::float8[]);
INSERT 0 4
lukelonergan=# select COUNT_VEC(a)::float8[] from list;
count_vec
—————-
{1,2,3}
====================
Example
====================
For a text classification example, lets assume we have a dictionary composed of
words in a text array:
create table features (a text[][]) distributed randomly;
insert into features
values (‘{am,before,being,bothered,corpus,document,i,in,is,me,never,now,’
‘one,really,second,the,third,this,until}’);
We have a set of documents, also defined as arrays of words:
lukelonergan=# create table documents(a int,b text[]);
CREATE TABLE
lukelonergan=# insert into documents values
(1,’{this,is,one,document,in,the,corpus}’),
(2,’{i,am,the,second,document,in,the,corpus}’),
(3,’{being,third,never,really”,bothered,me,until,now}’),
(4,’{the,document,before,me,is,the,third,document}’);
Now we have a dictionary and some documents, we would like to do some document
classification using vector arithmetic on word counts and proportions of
dictionary words in each document.
To start this process, well need to find the dictionary words in each document.
Well prepared what is called a Sparse Feature Vector or SFV for each document.
An SFV is a vector of dimension N, where N is the number of dictionary words,
and in each SFV is a count of each dictionary word in the document.
Inside the sparse vector toolset, we have a function that will create an SFV
from a document, so we can just do this:
lukelonergan=# select gp_extract_feature_histogram(
(select a from features limit 1),b)::float8[] from documents;
gp_extract_feature_histogram
————————————————————-
{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0}
{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1}
{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0}
{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0}
Note that the output of gp_extract_feature_histogram is an svec for each
document containing the count of each of the dictionary words in the ordinal
positions of the dictionary. This can more easily be understood by doing this:
lukelonergan=# select gp_extract_feature_histogram(
(select a from features limit 1),b)::float8[],b from documents;
gp_extract_feature_histogram | b
————————————————————-+—————————————————————————
{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
lukelonergan=# select from features;
a
————————————————————————————————————————————————————
{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}
Now when we look at the first document “i am the second document in the corpus”,
its SFV is {1,30,1,1,1,1,60,1,2}. The word “am” is the first ordinate in the
dictionary and there is 1 instance of it in the SFV. “before” has no instances
in the document, so its value is “0” and so on.
gp_extract_feature_histogram is very speed optimized - in essence a single
routine version of a hash join, and should be able to process large numbers of
documents into their SFVs in parallel at the highest possible speeds.
The rest of the classification process is all vector math. As Brian Dolan
describes the process:
The actual count is hardly ever used. Instead, it’s turned into a weight.
The most common weight is called tf/idf for “Term Frequency / Inverse
Document Frequency”.
The calculation for a given term in a given document is
{#Times in this document} log {#Documents / #Documents the term appears in}
For instance, the term “document” in document A would have weight
1 log (4/3).
In document D it would have weight
2 log (4/3).
Terms that appear in every document would have tf/idf weight 0, since
log (4/4) = log(1) = 0.
(Our example has no term like that.) That usually sends a lot of values to
0.
For this part of the processing, well need to have a sparse vector of the
dictionary dimension (19) with the values (log(#documents/#Documents each term
appears in). There will be one such vector for the whole list of documents (aka
the “corpus”). The #documents is just a count of all of the documents, in this
case 4, but there is one divisor for each dictionary word and its value is the
count of all the times that word appears in the document. This single vector for
the whole corpus can then be scalar product multiplied by each document SFV to
produce the Term Frequency/Inverse Document Frequency.
This can be done as follows:
lukelonergan=# create table corpus as
(select a,
gp_extract_feature_histogram(
(select a from features limit 1),
b) sfv from documents);
lukelonergan=# SELECT docnum, (alogidf) tf_idf
FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo,
corpus
ORDER BY docnum;
docnum | tf_idf
————+—————————————————————————————————————————————————————————————————————————————
1 | {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0}
2 | {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0}
3 | {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989}
4 | {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0}
We can now get the “angular distance” between one document and the rest of the
documents using the ACOS of the dot product of the document vectors:
lukelonergan=# CREATE TABLE weights AS
(SELECT docnum, (alogidf) tf_idf
FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo,
corpus
ORDER BY docnum)
DISTRIBUTED RANDOMLY;
Calculate the angular distance between the first document to each other
document
lukelonergan=# SELECT docnum,
180.(ACOS(dmin(1.,
(tf_idf%%testdoc)/(svec_l2norm(tf_idf)
*svec_l2norm(testdoc))))/3.141592654) angular_distance
FROM weights,
(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
ORDER BY 1;
docnum | angular_distance
————+—————————
1 | 0
2 | 78.8235846096986
3 | 89.9999999882484
4 | 80.0232034288617
We can see that the angular distance between document 1 and itself is 0 degrees
and between document 1 and 3 is 90 degrees because they share no features at
all.

gp_sparse_vector - 图21