greenplum-db / gpdb
- Watch 422
- Star 4.1k
- Issues 312
- Pull requests 104
- Actions
- Projects 6
- Wiki
- Security
-
More
- Issues
- Pull requests
- Actions
- Projects
- Wiki
- Security
- Insights
Dismiss
Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
master
gpdb/gpcontrib/gp_sparse_vector/README
- Go to file T
Go to line L
Copy path
Cannot retrieve contributors at this time
238 lines (197 sloc) 10.9 KB
Sparse Vector Datatype | |
==================== | |
Background | |
==================== | |
When you use arrays of floating point numbers for various calculations, you will | |
often have long runs of zeros. This is common in many kinds of applications | |
such as scientific, retail optimization and text processing. Each floating | |
point number takes 8 bytes of storage in memory and/or disk, so saving those | |
zeros is often impractical. There are also many computations that benefit from | |
skipping over the zeros. | |
Say for example that we have the following array of doubles in Postgres stored | |
as a “float8[]”: | |
‘{0, 33,…40,000 zeros…, 12, 22 }’::float8[] | |
This array would occupy slightly more than 320KB of memory/disk, most of it | |
zeros. Also, if we perform various operations on this, we’ll be doing work | |
on 40,001 fields that aren’t important. This kind of array arises often in | |
text processing, where the dictionary may have 40-100K terms and each vector | |
will store the number of words in a particular document. | |
We could use the built in “Array” datatype in Postgres/GP, which has a bitmap | |
for null values, but since it is not optimized for float8[], or for long runs | |
of zeros instead of nulls and the bitmap is not RLE compressed, it is a poor | |
choice for this use-case. If we stored the zeros as NULL in the Postgres/GP | |
array, then the bitmap for nulls would have 5KB of bits in it to mark the nulls, | |
which is not nearly as efficient as we’d like. | |
A simple RLE scheme that is biased toward being efficient for zero value bitmap | |
results in 6 bytes for bitmap storage. | |
We also want to implement vector operators for our type that take advantage of | |
an efficient sparse storage format to make computations faster. | |
==================== | |
What is it? | |
==================== | |
The “Sparse Vector Datatype” is named “svec” and implements a vector with | |
compressed storage of zeros. You can take the above vector and cast it into an | |
svec like this: | |
(‘{0, 33,…40,000 zeros…, 12, 22 }’::float8[])::svec | |
This is the same vector we input (I only used 5 zeros). | |
We can use operations with svec type like <, >, , **, /,=,+,SUM,COUNT_VEC | |
etc, and they have meanings associated with typical vector operations. The plus | |
(+) operator adds each of the terms of two vectors of same dimension together. | |
For instance, if we have a = {0,1,5} and b = {4,3,2}, then a+b = {4,4,7}. We | |
can see this using the svec type like this: | |
lukelonergan=# select | |
(‘{0,1,5}’::float8[]::svec + ‘{4,3,2}’::float8[]::svec)::float8[]; | |
float8 | |
————- | |
{4,4,7} | |
A vector dot product () between these two will result in a scalar result of | |
type float8. The dot product should be (04+13+52)=13, like this: | |
lukelonergan=# select ‘{0,1,5}’::float8[]::svec ‘{4,3,2}’::float8[]::svec; | |
?column? | |
————— | |
13 | |
A scalar product () between these two will result in the individual terms | |
being multiplied. The scalar product should be {04,13,5*2}={0,3,10}, like | |
this: | |
lukelonergan=# select | |
(‘{0,1,5}’::float8[]::svec ‘{4,3,2}’::float8[]::svec)::float8[]; | |
float8 | |
————— | |
{0,3,10} | |
Special vector aggregate functions are also useful. SUM is self explanatory. | |
COUNT_VEC (as opposed to COUNT) evaluates the count of non-zero terms found in | |
a set of svec and returns an svec with the counts. For instance, if we have | |
{0,1,5},{10,0,3},{0,0,3},{0,1,0}, then COUNT_VEC() would return {1,2,3}: | |
lukelonergan=# create table list (a svec); | |
CREATE TABLE | |
lukelonergan=# insert into list values | |
(‘{0,1,5}’::float8[]), | |
(‘{10,0,3}’::float8[]), | |
(‘{0,0,3}’::float8[]), | |
(‘{0,1,0}’::float8[]); | |
INSERT 0 4 | |
lukelonergan=# select COUNT_VEC(a)::float8[] from list; | |
count_vec | |
—————- | |
{1,2,3} | |
==================== | |
Example | |
==================== | |
For a text classification example, lets assume we have a dictionary composed of | |
words in a text array: | |
create table features (a text[][]) distributed randomly; | |
insert into features | |
values (‘{am,before,being,bothered,corpus,document,i,in,is,me,never,now,’ | |
‘one,really,second,the,third,this,until}’); | |
We have a set of documents, also defined as arrays of words: | |
lukelonergan=# create table documents(a int,b text[]); | |
CREATE TABLE | |
lukelonergan=# insert into documents values | |
(1,’{this,is,one,document,in,the,corpus}’), | |
(2,’{i,am,the,second,document,in,the,corpus}’), | |
(3,’{being,third,never,really”,bothered,me,until,now}’), | |
(4,’{the,document,before,me,is,the,third,document}’); | |
Now we have a dictionary and some documents, we would like to do some document | |
classification using vector arithmetic on word counts and proportions of | |
dictionary words in each document. | |
To start this process, well need to find the dictionary words in each document. | |
Well prepared what is called a Sparse Feature Vector or SFV for each document. | |
An SFV is a vector of dimension N, where N is the number of dictionary words, | |
and in each SFV is a count of each dictionary word in the document. | |
Inside the sparse vector toolset, we have a function that will create an SFV | |
from a document, so we can just do this: | |
lukelonergan=# select gp_extract_feature_histogram( | |
(select a from features limit 1),b)::float8[] from documents; | |
gp_extract_feature_histogram | |
————————————————————- | |
{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | |
{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | |
{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | |
{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | |
Note that the output of gp_extract_feature_histogram is an svec for each | |
document containing the count of each of the dictionary words in the ordinal | |
positions of the dictionary. This can more easily be understood by doing this: | |
lukelonergan=# select gp_extract_feature_histogram( | |
(select a from features limit 1),b)::float8[],b from documents; | |
gp_extract_feature_histogram | b | |
————————————————————-+————————————————————————— | |
{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus} | |
{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document} | |
{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus} | |
{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now} | |
lukelonergan=# select from features; | |
a | |
———————————————————————————————————————————————————— | |
{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until} | |
Now when we look at the first document “i am the second document in the corpus”, | |
its SFV is {1,30,1,1,1,1,60,1,2}. The word “am” is the first ordinate in the | |
dictionary and there is 1 instance of it in the SFV. “before” has no instances | |
in the document, so its value is “0” and so on. | |
gp_extract_feature_histogram is very speed optimized - in essence a single | |
routine version of a hash join, and should be able to process large numbers of | |
documents into their SFVs in parallel at the highest possible speeds. | |
The rest of the classification process is all vector math. As Brian Dolan | |
describes the process: | |
The actual count is hardly ever used. Instead, it’s turned into a weight. | |
The most common weight is called tf/idf for “Term Frequency / Inverse | |
Document Frequency”. | |
The calculation for a given term in a given document is | |
{#Times in this document} log {#Documents / #Documents the term appears in} | |
For instance, the term “document” in document A would have weight | |
1 log (4/3). | |
In document D it would have weight | |
2 log (4/3). | |
Terms that appear in every document would have tf/idf weight 0, since | |
log (4/4) = log(1) = 0. | |
(Our example has no term like that.) That usually sends a lot of values to | |
0. | |
For this part of the processing, well need to have a sparse vector of the | |
dictionary dimension (19) with the values (log(#documents/#Documents each term | |
appears in). There will be one such vector for the whole list of documents (aka | |
the “corpus”). The #documents is just a count of all of the documents, in this | |
case 4, but there is one divisor for each dictionary word and its value is the | |
count of all the times that word appears in the document. This single vector for | |
the whole corpus can then be scalar product multiplied by each document SFV to | |
produce the Term Frequency/Inverse Document Frequency. | |
This can be done as follows: | |
lukelonergan=# create table corpus as | |
(select a, | |
gp_extract_feature_histogram( | |
(select a from features limit 1), | |
b) sfv from documents); | |
lukelonergan=# SELECT docnum, (alogidf) tf_idf | |
FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo, | |
corpus | |
ORDER BY docnum; | |
docnum | tf_idf | |
————+————————————————————————————————————————————————————————————————————————————— | |
1 | {4,1,1,1,2,3,1,2,1,1,1,1}:{0,0.693147180559945,0.287682072451781,0,0.693147180559945,0,1.38629436111989,0,0.287682072451781,0,1.38629436111989,0} | |
2 | {1,3,1,1,1,1,6,1,1,3}:{1.38629436111989,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,1.38629436111989,0.575364144903562,0} | |
3 | {2,2,5,1,2,1,1,2,1,1,1}:{0,1.38629436111989,0,0.693147180559945,1.38629436111989,0,1.38629436111989,0,0.693147180559945,0,1.38629436111989} | |
4 | {1,1,3,1,2,2,5,1,1,2}:{0,1.38629436111989,0,0.575364144903562,0,0.693147180559945,0,0.575364144903562,0.693147180559945,0} | |
We can now get the “angular distance” between one document and the rest of the | |
documents using the ACOS of the dot product of the document vectors: | |
lukelonergan=# CREATE TABLE weights AS | |
(SELECT docnum, (alogidf) tf_idf | |
FROM (SELECT log(count(a)/count_vec(a)) logidf FROM corpus) foo, | |
corpus | |
ORDER BY docnum) | |
DISTRIBUTED RANDOMLY; | |
Calculate the angular distance between the first document to each other | |
document | |
lukelonergan=# SELECT docnum, | |
180.(ACOS(dmin(1., | |
(tf_idf%%testdoc)/(svec_l2norm(tf_idf) | |
*svec_l2norm(testdoc))))/3.141592654) angular_distance | |
FROM weights, | |
(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo | |
ORDER BY 1; | |
docnum | angular_distance | |
————+————————— | |
1 | 0 | |
2 | 78.8235846096986 | |
3 | 89.9999999882484 | |
4 | 80.0232034288617 | |
We can see that the angular distance between document 1 and itself is 0 degrees | |
and between document 1 and 3 is 90 degrees because they share no features at | |
all. |
- Copy lines
- Copy permalink
- View git blame
-
Go