Description
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Input articles data, then LDA algorithm can give the probability that each word in the input articles belongs to each topic, and it can predict the topic of other input articles. This algorithm can also generate the probability that each word belongs to each topic and the perplexity which can evaluate the fitting effect of this algorithm.
Parameters
Name |
Description |
Type |
Required? |
Default Value |
topicNum |
Number of topic. |
Integer |
✓ |
|
alpha |
alpha.Concentration parameter (commonly named “alpha”) for the prior placed on documents’ distributions over topics (“beta”). |
Double |
|
-1.0 |
beta |
Concentration parameter (commonly named “beta” or “eta”) for the prior placed on topics’ distributions over terms. |
Double |
|
-1.0 |
method |
optimizer: em, online |
String |
|
“em” |
onlineLearningOffset |
(For online optimizer) A (positive) learning parameter that downweights early iterations. Larger values make early iterations count less. |
Double |
|
1024.0 |
learningDecay |
(For online optimizer) Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee asymptotic convergence. |
Double |
|
0.51 |
subsamplingRate |
For online optimizer Fraction of the corpus to be sampled and used in each iteration of mini-batchgradient descent, in range (0, 1]. |
Double |
|
0.05 |
optimizeDocConcentration |
(For online optimizer only, currently) Indicates whether the docConcentration(Dirichlet parameter for document-topic distribution) will be optimized during training. |
Boolean |
|
true |
numIter |
Number of iterations, The default value is 10 |
Integer |
|
10 |
vocabSize |
The maximum word number of the dictionary. If the total numbers of words are above this value,the words with lower document frequency will be filtered |
Integer |
|
262144 |
selectedCol |
Name of the selected column used for processing |
String |
✓ |
Script Example
Code
data = np.array(["a b b c c c c c c e e f f f g h k k k", \
"a b b b d e e e h h k", \
"a b b b b c f f f f g g g g g g g g g i j j", \
"a a b d d d g g g g g i i j j j k k k k k k k k k", \
"a a a b c d d d d d d d d d e e e g g j k k k", \
"a a a a b b d d d e e e e f f f f f g h i j j j j", \
"a a b d d d g g g g g i i j j k k k k k k k k k", \
"a b c d d d d d d d d d e e f g g j k k k", \
"a a a a b b b b d d d e e e e f f g h h h", \
"a a b b b b b b b b c c e e e g g i i j j j j j j j k k", \
"a b c d d d d d d d d d f f g g j j j k k k", \
"a a a a b e e e e f f f f f g h h h j"])
df = pd.DataFrame({"doc" : data})
inOp = dataframe_to_operator(df, schema_str="doc string")
ldaTrain = LdaTrainBatchOp()\
.setSelectedCol("doc")\
.setTopicNum(6)\
.setMethod("em")\
.setSubsamplingRate(1.0)\
.setOptimizeDocConcentration(True)\
.setNumIter(50)
ldaPredict = LdaPredictBatchOp().setPredictionCol("pred").setSelectedCol("doc")
model = ldaTrain.linkFrom(inOp)
ldaPredict.linkFrom(model, inOp).collect_to_dataframe()
Results
Model
model_id |
model_info |
0 |
{“logPerplexity”:”22.332946259667825”,”betaArray”:”[0.2,0.2,0.2,0.2,0.2]”,”logLikelihood”:”-915.6507966463809”,”method”:”\”online\””,”alphaArray”:”[0.16926092344987234,0.17828690973899627,0.17282213771078062,0.18555794554097874,0.15898463316059516]”,”topicNum”:”5”,”vocabularySize”:”11”} |
1048576 |
{“m”:5,”n”:11,”data”:[6135.5227952852865,7454.918734235136,6569.887273287071,7647.590029783137,7459.37196542985,6689.783286316853,8396.842418256507,7771.120258275389,7497.94247894282,7983.617922597562,7975.470848777338,7114.413879475893,8420.381073064213,6747.377398176922,6959.728145538011,7368.902852508116,7635.5968635989275,6734.522904998126,6792.566021565353,6487.885790775943,8086.932892160501,8443.888239756887,7227.0417299467745,7561.023252667202,6264.97808011349,6964.080980387547,8234.247108608217,8263.190977757107,7872.088651923572,7725.669369347696,7591.453097717432,7733.627117746213,6595.2753568320295,8158.346230399092,7765.777648163369,6456.891859572009,6814.768507000475,6612.17371610521,6506.877213010642,7166.140342089344,7588.370517354863,7645.016947338933,8929.620632942893,6855.855247335312,7263.088264847597,7993.009126022237,7302.794183756114,6074.524636118613,6386.578740892538,8465.84700774072,7242.276290933901,7257.474039179472,7676.72445702261,6733.70550536632,7577.265607033211]} |
2097152 |
{“f0”:”d”,”f1”:0.36772478012531734,”f2”:0} |
3145728 |
{“f0”:”k”,”f1”:0.36772478012531734,”f2”:1} |
4194304 |
{“f0”:”g”,”f1”:0.08004270767353636,”f2”:2} |
5242880 |
{“f0”:”b”,”f1”:0.0,”f2”:3} |
6291456 |
{“f0”:”a”,”f1”:0.0,”f2”:4} |
7340032 |
{“f0”:”e”,”f1”:0.36772478012531734,”f2”:5} |
8388608 |
{“f0”:”j”,”f1”:0.26236426446749106,”f2”:6} |
9437184 |
{“f0”:”f”,”f1”:0.4855078157817008,”f2”:7} |
10485760 |
{“f0”:”c”,”f1”:0.6190392084062235,”f2”:8} |
11534336 |
{“f0”:”h”,”f1”:0.7731898882334817,”f2”:9} |
12582912 |
{“f0”:”i”,”f1”:0.7731898882334817,”f2”:10} |
Prediction
doc |
pred |
a b b b d e e e h h k |
1 |
a a b d d d g g g g g i i j j j k k k k k k k k k |
3 |
a a a a b b d d d e e e e f f f f f g h i j j j j |
3 |
a a b d d d g g g g g i i j j k k k k k k k k k |
1 |
a a a a b b b b d d d e e e e f f g h h h |
3 |
a b c d d d d d d d d d f f g g j j j k k k |
3 |
a b b c c c c c c e e f f f g h k k k |
2 |
a b b b b c f f f f g g g g g g g g g i j j |
0 |
a a a b c d d d d d d d d d e e e g g j k k k |
3 |
a b c d d d d d d d d d e e f g g j k k k |
3 |
a a b b b b b b b b c c e e e g g i i j j j j j j j k k |
3 |
a a a a b e e e e f f f f f g h h h j |
0 |