greenplum gptext 1.1.0.0 user’s guide · 2020-06-23 · gptext enables analysis of solr indexes...

Greenplum GPText 1.1.0.0
GPText User’s Guide

Rev: A07

Copyright © 2013 EMC Corporation. All rights reserved.

EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an applicable software license.

For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com

All other trademarks used herein are the property of their respective owners.

GPText User’s Guide - Contents

GPText User’s GuidePreface ............................................................................................... 1

About Pivotal, Inc............................................................................. 1About This Guide.............................................................................. 1Text Conventions ............................................................................. 2Command Syntax Conventions......................................................... 3

Chapter 1: Introduction to Greenplum GPText ....................... 4GPText Sample Use Case ................................................................. 4GPText Workflow.............................................................................. 5

Data Loading and Indexing Workflow.......................................... 5Querying Data Workflow ............................................................. 6

Text Analysis.................................................................................... 6

Chapter 2: Installing GPText ........................................................ 8Prerequisites .................................................................................... 8Installing the GPText Binaries........................................................... 8

Setting Installation Parameters................................................... 9Update Symbolic Links After a Greenplum Database Upgrade .......... 9Uninstall GPText............................................................................. 10

Chapter 3: Working With Indexes ............................................ 11Using Indexing ............................................................................... 11Working with Indexes..................................................................... 14Setting up Text Analysis Chains ..................................................... 15

text_intl, the International Text Analyzer............................... 17text_sm, the Social Media Text Analyzer .................................. 19

Using Multiple Analyzer Chains ....................................................... 22

Chapter 4: Queries ........................................................................ 23Creating a Simple Search Query..................................................... 23Creating a Faceted Search Query ................................................... 24

Facet by field ............................................................................ 24Facet by query.......................................................................... 24Facet by range.......................................................................... 24

Using Advanced Querying Options.................................................. 25Changing the Query Parser at Query Time ..................................... 25

Using the Solr localParams syntax ........................................... 26

Chapter 5: Administering GPText.............................................. 27Security and GPText Indexes.......................................................... 27Checking Solr Instance Status........................................................ 27Gathering Solr Index Statistics ....................................................... 27Changing Global Configuration Values ............................................ 28Troubleshooting ............................................................................. 29

Monitoring Logs ........................................................................ 30Determining Segment Status with gptext-state ........................ 30

Chapter 6: k-means Analysis...................................................... 31Performing k-means Analysis Using gptext-analytics................... 31Editing kmeans.yaml ...................................................................... 32

The kmeans.yaml file to edit..................................................... 32

Table of Contents iii

GPText User’s Guide - Contents

Replacing the kmeans.yaml File....................................................... 35Running the k-means Algorithm ..................................................... 35Running the k-means Algorithm More Than Once ........................... 35

Chapter 7: Support Vector Machine Analysis......................... 37Training SVM.................................................................................. 37Creating svm_train.yaml................................................................ 38Editing svm_train.yaml .................................................................. 38

The svm_train.yaml File Contents ............................................ 38Replacing svm_train.yaml and Generating SQL Files .................... 40Running the SVM Training Algorithm .............................................. 41Running the SVM Training Algorithm More Than Once .................... 41Testing the SVM ............................................................................. 41Creating svm_test.yaml ................................................................. 42Editing svm_test.yaml .................................................................... 42

The svm_test.yaml File Contents.............................................. 42Replacing svm_test.yaml and Generating SQL Files........................ 44Running the SVM Test .................................................................... 44Running the SVM Test Algorithm More Than Once .......................... 44

Chapter 8: GPText High Availability ......................................... 46Normal GPText Running Conditions ................................................ 46Solr Master Failure ......................................................................... 47Primary Segment Failure ................................................................ 47Primary Segment and Solr Master Failure....................................... 48

Appendix A: Glossary ..................................................................... 49

Table of Contents iv

GPText User’s Guide - Preface

Preface

This guide provides information and instructions for installing, configuring, and using Greenplum GPText for data scientists, application builders, system administrators, and database administrators.

• About Pivotal, Inc.• About This Guide• Text Conventions• Command Syntax Conventions

About Pivotal, Inc.Greenplum is currently transitioning to a new corporate identity (Pivotal, Inc.). We estimate that this transition will be complete by Q2 2013. During this transition, there will be some legacy instances of our former corporate identity (Greenplum) appearing in our products and documentation. If you have any questions or concerns, please do not hesitate to contact us through our web site: http://www.greenplum.com/support-transition.

About This GuideThis guide explains how to install, configure, and use GPText, describes key concepts such as text analysis, indexing, and query workflows, available analysis chains, querying with GPText, administration, troubleshooting, and high availability.

The documentation is intended for:

• System administrators and database managers who will install and maintain a Greenplum Database (GPDB). These individuals should know basic Linux administration and should have some experience administering GPDB clusters.

• Application builders (software developers) who will use GPText as a platform for building new software.

• Data scientists who are interested in solving business problems with advanced techniques.

Application builders and casual data scientists should have a good working knowledge of PostgreSQL (the SQL language of GPDB), should understand basic GPDB principles, and should have some familiarity with Lucene query syntax.

A moderate user should also have a basic understanding of natural language processing and some familiarity with configuring Solr analyzer chains.

Advanced users should have a thorough understanding of PostgreSQL and a solid background in Machine Learning algorithms, especially as they are implemented in MADlib.

About Pivotal, Inc. 1


Text Conventions

Table 0.1 Text Conventions

Text Convention Usage Examples

italics New terms where they are defined

Database objects, such as schema, table, or columns names

The master instance is the postgres process that accepts client connections.

Catalog information for Greenplum Database resides in the pg_catalog schema.

monospace File names and path names

Programs and executables

Command names and syntax

Parameter names

Edit the postgresql.conf file.

Use gpstart to start Greenplum Database.

<monospace italics>

Variable information within file paths and file names

Variable information within command syntax

/home/gpadmin/<config_file>

COPY tablename FROM '<filename>'

monospace bold Used to call attention to a particular part of a command, parameter, or code snippet.

Change the host name, port, and database name in the JDBC connection URL:

jdbc:postgresql://host:5432/mydb

UPPERCASE Environment variables

SQL commands

Keyboard keys

Make sure that the Java /bin directory is in your $PATH.

SELECT * FROM my_table;

Press CTRL+C to escape.

Text Conventions 2


Command Syntax Conventions

Table 0.2 Command Syntax Conventions

Text Convention Usage Examples

{ } Within command syntax, curly braces group related command options. Do not type the curly braces.

FROM { 'filename' | STDIN }

[ ] Within command syntax, square brackets denote optional arguments. Do not type the brackets.

TRUNCATE [ TABLE ] name

... Within command syntax, an ellipsis denotes repetition of a command, variable, or option. Do not type the ellipsis.

DROP TABLE name [, ...]

| Within command syntax, the pipe symbol denotes an “OR” relationship. Do not type the pipe symbol.

VACUUM [ FULL | FREEZE ]

Command Syntax Conventions 3

Chapter 1: Introduction to Greenplum GPText

1. Introduction to Greenplum GPText

Greenplum GPText enables processing mass quantities of raw text data (such as social media feeds or e-mail databases) into mission-critical information that guides business and project decisions. GPText joins the Greenplum Database massively parallel-processing database server with Apache Solr enterprise search and the MADlib Analytics Library to provide large-scale analytics processing and business decision support. GPText includes free text search as well as support for text analysis. GPText supports business decision making by offering:

• Multiple kinds of data: GPText supports both semi-structured and unstructured data searches, which exponentially increases the kinds of information you can find.

• Less schema dependence: GPText does not require static schemas to successfully locate information; schemas can change or be quite simple and still return targeted results.

• Text analytics: GPText supports analysis of text data with machine learning algorithms. The MADlib Analytics Library is integrated with Greenplum Database and is available for use by GPText.

This chapter contains the following topics:

• GPText Sample Use Case• GPText Workflow• Text Analysis

GPText Sample Use CaseForensic financial analysts need to locate communications among corporate executives that point to financial malfeasance in their firm. The analysts use the following workflow:

1. Load the email records into a Greenplum database.

2. Create a Solr index of the email records.

3. Run queries that look for text strings and their authors.

4. Refine the queries until they pair a dummy company name with top three or four executives corresponding about suspect offshore financial transacations. With this data, the analysts can focus the investigation on specific individuals rather than the thousands of authors in the initial data sample.

GPText Sample Use Case 4


GPText WorkflowGPText works with Greenplum Database and Apache Solr to store and index big data for information retrieval (query) purposes. High-level workflows include data loading and indexing, and data querying.

This topic describes the following information:

• Data Loading and Indexing Workflow• Querying Data Workflow

Data Loading and Indexing Workflow

The following diagram shows the GPText workflow for loading and indexing data.

1. Load data into your Greenplum system. For details, see the Greenplum Database Database Administrator Guide, available on Support Zone.

2. Create an index targeted to your application’s requirements in the Solr instance attached to your data’s segment.

GPText Workflow 5


Querying Data Workflow

The following diagram shows the high-level GPText query process workflow:

1. Create a SQL query targeted to your indexed information.

2. The Greenplum Master dispatches the query to Greenplum Segments.

3. The segments search the appropriate indexes stored in its Solr repository.

4. The segments gather the results and send them to the master.

Text AnalysisGPText enables analysis of Solr indexes with MADlib, an open source library for scalable in-database analytics. MADlib started as a collaboration between the University of California, Berkeley and EMC/Greenplum. MADlib provides data-parallel implementations of mathematical, statistical, and machine learning methods for structured and unstructured data. You can use GPText to perform a variety of MADlib analyses.

The virtual machine described in Getting Started With GPText contains demonstration analyses from the MADlib library:

• K-means Clustering• Support Vector Machines

MADlib resources include:

• Source code: http://madlib.net• Documentation and installation procedures on the MADlib Wiki:

http://github.com/madlib/madlib/wiki.• MADlib K-means clustering module:

http://doc.madlib.net/v0.3/group__grp__kmeans.html

Text Analysis 6


• MADlib SVM module: http://doc.madlib.net/v0.3/group__grp__kernmach.html

Text Analysis 7

Chapter 2: Installing GPText

2. Installing GPText

The GPText installation includes the installation of Solr.

Note: You cannot install GPText onto a shared NFS mount.


• Prerequisites• Installing the GPText Binaries• Update Symbolic Links After a Greenplum Database Upgrade• Uninstall GPText

PrerequisitesBefore you install GPText:

• Install and configure your Greenplum Database system, version 4.2.1 or higher. See the Greenplum Database Installation Guide, available on Support Zone.

• Install JRE 1.6.x and place it in PATH on the master and all segment servers.• Install MADlib on segments (optional). For instructions, see the MADlib

installation guide: https://github.com/madlib/madlib/wiki/Installation-Guide

Installing the GPText BinariesInstall GPText as the same user that installed Greenplum Database, for example, gpadmin.

1. Copy the GPText binary to a directory on the GPDB master server machine. For example, /home/gpadmin. The binary has a name similar to greenplum-text-<version>-<OS>.bin. For example, greenplum-text-1.1.0.1-rhel5_x86_64.bin.

2. Grant execute permission to the GPText binary. For example:chmod +x /home/gpadmin/greenplum-text-1.1.0.1-rhel5_x86_64.bin

3. Create the installation directory as root and change the ownership to the GPText installer, gpadmin.You need to perform this step because GPText is installed in the same directory as the Greenplum Database, /usr/local.

4. To install to a directory where the user may or may not does not have write permissions:

a. Use gpssh to create a directory with the same file path on all hosts (mdw, smdw, and the segment hosts). For example:/usr/local/<gptext-version>

Prerequisites 8


b. As root, set the file permissions and owner. For example: # chmod 775 /usr/local/<gptext-version> # chown gpadmin:gpadmin /usr/local/<gptext-version>

5. Run the following script as gpadmin on the master server:bash <gptext-version>.bin

6. Accept the EMC license agreement.

Setting Installation Parameters

1. Set the following installation parameters:

Table 2.1 GPText installation parameters

Parameter Description Default value

Installation path Path where GPText is installed. The default is /usr/local/<gptext-version>

Java virtual machine (JVM) options

Sets the minimum and maximum memory that JVM (Solr) can use.

The default is (-Xms1024M -Xmx2048M)

Port offset The installation uses this value to initialize Solr ports. If the Greenplum Database segment uses port 50000, the Solr instance for that segment will use port number 49900 (50000 - 100). If the new calculated Solr port value clashes with an already-used GPDB port or if the port not in the range 1024-65535, an error message appears.

The default is -100.

2. To get GPText commands in your path, source the following file located in the installation directory:source <install_dir>/<gptext-version>_path.sh.

3. To install objects, such as text indexes into the database, run the following command:gptext-installsql <database> [<database2> ... ]

4. Start GPText:$ gptext-start

Update Symbolic Links After a Greenplum Database Upgrade

After you upgrade a Greenplum Database that has an existing installation of GPText, update the GPText symbolic links (symlinks) on all nodes as follows.

• Source: /usr/local/greenplum-text-1.1.0.1/lib/gptext-gpdb42-1.1.0.1.so

Target: /usr/local/<gpdb_new_version>/lib/postgresql/gptext-1.1.0.1.so

Update Symbolic Links After a Greenplum Database Upgrade 9


• Source: /usr/local/greenplum-text-1.1.0.1/lib/python/gptextlibTarget: /usr/local/<gpdb_new_version>/lib/python/gptextlib

Uninstall GPTextTo uninstall GPText, run the gptext-uninstall utility. You must have superuser permissions on all databases with Greenplum schemas to run gptext-uninstall.

gptext-uninstall runs only if there is at least one database with a Greenplum schema.

Execute:

gptext-uninstall

Uninstall GPText 10

Chapter 3: Working With Indexes

3. Working With Indexes

Indexing is key to preparing documents for text analysis and to achieving the best possible query performance. How you set up and configure indexes can affect the success of your project.


• Using Indexing• Working with Indexes• Setting up Text Analysis Chains• Using Multiple Analyzer Chains

Using IndexingThe general steps for creating and using a GPText index are:

1. To access a Greenplum Database

2. To create an empty Solr Index

3. To map Greenplum Database data types to Solr data types

4. To populate the Index

5. To commit the Index

6. To configure an Index

After you complete these steps, you can create and execute a search query or implement machine learning algorithms.

The examples in this section use a table called articles in a database called wikipedia that was created with a default public schema.

The articles table has five columns: id, date_time, title, content, and references. The contents of the id column must be of type bigint or int8.

The content column is the default search column—the column that will be searched if we do not name a different one in the search query.

See the GPText Function Reference for details about GPText functions.

To access a Greenplum Database

To open an interactive shell for executing queries on the wikipedia database, execute:

psql wikipedia

Using Indexing 11


To create an empty Solr Index

To create a new index, use the function gptext.create_index(schema_name, table_name, id_col_name, default_search_col_name).

Example:

SELECT * FROM gptext.create_index('public', 'articles', 'id', 'content');

This creates an index named <database_name>.<schema_name>.<table_name>, in our case 'wikipedia.public.articles'.

To map Greenplum Database data types to Solr data types

If a Greenplum Database data type is an array it is mapped to a multi-value type in Solr. For example, INT[ ] maps to a multi-value int field.

GPDB Type Solr Type

bool boolean

bytea binary

char string

name string

int8 long

int4 int

int2 int

int int

text text

point point

float4 float

float8 double

money string

bpchar string

varchar text

interval string

date tdate

time string

timestamp tdate

timetz string

timestamptz tdate

bit string

Using Indexing 12


If a GPDB data type is not listed, it will be text type in Solr.

To populate the Index

To populate the index, use the table function gptext.index():

SELECT * FROM gptext.index(TABLE(SELECT * FROM table_name), index_name);

For example:

SELECT * FROM gptext.index(TABLE(SELECT * FROM articles), 'wikipedia.public.articles');

Note: The arguments to the gptext.index() function must be expressions. TABLE(SELECT * FROM articles) creates a “table-valued expression” from the articles table, using the table function TABLE.You can selectively index/update by changing the inner select list in the query.

Be careful about distribution policies:

• The first parameter of gptext.index() is TABLE(SELECT * FROM articles). The query in this parameter should have the same distribution policy as the table you are indexing. However, there are two cases where the query will not have the same distribution policy:1. Your query is a join of two tables2. You are indexing an intermediate table (staging table) that is distributed

differently than the final table.• When the distribution policies differ, you must specify "SCATTER BY" for the

query:TABLE(SELECT * FROM articles SCATTER BY distrib_id), where distrib_id is the distribution id used when you created your primary/final table.

To commit the Index

After you create and populate an index, you must commit the index using gptext.commit_index(index_name).

For example:

SELECT * FROM gptext.commit_index('wikipedia.public.articles');

Note: The index picks up any new data added since your last index commit when you call this function.

To configure an Index

You can modify your indexing behavior globally by using gptext-config to edit a set of index configuration files. The files you can edit with gptext-config are:

varbit string

numeric double

GPDB Type Solr Type

Using Indexing 13


• solrconfig.xml -- Contains most of the parameters for configuring Solr itself (see http://wiki.apache.org/solr/SolrConfigXml).

• schema.xml -- Defines the analysis chains that Solr uses for various different types of search fields (see “Setting up Text Analysis Chains” on page 15).

• stopwords.txt -- Lists words you want to eliminate from the final index.• protwords.txt -- Lists protected words that you do not want to be modified by

the analysis chain. For example, iPhone.• synonyms.txt -- Lists words that you want replaced by synonyms in the analysis

chain. • elevate.xml -- Moves specific words to the top of your final index.• emoticons.txt -- Defines emoticons for the text_sm social media analysis

chain. (see “The emoticons.txt file” on page 21).See “Setting up Text Analysis Chains” on page 15

You can also use gptext-config to move files.

Working with IndexesThis topic describes how you can work with indexes. Im particular, it covers the following tasks:

• To roll back an Index• To optimize an Index• To delete From an Index• To drop an Index• To list All Indexes

To roll back an Index

Use the function gptext.rollback_index(index_name) to undo any index operations, including delete, performed since the last time this index was committed.

Example:

SELECT * FROM gptext.rollback_index('wikipedia.public.articles')

To optimize an Index

The function gptext.optimize_index(index_name, max_segments)merges all segments into a small number of segments (max_segments) for increased efficiency.

Example:

SELECT * FROM gptext.optimize_index('wikipedia.public.articles', 10);

Working with Indexes 14


To delete From an Index

You can delete from an index using a query with the function gptext.delete(index_name, query). This will delete all documents that match the search query. To delete all documents, use the query '*.*'.

After a successful deletion, you must issue a gptext.commit_index(index_name).

Example that deletes all documents containing "sports" in the default search field:

SELECT * FROM gptext.delete('wikipedia.public.articles', 'sports');


Example that deletes all documents from the index:

SELECT * FROM gptext.delete('wikipedia.public.articles', '*:*');


To drop an Index

You can completely remove an index with the gptext.drop_index(schema_name, table_name) function.

Example:

SELECT * FROM gptext.drop_index('public', 'articles');

To list All Indexes

You can list all indexes using gptext-state. For example,

gptext-stats --list

Setting up Text Analysis ChainsText analysis chains determine how Solr indexes a document.

GPText provides the following text analysis chains.

• text_intl, the International Text Analyzer• text_sm, the Social Media Text Analyzer

Analysis begins with a tokenizer that divides the document content into tokens. In Latin-based text documents, the tokens are words (also called terms). In Chinese, Japanese, and Korean (CJK) documents, the tokens are characters.

An analyzer has only one tokenizer. The tokenizer can be followed by one or more filters executed in succession. Filters restrict the query results, for example, by removing unnecessary terms (“a”, “an”, “the”), converting term formats, or by performing other actions to ensure that only important, relevant terms appear in the result set. Each filter operates on the output of the tokenizer or filter that precedes it.

The analysis chain can include a “stemmer”. The stemmer changes words to their “stems”. For example, “receive”, “receives”, “received”, “receiver”, and “receiving” are all stemmed to “receiv”.

Setting up Text Analysis Chains 15

Note: The order of the filters is important.You can specify different analyzers for indexing and querying, or use the same analyzer for both indexing and querying.

For example, the following shows the default GPText analyzer, text_intl. See “text_intl, the International Text Analyzer” on page 17 for details.

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField" name="text_intl" positionIncrementGap="100">

<analyzer type="index"> <tokenizer class= "com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/>

<filter class= "com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" />

<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>

<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>

<filter class="solr.PorterStemFilterFactory"/> </analyzer>

<analyzer type="query">



</analyzer>

</fieldType>

Following are the analysis steps for text_intl.

1. The analyzer chain for indexing begins with a tokenizer called WorldLexerTokenizerFactory. This tokenizer handles most modern languages. It separates CJK characters from other language text and identifies any currency tokens or symbols.

2. The solr.CJKWidthFilterFactory filter normalizes the CJK characters based on character width.

3. The solr.LowerCaseFilterFactory filter changes all letters to lower case.

4. The WorldLexerBigramFilterFactory filter generates a bigram for any CJK characters, leaves any non-CJK characters intact, and preserves original Korean-language words. Set the han, hiragana, katakana, and hangul attributes to “true” to generate bigrams for all supported CJK languages.

5. The solr.StopFilterFactory removes common words, such as “a”, “an”, and “the”, that are listed in the stopwords.txt configuration file (see “To configure an Index” on page 13). If there are no words in the stopwords.txt file, no words are removed.


6. The solr.KeywordMarkerFilterFactory marks the English words to protect from stemming, using the words listed in the protwords.txt configuration file (see “To configure an Index” on page 13). If protwords.txt does not contain a list of words, all words in the document are stemmed.

7. The final filter is the stemmer, in this case solr.PorterStemFilterFactory, a fast stemmer for the English language.

Note: The text_intl analyzer chain for querying is the same as the text analyzer chain for indexing. An analysis chain, text, is included in GPText’s Solr schema.xml and is based on Solr’s default analysis chain. Because its tokenizer splits on white space, text cannot process CJK languages: white space is meaningless for CJK languages. Best practice is to use the text_intl analyzer.

For information about using an analyzer chain other than the default, see “To use the text_sm Social Media Analyzer” on page 21

text_intl, the International Text Analyzer

text_intl is the default GPText analyzer. It is a multiple language text analyzer for text fields. It handles Latin-based words and Chinese, Japanese, and Korean (CJK) characters.

text-intl processes documents as follows.

1. Separates CJK characters from other language text.

2. Identifies currency tokens or symbols that were ignored in the first pass.

3. For any CJK characters, generates a bigram for the CJK character and, for Korean characters only, preserves the original word.

Note that CJK and non-CJK text are treated as separate tokens. Preserving the original Korean word increases the number of tokens in a document.

Following is the definition from the Solr schema.xml template.

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField" name="text_intl" positionIncrementGap="100">

<analyzer type="index"> <tokenizer class= "com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/>

<filter class= "com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" />





<filter class="solr.PorterStemFilterFactory"/> </analyzer>

<analyzer type="query"> <tokenizer class= "com.emc.solr.analysis.worldlexer.WorldLexerTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class= "com.emc.solr.analysis.worldlexer.WorldLexerBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" /> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>

<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/>

</analyzer> </fieldType>

GPText Language Processing

The root-level tokenizer, WorldLexerTokenizerFactory, tokenizes international languages, including CJK languages. WorldLexerTokenizerFactory tokenizes languages based on their Unicode points and, for Latin-based languages, white space.

Note: Unicode is the encoding for all text in the Greenplum Database.The following are sample input to, and output from, GPText. Each line in the output corresponds to a term.

English and CJK input: .

English and CJK output:

Bulgarian input: Cъстав на nарламента: вж. nротоколи

Bulgarian output:

cъстав

на

nарламента


вж

протоколиа

Danish input: Genoptagelse af sessionen

Danish output:

genoptagelse

af

sessionen

text_intl Filters

The text_intl analyzer uses the following filters:

• The CJKWidthFilterFactory normalizes width differences in CJK characters. This filter normalizes all character widths to fullwidth.

• The WorldLexerBigramFilterFactory filter forms bigrams (pairs) of CJK terms that are generated from WorldLexerTokenizerFactory. This filter does not modify non-CJK text.WorldLexerBigramFilterFactory accepts attributes that guide the creation of bigrams for CJK scripts. For example, if the input contains HANGUL script but the hangul attribute is set to false, this filter will not create bigrams for that script. To ensure that WorldLexerBigramFilterFactory creates bigrams as required, set the CJK attributes han, hiragana, katakana, and hangul to true.

text_sm, the Social Media Text Analyzer

The GPText text_sm text analyzer analyzes text from sources such as social media feeds. text_sm consists of a tokenizer and two filters. To configure the text_sm text analyzer, use the gptext-config utility to edit the schema.xml file. See “To use the text_sm Social Media Analyzer” for details.

text_sm normalizes emoticons: it replaces emoticons with text using the emoticons.txt configuration file. For example, it replaces a happy face emoticon, :-), with the text “happy”.

The following is the definition from the Solr schema.xml template.

<fieldType autoGeneratePhraseQueries="true" class="solr.TextField" name="text_sm" positionIncrementGap="100" termVectors="true" termPositions="true" termOffsets="true">

<analyzer type="index">

<tokenizer class = "com.emc.solr.analysis.text_sm.twitter. TwitterTokenizerFactory" delimiter="\t" emoticons="emoticons.txt"/>



<filter class="solr.StopFilterFactory"


enablePositionIncrements="true" ignoreCase="true" words="stopwords.txt"/>

<filter class="solr.LowerCaseFilterFactory"/>


<filter class = "com.emc.solr.analysis.text_sm.twitter. EmoticonsClassifierFilterFactory" delimiter="\t" emoticons="emoticons.txt"/>

<filter class = "com.emc.solr.analysis.text_sm.twitter. TwitterStemFilterFactory"/>

<analyzer type="query">

<tokenizer class = "com.emc.solr.analysis.text_sm.twitter. TwitterTokenizerFactory" delimiter="\t" emoticons="emoticons.txt" />


<filter class="solr.LowerCaseFilterFactory"/>



<filter class = "com.emc.solr.analysis.text_sm.twitter. TwitterStemFilterFactory"/>

</analyzer>

</fieldType>

The TwitterTokenizer

The Twitter tokenizer extends the English language tokenizer, solr.WhitespaceTokenizerFactory, to recognize the following elements as terms.

• Emoticons• Hyperlinks• Hashtag keywords (for example, #keyword)• User references (for example, @username)• Numbers• Floating point numbers• Numbers including commas (for example 10,000)



• time expressions (for example, 9:30)

The text_sm filters

com.emc.solr.analysis.socialmedia.twitter.EmoticonsClassifierFilterFactory classifies emoticons as happy, sad, or wink. It is based on the emoticons.txt file (one of the files you can edit with gptext-config), and is intended for future use, such as in sentiment analysis.

The TwitterStemFilterFactory

com.emc.solr.analysis.socialmedia.twitter.TwitterStemFilterFactory extends the solr.PorterStemFilterFactory class to bypass stemming of the social media patterns recognized by the twitter.TwitterTokenizerFactory.

The emoticons.txt file

This file contains lists of emoticons for “happy,” “sad,” and “wink.” They are separated by a tab by default. You can change the separation to any character or string by changing the value of delimiter in the social media analysis chain. The following is a sample line from the text_sm analyzer chain:


To use the text_sm Social Media Analyzer

The Solr schema.xml file specifies the analyzer to use to index a field. The default analyzer is text_intl. To specify the text_sm social media analyzer, you use the gptext-config utility to modify the Solr schema.xml for your index.

The steps are:

1. Create an index using gptext.create_index(). The index contains a line similar to the following near the end of the file:<field name="text_search_col" indexed="true" stored="false" type="text_intl"/>

The type field specifies the analyzer to use. text_intl is the default.

2. Use gptext-config to edit the schema.xml file:gptext-config -f schema.xml -i <index_name>

3. Modify the line as follows:<field name="text_search_col" indexed="true" stored="false" type="text_sm"/>

gptext-config fetches the schema.xml file from the configuration files directory for your index and opens it in the vi editor. After you edit the file, save it, and quit vi, gptext-config returns the file to its original configuration files directory.



Using Multiple Analyzer ChainsIf you want to index a field using two different analyzer chains simultaneously, you can do this:

Create a new empty index. Then use the gptext-config utility to add a new field to the index that is a copy of the field you are interested in, but with a different name and analyzer chain.

Let us assume that your index, as initially created, includes a field to index named mytext. Also assume that this field will be indexed using the default international analyzer (text_intl).

You want to add a new field to the index’s schema.xml that is a copy of mytext and that will be indexed with a different analyzer (say the text_sm analyzer). To do so, follow these steps:

1. Create an empty index with gptext.create_index().

2. Open the index’s schema.xml file for editing with gptext-config.

3. Add a <field> in the schema.xml for a new field that will use a different analyzer chain. For example:<field indexed="true" name="mytext2" stored="false" type="text_sm"/>

By defining the type of this new field to be text_sm, it will be indexed using the social media analyzer rather than the default text_intl.

4. Add a <copyField> in schema.xml to copy the original field to the new field. For example: <copyField dest="mytext2" source="mytext"/>

5. Index and commit as you normally would.

The database column mytext is now in the index twice with two different analyzer chains. One column is mytext, which uses the default international analyzer chain, and the other is the newly created mytext2, which uses the social media analyzer chain.

Using Multiple Analyzer Chains 22

Chapter 4: Queries

4. Queries

To retrieve data, you submit a query that performs a search based on criteria that you specify. Simple queries return straight-forward results. You can specify filters to process the results, or perform a faceted search to break search results into categories. You can use the default query parser, or specify a different query parser at query time.


Creating a Simple Search Query

Creating a Faceted Search Query

Using Advanced Querying Options

Changing the Query Parser at Query Time

Creating a Simple Search QueryAfter a Solr index is committed, you can create simple queries with the gptext.search() function:

gptext.search(src_table, index_name, search_query, filter_queries[, options])

Example: Top 10 results, no filter query)SELECT w.id, w.date_time, w.title, q.score

FROM articles w,

gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'wikipedia.public.articles' , 'Libya AND (Gadaffi OR Kadafi OR Qad*I)', null, 'rows=10') q

WHERE q.id = w.id;

Example: Top 100 results, no filter querySELECT w.title, q.score FROM wikipedia w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'wikipedia.public.articles', 'solr search query', null, ’rows=100’) q WHERE w.id = q.id;

Example: All results, no filter query)SELECT w.title, q.score FROM wikipedia w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'wikipedia.public.articles', 'solr search query', null) q WHERE w.id = q.id;

Example: Top 100 results, with filter querySELECT w.title, q.score FROM wikipedia w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'wikipedia.public.articles', 'solr search query', 'solr filter query', ’rows=100’) q WHERE w.id = q.id;

Creating a Simple Search Query 23

Chapter 4: Queries

Note: For now, the legacy search is still available through the function gptext.legacy_search().

Creating a Faceted Search QueryFaceting breaks up a search result into multiple categories and shows counts for each category. You can facet your search as follows:

• faceted_field_search() facets by field• faceted_query_search() facets by query• faceted_range_search() facets by range

Facet by field

The faceted_field_search() facet categories are the field names. The syntax is:

faceted_field_search(index_name, query, filter_queries, facet_fields, facet_limit, minimum)

In this example, the query runs on a set of social media feeds and eliminates spam and truncated fields:

SELECT * FROM gptext.faceted_field_search('twitter.public.message', '*:*', null, '{spam, truncated}', 0, 0);

In this example, the query searches all tweets in the data set by users (user_id’s) who have created at least 5 tweets.

SELECT * FROM gptext.faceted_field_search('twitter.public.message', '*:*', null, '{author_id}', -1, 5);

Facet by query

The faceted_query_search() facet categories are the results of queries that you provide. The syntax is:

faceted_query_search(index_name, query, filter_queries, facet_queries)

This example separates results into different price categories.

SELECT * FROM gptext.faceted_query_search('store.public.catalog', 'product_type:camera', null, '{price:[* TO 100], price:[101 TO 200], price:[201 TO 300], price:[301 TO *]}');

Facet by range

The faceted_range_search() facet categories are ranges defined by the range_start, range_end, and range_gap parameters. The syntax is:

Creating a Faceted Search Query 24

Chapter 4: Queries

faceted_range_search(index_name, query, filter_queries, range_start, range_end, range_gap), the

In this example, the query examines daily tweets for a year that ends on today’s date.

SELECT * FROM gptext.faceted_range_search('twitter.public.message', '*:*', null, 'NOW/YEAR-1YEAR', 'NOW/YEAR', '+1DAY');

Using Advanced Querying OptionsWhen you submit a query, Solr processes the query using a query parser. There are several Solr query parsers with different capabilities. For example, the ComplexPhraseQueryParser can parse wildcards, and the SurroundQueryParser supports span queries: finding words in the vicinity of a search word in a document.

You can use the most appropriate parser for your query (see “Changing the Query Parser at Query Time” on page 25).

GPText supports these query parsers:

1. QParserPlugin, the default GPText query parser. QParserPlugin is a superset of the LuceneQParserPlugin, Solr’s native Lucene query parser. QParserPlugin is a general purpose query parser with broad capabilities. QParserPlugin does not support span queries and handles operator precedence in an unintuitive manner. The support for field selection is also rather weak. See http://wiki.apache.org/solr/SolrQuerySyntax.

2. The ComplexPhraseQueryParser, which supports wildcards, ORs, ranges, and fuzzies inside phrase queries. See https://issues.apache.org/jira/browse/SOLR-1604.

3. The SurroundQueryParser, which supports the family of span queries. See http://wiki.apache.org/solr/SurroundQueryParser.

4. The DisMax (or eDisMax) Query Parser, which handles operator precedence in an intuitive manner and is best suited for user queries. See http://wiki.apache.org/solr/DisMaxQParserPlugin.

A good general reference for query parsers can be found at: http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers.

Changing the Query Parser at Query TimeYou can change the query parser at query time with the defType Solr option in the gptext.search() function that supports Solr options. For example, this query uses the dismax query parser to return the top 100 results for “olympics”, with no filter query, and using the default search field:

Top 100 results, no filter query, default search field used:

SELECT w.title, q.score FROM wikipedia w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'wikipedia.public.articles', 'olympics', null, 'rows=100,defType=dismax') q

Using Advanced Querying Options 25

Chapter 4: Queries

WHERE w.id = q.id;

The options parameter includes 'defType=dismax'.

Using the Solr localParams syntax

You can use the Solr localParams syntax with all GPText search functions to change the query parser at query time by replacing the <query> term with '{!type=dismax}<query>'.

For example, the following query has a query term'iphone' and uses the default query parser:

SELECT * FROM gptext.faceted_field_search('twitter.public.message', 'iphone', null, '{spam, truncated}', 0, 0);

You can change the query to use the edismax parser as follows:

SELECT * FROM gptext.faceted_field_search('twitter.public.message', '{!type=dismax}iphone', null, '{spam, truncated}', 0, 0);

Note: The default query parser is specified in the requestHandler definitions in solrconfig.xml. You can edit solrconfig.xml with the management utility gptext-config.

Changing the Query Parser at Query Time 26

Chapter 5: Administering GPText

5. Administering GPText

GPText administration includes security considerations, monitoring Solr index statistics, and troubleshooting.


• Security and GPText Indexes• Checking Solr Instance Status• Gathering Solr Index Statistics• Changing Global Configuration Values• Troubleshooting

Security and GPText IndexesThe security model for GPText indexes is based on the security model for Greenplum Database tables.

Your privileges to execute GPText functions depend on your Greenplum Database privileges for the table from which the index was generated. For example, if you have SELECT privileges for a table in the Greenplum database, then you have SELECT privileges for an index generated from that table.

Executing the GPText functions requires one of OWNER, SELECT, INSERT, UPDATE, or DELETE privileges, depending on the function. The OWNER is the person who created the table and has all privileges. See the Greenplum Database Database Administrator Guide for information about setting privileges.

Checking Solr Instance StatusYou can check a Solr instance using the SQL function gptext.status(),or by running the gptext-state utility from the command line.

Sample SQL:

SELECT * FROM gptext.status();

Gathering Solr Index StatisticsYou can gather Solr index statistics using the SQL function gptext.index_statistics(),or by running the gptext-state utility from the command line:

Sample SQL to obtain statistics for an index:

SELECT * FROM gptext.index_statistics('wikipedia.public.articles');

Security and GPText Indexes 27


Sample SQL to obtain statistics for all indexes:

SELECT * FROM gptext.index_statistics(null);

A command line sample that retrieves all statistics for an index:

gptext-state --stats --index wikipedia.public.articles

A command line sample that retrieves the number of documents in an index:

gptext-state --stats --index wikipedia.public.articles --stats-columns num_docs

A command line sample that retrieves 'has_deletes' and the index size:

gptext-state --stats --index wikipedia.public.articles --stats-columns has_deletes,size

A command line sample that retrieves all statistics for all indexes:

gptext-state --stats

Changing Global Configuration ValuesGlobal User Configurations (GUCs) are built into the gptext installation with default values. You can change the default values by editing the postgresql.conf file on the segment.

The following commented-out line is found near the end of the postgresql.conf file:

#custom_variable_classes = ' '

To change the value of one or more GUCs, remove the leading comment character (#) and supply gptext as the value.

custom_variable_classes = 'gptext'

Then, redefine the GUC values following the two examples below:

gptext.idx_buffer_size=10485760

gptext.idx_delim='|'

The following table lists the GUCs and shows their default values.

GUC Name Description Minimum Maximum Default

idx_buffer_size Size of indexing buffer in bytes.

4096 67108864 4194304

idx_delim Delimiter to use during indexing.

comma

’,’

idx_escape Escape character to use for indexing.

backslash

’\’

search_timeout Timeout, in seconds, for searches.

30 INT_MAX 600

Changing Global Configuration Values 28


TroubleshootingGPText errors are of the following types:

• Solr errors• gptext errors

Most of the Solr errors are self-explanatory.

gptext errors are caused by misuse of a function or utility. They provide a message that tells you when you have used an incorrect function or argument.

admin_timeout Timeout, in seconds, for admin requests (create_index, etc.).

30 INT_MAX 3600

delete_timeout Timeout, in seconds, for delete requests.

30 INT_MAX 3600

facet_timeout Timeout, in seconds, for faceting queries.

30 INT_MAX 3600

index_timeout Timeout, in seconds, for receiving response to indexing operation.

30 INT_MAX 3600

optimize_timeout Timeout, in seconds, for optimize operations.

30 INT_MAX 3600

ping_timeout Timeout, in seconds, for ping requests.

30 INT_MAX 120

replication_timeout Timeout, in seconds, for replication operations (backup, restore).

30 INT_MAX 43200

rollback_timeout Timeout, in seconds, for rollback operations.

30 INT_MAX 3600

stats_timeout Timeout, in seconds, for obtaining statistics.

30 INT_MAX 600

commit_timeout Timeout, in seconds, for prepare commit and commit operations.

30 INT_MAX 3600

search_buffer_size Buffer size for search results, in bytes.

4096 67108864 4194304

search_post_buffer _size

Post buffer size for search, in bytes.

512 4194304 4096

terms_batch_size Batch size for terms operations.

1 INT_MAX 50000

search_batch_size Batch size for search requests.

1 INT_MAX 2500000

GUC Name Description Minimum Maximum Default

Troubleshooting 29


Monitoring Logs

You can examine the Greenplum Database and Solr logs for more information if errors occur. Greenplum Database logs reside in:

segment-directory/pg-log

Solr logs reside in:

<GPDB path>/solr/logs

Determining Segment Status with gptext-state

Use the gptext-state utility to determine if any primary or mirror segments are down. See gptext-state in the GPText Function Reference.

Troubleshooting 30

Chapter 6: k-means Analysis

6. k-means Analysis

The k-means algorithm is a well-known algorithm for grouping data into clusters, enabling classification of a data set. This is useful for applications such as data mining.


• Performing k-means Analysis Using gptext-analytics• Editing kmeans.yaml• Replacing the kmeans.yaml File• Running the k-means Algorithm• Running the k-means Algorithm More Than Once

Performing k-means Analysis Using gptext-analyticsYou can use the gptext-analytics utility to run the k-means algorithm on an index of your own data. The syntax is:

gptext-analytics -c kmeans -f kmeans.yaml

This creates a file called kmeans.yaml in the current directory.

Note: gptext-analytics runs the same sequence of SQL statements as the k-means demo described in Getting Started with GPText. See Getting Started with GPText for details about the procedure that is incorporated in gptext-analytics.

gptext_analytics generates a SQL file for executing the k-means algorithm, kmeans_plusplus.sql. The SQL file is based on the MADlib 0.4 API.

Provide the following parameters in the kmeans.yaml file:

• The name of your database (myDB).• The name of the table you are indexing (myTable).• The name of the schema for that table (mySchema).• The name of the table’s id column (myTableID).• The name of the default column to search in the index, in the event that another

column is not named in a query (myTableDocBody).The general procedure is:

1. Edit the kmeans.yaml file and replace the placeholder parameters with your own.

2. Run:gptext-analytics -d <sql_files_directory> -f kmeans.yaml

The -d parameter specifies the directory in which the SQL files are generated. kmeans.yaml is the file after you have edited it.

Performing k-means Analysis Using gptext-analytics 31


Running this command generates SQL files, one for each intermediate step as specified in the .yaml file, along with a main SQL file, execute.sql, to run all the intermediate files.

3. Run the SQL file execute.sql. This will run the sequence of SQL files to perform the analysis. It is the same sequence as the one in the k-means demo described in Getting Started with GPText.

Editing kmeans.yamlThe kmeans.yaml file is created when you run:

gptext-analytics -c kmeans -f kmeans.yaml.

kmeans.yaml contains placeholder parameters that you must replace. For example, you can use the vi/vim command :%s/OLD/NEW/g to globally replace OLD with NEW.

Placeholder Replace with:

myDB The name of your database

myTable The name of your table to index.

mySchema The name of your table’s schema.

myTableID The name of your table’s id column.

myTableDocBody The name of your table’s default search column.

At the end of the kmeans.yaml file are the parameters that are used when running the k-means algorithm. For example, the default file includes:

distance_metric: <cosine>

num_iterations: <5>

convergence_threshold: <0.01>

You can change some of these parameters by editing the kmeans.yaml file.

The kmeans.yaml file to edit# The following file corresponds to the configuration parameters for configuring k-means clustering algorithm using GPText via gptext-analytics utility.

#

# Name of the algorithm.

algorithm: kmeans

#

# Required parameters for create index operation.

# schema_name: Name of the schema in which the table exists

Editing kmeans.yaml 32


# table_name: Name of the table

# id_col_name: Name of the id column

# def_search_col_name: Name of the default search column name

#

create_index:

schema_name: <mySchema>

table_name: <myTable>

id_col_name: <myTableID>

def_search_col_name: <myTableDocBody>

#

# Required Parameters for the index data and term enabling # operation.

# index_name: Name of the index to be used for indexing.

# table_name: Name of the table to index.

# field_name: Name of the field (column) for which to # enable terms.

#

index_data:

index_name: <myDB.mySchema.myTable>

table_name: <mySchema.myTable>

field_name: <myTableDocBody>

#

# Required parameters for creating terms table for the indexed # column.

# terms_table_name: Name of the OUTPUT terms table to be # created.

# index_name: Name of the index creating the indexed # documents.

# field_name: Name of the field (column) that has been # indexed and whose terms are required.

#

create_terms:

terms_table_name: <mySchema.terms>



#

# Required parameters for creating dictionary table storing # the extracted features (or terms) from the terms_table.



# dict_table_name: Name of the OUTPUT dictionary table.

# terms_table_name: Name of the terms table containing the required terms.

#

create_dictionary:

dict_table_name: <mySchema.dictionary>


#

# Required parameters for creating the corpus table # for storing the sparse vector representation of the # documents from the terms table and dictionary.

# corpus_table_name: Name of the OUTPUT corpus table to be created.

# terms_table_name: Name of the terms table containing required terms.

# dict_table_name: Name of the dictionary table containing extracted terms to be used as features.

#

create_corpus:

corpus_table_name: <mySchema.corpus>



#

# Required parameters for creating the tfidf table containing # the TF-IDF vectors for the documents.

# tfidf_table_name: Name of the OUTPUT tfidf table to be created.

# corpus_table_name: Name of the corpus table containing the sparse vector representations for the documents.

#

create_tfidf:

tfidf_table_name: <mySchema.tfidf>

corpus_table_name: <mySchema.corpus>

#

# Required parameters for running the kmeans algorithm on the corpus created # (tfidf table).

# Please refer to MADlib k-means documentation for details # about these parameters.



kmeans_plusplus:

src_relation_name: <kmeans.tfidf>

output_point_table: <kmeans.km_p>

output_centroid_table: <kmeans.km_c>

distance_metric: <cosine>

num_iterations: <5>

convergence_threshold: <0.01> evaluate: <True>

verbose: <True>

num_clusters: <10>

sample_fraction: <null>

Replacing the kmeans.yaml FileRun gptext-analytics -d <sql_files_directory> -f kmeans.yaml.

The -d parameter specifies the directory in which to generate the SQL files.

kmeans.yaml is the file after you have edited it.

Running this command generates SQL files, one for each intermediate step as specified in the .yaml file, and a main SQL file, execute.sql, to run all the intermediate files. The SQLfiles are based on MADlib 0.4.

Running the k-means AlgorithmAfter you edit and replace the kmeans.yaml file, you can run the k-means algorithm on your data. Run execute.sql from the -d directory. For example:

psql -f execute.sql

or

option: vi execute.sql

Running the k-means Algorithm More Than OnceMany of the steps in execute.sql, as presented in gptext-analytics -d <sql_files_directory> -f kmeans.yaml, create tables. By default, the steps do not delete the old table before creating a new one. The options for handling existing tables when re-running the k-means algorithm are:

• Delete the old table using the original edited kmeans.yaml file:gptext-analytics -d <sql_files_directory> -f kmeans.yaml

• In the kmeans.yaml file, change the name of the table to create before running gptext-analytics -d <sql_files_directory> -f kmeans.yaml

This allows execute.xml to run with a new table name.

Replacing the kmeans.yaml File 35


See Getting Started With GPText for details about the steps that are incorporated into the execute.sql script.

Running the k-means Algorithm More Than Once 36

Chapter 7: Support Vector Machine Analysis

7. Support Vector Machine Analysis

The procedure for running the SVM algorithm is very similar to that described in Chapter 6, “k-means Analysis” for the k-means algorithm. The principle difference is that you run SVM twice: once for training and once for testing.


• Training SVM• Creating svm_train.yaml• Editing svm_train.yaml• Replacing svm_train.yaml and Generating SQL Files• Running the SVM Training Algorithm• Running the SVM Training Algorithm More Than Once• Testing the SVM• Creating svm_test.yaml• Editing svm_test.yaml• Replacing svm_test.yaml and Generating SQL Files• Running the SVM Test• Running the SVM Test Algorithm More Than Once

Training SVM You can use the gptext-analytics utility to train an SVM on an index of your own data.

Note: gptext-analytics runs the same sequence of SQL statements as the training portion of the SVM demo described in Getting Started with GPText. Refer to that document for a complete discussion of the procedure that is incorporated in gptext-analytics.

Specify the following parameters:

• The name of your database (myDB).• The name of the table you are indexing (myTable).• The name of the schema for that table (mySchema).• The name of the table’s id column (myTableID).• The name of the default column to search in the index (myTableDocBody).The procedure is:

1. Creating svm_train.yaml.

2. Editing svm_train.yaml to replace the placeholder parameters with your own.

Training SVM 37


3. Replacing svm_train.yaml and Generating SQL Files.

4. Running the SVM Training Algorithm.

Creating svm_train.yamlRun:

gptext-analytics -c svm_train -f svm_train.yaml

This creates the svm_train.yaml file in the current directory.

Editing svm_train.yamlThe svm_train.yaml file contains placeholder parameters (shown in magenta) that you must replace. For example, you can use the vi/vim command :%s/OLD/NEW/g to globally replace OLD with NEW.







The svm_train.yaml File Contents

The contents of the svm_train.yaml file are similar to the following.

# The following file corresponds to the configuration # parameters for configuring Support Vector Machines (SVM) # Training using GPText via gptext-analytics utility.

#


algorithm: svm_train

#






#

create_index:

Creating svm_train.yaml 38



table_name: <myTable>



#

# Required Parameters for the index data operation.


#table_name: Name of the table to index.

# field_name: Name of the field (column) whose values to be #indexed.

#

index_data:




#

# Required parameters for creating terms table for the indexed column.

# terms_table_name: Name of the OUTPUT terms table to be # created.

# index_name: Name of the index creating the indexed #documents.

# field_name: Name of the field (column) that has been #indexed and whose terms are required.

#

create_terms:

terms_table_name: <mySchema.train_terms>



#

# Required parameters for creating dictionary table storing the # extracted features (or terms) from the terms_table.

# dict_table_name: Name of the OUTPUT dictionary table.

# terms_table_name: Name of the terms table containing the #required terms.

#

create_dictionary:


terms_table_name: <mySchema.train_terms>

#

# Required parameters for creating the corpus table for storing # the Float8 vector representation of the documents from the # terms table and dictionary.

Editing svm_train.yaml 39


# corpus_table_name: Name of the OUTPUT corpus table to be #created.

# terms_table_name: Name of the terms table containing #required terms.

# dict_table_name: Name of the dictionary table containing #extracted terms to be used as features.

#

create_corpus:

corpus_table_name: <mySchema.train_corpus>

terms_table_name:<mySchema.train_terms>


#

# Required parameters for creating the input table storing # Float8 vectors of the documents along with the # corresponding labels.

# input_table_name: Name of the OUTPUT input table.

# corpus_table_name: Name of the corpus table containing #Float8 vectors.

# table_name: Name of the original data table having #training documents with labels.

# label_col_name: Name of the column containing the #labels for training documents.

#

create_input:

input_table_name: <mySchema.train_input_table>

corpus_table_name: <mySchema.train_corpus>


label_col_name: <class_id>

#

# Required parameters for running SVM training.

# input_table_name: Name of the required input table #containing documents (Float8 vector) along with labels.

# model_table_name: Name of the OUTPUT (learned) model #table to be created.

#

lsvm_train:

input_table_name: <mySchema.train_input_table>

model_table_name: <mySchema.model_table>

Replacing svm_train.yaml and Generating SQL FilesRun:

gptext-analytics -d <sql_files_directory> -f svm_train.yaml


Replacing svm_train.yaml and Generating SQL Files 40


svm_train.yaml is the edited file.

This command generates one SQL file for each intermediate step as specified in the .yaml file, as well as a main SQL file, execute.sql, to run all the intermediate files.

Running the SVM Training AlgorithmTo run the SVM training algorithm on your data, run execute.sql from the -d directory:

psql -f execute.sql

or:

optional: vi execute.sql

Running the SVM Training Algorithm More Than OnceYou can run the SVM training algorithm multiple times. However, many of the steps in execute.sql create tables and do not delete the old tables before creating new ones. Before you run the algorithm again, perform one of the following tasks:

• Delete the old table prior to running:gptext-analytics -d <sql_files_directory> -f svm_train.yaml

with the original edited svm_train.yaml file, or:• Change the name of the table so that execute.xml runs with a new table name.

Edit the svm_train.yaml file and change the name of the table to create before you run gptext-analytics.

See Getting Started With GPText for details about the execute.sql script.

Testing the SVM You can use the gptext-analytics utility to test an SVM on an index of your own data.

Note: gptext-analytics runs the same sequence of SQL statements as the testing portion of the SVM demo described in Getting Started with GPText. You can refer to that document for a complete discussion of the procedure that is incorporated in gptext-analytics.

Specify the following parameters:

• The name of your database (myDB).• The name of the table you are indexing (myTable).• The name of the schema for that table (mySchema).• The name of the table’s id column (myTableID).• The name of the default column to search in the index.The procedure is:

Running the SVM Training Algorithm 41


1. Creating svm_test.yaml.

2. Editing svm_test.yaml and replace the placeholder parameters with your own.

3. Replacing svm_test.yaml and Generating SQL Files.

4. Running the SVM Test.

Creating svm_test.yamlCreate svm_test.yaml in the current directory:

gptext-analytics -c svm_test -f svm_test.yaml

Editing svm_test.yamlsvm_test.yaml contains placeholder parameters (shown in magenta) that you must replace. For example, you can use the vi/vim command :%s/OLD/NEW/g to globally replace OLD with NEW.







The svm_test.yaml File Contents

The svm_test.yaml file is similar to the following.

# The following file corresponds to the configuration # parameters for configuring Support Vector Machines (SVM) # Testing using GPText via gptext-analytics utility.

#


algorithm: svm_test

#






#

Creating svm_test.yaml 42


create_index:


table_name: <test_data>



#

# Required Parameters for the index data operation.


# table_name: Name of the table to index.

# field_name: Name of the field (column) to be indexed.

#

index_data:

index_name: <myDB.mySchema.test_data>

table_name: <mySchema.test_data>


#

# Required parameters for creating terms table for the # indexed column.

# terms_table_name: Name of the OUTPUT terms table to # be created.

# index_name: Name of the index creating the indexed #documents.

# field_name: Name of the field (column) that has been #indexed and whose terms are required.

#

create_terms:

terms_table_name: <mySchema.test_terms>

index_name:<myDB.mySchema.test_data>


#

# Required parameters for creating the corpus table for storing # the Float8 vector representation of the documents from the # terms table and dictionary.

# corpus_table_name: Name of the OUTPUT corpus table to be #created.

# terms_table_name: Name of the terms table containing #required terms.

# dict_table_name: Name of the dictionary table containing #extracted terms to be used as features.

#Should be the same as extracted during SVM training.

#

create_corpus:

corpus_table_name: <mySchema.test_corpus>

Editing svm_test.yaml 43


terms_table_name: <mySchema.test_terms>


#

# Required parameters for running SVM testing.

#corpus_table_name: Name of the table containing Float8 #vector representations for the test documents.

# model_table_name: Name of the table containing the learned #model from SVM training.

#output_table_name: Name of the OUTPUT table to be created #containing the predicted values.

#

lsvm_predict_batch:

corpus_table_name: <mySchema.test_corpus>

model_table_name: <mySchema.model_table>

output_table_name: <mySchema.output_table>

Replacing svm_test.yaml and Generating SQL FilesThis step generates one SQL file for each intermediate step specified in the .yaml file as well as a main SQL file, execute.sql, that runs all the intermediate files.

gptext-analytics -d <sql_files_directory> -f svm_test.yaml


svm_test.yaml is the edited file.

Running the SVM TestTo run the SVM testing algorithm on your data, run execute.sql from the -d directory. For example:

psql -f execute.sql

or

option: vi execute.sql

Running the SVM Test Algorithm More Than OnceYou can run the SVM testing algorithm multiple times. However, many of the steps in execute.sql create tables and do not delete the old tables before creating new ones. Before you run the algorithm again, perform one of the following tasks:

• Delete the old tables:gptext-analytics -d <sql_files_directory> -f svm_train.yaml

with the original edited svm_test.yaml file, or:

Replacing svm_test.yaml and Generating SQL Files 44


• Change the name of the tables so that execute.xml runs with a new table name. Edit the svm_test.yaml file and change the name of the tables to create before you run gptext-analytics.

See Getting Started With GPText for details about the execute.sql script.

Running the SVM Test Algorithm More Than Once 45

Chapter 8: GPText High Availability

8. GPText High Availability

The GPText high availability feature ensures that, if a failure occurs, you can still query your data and continue working. However, certain operations are unavailable during a failure.The following functions require that the Solr master and Solr mirror instances are running:

• gptext.create_index()

• gptext.drop_index()

• gptext.add_field()

• gptext.drop_field()

• gptext.reload_index()

The following functions require that the Solr master instance is running:

• gptext.index()

• gptext.commit_index()

• gptext.optimize_index()

• gptext.rollback_index()

Normal GPText Running ConditionsDuring normal running conditions, GPText queries execute on the primary Greenplum Database (GPDB) segment and the local Solr master instance queries for text search-related tasks, as shown in the following figure.

GPDB Master

Primary Segment

Solr Master Instance

Mirror Segment

Solr Mirror Instance

Normal GPText Running Conditions 46


Solr Master FailureIf the master Solr instance is down, the primary GPDB segment queries the Solr mirror instance located on the mirror segment data directory, as shown in the following figure.

GPDB Master

Primary Segment


Mirror Segment


Primary Segment FailureIf the GPDB primary segment is down, the GPDB master fails over to the mirror segment to execute queries. The mirror segment queries the master Solr instance located on the primary GPDB segment data directory, as shown in the following figure.

Solr Master Failure 47


GPDB Master

Primary Segment


Mirror Segment


Primary Segment and Solr Master FailureIf the GPDB primary segment and the master Solr instance are down, the GPDB master fails over to the mirror segment, which queries the Solr mirror instance for text search-related queries, as shown in the following figure.

GPDB Master

Primary Segment


Mirror Segment


Primary Segment and Solr Master Failure 48

Greenplum GPText 1.1.0.1– Glossary

A. Glossary

A

analyzer

Defines the set of terms for a Solr index field. An analyzer consists of a tokenizer and a set of optional filters to be applied to the input text. For example, an analyzer can consist of a WhiteSpaceTokenizerFactory followed by a LowerCaseFilterFactory as a filter. See also: tokenizer.

B

bigram

A sequence of two adjacent elements in a token string. A sequence of three consecutive tokens is a trigram and a sequence of n consecutive tokens is an n-gram.

binary classification

The process of sorting data into one of two categories, for example, classifying a given text according to whether the text is associated with a positive or negative sentiment.Classification problems with more than two classes into which the data is to be categorized are called multiclass classification problems.

C

centroid

In clustering problems, a centroid represents the approximate center of a cluster.A centroid does not have to map directly to a data point in the cluster. For example, in k-means clustering the coordinates of a centroid are the mean of the coordinates of data points (documents) pertaining to that cluster and are constantly updated as new data point assignments are made.

cluster

A set of identical data points. For example, if some documents are to be grouped into three clusters, the result of a machine learning algorithm is three clusters: all the documents within a particular cluster are similar to each other, but different from the documents in other clusters. The results and the quality of the clusters depend on various factors, including the algorithm used, the parameters that were configured, and set of features used.

corpus

A collection of documents. Plural: corpora.

analyzer 49


D

dictionary

A list of unique words or terms from the documents that comprise the vocabulary of the document collection.

dimension

A generalized term for a feature of data, such as word counts in a document. Dimensions are typically large in number.An n-dimensional vector is a vector according to which a document can be expressed in an n-dimensional feature space. For example, if your dictionary contains n unique terms, a document could be expressed in an n–dimensional vector where each position contains the count with which a particular term from the dictionary appears in that document. A feature space could be the entire dictionary or could be another dictionary (or set of features) extracted by using feature selection.

dimensionality reduction

The process of reducing the dimensions (or features) according to which the data (or documents) is expressed in a feature space. For example, selecting terms (or features) from the dictionary that appear in more than k documents in the entire corpus gives one set of reduced dimensions.

F

facet, faceting

A distinct feature directly attributed to a Solr field that can be used to group terms or data, usually a field name in GPText. For example, you can facet a document based on author_name or message_type, or a facet can be the size of the document. Facets are usually small in number. Faceting in Solr can be a way to dynamically create taxonomies. See taxonomy.

faceted search

Perform a search based on specified aspects of a set of terms or data.

filter

A method of constraining a search result set, such as searching an initial set of results and selecting those that contain specified words, phrases, and symbols.

M

machine learning

A branch of artificial intelligence that focuses on the construction and study of systems that can learn from data.

dictionary 50


N

natural language processing

A field of study that combines computer science, artificial intelligence, and liguistics to study interactions between computers and human languages.

O

ontology

The formal representation of knowledge as a set of concepts (ideas, entities, events) and their properties and relations according to a system of categories within a domain. Ontologies provide the structural frameworks for organizing information for fields such as artificial intelligence. Ontology is not a synonym for taxonomy.

P

proximity, term proximity

A search that looks for documents in which two or more separately matching term occurences are within a specified distance (a number of intermediate words or characters).

Q

query parser

A component that parses the input queries provided for search.

S

sentiment analysis

Classifies opinions expressed in text documents into categories such as “positive” and “negative”.

silhouette coefficient (SC)

A quantitive measure of clustering performance. SC measures how tightly grouped all the data in the cluster are. Its values range between –1 and 1. Values near 1 indicate that clustering was good, and values near -1 indicate that clustering was not good and the data point must have been assigned to another cluster. Values near 0 indicate that the cluster assignment was ambiguous, and the data point is somewhere on the boundary of the cluster.

sparse vector

A vector whose elements are mostly zeros or are unpopulated. See the Getting Started with GPText Guide for more information.

natural language processing 51


stem

The part of a word that is common to all its inflected variants (how you modify a word to express its different grammatical categories, for example, by conjugating a verb). For example, receives, receiving, and received all derive from the stem “receiv”.

stemming

The process for reducing an inflected or derived word to its stem, base, or root form. The stem is not necessarily the same as the root form. For example, receives, receiving, and received all derive from the stem “receiv”; the root form is receive.

support vector machine (SVM)

A supervised learning model that classifies data by analyzing the data, recognizing patterns in the data, and placing the data in specific classes. Applications include sentiment analysis, separating spam email from legitimate email, and, if the Sorting Hat were an SVM, determining the House to which new Hogwarts students are assigned.

T

taxonomy

A hierarchical system of classification; a method for dividing terms, concepts, or other entities into ordered groups or categories. Taxonomies differ from ontologies in that they are generally focused, simple tree relationships, and ontologies have wider, broader scopes.

term

A distinct word within a document or set of documents.

TF-IDF score

Term frequency-inverse document frequency. A numeric statistic that reflects how important a word is to a set of documents. The tf-idf score increases proportionate to the number of times a word appears in a document.

TF-IDF vector

A vector containing tf-idf scores.

token

Units into which an input string is broken. For example, a token can be individual terms in a bigram (multiple terms) or trigram that appear in the input text.

tokenizer

Breaks a stream of text into tokens based on delimiters, the separators that specify the characters to consider as the token boundaries, or some regular expressions. For example, a delimiter could be a white space.Tokenizers are not aware of fields in a document.

stem 52


token filter

Takes a stream of tokens produced by a tokenizer, examines each token, and either passes the token along or discards it. For example, a token filter may remove white space, unnecessary words such as “a”, “an”, or “the”, or remove dots from acronyms. Token filters produce another stream of tokens that can be input to other token filters.

token filter 53

greenplum gptext 1.1.0.0 user’s guide · 2020-06-23 · gptext enables analysis of solr indexes...

Documents