HBase powered Merchant Lookup Service at IntuitVrushali Channapattan, Intuit
Lightning Talk @ HBaseCon2012 (May 22nd, 2012)
About Intuit
2
Intuit is a leader in this trend because we are entrusted with the collective data of our 50 million customers.
Both of the above vendor records map to the D&B business:
ID: 002114902Name: The Windsor-Press IncStreet: 6 N 3rd StCity: HamburgState: PAZip: 19526-1502Phone: (610)-562-2267
Company ABC
name: The Windsor Press, Inc.address: PO Box 465 6 North Third Streetcity: Hamburgstate: PAzip: 19526phone: (610) 562-2267
name: The Windsor Pressaddress: P.O. Box 465 6 North 3rd St.city: Hamburgstate: PAzip: 19526-0465phone: (610) 562-2267
Company PQR
Problem: Duplicate Merchants
Dun & Bradstreet
Name AddressPhone
Loader
Various Matchers
Final Match Score
Merchant
Splicer
Update
Full table Scan
Score
Combiner
Backend Architecture
IndividualMatcher Scores
Input
Data
Applications
Internal Research Projects
6
Data Model -Tables in HBase
7
Merchants Master dataset of merchants
Sangria_idUnique id generation coordination across mapper processes
DuplicatesNoting duplicate merchants after deduplication
SnapshotMerchantsMerging into master dataset
NewMerchantsThe new merchant set that is to be added to the master data set of
merchants
Schema
8
Merchants
Row key Info (column family) Mapping (column family)
25204939 name:Crepevinestreet:367 University Avenuecity:Palo Altostate:CAzip:94031county:Santa Clara Countycountry: United States of Americawebsite:www.crepevine.comphoneNumber:16503233900latitude:37.430211longitude:-122.098221source:internetmint_category:Food & Diningqbo_category:RestaurantsNAICS:722110SIC:5182
sourcename:10000048, 10000075
Schema
9
Sangria_id
Duplicates
Row key Info (column family)
10000043 25204921:0.998
10000048 25204939:0.78
10000075 25204939:0.95
Row key Info (column family)
default seed:30000comment:initial seed by vc of 1000
qbo seed:20550000comment:initial seed by kf of 20000000
Optimizations (job level)
10
• For Hadoop jobs interfacing with HBase, used TableMapReduceUtil
• Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put
– Use context.write(rowKey,put)
• To make the full table scan faster (hbase read only hadoop jobs – deduping
matchers , Solr index generator)
scan.setCaching(500);
scan.setCacheBlocks(false);
• Used Customized TableInputFormat while scanning (custom number of
splits for map tasks)
job.setInputFormatClass(CustomizedTableInputFormat.class);
extends TableInputFormat class and overriding getSplits
method
Optimizations (code level)
11
• Storing frequently used column family and column names as byte arrays in a
public interface
public static final byte[] COLUMN_NAME =
Bytes.toBytes("name");
public static final byte[] COLUMN_FAMILY_INFO = Bytes.toBytes("info");
• Utility class for getting values from hbase.client.Result
HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO,
COLUMN_NAME));
public static String getColumnValue(Result result, byte[] type, byte[] columnName) {
return Bytes.toString(result.getValue(type, columnName));
}
• Writing a sample set of 31 million records into the HBase cluster changed from 4 hours 37 mins 47 secs to 32 mins, 18 seconds