hbasecon 2012 | hbase powered merchant lookup service at intuit

13
HBase powered Merchant Lookup Service at Intuit Vrushali Channapattan, Intuit Lightning Talk @ HBaseCon2012 (May 22 nd , 2012)

Upload: cloudera-inc

Post on 13-Jul-2015

1.266 views

Category:

Technology


2 download

TRANSCRIPT

HBase powered Merchant Lookup Service at IntuitVrushali Channapattan, Intuit

Lightning Talk @ HBaseCon2012 (May 22nd, 2012)

About Intuit

2

Intuit is a leader in this trend because we are entrusted with the collective data of our 50 million customers.

Both of the above vendor records map to the D&B business:

ID: 002114902Name: The Windsor-Press IncStreet: 6 N 3rd StCity: HamburgState: PAZip: 19526-1502Phone: (610)-562-2267

Company ABC

name: The Windsor Press, Inc.address: PO Box 465 6 North Third Streetcity: Hamburgstate: PAzip: 19526phone: (610) 562-2267

name: The Windsor Pressaddress: P.O. Box 465 6 North 3rd St.city: Hamburgstate: PAzip: 19526-0465phone: (610) 562-2267

Company PQR

Problem: Duplicate Merchants

Dun & Bradstreet

Applications of Merchant Lookup

Applications of Merchant Lookup

Name AddressPhone

Loader

Various Matchers

Final Match Score

Merchant

Splicer

Update

Full table Scan

Score

Combiner

Backend Architecture

IndividualMatcher Scores

Input

Data

Applications

Internal Research Projects

6

Data Model -Tables in HBase

7

Merchants Master dataset of merchants

Sangria_idUnique id generation coordination across mapper processes

DuplicatesNoting duplicate merchants after deduplication

SnapshotMerchantsMerging into master dataset

NewMerchantsThe new merchant set that is to be added to the master data set of

merchants

Schema

8

Merchants

Row key Info (column family) Mapping (column family)

25204939 name:Crepevinestreet:367 University Avenuecity:Palo Altostate:CAzip:94031county:Santa Clara Countycountry: United States of Americawebsite:www.crepevine.comphoneNumber:16503233900latitude:37.430211longitude:-122.098221source:internetmint_category:Food & Diningqbo_category:RestaurantsNAICS:722110SIC:5182

sourcename:10000048, 10000075

Schema

9

Sangria_id

Duplicates

Row key Info (column family)

10000043 25204921:0.998

10000048 25204939:0.78

10000075 25204939:0.95

Row key Info (column family)

default seed:30000comment:initial seed by vc of 1000

qbo seed:20550000comment:initial seed by kf of 20000000

Optimizations (job level)

10

• For Hadoop jobs interfacing with HBase, used TableMapReduceUtil

• Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put

– Use context.write(rowKey,put)

• To make the full table scan faster (hbase read only hadoop jobs – deduping

matchers , Solr index generator)

scan.setCaching(500);

scan.setCacheBlocks(false);

• Used Customized TableInputFormat while scanning (custom number of

splits for map tasks)

job.setInputFormatClass(CustomizedTableInputFormat.class);

extends TableInputFormat class and overriding getSplits

method

Optimizations (code level)

11

• Storing frequently used column family and column names as byte arrays in a

public interface

public static final byte[] COLUMN_NAME =

Bytes.toBytes("name");

public static final byte[] COLUMN_FAMILY_INFO = Bytes.toBytes("info");

• Utility class for getting values from hbase.client.Result

HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO,

COLUMN_NAME));

public static String getColumnValue(Result result, byte[] type, byte[] columnName) {

return Bytes.toString(result.getValue(type, columnName));

}

• Writing a sample set of 31 million records into the HBase cluster changed from 4 hours 37 mins 47 secs to 32 mins, 18 seconds

Vrushali Channapattan, Intuit Data Group (BIO)

[email protected]

12

Thank You!

Schema

13

SnapshotMerchants

NewMerchants- same as Merchants

Row key Info (column family)

merge first:1336813613start:1337029113end:1337120100comments:merging qbo against dandbmerchants initiated on May 14th 2012outcome:started (or) merge run successful