complex storiesabout sqooping postgresqldata · title:...

24
Copyright © 2013 NTT DATA Corporation 10/28/2013 NTT DATA Corporation Masatake Iwasaki Complex stories about Sqooping PostgreSQL data Presentation slide for Sqoop User Meetup (Strata + Hadoop World NYC 2013)

Upload: others

Post on 05-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

Copyright © 2013 NTT DATA Corporation

10/28/2013NTT  DATA  CorporationMasatake  Iwasaki

Complex  stories  aboutSqooping  PostgreSQL  data

Presentation  slide  for  Sqoop  User  Meetup  (Strata  +  Hadoop  World  NYC  2013)  

Page 2: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

Copyright © 2013 NTT DATA Corporation 2

Introduction

Page 3: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

3 Copyright © 2013 NTT DATA Corporation

About  Me

Masatake  Iwasaki:Software  Engineer  @  NTT  DATA:

NTT(Nippon  Telegraph  and  Telephone  Corporation):  TelecommunicationNTT  DATA:  Systems  Integrator

Developed:Ludia:  Fulltext  search  index  for  PostgreSQL  using  Senna

Authored:“A  Complete  Primer  for  Hadoop”  (no  official  English  title)

Patches  for  Sqoop:SQOOP-‐‑‒390:  PostgreSQL  connector  for  direct  export  with  pg_̲bulkloadSQOOP-‐‑‒999:  Support  bulk  load  from  HDFS  to  PostgreSQL  using  COPY  ...  FROM  SQOOP-‐‑‒1155:  Sqoop  2  documentation  for  connector  development  

Page 4: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

4 Copyright © 2013 NTT DATA Corporation

Why  PostgreSQL?

Enterprisyfrom  earlier  versioncomparing  to  MySQL

Active  community  in  Japan

NTT  DATA  commits  itself  to  development

Page 5: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

Copyright © 2013 NTT DATA Corporation 5

Sqooping  PostgreSQL  data

Page 6: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

6 Copyright © 2013 NTT DATA Corporation

Working  around  PostgreSQL

Direct  connector  for  PostgreSQL  loader:    SQOOP-‐‑‒390:  PostgreSQL  connector  for  direct  export  with  pg_̲bulkload

Yet  another  direct  connector  for  PostgreSQL  JDBC:    SQOOP-‐‑‒999:  Support  bulk  load  from  HDFS  to  PostgreSQL                                            using  COPY  ...  FROM

Supporting  complex  data  types:    SQOOP-‐‑‒1149:  Support  Custom  Postgres  Types  

Page 7: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

7 Copyright © 2013 NTT DATA Corporation

Direct  connector  for  PostgreSQL  loader

SQOOP-‐‑‒390:      PostgreSQL  connector  for  direct  export  with  pg_̲bulkload

pg_̲bulkload:Data  loader  for  PostgreSQLServer  side  plug-‐‑‒in  library  and  client  side  commandProviding  filtering  and  transformation  of  datahttp://pgbulkload.projects.pgfoundry.org/

Page 8: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

8 Copyright © 2013 NTT DATA Corporation

SQOOP-‐‑‒390:      PostgreSQL  connector  for  direct  export  with  pg_̲bulkload

Mapper�

File  Split�

Mapper�

File  Split�

Mapper�

File  Split�

Destination  Table �

HDFS �

PostgreSQL

tmp1� tmp2� tmp3�

pg_̲bulkload � pg_̲bulkoad � pg_̲bulkload �

CREATE  TABLE    tmp3(LIKE  dest  INCLUDING  CONSTRAINTS) �

Reducer�

BEGININSERT  INTO  dest  (  SELECT  *  FROM  tmp1  )DROP  TABLE  tmp1INSERT  INTO  dest  (  SELECT  *  FROM  tmp2  )DROP  TABLE  tmp2INSERT  INTO  dest  (  SELECT  *  FROM  tmp3  )DROP  TABLE  tmp3COMMIT

staging  table  per  mapper  is  mustdue  to  table  level  locks

external  process

Page 9: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

9 Copyright © 2013 NTT DATA Corporation

Direct  connector  for  PostgreSQL  loader

Pros:Fastby  short-‐‑‒circuitting  server  functionality

Flexiblefiltering  error  records

Cons:Not  so  fastBottleneck  is  not  in  client  side  but  in  DB  sideBuilt-‐‑‒in  COPY  functionality  is  fast  enough

Not  Generalpg_̲bulkload  supports  only  export

Requiring  setup  on  all  slave  nodes  and  client  nodePossible  to  Require  recovery  on  failure

Page 10: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

10 Copyright © 2013 NTT DATA Corporation

Yet  another  direct  connector  for  PostgreSQL  JDBC

PostgreSQL  provides  custom  SQL  command  for  data  import/export

COPY table_name [ ( column_name [, ...] ) ] FROM { 'filename' | STDIN } [ [ WITH ] ( option [, ...] ) ] COPY { table_name [ ( column_name [, ...] ) ] | ( query ) } TO { 'filename' | STDOUT } [ [ WITH ] ( option [, ...] ) ] where option can be one of: FORMAT format_name OIDS [ boolean ] DELIMITER 'delimiter_character' NULL 'null_string' HEADER [ boolean ] QUOTE 'quote_character' ESCAPE 'escape_character' FORCE_QUOTE { ( column_name [, ...] ) | * } FORCE_NOT_NULL ( column_name [, ...] ) ENCODING 'encoding_name‘

AND  JDBC  APIorg.postgresql.copy.*

Page 11: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

11 Copyright © 2013 NTT DATA Corporation

SQOOP-‐‑‒999:    Support  bulk  load  from  HDFS  to  PostgreSQL  using  COPY  ...  FROM

Mapper�

File  Split�

Mapper�

File  Split�

Mapper�

File  Split�

Destination  Table �

HDFS �

PostgreSQL

staging  tagle �

PostgreSQLJDBC �

PostgreSQLJDBC

PostgreSQLJDBC

COPY  FROM  STDIN  WITH  ... �

Using  custom  SQL  command  via  JDBC  API

only  available  in  PostgreSQL

Page 12: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

12 Copyright © 2013 NTT DATA Corporation

SQOOP-‐‑‒999:    Support  bulk  load  from  HDFS  to  PostgreSQL  using  COPY  ...  FROM  

import org.postgresql.copy.CopyManager; import org.postgresql.copy.CopyIn; ... protected void setup(Context context) ... dbConf = new DBConfiguration(conf); CopyManager cm = null; ... public void map(LongWritable key, Writable value, Context context) ... if (value instanceof Text) { line.append(System.getProperty("line.separator")); } try { byte[]data = line.toString().getBytes("UTF-8"); copyin.writeToCopy(data, 0, data.length); ����

Requiring  PostgreSQL  specific  interface.

Just  feeding  lines  of  text

Page 13: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

13 Copyright © 2013 NTT DATA Corporation

Yet  another  direct  connector  for  PostgreSQL  JDBC

Pros:Fast  enoughEase  of  useJDBC  driver  jar  is  distributed  automatically  by  MR  framework

Cons:Dependency  on  not  general  JDBCpossible  licensing  issue  (PostgreSQL  is  OK,  itʼ’s  BSD  Lisence)build  time  requirement  (PostgreSQL  JDBC  is  available  in  Maven  repo.) <dependency org="org.postgresql" name="postgresql" rev="${postgresql.version}" conf="common->default" />

Error  record  causes  rollback  of  whole  transactionStill  difficult  to  implement  custom  connector  for  IMPORTbecause  of  code  generation  part

Page 14: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

14 Copyright © 2013 NTT DATA Corporation

Supporting  complex  data  types

PostgreSQL  supports  lot  of  complex  data  types  Geometric  Types

PointsLine  SegmentsBoxesPathsPolygonsCircles

Network  Address  Typesinetcidrmacaddr

XML  TypeJSON  Type

Supporting  complex  data  types:    SQOOP-‐‑‒1149:  Support  Custom  Postgres  Types

not  me

Page 15: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

15 Copyright © 2013 NTT DATA Corporation

Constraints  on  JDBC  data  types  in  Sqoop  framework

protected Map<String, Integer> getColumnTypesForRawQuery(String stmt) { ... results = execute(stmt);

... ResultSetMetaData metadata = results.getMetaData(); for (int i = 1; i < cols + 1; i++) { int typeId = metadata.getColumnType(i);

public String toJavaType(int sqlType) { // Mappings taken from: // http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html if (sqlType == Types.INTEGER) { return "Integer"; } else if (sqlType == Types.VARCHAR) {

return "String"; .... } else { // TODO(aaron): Support DISTINCT, ARRAY, STRUCT, REF, JAVA_OBJECT. // Return null indicating database-specific manager should return a // java data type if it can find one for any nonstandard type. return null; �

returns  java.sql.Types.OTHER  for  types  not  mappable  to  basic    Java  data  types

=>  Losing  type  imformation    

reaches  here

Page 16: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

16 Copyright © 2013 NTT DATA Corporation

Sqoop  1  Summary

Pros:Simple  Standalone  MapReduce  DriverEasy  to  understand  for  MR  application  developpers                                                                            except  for  ORM  (SqoopRecord)  code  generation  part.

Variety  of  connectorsLot  of  information

Cons:Complex  command  line  and  inconsistent  optionsmeaning  of  options  is  according  to  connectors

Not  enough  modularDependency  on  JDBC  data  modelSecurity

Page 17: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

Copyright © 2013 NTT DATA Corporation 17

Sqooping  PostgreSQL  Data  2

Page 18: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

18 Copyright © 2013 NTT DATA Corporation

Sqoop  2

Everything  are  rewrittenWorking  on  server  sideMore  modular

Not  compatible  with  Sqoop  1  at  all(Almost)  Only  generic  connectorBlack  box  comparing  to  Sqoop  1Needs  more  documentation

SQOOP-‐‑‒1155:  Sqoop  2  documentation  for  connector  development

Internal of Sqoop2 MapReduce Job ++++++++++++++++++++++++++++++++ ... - OutputFormat invokes Loader's load method (via SqoopOutputFor .. todo: sequence diagram like figure.

Page 19: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

19 Copyright © 2013 NTT DATA Corporation

Sqoop2:  Initialization  phase  of  IMPORT  job

,----------------. ,-----------. |SqoopInputFormat| |Partitioner| `-------+--------' `-----+-----' getSplits | | ----------->| | | getPartitions | |------------------------>| | | ,---------. | |-------> |Partition| | | `----+----' |<- - - - - - - - - - - - | | | | | ,----------. |-------------------------------------------------->|SqoopSplit| | | | `----+-----'

Implement  this

Page 20: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

20 Copyright © 2013 NTT DATA Corporation

Sqoop2:  Map  phase  of  IMPORT  job

,-----------. |SqoopMapper| `-----+-----' run | --------->| ,-------------. |---------------------------------->|MapDataWriter| | `------+------' | ,---------. | |--------------> |Extractor| | | `----+----' | | extract | | |-------------------->| | | | | read from DB | | <-------------------------------| write* | | |------------------->| | | | ,----. | | |---------->|Data| | | | `-+--' | | | | | | context.write | | |-------------------------->

Conversion  to  Sqoop  internal  data  format

Implement  this

Page 21: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

21 Copyright © 2013 NTT DATA Corporation

,-------. ,---------------------. |Reducer| |SqoopNullOutputFormat| `---+---' `----------+----------' | | ,-----------------------------. | |-> |SqoopOutputFormatLoadExecutor| | | `--------------+--------------' ,----. | | |---------------------> |Data| | | | `-+--' | | | ,-----------------. | | | |-> |SqoopRecordWriter| | getRecordWriter | | `--------+--------' | ----------------------->| getRecordWriter | | | | |----------------->| | | ,--------------. | | |-----------------------------> |ConsumerThread| | | | | | `------+-------' | |<- - - - - - - - -| | | | ,------. <- - - - - - - - - - - -| | | | |--->|Loader| | | | | | | `--+---' | | | | | | | | | | | | | load | run | | | | | |------>| ----->| | write | | | | | |------------------------------------------------>| setContent | | read* | | | | |----------->| getContent |<------| | | | | |<-----------| | | | | | | | - - ->| | | | | | | | write into DB | | | | | | |-------------->

Sqoop2:  Reduce  phase  of  EXPORT  job

Implement  this

Page 22: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

Copyright © 2013 NTT DATA Corporation 22

Summary

Page 23: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

23 Copyright © 2013 NTT DATA Corporation

My  interest  for  popularizing  Sqoop  2

Complex  data  type  support  in  Sqoop  2Bridge  to  use  Sqoop  1  connectors  on  Sqoop  2Bridge  to  use  Sqoop  2  connectors  from  Sqoop  1  CLI

Page 24: Complex storiesabout Sqooping PostgreSQLdata · Title: complex-stories-about-sqooping-postgresql-data.pptx Author: Masatake Iwasaki Created Date: 11/4/2013 3:34:46 PM

Copyright © 2011 NTT DATA Corporation

Copyright © 2013 NTT DATA Corporation