如何提高spark批量读取HBase数据的性能

2025-05-19 16:31:21

推荐回答（1个）

回答1：

　　Configuration conf = HBaseConfiguration.create();
　　String tableName = "testTable";
　　Scan scan = new Scan();
　　scan.setCaching(10000);
　　scan.setCacheBlocks(false);
　　conf.set(TableInputFormat.INPUT_TABLE, tableName);
　　ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
　　String ScanToString = Base64.encodeBytes(proto.toByteArray());
　　conf.set(TableInputFormat.SCAN, ScanToString);
　　JavaPairRDD myRDD = sc
　　.newAPIHadoopRDD(conf, TableInputFormat.class,
　　ImmutableBytesWritable.class, Result.class);
　　在Spark使用如上Hadoop提供的标准接口读取HBase表数据（全表读），读取5亿左右数据，要20M+，而同样的数据保存在Hive中，读取却只需要1M以内，性能差别非常大。
　　转载，仅供参考。