Skip to content

Commit 44e8e9b

Browse files
author
Shiva Verma
committed
Merge branch 'release-8.0.6' into 8.0-master
2 parents e730eae + 3789cf5 commit 44e8e9b

35 files changed

+939
-377
lines changed

README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,17 @@ The Hadoop Connector is an extension to Hadoop’s MapReduce framework that allo
1919
* Access MarkLogic text, geospatial, scalar, and document structure indexes to send only the most relevant data to Hadoop for processing
2020
* Write results from MapReduce jobs to MarkLogic in parallel
2121

22+
## Release Note
23+
24+
### What's New in mlcp and Hadoop Connector 8.0-6
25+
26+
- mlcp distributed mode supports MapR 5.1, HDP 2.4 and CDH 5.8
27+
- significant performance improvement in archive export
28+
- mlcp honors user-specified InputFormat, OutputFormat and Mapper classes when creating jobs for import, export and copy
29+
- mlcp export now streams out binary documents
30+
- mlcp import now streams reading zip entries in compressed delimited texts, delimited JSON and aggregate XMLs
31+
- bug fixes
32+
2233
## Getting Started
2334

2435
- [Getting Started with mlcp](http://docs.marklogic.com/guide/mlcp/getting-started)
@@ -55,7 +66,7 @@ The build writes to the respective **deliverable** directories under the top-lev
5566

5667
Alternatively, you can build mlcp and the Hadoop Connector independently from each component’s root directory (i.e. `./mlcp/` and `./mapreduce/`) with the above command. *Note that mlcp depends on the Hadoop Connector*, so a successful build of the Hadoop Connector is required to build mlcp.
5768

58-
For information on contributing to this project see [CONTRIBUTING.md](https://github.com/marklogic/marklogic-contentpump/blob/8.0-master/CONTRIBUTING.md). For information on working on development of this project see [project wiki page](https://github.com/marklogic/marklogic-contentpump/wiki).
69+
For information on contributing to this project see [CONTRIBUTING.md](https://github.com/marklogic/marklogic-contentpump/blob/8.0-develop/CONTRIBUTING.md). For information on working on development of this project see [project wiki page](https://github.com/marklogic/marklogic-contentpump/wiki).
5970

6071
## Tests
6172

@@ -73,4 +84,4 @@ If you have questions about mlcp or the Hadoop Connector, ask on [StackOverflow]
7384

7485
## Support
7586

76-
mlcp and the Hadoop Connector are maintained by MarkLogic Engineering and distributed under the [Apache 2.0 license](https://github.com/marklogic/marklogic-contentpump/blob/8.0-master/LICENSE). They are designed for use in production applications with MarkLogic Server. Everyone is encouraged [to file bug reports, feature requests, and pull requests through GitHub](https://github.com/marklogic/marklogic-contentpump/issues/new). This input is critical and will be carefully considered. However, we can’t promise a specific resolution or timeframe for any request. In addition, MarkLogic provides technical support for [release tags](https://github.com/marklogic/marklogic-contentpump/releases) of mlcp and the Hadoop Connector to licensed customers under the terms outlined in the [Support Handbook](http://www.marklogic.com/files/Mark_Logic_Support_Handbook.pdf). For more information or to sign up for support, visit [help.marklogic.com](http://help.marklogic.com).
87+
mlcp and the Hadoop Connector are maintained by MarkLogic Engineering and distributed under the [Apache 2.0 license](https://github.com/marklogic/marklogic-contentpump/blob/8.0-develop/LICENSE). They are designed for use in production applications with MarkLogic Server. Everyone is encouraged [to file bug reports, feature requests, and pull requests through GitHub](https://github.com/marklogic/marklogic-contentpump/issues/new). This input is critical and will be carefully considered. However, we can’t promise a specific resolution or timeframe for any request. In addition, MarkLogic provides technical support for [release tags](https://github.com/marklogic/marklogic-contentpump/releases) of mlcp and the Hadoop Connector to licensed customers under the terms outlined in the [Support Handbook](http://www.marklogic.com/files/Mark_Logic_Support_Handbook.pdf). For more information or to sign up for support, visit [help.marklogic.com](http://help.marklogic.com).

mapreduce/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
.classpath
22
.project
33
.settings
4+
target
5+
deliverable

mapreduce/pom.xml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<modelVersion>4.0.0</modelVersion>
44
<groupId>com.marklogic</groupId>
55
<artifactId>marklogic-mapreduce2</artifactId>
6-
<version>2.1</version>
6+
<version>2.1.6</version>
77
<name>${mapreduce.product.name}</name>
88
<description>MarkLogic Connector for Hadoop MapReduce</description>
99
<url>https://github.com/marklogic/marklogic-contentpump</url>
@@ -18,12 +18,12 @@
1818
<!-- Global definitions -->
1919
<mapreduce.product.name>MarkLogic Connector for Hadoop</mapreduce.product.name>
2020
<mapreduce.product.name.short>MarkLogic Connector for Hadoop</mapreduce.product.name.short>
21-
<version.number.string>2.1</version.number.string>
21+
<version.number.string>2.1.6</version.number.string>
2222
<jar.version.number.string>${version.number.string}</jar.version.number.string>
2323
<date-string>${maven.build.timestamp}</date-string>
2424
<libdir>${basedir}/src/lib</libdir>
2525
<skipTests>false</skipTests>
26-
<xccVersion>8.0</xccVersion>
26+
<xccVersion>8.0.6</xccVersion>
2727
<deliverableName>Connector-for-Hadoop2</deliverableName>
2828
<!-- Static definitions of where things are relative to the root -->
2929
<java.source>src/main/java</java.source>

mapreduce/src/conf/log4j.properties

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,4 @@ log4j.appender.console.layout=org.apache.log4j.PatternLayout
55
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
66

77
# To enable debug
8-
log4j.logger.com.marklogic.dom=INFO
8+
log4j.logger.com.marklogic.tree=TRACE
290 KB
Binary file not shown.
811 Bytes
Binary file not shown.

mapreduce/src/main/java/com/marklogic/io/Decoder.java

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -101,20 +101,22 @@ public void realign()
101101
}
102102
}
103103

104-
public void decode(int[] array, int i, int count) throws IOException {
104+
public void decode(int[] array, int count) throws IOException {
105105
if (count <= 4) {
106-
for (; i < count; i++) {
106+
for (int i = 0; i < count; i++) {
107107
array[i] = decode32bits();
108108
}
109109
} else {
110+
int i = 0;
110111
realign();
111112
if (numBitsInReg==32) {
112-
array[i++] = (int)reg;
113+
++i;
114+
array[0] = (int)reg;
113115
reg = 0;
114116
numBitsInReg = 0;
115117
}
116-
for (; i<count; ++i) {
117-
if (load32(array, i)) break;
118+
for ( ; i<count; ++i) {
119+
if (!load32(array, i)) break;
118120
}
119121
}
120122
}

mapreduce/src/main/java/com/marklogic/mapreduce/ContentOutputFormat.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -471,7 +471,7 @@ protected LinkedMapWritable queryForestInfo(ContentSource cs)
471471
}
472472
}
473473
if (forestStatusMap.size() == 0) {
474-
throw new IOException("Number of forests is 0: "
474+
throw new IOException("Target database has no forests attached: "
475475
+ "check forests in database");
476476
}
477477
am.initialize(policy, forestStatusMap, conf.getInt(BATCH_SIZE,10));

mapreduce/src/main/java/com/marklogic/mapreduce/DatabaseDocument.java

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717

1818
import java.io.ByteArrayInputStream;
1919
import java.io.DataInput;
20+
import java.io.DataInputStream;
2021
import java.io.DataOutput;
2122
import java.io.IOException;
2223
import java.io.InputStream;
@@ -27,6 +28,7 @@
2728
import org.apache.hadoop.io.Text;
2829
import org.apache.hadoop.io.WritableUtils;
2930

31+
import com.marklogic.io.IOHelper;
3032
import com.marklogic.xcc.ResultItem;
3133
import com.marklogic.xcc.types.ValueType;
3234
import com.marklogic.xcc.types.XdmBinary;
@@ -38,10 +40,12 @@
3840
* @author jchen
3941
*
4042
*/
41-
public class DatabaseDocument implements MarkLogicDocument {
43+
public class DatabaseDocument implements MarkLogicDocument,
44+
InternalConstants {
4245
public static final Log LOG = LogFactory.getLog(
4346
DatabaseDocument.class);
4447
protected byte[] content;
48+
protected InputStream is; // streaming binary
4549
protected ContentType contentType;
4650

4751
public DatabaseDocument(){}
@@ -70,11 +74,23 @@ public Text getContentAsText() {
7074
* @see com.marklogic.mapreduce.MarkLogicDocument#getContentAsByteArray()
7175
*/
7276
public byte[] getContentAsByteArray() {
77+
if (content == null) {
78+
try {
79+
content = IOHelper.byteArrayFromStream(is);
80+
} catch (IOException e) {
81+
throw new RuntimeException("IOException buffering binary data",
82+
e);
83+
}
84+
is = null;
85+
}
7386
return content;
7487
}
7588

7689
@Override
7790
public InputStream getContentAsByteStream() {
91+
if (is != null) {
92+
return is;
93+
}
7894
return new ByteArrayInputStream(getContentAsByteArray());
7995
}
8096

@@ -116,7 +132,11 @@ public void set(ResultItem item){
116132
content = item.asString().getBytes("UTF-8");
117133
contentType = ContentType.TEXT;
118134
} else if (item.getValueType() == ValueType.BINARY) {
119-
content = ((XdmBinary) item.getItem()).asBinaryData();
135+
if (item.isCached()) {
136+
content = ((XdmBinary) item.getItem()).asBinaryData();
137+
} else {
138+
is = item.asInputStream();
139+
}
120140
contentType = ContentType.BINARY;
121141
} else if (item.getValueType() == ValueType.ARRAY_NODE ||
122142
item.getValueType() == ValueType.BOOLEAN_NODE ||
@@ -155,6 +175,10 @@ public void readFields(DataInput in) throws IOException {
155175
int ordinal = in.readInt();
156176
contentType = ContentType.valueOf(ordinal);
157177
int length = WritableUtils.readVInt(in);
178+
if (length > MAX_BUFFER_SIZE) {
179+
is = (DataInputStream)in;
180+
return;
181+
}
158182
content = new byte[length];
159183
in.readFully(content, 0, length);
160184
}
@@ -165,17 +189,32 @@ public void readFields(DataInput in) throws IOException {
165189
@Override
166190
public void write(DataOutput out) throws IOException {
167191
out.writeInt(contentType.ordinal());
168-
WritableUtils.writeVInt(out, content.length);
169-
out.write(content, 0, content.length);
192+
if (content != null) {
193+
WritableUtils.writeVInt(out, content.length);
194+
out.write(content, 0, content.length);
195+
} else if (is != null) {
196+
content = new byte[MAX_BUFFER_SIZE];
197+
int len = 0;
198+
while ((len = is.read(content)) > 0) {
199+
out.write(content, 0, len);
200+
}
201+
}
170202
}
171203

172204
@Override
173205
public long getContentSize() {
174-
return content.length;
206+
if (content != null) {
207+
return content.length;
208+
} else {
209+
return Integer.MAX_VALUE;
210+
}
175211
}
176212

177213
@Override
178214
public boolean isStreamable() {
215+
if (content == null) {
216+
return true;
217+
}
179218
return false;
180219
}
181220
}

mapreduce/src/main/java/com/marklogic/mapreduce/ForestInputFormat.java

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,6 @@ public class ForestInputFormat<VALUE>
6666
extends FileInputFormat<DocumentURIWithSourceInfo, VALUE>
6767
implements MarkLogicConstants {
6868
public static final Log LOG = LogFactory.getLog(ForestInputFormat.class);
69-
static final int STREAM_BUFFER_SIZE = 1 << 24;
7069

7170
@Override
7271
public RecordReader<DocumentURIWithSourceInfo, VALUE> createRecordReader(

0 commit comments

Comments
 (0)