News and wat wet compatibilities #32

jt55401 · 2024-05-31T01:58:01Z

Modifications to the index-table routines to:

Run more easily locally (devcontainer and plain dockerfile)
Make the logic which parses the filenames in the cdxj so it's compatible with wat, wet, and news paths
Add a test case for the above

…ded in spark

sebastian-nagel

Hi @jt55401, great! See my comments.

sebastian-nagel · 2024-05-31T10:16:55Z

src/main/java/org/commoncrawl/spark/util/CCIndex2FilenameParser.java

+                throw new FilenameParseError("Filename not parseable (tried normal and news): " + filename);
+            }
+            parts.crawl = String.format("CC-NEWS-%s-%s", newsParts.group(1), newsParts.group(2));
+            parts.segment = newsParts.group(3);


There is (now) no equivalent to the segment in the news crawl. I wouldn't use the WARC timestamp (time when the WARC was started) and its serial number as an equivalent: timestamps and serial number of the segments in the main crawl relate to time of fetch list generation. In the news crawl it's about the fetch time.

If segment is left away we would need to modify the schema for the news crawl because the "segment" is defined as not nullable. Maybe it's better to provide a different schema than trying to force every data set into the same scheme independent of its scope, collection method and tools. This was one insight of adapting the CDX-to-Parquet converter for the end-of-term archive. See eot-index-schema.json and EOTIndexTable.java.

Yes, I thought much about this as well.

I did it this way for the following reasons:

the redaction tools (and IMHO, all our tools) are going to be simpler if we're dealing with same index format.

I think while news may have no natural "segment", I don't really see the harm with populating it with some potentially useful data. Perhaps I need to be aware of what the harm might be. (what do people even use that field for?)

I also considered a single value ie: "0000000", but, per previous point, I think the way I did it is potentially a little more useful.

Perhaps it boils down to these options:

Leave my solution (probably not)

Populate only with serial number (seems a bit better based on what I think you're saying above)

Populate with static value (I'm not a fan of useless data, but, I like it better than changing schema)

New index format (makes many of our tools & future processes more complex "forever")

I'm going to ask a question below about news as well - that may be a part of the answer. See below.

See the discussion below about the subset "news-warc".

(what do people even use that field for?)

Only very few users use it, for example Henry Thompson in "Improved methodology ...". In order to use it you need to know what it stands for: one part (one 100th) of the fetch list with all data of one segment fetched together in the same time frame (as of now: 3 hours). We might explain it to our users, for example in the schema.

sebastian-nagel · 2024-05-31T10:16:58Z

src/main/java/org/commoncrawl/spark/util/CCIndex2FilenameParser.java

+            }
+            parts.crawl = String.format("CC-NEWS-%s-%s", newsParts.group(1), newsParts.group(2));
+            parts.segment = newsParts.group(3);
+            parts.subset = "news-warc";


Why this shouldn't be just "warc"? The value in the "crawl" partition column already makes it clear that it's the news crawl and not the main crawl.

Yes - I went back and forth on this a few times.
At the end of the day - it came down to a few comments Greg has made - in that we want to be extra careful about how users of the data may be sorting/filtering data, and not making changes that might upset their existing processes.

I'm not sure philosophically if we want to consider the news and main crawls "one thing" or if they should be separate. So, I chose a middle ground, where news is in the same index/schema, but it wouldn't likely get filtered into scope by anyone looking for "warc" subset. (that's also why i chose to prefix with news rather than suffix - in case people might be doing "startswith warc" or "warc%" or something strange.

I realize people COULD filter news in/out based on the crawl value - but, again, I erred on the side of making the default for existing users "out".

Thoughts?

we want to be extra careful about how users of the data may be sorting/filtering data, and not making changes that might upset their existing processes.

I fully agree. There shouldn't be any changes which cannot be handled by schema merging. That is, extending the table is possible, but not changing the semantics of existing columns.

My assumption was that any new table (for CC-NEWS, the old crawls ARC files, but also the WAT/WET files) would break the semantics of the schema. That's why the location is:

s3://commoncrawl/cc-index/table/cc-main/warc/crawl=.../subset=.../

Consequently, a table for CC-NEWS would be on a different path. It would be a separate table which might use a different schema. Given the continuous release of the news crawl the path pattern could be:

s3://commoncrawl/cc-index/table/cc-news/warc/year=.../month=.../

The news crawl has a different collection method: it's sampled by time, not on a global level, with no revisits.

In addition, the cc-main table is already quite big. Simply for practical reasons, I would keep the news crawl and especially the WAT/WET records in a separate table. It's also more practical to announce separate new tables to the users than explain the extension of the existing one by new partitions and how to rephrase queries so that the results do not change.

Over the long term, an independent schema also allows for schema upgrades which focus on the given dataset format or scope: text metrics for WET or news-related metadata (publication time, etc.).

OK - I understand what you are saying.

Still not sure I agree though - isn't it better for most users to have a comprehensive index of everything in one place?

@wumpus - I guess you've mentioned a few times these new indexes need not even be public - so, perhaps this is all a moot point if they're for internal use only.

I'm of the mind that a single index for all data is most useful, and there's little harm (and likely benefit) to making them public as well, but, if that is considered ill advised, just let me know which account the cdx files should end up in - our private/commoncrawl one, or the public dataset one.

Furthermore - if anyone has preferences on the final bucket/location/path within S3, please do let me know.

I'm going to try to do a few small test runs of the cdx job early this week, with a goal of getting a larger job running later in the week.

sebastian-nagel · 2024-05-31T10:29:46Z

src/script/convert_url_index.sh

@@ -60,6 +60,8 @@ $SPARK_HOME/bin/spark-submit \
    --conf spark.sql.hive.metastorePartitionPruning=true \
    --conf spark.hadoop.parquet.enable.summary-metadata=false \
    --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS \
+    --conf "spark.driver.userClassPathFirst=true" \
+    --conf "spark.executor.userClassPathFirst=true" \


Hmhh. Might be safe because cc-index-table ships with very few dependencies given that Spark is "provided" and the AWS SDK shades all transitive dependencies. In general, I'd be very cautious with this configuration option still marked as "experimental": conflicts with transitive dependencies from the user jar and used/required by Spark may cause errors (typically related to serialization or inter-process communication) which are difficult to trace.

I could not get the job to run to completion without these lines - the spark environment (as seen in the dockerfile) was overriding our libraries, and it was causing this error:

Caused by: java.lang.NoSuchMethodError: 'com.google.gson.JsonElement com.google.gson.JsonParser.parseReader(java.io.Reader)'

If you have documentation or hints you can give on how the real environment is setup, I'd appreciate a pointer, I'm happy to adjust.

I know, I've also run into the gson issue. The version used must be the same (or API-compatible) than that provided by Spark. As said, I'm mostly optimistic that there will be no regressions with the user class path first. If not (to be tested), we can roll this back later.

Alternatively, the script already allows to add additional Spark options via the environment variable SPARK_EXTRA_OPTS:

SPARK_EXTRA_OPTS="--conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true" ./src/script/convert_url_index.sh ...

The setup of Spark and the job configuration is in the crawl-tools repository. Ping me if more information is needed.

sebastian-nagel · 2024-05-31T10:34:21Z

src/main/java/org/commoncrawl/spark/util/CCIndex2FilenameParser.java

+    public static FilenameParts getParts(String filename) throws FilenameParseError {
+        FilenameParts parts = new FilenameParts();
+        Matcher m = filenameAnalyzer.matcher(filename);
+        if(m.find()){


Nit-picking: this project adds a space before opening parentheses or brackets. The default Eclipse code style is used. Shall we add a code formatter to the build?

default Eclipse code style is used

Ah - sure - do you know (or can you provide public URL of) what the default eclipse formatter settings are?
Is it the google one? or something different?
https://raw.githubusercontent.com/google/styleguide/gh-pages/eclipse-java-google-style.xml

Or, food for thought/broader discussion - can we use EditorConfig so any IDE can work well in every langauge?
https://editorconfig.org/ Probably not a topic to discuss here though.

Thanks for the pointer. I'll have a look. But, yes, let's discuss it separately.

wumpus · 2024-05-31T21:44:03Z

The main purpose of the news index is to checksum it for integrity reasons. I haven't given any thought to making the parquet available, were people might want a single table in Athena.

wumpus · 2024-06-01T00:18:53Z

Why don't we just create the CDXJ index for news for now -- that's enough for integrity. Then we can kick the can of a potential future columnar index down the road.

…atibilities

…equires java 11 now.

…field, and default crawl/segment/subset to "unknown" if they can't be found

sebastian-nagel

Maybe a general remark: this PR accumulates more and more changes which makes reviewing it difficult. Included are changes which may break the production version converting the monthly CDX index files to the Parquet index. These will require lengthy and exhaustive testing.

So, would it make sense to split this PR into multiple smaller ones, each with a specific focus? E.g., the Dockerfile, Spark upgrade, Java 11 to fit with the latest crawler-commons releases, etc.

sebastian-nagel · 2024-09-10T15:59:38Z

src/main/java/org/commoncrawl/spark/IndexTable.java

@@ -386,7 +386,8 @@ public void run(String inputPaths, String outputPath, Function<String, Row> mapI
 		LOG.info("Function to map CDX entries to table rows: {}", mapIndexEntries);

 		JavaRDD<String> input = spark.read().textFile(inputPaths).toJavaRDD();
-		JavaRDD<Row> output = input.map(mapIndexEntries);
+		JavaRDD<Row> output = input.map(mapIndexEntries)


... and then skipping them from the table?

If there are unparsable URIs in the CDX input, this is a serious issue which needs to be fixed. And if you do not read the executor log files, this would silently allow that those issues are overlooked. As a further consequence, the number of rows in the table would differ from the number of records in the CDX index. These numbers (and also the number of page captures shown in cc-crawl-statistics) must fit.

this should also be removed, so the job fails if we have bad rows

sebastian-nagel · 2024-09-10T15:59:45Z

src/main/java/org/commoncrawl/spark/CCIndex2Table.java

@@ -109,6 +110,10 @@ public static Row convertCdxLine(String line) {
 					cdx.crawl, cdx.subset);
 		} else {
 			Row h = cdx.uri.getHostName().asRow();
+			if( h.get(0) == null ) {


What is the objective of first accepting CDX input records with an URI without a hostname ...

Jason will re-test this as well - we should probably remove this. (It's better to let the job fail - because we had bad data, etc)

sebastian-nagel · 2024-09-10T16:23:35Z

src/main/java/org/commoncrawl/spark/CCIndex2Table.java

@@ -65,6 +65,10 @@ public CdxLine(String line) throws IOException {
 			length = getInt("length");
 			status = getHttpStatus("status");

+			crawl = "unknown";


If we need to support data sets which are not partitioned by crawl, subset and segments, this could be also done by providing a further schema and a adapted table converter, cf. the EOT schema and EOT converter.

I will try removing this - it was done for testing, and may not be needed now that I mirror the structure of our crawl data during testing.

jt55401 and others added 5 commits May 31, 2024 00:57

Added devcontainer for Java 11

1e07009

Seperated logic for filename parsing, and added a test for it.

dc08484

Modified CCIndex2Table to use CCIndex2FilenameParser

99acc5a

Added Dockerfile to build and run convert_url_index.sh

ab34b57

Modified convert_url_index.sh to prefer user classes over those inclu…

fc800a6

…ded in spark

jt55401 requested a review from wumpus May 31, 2024 01:58

sebastian-nagel requested changes May 31, 2024

View reviewed changes

jt55401 added 5 commits August 26, 2024 19:47

Merge remote-tracking branch 'origin/main' into news-and-wat-wet-comp…

e737fba

…atibilities

fix issue with build not working due to upgraded crawl-common which r…

96a0ca6

…equires java 11 now.

upgrade spark to 3.5.1 to match the rest of the project

d6ed799

add error handling & logging for issues with hostname parsing of uri …

cdf11e2

…field, and default crawl/segment/subset to "unknown" if they can't be found

filter out null lines/entries

100e186

sebastian-nagel requested changes Sep 10, 2024

View reviewed changes

sebastian-nagel reviewed Sep 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

News and wat wet compatibilities #32

News and wat wet compatibilities #32

jt55401 commented May 31, 2024

sebastian-nagel left a comment

sebastian-nagel May 31, 2024

jt55401 May 31, 2024

sebastian-nagel May 31, 2024

sebastian-nagel May 31, 2024

jt55401 May 31, 2024

sebastian-nagel May 31, 2024

jt55401 Jun 3, 2024

sebastian-nagel May 31, 2024

jt55401 May 31, 2024 •

edited

Loading

sebastian-nagel May 31, 2024

sebastian-nagel May 31, 2024

sebastian-nagel May 31, 2024

jt55401 May 31, 2024

jt55401 May 31, 2024

sebastian-nagel May 31, 2024

wumpus commented May 31, 2024

wumpus commented Jun 1, 2024

sebastian-nagel left a comment

sebastian-nagel Sep 10, 2024

jt55401 Sep 26, 2024

sebastian-nagel Sep 10, 2024

jt55401 Sep 26, 2024 •

edited

Loading

sebastian-nagel Sep 10, 2024

jt55401 Sep 26, 2024

News and wat wet compatibilities #32

Are you sure you want to change the base?

News and wat wet compatibilities #32

Conversation

jt55401 commented May 31, 2024

sebastian-nagel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jt55401 May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wumpus commented May 31, 2024

wumpus commented Jun 1, 2024

sebastian-nagel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jt55401 Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jt55401 May 31, 2024 •

edited

Loading

jt55401 Sep 26, 2024 •

edited

Loading