Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

News and wat wet compatibilities #32

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jt55401
Copy link
Contributor

@jt55401 jt55401 commented May 31, 2024

Modifications to the index-table routines to:

  1. Run more easily locally (devcontainer and plain dockerfile)
  2. Make the logic which parses the filenames in the cdxj so it's compatible with wat, wet, and news paths
  3. Add a test case for the above

@jt55401 jt55401 requested a review from wumpus May 31, 2024 01:58
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jt55401, great! See my comments.

throw new FilenameParseError("Filename not parseable (tried normal and news): " + filename);
}
parts.crawl = String.format("CC-NEWS-%s-%s", newsParts.group(1), newsParts.group(2));
parts.segment = newsParts.group(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is (now) no equivalent to the segment in the news crawl. I wouldn't use the WARC timestamp (time when the WARC was started) and its serial number as an equivalent: timestamps and serial number of the segments in the main crawl relate to time of fetch list generation. In the news crawl it's about the fetch time.

If segment is left away we would need to modify the schema for the news crawl because the "segment" is defined as not nullable. Maybe it's better to provide a different schema than trying to force every data set into the same scheme independent of its scope, collection method and tools. This was one insight of adapting the CDX-to-Parquet converter for the end-of-term archive. See eot-index-schema.json and EOTIndexTable.java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought much about this as well.

I did it this way for the following reasons:

  • the redaction tools (and IMHO, all our tools) are going to be simpler if we're dealing with same index format.
  • I think while news may have no natural "segment", I don't really see the harm with populating it with some potentially useful data. Perhaps I need to be aware of what the harm might be. (what do people even use that field for?)
  • I also considered a single value ie: "0000000", but, per previous point, I think the way I did it is potentially a little more useful.

Perhaps it boils down to these options:

  • Leave my solution (probably not)
  • Populate only with serial number (seems a bit better based on what I think you're saying above)
  • Populate with static value (I'm not a fan of useless data, but, I like it better than changing schema)
  • New index format (makes many of our tools & future processes more complex "forever")

I'm going to ask a question below about news as well - that may be a part of the answer. See below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion below about the subset "news-warc".

(what do people even use that field for?)

Only very few users use it, for example Henry Thompson in "Improved methodology ...". In order to use it you need to know what it stands for: one part (one 100th) of the fetch list with all data of one segment fetched together in the same time frame (as of now: 3 hours). We might explain it to our users, for example in the schema.

}
parts.crawl = String.format("CC-NEWS-%s-%s", newsParts.group(1), newsParts.group(2));
parts.segment = newsParts.group(3);
parts.subset = "news-warc";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this shouldn't be just "warc"? The value in the "crawl" partition column already makes it clear that it's the news crawl and not the main crawl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I went back and forth on this a few times.
At the end of the day - it came down to a few comments Greg has made - in that we want to be extra careful about how users of the data may be sorting/filtering data, and not making changes that might upset their existing processes.

I'm not sure philosophically if we want to consider the news and main crawls "one thing" or if they should be separate. So, I chose a middle ground, where news is in the same index/schema, but it wouldn't likely get filtered into scope by anyone looking for "warc" subset. (that's also why i chose to prefix with news rather than suffix - in case people might be doing "startswith warc" or "warc%" or something strange.

I realize people COULD filter news in/out based on the crawl value - but, again, I erred on the side of making the default for existing users "out".

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to be extra careful about how users of the data may be sorting/filtering data, and not making changes that might upset their existing processes.

I fully agree. There shouldn't be any changes which cannot be handled by schema merging. That is, extending the table is possible, but not changing the semantics of existing columns.

My assumption was that any new table (for CC-NEWS, the old crawls ARC files, but also the WAT/WET files) would break the semantics of the schema. That's why the location is:

s3://commoncrawl/cc-index/table/cc-main/warc/crawl=.../subset=.../

Consequently, a table for CC-NEWS would be on a different path. It would be a separate table which might use a different schema. Given the continuous release of the news crawl the path pattern could be:

s3://commoncrawl/cc-index/table/cc-news/warc/year=.../month=.../

The news crawl has a different collection method: it's sampled by time, not on a global level, with no revisits.

In addition, the cc-main table is already quite big. Simply for practical reasons, I would keep the news crawl and especially the WAT/WET records in a separate table. It's also more practical to announce separate new tables to the users than explain the extension of the existing one by new partitions and how to rephrase queries so that the results do not change.

Over the long term, an independent schema also allows for schema upgrades which focus on the given dataset format or scope: text metrics for WET or news-related metadata (publication time, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I understand what you are saying.

Still not sure I agree though - isn't it better for most users to have a comprehensive index of everything in one place?

@wumpus - I guess you've mentioned a few times these new indexes need not even be public - so, perhaps this is all a moot point if they're for internal use only.

I'm of the mind that a single index for all data is most useful, and there's little harm (and likely benefit) to making them public as well, but, if that is considered ill advised, just let me know which account the cdx files should end up in - our private/commoncrawl one, or the public dataset one.

Furthermore - if anyone has preferences on the final bucket/location/path within S3, please do let me know.

I'm going to try to do a few small test runs of the cdx job early this week, with a goal of getting a larger job running later in the week.

@@ -60,6 +60,8 @@ $SPARK_HOME/bin/spark-submit \
--conf spark.sql.hive.metastorePartitionPruning=true \
--conf spark.hadoop.parquet.enable.summary-metadata=false \
--conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS \
--conf "spark.driver.userClassPathFirst=true" \
--conf "spark.executor.userClassPathFirst=true" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmhh. Might be safe because cc-index-table ships with very few dependencies given that Spark is "provided" and the AWS SDK shades all transitive dependencies. In general, I'd be very cautious with this configuration option still marked as "experimental": conflicts with transitive dependencies from the user jar and used/required by Spark may cause errors (typically related to serialization or inter-process communication) which are difficult to trace.

Copy link
Contributor Author

@jt55401 jt55401 May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get the job to run to completion without these lines - the spark environment (as seen in the dockerfile) was overriding our libraries, and it was causing this error:

Caused by: java.lang.NoSuchMethodError: 'com.google.gson.JsonElement com.google.gson.JsonParser.parseReader(java.io.Reader)'

If you have documentation or hints you can give on how the real environment is setup, I'd appreciate a pointer, I'm happy to adjust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I've also run into the gson issue. The version used must be the same (or API-compatible) than that provided by Spark. As said, I'm mostly optimistic that there will be no regressions with the user class path first. If not (to be tested), we can roll this back later.

Alternatively, the script already allows to add additional Spark options via the environment variable SPARK_EXTRA_OPTS:

SPARK_EXTRA_OPTS="--conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true" ./src/script/convert_url_index.sh ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup of Spark and the job configuration is in the crawl-tools repository. Ping me if more information is needed.

public static FilenameParts getParts(String filename) throws FilenameParseError {
FilenameParts parts = new FilenameParts();
Matcher m = filenameAnalyzer.matcher(filename);
if(m.find()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit-picking: this project adds a space before opening parentheses or brackets. The default Eclipse code style is used. Shall we add a code formatter to the build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default Eclipse code style is used

Ah - sure - do you know (or can you provide public URL of) what the default eclipse formatter settings are?
Is it the google one? or something different?
https://raw.githubusercontent.com/google/styleguide/gh-pages/eclipse-java-google-style.xml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, food for thought/broader discussion - can we use EditorConfig so any IDE can work well in every langauge?
https://editorconfig.org/ Probably not a topic to discuss here though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I'll have a look. But, yes, let's discuss it separately.

@wumpus
Copy link
Member

wumpus commented May 31, 2024

The main purpose of the news index is to checksum it for integrity reasons. I haven't given any thought to making the parquet available, were people might want a single table in Athena.

@wumpus
Copy link
Member

wumpus commented Jun 1, 2024

Why don't we just create the CDXJ index for news for now -- that's enough for integrity. Then we can kick the can of a potential future columnar index down the road.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a general remark: this PR accumulates more and more changes which makes reviewing it difficult. Included are changes which may break the production version converting the monthly CDX index files to the Parquet index. These will require lengthy and exhaustive testing.

So, would it make sense to split this PR into multiple smaller ones, each with a specific focus? E.g., the Dockerfile, Spark upgrade, Java 11 to fit with the latest crawler-commons releases, etc.

@@ -386,7 +386,8 @@ public void run(String inputPaths, String outputPath, Function<String, Row> mapI
LOG.info("Function to map CDX entries to table rows: {}", mapIndexEntries);

JavaRDD<String> input = spark.read().textFile(inputPaths).toJavaRDD();
JavaRDD<Row> output = input.map(mapIndexEntries);
JavaRDD<Row> output = input.map(mapIndexEntries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and then skipping them from the table?

If there are unparsable URIs in the CDX input, this is a serious issue which needs to be fixed. And if you do not read the executor log files, this would silently allow that those issues are overlooked. As a further consequence, the number of rows in the table would differ from the number of records in the CDX index. These numbers (and also the number of page captures shown in cc-crawl-statistics) must fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also be removed, so the job fails if we have bad rows

@@ -109,6 +110,10 @@ public static Row convertCdxLine(String line) {
cdx.crawl, cdx.subset);
} else {
Row h = cdx.uri.getHostName().asRow();
if( h.get(0) == null ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the objective of first accepting CDX input records with an URI without a hostname ...

Copy link
Contributor Author

@jt55401 jt55401 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jason will re-test this as well - we should probably remove this. (It's better to let the job fail - because we had bad data, etc)

@@ -65,6 +65,10 @@ public CdxLine(String line) throws IOException {
length = getInt("length");
status = getHttpStatus("status");

crawl = "unknown";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to support data sets which are not partitioned by crawl, subset and segments, this could be also done by providing a further schema and a adapted table converter, cf. the EOT schema and EOT converter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try removing this - it was done for testing, and may not be needed now that I mirror the structure of our crawl data during testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants