Skip to content

Artwork Knowledge Panel parser #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,6 @@ build-iPhoneSimulator/
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc
.DS_Store

# node
node_modules/
3 changes: 3 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--format documentation
--color
--require spec_helper
4 changes: 4 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
source 'https://rubygems.org'

gem 'nokogiri'
gem 'rspec'
51 changes: 51 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
GEM
remote: https://rubygems.org/
specs:
diff-lcs (1.6.2)
nokogiri (1.18.8-aarch64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.8-aarch64-linux-musl)
racc (~> 1.4)
nokogiri (1.18.8-arm-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.8-arm-linux-musl)
racc (~> 1.4)
nokogiri (1.18.8-arm64-darwin)
racc (~> 1.4)
nokogiri (1.18.8-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.18.8-x86_64-linux-gnu)
racc (~> 1.4)
nokogiri (1.18.8-x86_64-linux-musl)
racc (~> 1.4)
racc (1.8.1)
rspec (3.13.1)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.5)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.5)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.4)

PLATFORMS
aarch64-linux-gnu
aarch64-linux-musl
arm-linux-gnu
arm-linux-musl
arm64-darwin
x86_64-darwin
x86_64-linux-gnu
x86_64-linux-musl

DEPENDENCIES
nokogiri
rspec

BUNDLED WITH
2.6.7
166 changes: 166 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,169 @@ Add also to your array the painting thumbnails present in the result page file (
Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.

-------------------------------------------------------------------------------

# Solution

The source of the page contains images in the markup of the artwork knowledge panel in two forms. Images above the fold
that are not in the truncated part of the panel, are embedded as base64 encoded data in the page itself. Image not
visible contains links to the image files to display, which are fetched when user expands the section.

Each artwork in the panel has the following structure:
![Artwork Structure](./files/artwork-structure.png)

```html

<div class="iELo6">
<a href="{relativeURL}">
<img
id="for images with lazy rendering"
class="taFZJe"
src="base64 placeholder"
data-src="remote url for image below the fold" alt="{name}"
data-deferred="1 - for images with lazy rendering"
// {...other attributes}
/>
<div class="KHK6lb">
<div class="pgNMRc">{name}</div>
<!-- optional year -->
<div class="cxzHyb">{year}</div>
</div>
</a>
</div>

## Implementation
```

### Extracting Images

Extracting the other values is straightforward, but images require special handling due to their different formats:

- For the remote images, the solution extracts the image url from the `data-src` attribute
- For the embedded images, the solution extracts the data from script tags. The source contains script tags with the
base64 encoded image data for lazily rendered images. These script tags can be associated with their respective
artwork using the id attribute on image and `ii` variable in the script. Then the image data can be extracted from the
`s` variable in the script tag.

## Code

Since I did not have prior ruby experience I implemented a solution in TypeScript using the `cheerio` library for HTML
parsing.

Following that I caught up on some ruby basics, and then implemented a ruby solution with the typescript code as the
reference implementation.

### Files

```txt
├── bin
│ ├── extractor.rb - Ruby script to run the extractor
│ └── extractor.ts - TypeScript script to run the extractor
├── files - Test HTML files
│ ├── ...
│ ├── hokusai-artwork.html
│ ├── mc-escher-artwork.html
│ ├── van-gogh-paintings.html
│ └── ...
├── ...
├── lib
│ ├── extractor.rb - Ruby extractor functionality
│ └── extractor.ts - TypeScript extractor functionality
├── ...
├── spec
│ ├── extractor_spec.rb - Ruby RSpec tests for the extractor
│ ├── extractor.spec.ts - TypeScript tests using the node builtin test utils
│ └── spec_helper.rb
└── ...
```

## Running the code

### Setup

#### With Mise

If you have [Mise](https://github.com/jdx/mise) you can use that to set up the environment for both node and ruby.

```bash
mise install
```

#### Without Mise - Node

You'd need node > `23.6` (or node > `22.6` and with `--experimental-strip-types`) to run the typescript code. After
ensuring you have that, install dependencies:

```bash
npm install
```

#### Without Mise - Ruby

You'd need ruby > `3.0` to run the ruby code. After ensuring you have that, install dependencies:

```bash
bundle install
```

### Running

#### With Mise

Test the code with:

```bash
mise run test
```

This will run tests for the typescript version and the ruby version. Then run both and compare each of their outputs
with the output in `expected-array.json` file.

Run ts version with:

```bash
mise run ts:extract <html file> # Optionally pipe through jq for syntax highlighting `| jq`
```

Run ruby version with:

```bash
mise run ruby:extract <html file> # Optionally pipe through jq for syntax highlighting `| jq`
```

#### Without Mise

Run tests

```bash
npm test # For typescript
bundle exec rspec # For ruby
```

Run extractor
For typescript

```bash
npm run extract <html file>
```

For ruby

```bash
ruby bin/extractor.rb <html file>
```

#### Examples

```bash
mise run ts:extract files/van-gogh-paintings.html | jq
mise run ts:extract files/hokusai-artwork.html

npm run extract files/van-gogh-paintings.html

ruby bin/extractor.rb files/van-gogh-paintings.html | jq
ruby bin/extractor.rb files/hokusai-artwork.html

ruby bin/extractor.rb files/mc-escher-artwork.html | jq
```
26 changes: 26 additions & 0 deletions bin/extractor.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
require './lib/extractor'
require 'json'

def main
if ARGV.empty?
puts('Please provide the path to the Google search results page HTML file.')
puts("\tUsage: ruby #{$PROGRAM_NAME} <path_to_serp_file>")
puts("\tExample: ruby #{$PROGRAM_NAME} ./files/van-gogh-paintings.html")
exit(1)
end

serp_path = ARGV[0]

begin
artworks = Extractor.extract_artworks_from_serp_file(serp_path)
puts(JSON.pretty_generate({ artworks: artworks }))
rescue StandardError => e
puts("An error occurred while extracting artworks: #{e.message}")
exit(1)
end

end

if __FILE__ == $PROGRAM_NAME
main
end
24 changes: 24 additions & 0 deletions bin/extractor.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import {extractArtworksFromSERPFile} from "../lib/extractor.ts";

async function main() {
const serpPath = process.argv[2];
if (!serpPath) {
console.error('Please provide the path to the Google search results page HTML file.');
console.error('\tUsage: npm run extract <path_to_serp_file>');
console.error('\tExample: npm run extract ./files/van-gogh-paintings.html');
process.exit(1);
}

try {
const artworks = await extractArtworksFromSERPFile(serpPath);
console.log(JSON.stringify({artworks}, null, 2));
} catch (error) {
console.error('Error extracting artworks:', error);
process.exit(1);
}
}

if (import.meta.url === `file://${process.argv[1]}`) {
await main();
}

Binary file added files/artwork-structure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading