Skip to content

Scrape Table

Shesh Ghimire edited this page Apr 20, 2025 · 3 revisions

By default, it ships with following abstract classes:

  • Single Table Scraper: Scrapes data from the source URL, traces table structure, and generates data of respective structures traced.
  • Accented Single Table Scraper: Extends Single Table Structure to support transformation of accented characters.

To extract table contents, any one of the above abstract can be extended. Keep in mind, neither of these abstract classes implements Table Tracer methods. To support this, there are two traits that implements these methods:

  • Using PHP DOMNode: It uses PHP's DOM API and can also support extracting multiple tables from HTML document.
  • Using only string: It uses regex and only supports extracting a single table from HTML document.

Steps

Step 1

Create an enum that maps to each table column respectively.

/** @template-implements BackedEnum<string> */
enum DeveloperDetails: string {
	case Name    = 'name';
	case Title   = 'title';
	case Address = 'address';
    case Age     = 'age';

    public function isValid( string $value ): bool {
        return match ( $this ) {
            self::Name, self::Title => strlen( $value ) < 20,
            self::Address           => strlen( $value ) === 3,
            self::Age               => is_numeric( $value ),
        };
    }
}

Step 2

Then, create a concrete that scrapes html table that has accented characters. We want each table column to return a string value, hence we will use generic return type as string.

use DeveloperDetails;
use TheWebSolver\Codegarage\Scraper\Enums\Table;
use TheWebSolver\Codegarage\Scraper\Error\ValidationFail;
use TheWebSolver\Codegarage\Scraper\Attributes\ScrapeFrom;
use TheWebSolver\Codegarage\Scraper\Attributes\CollectFrom;
use TheWebSolver\Codegarage\Scraper\Interfaces\Validatable;
use TheWebSolver\Codegarage\Scraper\Proxy\ItemValidatorProxy;
use TheWebSolver\Codegarage\Scraper\Traits\Table\HtmlTableFromString;

/** @template-extends AccentedSingleTableScraper<string> */
[ScrapeFrom('Wiki Dev List', url: 'https://fake.wiki.org/dev-list', filename: 'developer-list.html')]
[CollectFrom(DeveloperDetails::class)]
// To collect only "name" and "age" Table Columns, use like so.
// [CollectFrom(DeveloperDetails::class, DeveloperDetails::Name, DeveloperDetails::Age)]
class DevTable extends AccentedSingleTableScraper implements Validatable {
    /**
     * @use HtmlTableFromString<string>
     * Because there is only one table, we'll use string trait.
     */
    use HtmlTableFromString;

    // We need to translit title, so we will provide it here.
    protected array $transliterationColumnNames = array( DeveloperDetails::Title->value );

    public function validate( $content ): void {
        $column = DeveloperDetails::from( $this->getCurrentItemIndex() );

        $column->isValid( $content ) || throw new ValidationFail( 'Fail for ' . $column->value );
    }

    protected function defaultCachePath(): string {
        // ...path/to/directory where "developer-list.html" to be cached.
    }

    protected function getInjectedOrDefaultTransformers(): array {
        $transformers = parent::getInjectedOrDefaultTransformers();

        if ( ! $this->hasDefaultTransformerProvided( for: Table::Row ) ) {
            $invalidCount    = $this->getScraperSource()->name . ' ' . self::INVALID_COUNT;
            $indexKey        = DeveloperDetails::Address->value; // Each dataset indexed by address.
            $transformers[0] = new MarshallTableRow( $invalidCount, $indexKey );
        }

        // We'll use proxy that will translit and validate each column.
        if ( ! $this->hasDefaultTransformerProvided( for: Table::Column ) ) {
            $transformers[1] = new ItemValidatorProxy();
        }

        return $transformers;
    }
}

Step 3

Lets say, data to scrape is:

<!DOCTYPE html>
    <!-- head and other shenanigans here... -->
    <table id="developer-list" class="sortable collapsible">
        <thead>
            <tr>
                <th>Developer Name</th>
                <th><span class="nowrap">Job Title</span></th>
                <th><span>Full Address</span></th>
                <th>Age</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>John Doe</td>
                <td>PHP Devel&ocirc;per</td>
                <td>Ktm</td>
                <td>22</td>
            </tr>
            <tr>
                <td>Lorem Ipsum</td>
                <td>JS Developer</td>
                <td>Bkt</td>
                <td>19</td>
            </tr>
        </tbody>
    </table>
</html>

Step 4

Now, lets scrape and trace table.

$scraper = new DevTable();

// If table heads does not need to be traced.
$scraper->traceWithout( Table::THead );

// If only "name" and "age" Table Columns needs to be traced (when passed to "CollectFrom" attribute),
// then we will need to provide DeveloperDetails case offset in other positions (starting with 0).
$scraper->addEventListener(
    Table::Row,
    // Ignore Title & Address case at position "1" and "2".
    fn( $tracer ) => $tracer->setItemsIndices( $this->collectSourceItems(), 1, 2 )
);

$table = $scraper->scrape();
$scraper->toCache( $table ); // Saved to "path/to/directory/developer-list.html".
$iterator = $scraper->parse();

Getting first item:

$iterator->current()->getArrayCopy();
$johnDoe = ['name' => 'John Doe', 'title' => 'PHP Developer', 'address' => 'ktm', 'age' => '22'];

Getting next item:

$iterator->next();

$iterator->current()->getArrayCopy();
$lorem = ['name' => 'Lorem Ipsum', 'title' => 'JS Developer', 'address' => 'Bkt', 'age' => '19'];

To get all data at once.

iterator_to_array( $iterator );

Each item will get indexed by value of DeveloperDetails::Address as passed to MarshallTableRow #2 param above. See method DevTable::getInjectedOrDefaultTransformers() method in Step 2 above.

$all = [
    'Ktm' => ['name' => 'John Doe', 'title' => 'PHP Developer', 'address' => 'Ktm', 'age' => '22'],
    'Bkt' => ['name' => 'Lorem Ipsum', 'title' => 'JS Developer', 'address' => 'Bkt', 'age' => '19']
];
Clone this wiki locally