Drupal 8 - Search text into whole site by Drush 9 command

By k, 5 November, 2019

Authored on

Tue, 11/05/2019 - 07:47
Image
germany

Do you know what pages have links pointing to `.pdf` files? Kinda nice question, right! Well it might be easy (not really) if you do not have thousand of nodes. Then I thought, perhaps a query into database may helps... yup, but how do I know which tables to look up.

To solve this particular case I made a Drush 9 command. Firstly I added its definition into `drush.services.yml` file,

drush.services.yml

services:
  web_crawler.commdans:
    class: Drupal\keboca_drush9\Commands\WebCrawlerCommands
    arguments: ['@database', '@entity_field.manager', '@entity_type.manager']
    tags:
      - { name: drush.command }

 

Then I used `@entity_field.manager` to grab all fields definitions where I filtered only those ones which are prefixed with `field_` (also I included `body` field) and those definitions which are storing text/string into database.  In order to figure it out which fields I should review then I check all `@FieldType` available by checking subclasses from `Drupal\Core\Field\FieldItemBase`. At the end, the command was named `WebCrawler` such a original name, tho! 

Because I was about to execute a query into a MySQL database instance then I decided to check if there was a way I may run some regular expression. It turns out that's possible: https://dev.mysql.com/doc/refman/5.6/en/regexp.html

Based on what I thought and made previously, here you can see how that command looks like:

WebCrawlerCommands

<?php

namespace Drupal\keboca_drush9\Commands;

use Drupal\Core\Database\Connection;
use Drupal\Core\Entity\EntityFieldManager;
use Drupal\Core\Entity\EntityTypeManager;
use Drupal\node\Entity\Node;
use Drush\Commands\DrushCommands;

/**
 * Class WebCrawlerCommands
 *
 * @package Drupal\keboca_drush9\Commands
 */
class WebCrawlerCommands extends DrushCommands {

  /** @var Connection */
  protected $database;

  /** @var EntityFieldManager */
  protected $entityFieldManager;

  /** @var EntityTypeManager */
  protected $entityTypeManager;

  /**
   * WebCrawlerCommands constructor.
   *
   * @param \Drupal\Core\Database\Connection $database
   * @param \Drupal\Core\Entity\EntityFieldManager $entity_field_manager
   * @param \Drupal\Core\Entity\EntityTypeManager $entity_type_manager
   */
  public function __construct(Connection $database, EntityFieldManager $entity_field_manager, EntityTypeManager $entity_type_manager) {
    $this->database = $database;
    $this->entityFieldManager = $entity_field_manager;
    $this->entityTypeManager = $entity_type_manager;
  }

  /**
   * Look-up paragraphs item field data to find node parent ID value.
   *
   * @param int $id
   *
   * @return int
   */
  protected function getParentNode($id) {
    /** @var \stdClass $result */
    $result = $this->database->select('paragraphs_item_field_data', 'p')
      ->fields('p', ['parent_id', 'parent_type'])
      ->condition('id', $id)
      ->execute()
      ->fetchObject();

    // When node parent was found, then return it.
    if ('node' == $result->parent_type) {
      return $result->parent_id;
    }

    // Otherwise check its parent.
    return $this->getParentNode($result->parent_id);
  }

  /**
   * Run regular expression against whole text fields into whole site.
   * https://dev.mysql.com/doc/refman/5.6/en/regexp.html
   *
   * @command web:crawl:regexp
   *
   * @param string $regexp Given regular expression to lookup.
   *
   * @aliases wcr
   *
   * @usage web:crawl:regexp {regexp}
   *   Find given regular expression into whole site.
   * @usage wcr {regexp}
   *   Execute it by using its alias.
   */
  public function runRegexp($regexp) {
    /** @var array $fieldTypes */
    $fieldTypes = [];
    /** @var array $typesAvailable */
    $typesAvailable = [
      'string_long' => 'value',
      'string' => 'value',
      'link' => 'uri',
      'uri' => 'value',
      'file_uri' => 'value',
      'list_string' => 'value',
      'text_long' => 'value',
      'text_with_summary' => 'value',
    ];

    // Walk-through all field types available..
    foreach ($this->entityFieldManager->getFieldMap() as $content => $group) {
      // Walk-through group to retrieve only text fields.
      array_walk($group, function ($info, $key) use (&$fieldTypes, $typesAvailable, $content) {
        // Check field name and its type.
        if (array_key_exists($info['type'], $typesAvailable) &&
          (strpos($key, 'field_') === 0 || 'body' === $key)) {
          // Store field name and its type.
          array_push($fieldTypes, "{$content}__{$key}__{$typesAvailable[$info['type']]}");
        }
      });
    }

    /** @var Node[] $nodes */
    $nodes = [];
    /** @var array $report */
    $report = array_reduce($fieldTypes, function ($return, $definition) use (&$nodes, $regexp) {
      /** @var string $column */
      list($content, $field, $index) = explode('__', $definition);

      // Setup table and column names.
      $table = "{$content}__{$field}";
      $column = "{$field}_{$index}";

      // Search given text on current table.
      $result = $this->database->select($table, 't')
        ->fields('t', ['entity_id'])
        ->where("{$column} REGEXP '{$regexp}'")
        ->execute()
        ->fetchAll();

      // When there is a result, then make it persistent.
      if ($result) {
        // Walk-through result returned.
        array_walk($result, function ($row) use (&$return, &$nodes, $content) {
          /** @var \Drupal\Core\Entity\EntityInterface $entity */
          $entity = $this->entityTypeManager->getStorage($content)
            ->load($row->entity_id);

          // Check when $entity is `node` or `paragraph` entity type.
          if ('node' == $entity->getEntityTypeId()) {
            $nodes[$entity->id()] = $entity;
            $return["{$content}:{$entity->id()}"] = $entity->toUrl()
              ->toString();
          }
          elseif ('paragraph' == $entity->getEntityType()->id()) {
            // Retrieve node parent ID value.
            $nid = $this->getParentNode($entity->id());

            // Load node entity only once.
            if (!array_key_exists($nid, $nodes)) {
              /** @var Node $node */
              if (!$node = Node::load($nid)) {
                return;
              }
              $nodes[$node->id()] = $node;
            }

            // Store node reference.
            $return["node:{$nid}"] = $nodes[$nid]->toUrl()->toString();
          }
          else {
            $this->io()
              ->writeln("[warning]: Entity Type: {$content} is not supported yet.");
          }
        });
      }

      return $return;
    }, []);

    $this->io()->table(['node:key', 'url'], array_map(function ($key, $url) {
      return [$key, $url];
    }, array_keys($report), $report));
  }
}

That said, I hope it helps you at some point somehow!

Happy coding!

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.