Drupal 8 - Search text into whole site by Drush 9 command

By k, 5 November, 2019

Authored on

Tue, 11/05/2019 - 07:47

Do you know what pages have links pointing to `.pdf` files? Kinda nice question, right! Well it might be easy (not really) if you do not have thousand of nodes. Then I thought, perhaps a query into database may helps... yup, but how do I know which tables to look up.

To solve this particular case I made a Drush 9 command. Firstly I added its definition into `drush.services.yml` file,


    class: Drupal\keboca_drush9\Commands\WebCrawlerCommands
    arguments: ['@database', '@entity_field.manager', '@entity_type.manager']
      - { name: drush.command }


Then I used `@entity_field.manager` to grab all fields definitions where I filtered only those ones which are prefixed with `field_` (also I included `body` field) and those definitions which are storing text/string into database.  In order to figure it out which fields I should review then I check all `@FieldType` available by checking subclasses from `Drupal\Core\Field\FieldItemBase`. At the end, the command was named `WebCrawler` such a original name, tho! 

Because I was about to execute a query into a MySQL database instance then I decided to check if there was a way I may run some regular expression. It turns out that's possible: https://dev.mysql.com/doc/refman/5.6/en/regexp.html

Based on what I thought and made previously, here you can see how that command looks like:



namespace Drupal\keboca_drush9\Commands;

use Drupal\Core\Database\Connection;
use Drupal\Core\Entity\EntityFieldManager;
use Drupal\Core\Entity\EntityTypeManager;
use Drupal\node\Entity\Node;
use Drush\Commands\DrushCommands;

 * Class WebCrawlerCommands
 * @package Drupal\keboca_drush9\Commands
class WebCrawlerCommands extends DrushCommands {

  /** @var Connection */
  protected $database;

  /** @var EntityFieldManager */
  protected $entityFieldManager;

  /** @var EntityTypeManager */
  protected $entityTypeManager;

   * WebCrawlerCommands constructor.
   * @param \Drupal\Core\Database\Connection $database
   * @param \Drupal\Core\Entity\EntityFieldManager $entity_field_manager
   * @param \Drupal\Core\Entity\EntityTypeManager $entity_type_manager
  public function __construct(Connection $database, EntityFieldManager $entity_field_manager, EntityTypeManager $entity_type_manager) {
    $this->database = $database;
    $this->entityFieldManager = $entity_field_manager;
    $this->entityTypeManager = $entity_type_manager;

   * Look-up paragraphs item field data to find node parent ID value.
   * @param int $id
   * @return int
  protected function getParentNode($id) {
    /** @var \stdClass $result */
    $result = $this->database->select('paragraphs_item_field_data', 'p')
      ->fields('p', ['parent_id', 'parent_type'])
      ->condition('id', $id)

    // When node parent was found, then return it.
    if ('node' == $result->parent_type) {
      return $result->parent_id;

    // Otherwise check its parent.
    return $this->getParentNode($result->parent_id);

   * Run regular expression against whole text fields into whole site.
   * https://dev.mysql.com/doc/refman/5.6/en/regexp.html
   * @command web:crawl:regexp
   * @param string $regexp Given regular expression to lookup.
   * @aliases wcr
   * @usage web:crawl:regexp {regexp}
   *   Find given regular expression into whole site.
   * @usage wcr {regexp}
   *   Execute it by using its alias.
  public function runRegexp($regexp) {
    /** @var array $fieldTypes */
    $fieldTypes = [];
    /** @var array $typesAvailable */
    $typesAvailable = [
      'string_long' => 'value',
      'string' => 'value',
      'link' => 'uri',
      'uri' => 'value',
      'file_uri' => 'value',
      'list_string' => 'value',
      'text_long' => 'value',
      'text_with_summary' => 'value',

    // Walk-through all field types available..
    foreach ($this->entityFieldManager->getFieldMap() as $content => $group) {
      // Walk-through group to retrieve only text fields.
      array_walk($group, function ($info, $key) use (&$fieldTypes, $typesAvailable, $content) {
        // Check field name and its type.
        if (array_key_exists($info['type'], $typesAvailable) &&
          (strpos($key, 'field_') === 0 || 'body' === $key)) {
          // Store field name and its type.
          array_push($fieldTypes, "{$content}__{$key}__{$typesAvailable[$info['type']]}");

    /** @var Node[] $nodes */
    $nodes = [];
    /** @var array $report */
    $report = array_reduce($fieldTypes, function ($return, $definition) use (&$nodes, $regexp) {
      /** @var string $column */
      list($content, $field, $index) = explode('__', $definition);

      // Setup table and column names.
      $table = "{$content}__{$field}";
      $column = "{$field}_{$index}";

      // Search given text on current table.
      $result = $this->database->select($table, 't')
        ->fields('t', ['entity_id'])
        ->where("{$column} REGEXP '{$regexp}'")

      // When there is a result, then make it persistent.
      if ($result) {
        // Walk-through result returned.
        array_walk($result, function ($row) use (&$return, &$nodes, $content) {
          /** @var \Drupal\Core\Entity\EntityInterface $entity */
          $entity = $this->entityTypeManager->getStorage($content)

          // Check when $entity is `node` or `paragraph` entity type.
          if ('node' == $entity->getEntityTypeId()) {
            $nodes[$entity->id()] = $entity;
            $return["{$content}:{$entity->id()}"] = $entity->toUrl()
          elseif ('paragraph' == $entity->getEntityType()->id()) {
            // Retrieve node parent ID value.
            $nid = $this->getParentNode($entity->id());

            // Load node entity only once.
            if (!array_key_exists($nid, $nodes)) {
              /** @var Node $node */
              if (!$node = Node::load($nid)) {
              $nodes[$node->id()] = $node;

            // Store node reference.
            $return["node:{$nid}"] = $nodes[$nid]->toUrl()->toString();
          else {
              ->writeln("[warning]: Entity Type: {$content} is not supported yet.");

      return $return;
    }, []);

    $this->io()->table(['node:key', 'url'], array_map(function ($key, $url) {
      return [$key, $url];
    }, array_keys($report), $report));

That said, I hope it helps you at some point somehow!

Happy coding!

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.