Do you know what pages have links pointing to `.pdf` files? Kinda nice question, right! Well it might be easy (not really) if you do not have thousand of nodes. Then I thought, perhaps a query into database may helps... yup, but how do I know which tables to look up.
To solve this particular case I made a Drush 9 command. Firstly I added its definition into `drush.services.yml` file,
drush.services.yml
services:
web_crawler.commdans:
class: Drupal\keboca_drush9\Commands\WebCrawlerCommands
arguments: ['@database', '@entity_field.manager', '@entity_type.manager']
tags:
- { name: drush.command }
Then I used `@entity_field.manager` to grab all fields definitions where I filtered only those ones which are prefixed with `field_` (also I included `body` field) and those definitions which are storing text/string into database. In order to figure it out which fields I should review then I check all `@FieldType` available by checking subclasses from `Drupal\Core\Field\FieldItemBase`. At the end, the command was named `WebCrawler` such a original name, tho!
Because I was about to execute a query into a MySQL database instance then I decided to check if there was a way I may run some regular expression. It turns out that's possible: https://dev.mysql.com/doc/refman/5.6/en/regexp.html
Based on what I thought and made previously, here you can see how that command looks like:
WebCrawlerCommands
<?php
namespace Drupal\keboca_drush9\Commands;
use Drupal\Core\Database\Connection;
use Drupal\Core\Entity\EntityFieldManager;
use Drupal\Core\Entity\EntityTypeManager;
use Drupal\node\Entity\Node;
use Drush\Commands\DrushCommands;
/**
* Class WebCrawlerCommands
*
* @package Drupal\keboca_drush9\Commands
*/
class WebCrawlerCommands extends DrushCommands {
/** @var Connection */
protected $database;
/** @var EntityFieldManager */
protected $entityFieldManager;
/** @var EntityTypeManager */
protected $entityTypeManager;
/**
* WebCrawlerCommands constructor.
*
* @param \Drupal\Core\Database\Connection $database
* @param \Drupal\Core\Entity\EntityFieldManager $entity_field_manager
* @param \Drupal\Core\Entity\EntityTypeManager $entity_type_manager
*/
public function __construct(Connection $database, EntityFieldManager $entity_field_manager, EntityTypeManager $entity_type_manager) {
$this->database = $database;
$this->entityFieldManager = $entity_field_manager;
$this->entityTypeManager = $entity_type_manager;
}
/**
* Look-up paragraphs item field data to find node parent ID value.
*
* @param int $id
*
* @return int
*/
protected function getParentNode($id) {
/** @var \stdClass $result */
$result = $this->database->select('paragraphs_item_field_data', 'p')
->fields('p', ['parent_id', 'parent_type'])
->condition('id', $id)
->execute()
->fetchObject();
// When node parent was found, then return it.
if ('node' == $result->parent_type) {
return $result->parent_id;
}
// Otherwise check its parent.
return $this->getParentNode($result->parent_id);
}
/**
* Run regular expression against whole text fields into whole site.
* https://dev.mysql.com/doc/refman/5.6/en/regexp.html
*
* @command web:crawl:regexp
*
* @param string $regexp Given regular expression to lookup.
*
* @aliases wcr
*
* @usage web:crawl:regexp {regexp}
* Find given regular expression into whole site.
* @usage wcr {regexp}
* Execute it by using its alias.
*/
public function runRegexp($regexp) {
/** @var array $fieldTypes */
$fieldTypes = [];
/** @var array $typesAvailable */
$typesAvailable = [
'string_long' => 'value',
'string' => 'value',
'link' => 'uri',
'uri' => 'value',
'file_uri' => 'value',
'list_string' => 'value',
'text_long' => 'value',
'text_with_summary' => 'value',
];
// Walk-through all field types available..
foreach ($this->entityFieldManager->getFieldMap() as $content => $group) {
// Walk-through group to retrieve only text fields.
array_walk($group, function ($info, $key) use (&$fieldTypes, $typesAvailable, $content) {
// Check field name and its type.
if (array_key_exists($info['type'], $typesAvailable) &&
(strpos($key, 'field_') === 0 || 'body' === $key)) {
// Store field name and its type.
array_push($fieldTypes, "{$content}__{$key}__{$typesAvailable[$info['type']]}");
}
});
}
/** @var Node[] $nodes */
$nodes = [];
/** @var array $report */
$report = array_reduce($fieldTypes, function ($return, $definition) use (&$nodes, $regexp) {
/** @var string $column */
list($content, $field, $index) = explode('__', $definition);
// Setup table and column names.
$table = "{$content}__{$field}";
$column = "{$field}_{$index}";
// Search given text on current table.
$result = $this->database->select($table, 't')
->fields('t', ['entity_id'])
->where("{$column} REGEXP '{$regexp}'")
->execute()
->fetchAll();
// When there is a result, then make it persistent.
if ($result) {
// Walk-through result returned.
array_walk($result, function ($row) use (&$return, &$nodes, $content) {
/** @var \Drupal\Core\Entity\EntityInterface $entity */
$entity = $this->entityTypeManager->getStorage($content)
->load($row->entity_id);
// Check when $entity is `node` or `paragraph` entity type.
if ('node' == $entity->getEntityTypeId()) {
$nodes[$entity->id()] = $entity;
$return["{$content}:{$entity->id()}"] = $entity->toUrl()
->toString();
}
elseif ('paragraph' == $entity->getEntityType()->id()) {
// Retrieve node parent ID value.
$nid = $this->getParentNode($entity->id());
// Load node entity only once.
if (!array_key_exists($nid, $nodes)) {
/** @var Node $node */
if (!$node = Node::load($nid)) {
return;
}
$nodes[$node->id()] = $node;
}
// Store node reference.
$return["node:{$nid}"] = $nodes[$nid]->toUrl()->toString();
}
else {
$this->io()
->writeln("[warning]: Entity Type: {$content} is not supported yet.");
}
});
}
return $return;
}, []);
$this->io()->table(['node:key', 'url'], array_map(function ($key, $url) {
return [$key, $url];
}, array_keys($report), $report));
}
}
That said, I hope it helps you at some point somehow!
Happy coding!
Comments