Do1e

Do1e

github
email

博客配置Algolia搜索,分页以解除文本长度限制

此文由 Mix Space 同步更新至 xLog
为获得最佳浏览体验,建议访问原始链接
https://do1e.cn/posts/code/algolia-search


Algolia 搜索配置方法#

mx-space 的文档中有比较详细的配置教程,其他博客框架可能大同小异。

索引大小限制#

很不幸,在根据文档配置完后,log 中报错了:

16:40:40  ERROR   [AlgoliaSearch]  algolia 推送错误
16:40:40  ERROR   [Event]  Record at the position 10 objectID=xxxxxxxx is too big size=12097/10000 bytes. Please have a look at
  https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-limitations/#record-size-limits

出错原因也很明确,有一篇博客太长了,而免费的 Algolia 每条数据仅有 10KB。对于我这种想白嫖的人怎么能忍,马上想办法解决。

解决方案#

思路#

对于 mx-space 来说,可以配置 API Token 后从/api/v2/search/algolia/import-json获取到手动提交到 Algolia 索引的 json 文件。
其中是一个包含了posts, pages 和notes的列表,示例数据如下:

{
  "title": "南京大学IPv4地址范围",
  "text": "# 动机\n\n<details>\n<summary>动机来自于搭建的网页。由于校内和公网都有搭建....",
  "slug": "nju-ipv4",
  "categoryId": "abcdefg",
  "category": {
  "_id": "abcdefg",
  "name": "其他",
  "slug": "others",
  "id": "abcdefg"
  },
  "id": "1234567",
  "objectID": "1234567",
  "type": "post"
},

其中objectID比较关键,提交给 Algolia 的必须唯一。

这里我能想到的思路便是分页,将有过长text的文章切分,同时修改objectID不就可以了?!(显然,此时并没有想到问题的严重性)
另外我的一些页面里会写<style><script>,这部分也可以直接使用正则匹配删掉。
于是有了如下 Python 代码,编辑从上述接口下载的 json 并提交给 Algolia。

from algoliasearch.search.client import SearchClientSync
import requests
import json
import math
import os
from copy import deepcopy
import re

MAXSIZE = 9990
APPID = "..."
APPKey = "..."
MXSPACETOKEN = "..."
url = "https://www.do1e.cn/api/v2/search/algolia/import-json"
headers = {
  "Authorization": MXSPACETOKEN,
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0",
}

ret = requests.get(url, headers=headers)
ret = ret.json()
with open("data.json", "w", encoding="utf-8") as f:
  json.dump(ret, f, ensure_ascii=False, indent=2)
to_push = []

def json_length(item):
  content = json.dumps(item, ensure_ascii=False).encode("utf-8")
  return len(content)

def right_text(text):
  try:
    text.decode("utf-8")
    return True
  except:
    return False

def cut_json(item):
  length = json_length(item)
  text_length = len(item["text"].encode("utf-8"))
  # 计算切分份数
  n = math.ceil(text_length / (MAXSIZE - length + text_length))
  start = 0
  text_content = item["text"].encode("utf-8")
  for i in range(n):
    new_item = deepcopy(item)
    new_item["objectID"] = f"{item['objectID']}_{i}"
    end = start + text_length // n
    # 切分时要注意确保能被正确解码(中文占2个字节)
    while not right_text(text_content[start:end]):
      end -= 1
    new_item["text"] = text_content[start:end].decode("utf-8")
    start = end
    to_push.append(new_item)

for item in ret:
  # 删除style和script标签
  item["text"] = re.sub(r"<style.*?>.*?</style>", "", item["text"], flags=re.DOTALL)
  item["text"] = re.sub(r"<script.*?>.*?</script>", "", item["text"], flags=re.DOTALL)
  if json_length(item) > MAXSIZE: # 超过限制,切分
    print(f"{item['title']} is too large, cut it")
    cut_json(item)
  else: # 没超限制也修改objectID以保持一致性
    item["objectID"] = f"{item['objectID']}_0"
    to_push.append(item)

with open("topush.json", "w", encoding="utf-8") as f:
  json.dump(to_push, f, ensure_ascii=False, indent=2)

client = SearchClientSync(APPID, APPKey)
resp = client.replace_all_objects("mx-space", to_push)
print(resp)

如果你用的是其他博客框架,看到这里就够了,希望能给你提供点思路。

很好,用 Python 修改搜索索引后重新提交到 Algolia 并在 mx-space 后台启用搜索功能,来试一试搜索超出限制的JPEG 编码细节吧。
怎么没有结果?怎么后台又报错了?

17:03:46  ERROR   [Catch]  Cast to ObjectId failed for value "1234567_0" (type string) at path "_id" for model "posts"
  at SchemaObjectId.cast (entrypoints.js:1073:883)
  at SchemaType.applySetters (entrypoints.js:1187:226)
  at SchemaType.castForQuery (entrypoints.js:1199:338)
  at cast (entrypoints.js:159:5360)
  at Query.cast (entrypoints.js:799:583)
  at Query._castConditions (entrypoints.js:765:9879)
  at Hr.Query._findOne (entrypoints.js:768:4304)
  at Hr.Query.exec (entrypoints.js:784:5145)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
  at async Promise.all (index 0)

来编辑 mx-space 代码吧#

从上述 log 很容易看出,mx-space 使用ObjectId作为了索引,而不是id,定位到代码中的这里:

https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L164-L165

import type { SearchResponse } from '@algolia/client-search'
import {
  BadRequestException,
  forwardRef,
  Inject,
  Injectable,
  Logger,
} from '@nestjs/common'
import { OnEvent } from '@nestjs/event-emitter'
import { CronExpression } from '@nestjs/schedule'
import { CronDescription } from '~/common/decorators/cron-description.decorator'
import { CronOnce } from '~/common/decorators/cron-once.decorator'
import { BusinessEvents } from '~/constants/business-event.constant'
import { EventBusEvents } from '~/constants/event-bus.constant'
import type { SearchDto } from '~/modules/search/search.dto'
import { DatabaseService } from '~/processors/database/database.service'
import type { Pagination } from '~/shared/interface/paginator.interface'
import { transformDataToPaginate } from '~/transformers/paginate.transformer'
import algoliasearch from 'algoliasearch'
import removeMdCodeblock from 'remove-md-codeblock'
import { ConfigsService } from '../configs/configs.service'
import { NoteService } from '../note/note.service'
import { PageService } from '../page/page.service'
import { PostService } from '../post/post.service'

@Injectable()
export class SearchService {
  private readonly logger = new Logger(SearchService.name)
  constructor(
    @Inject(forwardRef(() => NoteService))
    private readonly noteService: NoteService,

    @Inject(forwardRef(() => PostService))
    private readonly postService: PostService,

    @Inject(forwardRef(() => PageService))
    private readonly pageService: PageService,

    private readonly configs: ConfigsService,
    private readonly databaseService: DatabaseService,
  ) {}

  async searchNote(searchOption: SearchDto, showHidden: boolean) {
    const { keyword, page, size } = searchOption
    const select = '_id title created modified nid'

    const keywordArr = keyword
      .split(/\s+/)
      .map((item) => new RegExp(String(item), 'gi'))

    return transformDataToPaginate(
      await this.noteService.model.paginate(
        {
          $or: [{ title: { $in: keywordArr } }, { text: { $in: keywordArr } }],
          $and: [
            { password: { $not: null } },
            { isPublished: { $in: showHidden ? [false, true] : [true] } },
            {
              $or: [
                { publicAt: { $not: null } },
                { publicAt: { $lte: new Date() } },
              ],
            },
          ],
        },
        {
          limit: size,
          page,
          select,
        },
      ),
    )
  }

  async searchPost(searchOption: SearchDto) {
    const { keyword, page, size } = searchOption
    const select = '_id title created modified categoryId slug'
    const keywordArr = keyword
      .split(/\s+/)
      .map((item) => new RegExp(String(item), 'gi'))
    return await this.postService.model.paginate(
      {
        $or: [{ title: { $in: keywordArr } }, { text: { $in: keywordArr } }],
      },
      {
        limit: size,
        page,
        select,
      },
    )
  }

  public async getAlgoliaSearchIndex() {
    const { algoliaSearchOptions } = await this.configs.waitForConfigReady()
    if (!algoliaSearchOptions.enable) {
      throw new BadRequestException('algolia not enable.')
    }
    if (
      !algoliaSearchOptions.appId ||
      !algoliaSearchOptions.apiKey ||
      !algoliaSearchOptions.indexName
    ) {
      throw new BadRequestException('algolia not config.')
    }
    const client = algoliasearch(
      algoliaSearchOptions.appId,
      algoliaSearchOptions.apiKey,
    )
    const index = client.initIndex(algoliaSearchOptions.indexName)
    return index
  }

  async searchAlgolia(searchOption: SearchDto): Promise<
    | SearchResponse<{
        id: string
        text: string
        title: string
        type: 'post' | 'note' | 'page'
      }>
    | (Pagination<any> & {
        raw: SearchResponse<{
          id: string
          text: string
          title: string
          type: 'post' | 'note' | 'page'
        }>
      })
  > {
    const { keyword, size, page } = searchOption
    const index = await this.getAlgoliaSearchIndex()

    const search = await index.search<{
      id: string
      text: string
      title: string
      type: 'post' | 'note' | 'page'
    }>(keyword, {
      // start with 0
      page: page - 1,
      hitsPerPage: size,
      attributesToRetrieve: ['*'],
      snippetEllipsisText: '...',
      responseFields: ['*'],
      facets: ['*'],
    })
    if (searchOption.rawAlgolia) {
      return search
    }
    const data: any[] = []
    const tasks = search.hits.map((hit) => {
      const { type, objectID } = hit

      const model = this.databaseService.getModelByRefType(type as 'post')
      if (!model) {
        return Promise.resolve()
      }
      return model
        .findById(objectID.split('_')[0])
        .select('_id title created modified categoryId slug nid')
        .lean({
          getters: true,
          autopopulate: true,
        })
        .then((doc) => {
          if (doc) {
            Reflect.set(doc, 'type', type)
            data.push(doc)
          }
        })
    })
    await Promise.all(tasks)
    return {
      data,
      raw: search,
      pagination: {
        currentPage: page,
        total: search.nbHits,
        hasNextPage: search.nbPages > search.page,
        hasPrevPage: search.page > 1,
        size: search.hitsPerPage,
        totalPage: search.nbPages,
      },
    }
  }

  /**
   * @description 每天凌晨推送一遍 Algolia Search
   */
  @CronOnce(CronExpression.EVERY_DAY_AT_MIDNIGHT, {
    name: 'pushToAlgoliaSearch',
  })
  @CronDescription('推送到 Algolia Search')
  @OnEvent(EventBusEvents.PushSearch)
  @OnEvent(BusinessEvents.POST_CREATE)
  @OnEvent(BusinessEvents.POST_UPDATE)
  @OnEvent(BusinessEvents.POST_DELETE)
  @OnEvent(BusinessEvents.NOTE_CREATE)
  @OnEvent(BusinessEvents.NOTE_UPDATE)
  @OnEvent(BusinessEvents.NOTE_DELETE)
  async pushAllToAlgoliaSearch() {
    const configs = await this.configs.waitForConfigReady()
    if (!configs.algoliaSearchOptions.enable || isDev) {
      return
    }
    const index = await this.getAlgoliaSearchIndex()

    this.logger.log('--> 开始推送到 Algolia')

    const documents = await this.buildAlgoliaIndexData()
    try {
      await Promise.all([
        index.replaceAllObjects(documents, {
          autoGenerateObjectIDIfNotExist: false,
        }),
        index.setSettings({
          attributesToHighlight: ['text', 'title'],
        }),
      ])

      this.logger.log('--> 推送到 algoliasearch 成功')
    } catch (error) {
      Logger.error('algolia 推送错误', 'AlgoliaSearch')
      throw error
    }
  }

  private canBeDecoded(textEncoded: Uint8Array): boolean {
    try {
      new TextDecoder('utf-8', { fatal: true }).decode(textEncoded)
      return true
    } catch {
      return false
    }
  }

  async buildAlgoliaIndexData() {
    const combineDocuments = await Promise.all([
      this.postService.model
        .find()
        .select('title text categoryId category slug')
        .populate('category', 'name slug')
        .lean()

        .then((list) => {
          return list.map((data) => {
            Reflect.set(data, 'objectID', data._id)
            Reflect.deleteProperty(data, '_id')
            return {
              ...data,
              text: removeMdCodeblock(data.text),
              type: 'post',
            }
          })
        }),
      this.pageService.model
        .find({}, 'title text slug subtitle')
        .lean()
        .then((list) => {
          return list.map((data) => {
            Reflect.set(data, 'objectID', data._id)
            Reflect.deleteProperty(data, '_id')
            return {
              ...data,
              type: 'page',
            }
          })
        }),
      this.noteService.model
        .find(
          {
            isPublished: true,
            $or: [
              { password: undefined },
              { password: null },
              { password: { $exists: false } },
            ],
          },
          'title text nid',
        )
        .lean()
        .then((list) => {
          return list.map((data) => {
            const id = data.nid.toString()
            Reflect.set(data, 'objectID', data._id)
            Reflect.deleteProperty(data, '_id')
            Reflect.deleteProperty(data, 'nid')
            return {
              ...data,
              type: 'note',
              id,
            }
          })
        }),
    ])

    const { algoliaSearchOptions } = await this.configs.waitForConfigReady()

    const combineDocumentsSplited: any[] = []
    combineDocuments.flat().forEach((item) => {
      const objectToAdjust = JSON.parse(JSON.stringify(item))
      objectToAdjust.text = objectToAdjust.text.replaceAll(
        /<style[^>]*>[\s\S]*?<\/style>/gi,
        '',
      )
      objectToAdjust.text = objectToAdjust.text.replaceAll(
        /<script[^>]*>[\s\S]*?<\/script>/gi,
        '',
      )
      const encodedSize = new TextEncoder().encode(
        JSON.stringify(objectToAdjust),
      ).length
      if (encodedSize <= algoliaSearchOptions.maxTruncateSize) {
        objectToAdjust.objectID = `${objectToAdjust.objectID}_0`
        combineDocumentsSplited.push(objectToAdjust)
      } else {
        const textEncoded = new TextEncoder().encode(objectToAdjust.text)
        const textSize = textEncoded.length
        const n = Math.ceil(
          textSize /
            (algoliaSearchOptions.maxTruncateSize - encodedSize + textSize),
        )
        let start = 0
        for (let i = 0; i < n; i++) {
          const newObject = JSON.parse(JSON.stringify(objectToAdjust))
          let end = start + Math.floor(textSize / n)
          while (!this.canBeDecoded(textEncoded.slice(start, end))) {
            end--
          }
          newObject.text = new TextDecoder('utf-8').decode(
            textEncoded.slice(start, end),
          )
          newObject.objectID = `${newObject.objectID}_${i}`
          combineDocumentsSplited.push(newObject)
          start = end
        }
      }
    })
    return combineDocumentsSplited
  }
}

小插曲#

在编辑代码的时候,我发现原来代码中已经定义了超出限制长度后截断。不过定义的是 100KB,看来开发者是付费玩家。个人觉得把这个设置为环境变量会更好,而不是写死在代码里。

https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L370

2024/12/21: 作者更新了可配置的截断,不过我还是更喜欢分页提交,毕竟可以全文搜索嘛。

https://github.com/mx-space/core/commit/6da1c13799174e746708844d0b149b4607e8f276

載入中......
此文章數據所有權由區塊鏈加密技術和智能合約保障僅歸創作者所有。