博客配置Algolia搜索，分页以解除文本长度限制

此文由 Mix Space 同步更新至 xLog
为获得最佳浏览体验，建议访问原始链接
https://do1e.cn/posts/code/algolia-search

Algolia 搜索配置方法#

mx-space 的文档中有比较详细的配置教程，其他博客框架可能大同小异。

索引大小限制#

很不幸，在根据文档配置完后，log 中报错了：

16:40:40  ERROR   [AlgoliaSearch]  algolia 推送错误
16:40:40  ERROR   [Event]  Record at the position 10 objectID=xxxxxxxx is too big size=12097/10000 bytes. Please have a look at
  https://www.algolia.com/doc/guides/sending-and-managing-data/prepare-your-data/in-depth/index-and-records-size-and-usage-limitations/#record-size-limits

出错原因也很明确，有一篇博客太长了，而免费的 Algolia 每条数据仅有 10KB。对于我这种想白嫖的人怎么能忍，马上想办法解决。

解决方案#

思路#

对于 mx-space 来说，可以配置 API Token 后从/api/v2/search/algolia/import-json获取到手动提交到 Algolia 索引的 json 文件。
其中是一个包含了posts, pages 和notes的列表，示例数据如下：

{
  "title": "南京大学IPv4地址范围",
  "text": "# 动机\n\n<details>\n<summary>动机来自于搭建的网页。由于校内和公网都有搭建....",
  "slug": "nju-ipv4",
  "categoryId": "abcdefg",
  "category": {
  "_id": "abcdefg",
  "name": "其他",
  "slug": "others",
  "id": "abcdefg"
  },
  "id": "1234567",
  "objectID": "1234567",
  "type": "post"
},

其中objectID比较关键，提交给 Algolia 的必须唯一。

这里我能想到的思路便是分页，将有过长text的文章切分，同时修改objectID不就可以了？！（显然，此时并没有想到问题的严重性）
另外我的一些页面里会写<style>和<script>，这部分也可以直接使用正则匹配删掉。
于是有了如下 Python 代码，编辑从上述接口下载的 json 并提交给 Algolia。

from algoliasearch.search.client import SearchClientSync
import requests
import json
import math
import os
from copy import deepcopy
import re

MAXSIZE = 9990
APPID = "..."
APPKey = "..."
MXSPACETOKEN = "..."
url = "https://www.do1e.cn/api/v2/search/algolia/import-json"
headers = {
  "Authorization": MXSPACETOKEN,
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0",
}

ret = requests.get(url, headers=headers)
ret = ret.json()
with open("data.json", "w", encoding="utf-8") as f:
  json.dump(ret, f, ensure_ascii=False, indent=2)
to_push = []

def json_length(item):
  content = json.dumps(item, ensure_ascii=False).encode("utf-8")
  return len(content)

def right_text(text):
  try:
    text.decode("utf-8")
    return True
  except:
    return False

def cut_json(item):
  length = json_length(item)
  text_length = len(item["text"].encode("utf-8"))
  # 计算切分份数
  n = math.ceil(text_length / (MAXSIZE - length + text_length))
  start = 0
  text_content = item["text"].encode("utf-8")
  for i in range(n):
    new_item = deepcopy(item)
    new_item["objectID"] = f"{item['objectID']}_{i}"
    end = start + text_length // n
    # 切分时要注意确保能被正确解码（中文占2个字节）
    while not right_text(text_content[start:end]):
      end -= 1
    new_item["text"] = text_content[start:end].decode("utf-8")
    start = end
    to_push.append(new_item)

for item in ret:
  # 删除style和script标签
  item["text"] = re.sub(r"<style.*?>.*?</style>", "", item["text"], flags=re.DOTALL)
  item["text"] = re.sub(r"<script.*?>.*?</script>", "", item["text"], flags=re.DOTALL)
  if json_length(item) > MAXSIZE: # 超过限制，切分
    print(f"{item['title']} is too large, cut it")
    cut_json(item)
  else: # 没超限制也修改objectID以保持一致性
    item["objectID"] = f"{item['objectID']}_0"
    to_push.append(item)

with open("topush.json", "w", encoding="utf-8") as f:
  json.dump(to_push, f, ensure_ascii=False, indent=2)

client = SearchClientSync(APPID, APPKey)
resp = client.replace_all_objects("mx-space", to_push)
print(resp)

如果你用的是其他博客框架，看到这里就够了，希望能给你提供点思路。

很好，用 Python 修改搜索索引后重新提交到 Algolia 并在 mx-space 后台启用搜索功能，来试一试搜索超出限制的JPEG 编码细节吧。
怎么没有结果？怎么后台又报错了？

17:03:46  ERROR   [Catch]  Cast to ObjectId failed for value "1234567_0" (type string) at path "_id" for model "posts"
  at SchemaObjectId.cast (entrypoints.js:1073:883)
  at SchemaType.applySetters (entrypoints.js:1187:226)
  at SchemaType.castForQuery (entrypoints.js:1199:338)
  at cast (entrypoints.js:159:5360)
  at Query.cast (entrypoints.js:799:583)
  at Query._castConditions (entrypoints.js:765:9879)
  at Hr.Query._findOne (entrypoints.js:768:4304)
  at Hr.Query.exec (entrypoints.js:784:5145)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
  at async Promise.all (index 0)

来编辑 mx-space 代码吧#

从上述 log 很容易看出，mx-space 使用ObjectId作为了索引，而不是id，定位到代码中的这里：

https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L164-L165

import type { SearchResponse } from '@algolia/client-search'
import {
  BadRequestException,
  forwardRef,
  Inject,
  Injectable,
  Logger,
} from '@nestjs/common'
import { OnEvent } from '@nestjs/event-emitter'
import { CronExpression } from '@nestjs/schedule'
import { CronDescription } from '~/common/decorators/cron-description.decorator'
import { CronOnce } from '~/common/decorators/cron-once.decorator'
import { BusinessEvents } from '~/constants/business-event.constant'
import { EventBusEvents } from '~/constants/event-bus.constant'
import type { SearchDto } from '~/modules/search/search.dto'
import { DatabaseService } from '~/processors/database/database.service'
import type { Pagination } from '~/shared/interface/paginator.interface'
import { transformDataToPaginate } from '~/transformers/paginate.transformer'
import algoliasearch from 'algoliasearch'
import removeMdCodeblock from 'remove-md-codeblock'
import { ConfigsService } from '../configs/configs.service'
import { NoteService } from '../note/note.service'
import { PageService } from '../page/page.service'
import { PostService } from '../post/post.service'

@Injectable()
export class SearchService {
  private readonly logger = new Logger(SearchService.name)
  constructor(
    @Inject(forwardRef(() => NoteService))
    private readonly noteService: NoteService,

    @Inject(forwardRef(() => PostService))
    private readonly postService: PostService,

    @Inject(forwardRef(() => PageService))
    private readonly pageService: PageService,

    private readonly configs: ConfigsService,
    private readonly databaseService: DatabaseService,
  ) {}

  async searchNote(searchOption: SearchDto, showHidden: boolean) {
    const { keyword, page, size } = searchOption
    const select = '_id title created modified nid'

    const keywordArr = keyword
      .split(/\s+/)
      .map((item) => new RegExp(String(item), 'gi'))

    return transformDataToPaginate(
      await this.noteService.model.paginate(
        {
          $or: [{ title: { $in: keywordArr } }, { text: { $in: keywordArr } }],
          $and: [
            { password: { $not: null } },
            { isPublished: { $in: showHidden ? [false, true] : [true] } },
            {
              $or: [
                { publicAt: { $not: null } },
                { publicAt: { $lte: new Date() } },
              ],
            },
          ],
        },
        {
          limit: size,
          page,
          select,
        },
      ),
    )
  }

  async searchPost(searchOption: SearchDto) {
    const { keyword, page, size } = searchOption
    const select = '_id title created modified categoryId slug'
    const keywordArr = keyword
      .split(/\s+/)
      .map((item) => new RegExp(String(item), 'gi'))
    return await this.postService.model.paginate(
      {
        $or: [{ title: { $in: keywordArr } }, { text: { $in: keywordArr } }],
      },
      {
        limit: size,
        page,
        select,
      },
    )
  }

  public async getAlgoliaSearchIndex() {
    const { algoliaSearchOptions } = await this.configs.waitForConfigReady()
    if (!algoliaSearchOptions.enable) {
      throw new BadRequestException('algolia not enable.')
    }
    if (
      !algoliaSearchOptions.appId ||
      !algoliaSearchOptions.apiKey ||
      !algoliaSearchOptions.indexName
    ) {
      throw new BadRequestException('algolia not config.')
    }
    const client = algoliasearch(
      algoliaSearchOptions.appId,
      algoliaSearchOptions.apiKey,
    )
    const index = client.initIndex(algoliaSearchOptions.indexName)
    return index
  }

  async searchAlgolia(searchOption: SearchDto): Promise<
    | SearchResponse<{
        id: string
        text: string
        title: string
        type: 'post' | 'note' | 'page'
      }>
    | (Pagination<any> & {
        raw: SearchResponse<{
          id: string
          text: string
          title: string
          type: 'post' | 'note' | 'page'
        }>
      })
  > {
    const { keyword, size, page } = searchOption
    const index = await this.getAlgoliaSearchIndex()

    const search = await index.search<{
      id: string
      text: string
      title: string
      type: 'post' | 'note' | 'page'
    }>(keyword, {
      // start with 0
      page: page - 1,
      hitsPerPage: size,
      attributesToRetrieve: ['*'],
      snippetEllipsisText: '...',
      responseFields: ['*'],
      facets: ['*'],
    })
    if (searchOption.rawAlgolia) {
      return search
    }
    const data: any[] = []
    const tasks = search.hits.map((hit) => {
      const { type, objectID } = hit

      const model = this.databaseService.getModelByRefType(type as 'post')
      if (!model) {
        return Promise.resolve()
      }
      return model
        .findById(objectID.split('_')[0])
        .select('_id title created modified categoryId slug nid')
        .lean({
          getters: true,
          autopopulate: true,
        })
        .then((doc) => {
          if (doc) {
            Reflect.set(doc, 'type', type)
            data.push(doc)
          }
        })
    })
    await Promise.all(tasks)
    return {
      data,
      raw: search,
      pagination: {
        currentPage: page,
        total: search.nbHits,
        hasNextPage: search.nbPages > search.page,
        hasPrevPage: search.page > 1,
        size: search.hitsPerPage,
        totalPage: search.nbPages,
      },
    }
  }

  /**
   * @description 每天凌晨推送一遍 Algolia Search
   */
  @CronOnce(CronExpression.EVERY_DAY_AT_MIDNIGHT, {
    name: 'pushToAlgoliaSearch',
  })
  @CronDescription('推送到 Algolia Search')
  @OnEvent(EventBusEvents.PushSearch)
  @OnEvent(BusinessEvents.POST_CREATE)
  @OnEvent(BusinessEvents.POST_UPDATE)
  @OnEvent(BusinessEvents.POST_DELETE)
  @OnEvent(BusinessEvents.NOTE_CREATE)
  @OnEvent(BusinessEvents.NOTE_UPDATE)
  @OnEvent(BusinessEvents.NOTE_DELETE)
  async pushAllToAlgoliaSearch() {
    const configs = await this.configs.waitForConfigReady()
    if (!configs.algoliaSearchOptions.enable || isDev) {
      return
    }
    const index = await this.getAlgoliaSearchIndex()

    this.logger.log('--> 开始推送到 Algolia')

    const documents = await this.buildAlgoliaIndexData()
    try {
      await Promise.all([
        index.replaceAllObjects(documents, {
          autoGenerateObjectIDIfNotExist: false,
        }),
        index.setSettings({
          attributesToHighlight: ['text', 'title'],
        }),
      ])

      this.logger.log('--> 推送到 algoliasearch 成功')
    } catch (error) {
      Logger.error('algolia 推送错误', 'AlgoliaSearch')
      throw error
    }
  }

  private canBeDecoded(textEncoded: Uint8Array): boolean {
    try {
      new TextDecoder('utf-8', { fatal: true }).decode(textEncoded)
      return true
    } catch {
      return false
    }
  }

  async buildAlgoliaIndexData() {
    const combineDocuments = await Promise.all([
      this.postService.model
        .find()
        .select('title text categoryId category slug')
        .populate('category', 'name slug')
        .lean()

        .then((list) => {
          return list.map((data) => {
            Reflect.set(data, 'objectID', data._id)
            Reflect.deleteProperty(data, '_id')
            return {
              ...data,
              text: removeMdCodeblock(data.text),
              type: 'post',
            }
          })
        }),
      this.pageService.model
        .find({}, 'title text slug subtitle')
        .lean()
        .then((list) => {
          return list.map((data) => {
            Reflect.set(data, 'objectID', data._id)
            Reflect.deleteProperty(data, '_id')
            return {
              ...data,
              type: 'page',
            }
          })
        }),
      this.noteService.model
        .find(
          {
            isPublished: true,
            $or: [
              { password: undefined },
              { password: null },
              { password: { $exists: false } },
            ],
          },
          'title text nid',
        )
        .lean()
        .then((list) => {
          return list.map((data) => {
            const id = data.nid.toString()
            Reflect.set(data, 'objectID', data._id)
            Reflect.deleteProperty(data, '_id')
            Reflect.deleteProperty(data, 'nid')
            return {
              ...data,
              type: 'note',
              id,
            }
          })
        }),
    ])

    const { algoliaSearchOptions } = await this.configs.waitForConfigReady()

    const combineDocumentsSplited: any[] = []
    combineDocuments.flat().forEach((item) => {
      const objectToAdjust = JSON.parse(JSON.stringify(item))
      objectToAdjust.text = objectToAdjust.text.replaceAll(
        /<style[^>]*>[\s\S]*?<\/style>/gi,
        '',
      )
      objectToAdjust.text = objectToAdjust.text.replaceAll(
        /<script[^>]*>[\s\S]*?<\/script>/gi,
        '',
      )
      const encodedSize = new TextEncoder().encode(
        JSON.stringify(objectToAdjust),
      ).length
      if (encodedSize <= algoliaSearchOptions.maxTruncateSize) {
        objectToAdjust.objectID = `${objectToAdjust.objectID}_0`
        combineDocumentsSplited.push(objectToAdjust)
      } else {
        const textEncoded = new TextEncoder().encode(objectToAdjust.text)
        const textSize = textEncoded.length
        const n = Math.ceil(
          textSize /
            (algoliaSearchOptions.maxTruncateSize - encodedSize + textSize),
        )
        let start = 0
        for (let i = 0; i < n; i++) {
          const newObject = JSON.parse(JSON.stringify(objectToAdjust))
          let end = start + Math.floor(textSize / n)
          while (!this.canBeDecoded(textEncoded.slice(start, end))) {
            end--
          }
          newObject.text = new TextDecoder('utf-8').decode(
            textEncoded.slice(start, end),
          )
          newObject.objectID = `${newObject.objectID}_${i}`
          combineDocumentsSplited.push(newObject)
          start = end
        }
      }
    })
    return combineDocumentsSplited
  }
}

小插曲#

在编辑代码的时候，我发现原来代码中已经定义了超出限制长度后截断。不过定义的是 100KB，看来开发者是付费玩家。个人觉得把这个设置为环境变量会更好，而不是写死在代码里。

https://github.com/mx-space/core/blob/20a1eef/apps/core/src/modules/search/search.service.ts#L370

2024/12/21: 作者更新了可配置的截断，不过我还是更喜欢分页提交，毕竟可以全文搜索嘛。

https://github.com/mx-space/core/commit/6da1c13799174e746708844d0b149b4607e8f276