由于之前有很多md文件的文章分布在不同的分类目录下(在同一父级目录下),使用docsify作为网站编译解释框架。

docsify用起来很舒服,之前在github.io没有seo方面的需求,后续有了seo的需求后,docsify在seo方面还需要等下一个大版本,实在等不及,只能考虑将批量转换所有md文件,支持hugo的Front Matter、内联md、摘要等属性设置。

写这个python脚本只是为了帮助人,而不是完全做了我们人该做的,本地也没有文章机器学习的能力,有机会考虑使用机器学习实现文章分类、标签、关键字、标题的自动生成,当然最后还是需要人为的完善这些自动生成的信息。

目录

脚本介绍

python中实现一个类,用于处理:

1、scanFiles方法支持:扫描源目录所有md文件,提取文件支持hugo相关的FrontMatter等信息,并在hugo项目的content/post/目录下,按照目录层级创建目录及写入文件文件

2、scanFile方法支持:输入单个md文件路径,提取文件支持hugo相关的FrontMatter等信息,并在hugo项目的content/post/目录下,按照目录层级创建目录及写入文件文件

usage:

& python hugo-md-format.py > mdlog

代码实现

准备:安装jieba、enchant模块(enchant暂时可以不用,用于英文单词判断)

import os
import time
import datetime
# import enchant
import jieba


class HugoMarkdown:

    # __srcDir = 'I:\src\hugo\docs' #源文章目录
    # __desDir = 'I:\src\hugo\9ong\content\post' #目的文件目录

    __srcDir = 'I:\src\github-page\docs' #源文章目录
    __desDir = 'I:\src\hugo\9ong\content\post' #目的文件目录
    __ignoreFile = ["index.md","README.md",'more.md']#文件忽略
    __ignoreParentDir = ["docs","post","content","互联网"]#分类忽略(父级目录)


    def __init__(self):
        print("···HugoMarkdown···\n")
        

    #遍历源日志目录所有文件,批量处理
    def scanFiles(self):
        print("不再使用,除非有新的md文件目录需要批量转换")
        return False

        print("开始遍历源文章目录:",self.__srcDir,"\n")
        for root,dirs,files in os.walk(self.__srcDir):
            for file in files:   
                
                print("\n-----开始处理文章:",os.path.join(root,file),"-----\n")

                if self.__isIgnoreFile(file):            
                    print("忽略",file,"\n")
                    continue


                fileInfoDict = self.__getFileInfo(root,file)

                if (fileInfoDict['fileExt'] != ".md") or (fileInfoDict['parentDir']==''):
                    print("忽略",file,"\n")
                    continue                

                #测试输出    
                print(fileInfoDict,"\n")                

                self.__adjustFIleContent(fileInfoDict)

                #只循环一次,跳出所有循环
                # return 

    def scanFile(self,filePath):           

        self.__srcDir = self.__desDir

        root = os.path.dirname(filePath)
        file = os.path.basename(filePath)
        # print(os.path.join(root,file))
        # return False

        print("\n-----开始处理文章:",os.path.join(root,file),"-----\n")
        if self.__isIgnoreFile(file):
            print("忽略",file,"\n")
            return False


        fileInfoDict = self.__getFileInfo(root,file)

        if (fileInfoDict['fileExt'] != ".md") or (fileInfoDict['parentDir']==''):
            print("忽略",file,"\n")
            return False            

        #测试输出    
        print(fileInfoDict,"\n")                

        self.__adjustFIleContent(fileInfoDict)
        

    def __getFileInfo(self,root,file):
        print("获取文章信息:\n")
        #文件全路径                
        filePath = os.path.join(root,file)
        #文件名、扩展名
        filename,fileExt = os.path.splitext(file)
        #所在目录及上级目录
        parentDir = os.path.basename(root)
        grandpaDir = os.path.basename(os.path.dirname(root))
        if self.__isIgnoreParentDir(parentDir):        
            parentDir = ""

        if self.__isIgnoreParentDir(grandpaDir):        
            grandpaDir = ""

        #文件相关时间
        fileCtime = self.__timeToDate(os.path.getctime(filePath),"%Y-%m-%d")
        fileMtime = self.__timeToDate(os.path.getmtime(filePath),"%Y-%m-%d")

        return {
            "filePath":filePath,
            "fileName":filename,
            "fileExt":fileExt,
            "parentDir":parentDir,
            "grandpaDir":grandpaDir,
            "fileCtime":fileCtime,
            "fileMtime":fileMtime
        }

    def __isIgnoreParentDir(self,parentDir):
        if parentDir in self.__ignoreParentDir:
            return True

    #调整文章内容 比如meta设置、TOC、MORE设置,
    def __adjustFIleContent(self,fileInfoDict):
        #读取文章内容 及 关键词
        print("读取文章内容...\n")
        with open(fileInfoDict['filePath'],"r",encoding="utf-8") as mdFile:
            content = mdFile.read().strip()            
            
            fileInfoDict['keywords'] = self.__getKeywords(content,fileInfoDict['fileName'])
            
            content = self.__getMmeta(fileInfoDict) + self.__insertMoreToContent(content)

            #写入新文件
            self.__writeNewMarkdownFile(content,fileInfoDict)

    #获取meta
    def __getMmeta(self,fileInfoDict):
        print("准备文章meta信息:","\n")        
        meta = ""
        metaTitle = "title: \""+fileInfoDict['fileName']+"\"\n"
        metaCJK = "isCJKLanguage: true\n"
        metaDate = "date: "+fileInfoDict['fileCtime']+"\n"
        metaCategories = "categories: \n"
        metaParentCategory = ""
        metaGrandpaCategory = ""
        metaTags = "tags: \n"
        metaTagsList = ""
        metaKeywords = "keywords: \n"
        metaKeywordsList = ""


        if fileInfoDict['grandpaDir']!='':
            metaGrandpaCategory = "- "+fileInfoDict['grandpaDir']+"\n"
        
        if fileInfoDict['parentDir']!='':
            metaParentCategory = "- "+fileInfoDict['parentDir']+"\n"
        
        if fileInfoDict['keywords']:
            for word in fileInfoDict['keywords']:
                metaTagsList += "- "+word+"\n"
                metaKeywordsList += "- "+word+"\n"

        meta = "---\n"+metaTitle+metaCJK+metaDate+metaCategories+metaGrandpaCategory+metaParentCategory+metaTags+metaTagsList+metaKeywords+metaKeywordsList+"---\n\n"
        print(meta,"\n")
        return meta

    #插入<!--more-->到文章
    def __insertMoreToContent(self,content):        
        tocFlag = '<!-- /TOC -->
<!--more-->
'
        if (content.find(tocFlag) != -1):            
            print("发现",tocFlag,"\n")
            content = content.replace(tocFlag,tocFlag+"\n"+'<!--more-->'+"\n")
        else:
            print("没有发现",tocFlag,"\n")
            contents = content.splitlines()
            contentsLen = len(contents)
            if contentsLen>4:
                contents[4] = contents[4]+"\n"+'<!--more-->'+"\n"
                content = "\n".join(contents)

        print("插入<!--more-->...","\n")
        return content

    def __writeNewMarkdownFile(self,content,fileInfoDict):        
        relativeFilePath = fileInfoDict['filePath'].replace(self.__srcDir,"")

        desFilePath = self.__desDir+relativeFilePath
        print("写入新文件:",desFilePath,"\n")
        desDirPath = os.path.dirname(desFilePath)
        # print("##Final Path:"+desFilePath)
        # return 
        if not os.path.exists(desDirPath):
            os.makedirs(desDirPath)
        with open(desFilePath,"w",encoding="utf-8") as nf:
            nf.write(content)

        if os.path.exists(desFilePath):
            print("----- 完成文章处理:",desFilePath," -----\n")
        else:
            print("---- 写入新文件失败! -----\n")

    def __isIgnoreFile(self,file):
        if file in self.__ignoreFile:
            return True

    #时间戳转换成日期
    def __timeToDate(self,timeStamp,format="%Y-%m-%d %H:%M:%S"):
        timeArray = time.localtime(timeStamp)
        return time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
    

    #获取文章关键词
    def __getKeywords(self,content,filename):
        keywords = self.__wordStatistics(content,filename)
        keywordsList = sorted(keywords.items(), key=lambda item:item[1], reverse=True)            
        keywordsList = keywordsList[0:50]   
        keywordsList = self.__filterKeywords(keywordsList,filename)   
        print("保留关键词:",keywordsList,"\n")           
        return keywordsList

    #词频统计
    def __wordStatistics(self,content,filename):        
        stopwords = open('stopwords.txt', 'r', encoding='utf-8').read().split('\n')[:-1]        
        words_dict = {}
    
        temp = jieba.cut(content)
        for t in temp:
            if t in stopwords or t == 'unknow' or t.strip() == "":
                continue
            if t in words_dict.keys():
                words_dict[t] += 1
            else:
                words_dict[t] = 1

        # filenameCuts = jieba.cut(filename)                
        # for fc in filenameCuts:
        #     if fc in stopwords or fc == 'unknow' or fc.strip() == "":
        #         continue
        #     if fc in words_dict.keys():
        #         words_dict[fc] += 100
        #     else:
        #         words_dict[fc] = 100
        return words_dict

    #再次过滤关键词:在文件名也就是标题中,且汉字不少于2个,字符串不少于3个,不是纯数字
    def __filterKeywords(self,keywordsList,filename):
        print("分析文章标签/关键词...\n")
        newKeywordsList = []
        # print(keywordsList)
        # enD = enchant.Dict("en_US")
        for word,count in keywordsList:            

            # print(word,"\t",count)            
            wordLen = len(word)
            if filename.find(word)!=-1:
                if self.__isChinese(word) and wordLen<2:
                    continue
                elif wordLen<3:
                    continue                                        
                elif word.isdigit():
                    continue
                else:
                    newKeywordsList.append(word)
            # else:
            #     if wordLen>1 and self.__isChinese(word) and count>5:
            #         newKeywordsList.append(word)                
            #     elif wordLen>2 and enD.check(word) and count>5:
            #         newKeywordsList.append(word)   
            #     else:
            #         continue

        return newKeywordsList

    def __isChinese(self,word):
        for ch in word:
            if '\u4e00' <= ch <= '\u9fff':
                return True
        return False


if __name__ == '__main__':
    hm = HugoMarkdown()
    #scanFiles 扫描一个目录下所有文件,批量处理
    # hm.scanFiles()

    #单独处理一个文件,覆盖原文件,注意保存
    theFile = input(r'输入文章绝对路径,比如I:\src\xxx\xxx\content\post\其他\xxx.md:')
    hm.scanFile(theFile)
    # theFile = r'I:\srcxxx\xxxx\content\post\其他\xxx.md'
    

演示日志

这里我们使用处理单个文件scanFile方法进行演示:

PS 9ong> & python hugo-md-format.py
···HugoMarkdown···

输入文章绝对路径,比如I:\src\hugo\9ong\content\post\其他\xxx.md:I:\src\hugo\9ong\content\post\其他\PHP技术精华合集.md

-----开始处理文章: I:\src\hugo\9ong\content\post\其他\PHP技术精华合集.md -----

获取文章信息:

{'filePath': 'I:\\src\\hugo\\9ong\\content\\post\\其他\\PHP技术精华合集.md', 'fileName': 'PHP技术精华合集', 'fileExt': '.md', 'parentDir': '其他', 'grandpaDir': '', 'fileCtime': '2020-05-27 15:48:53', 'fileMtime': '2020-05-27 16:42:34'}

读取文章内容...

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\jm\AppData\Local\Temp\jieba.cache
Loading model cost 0.710 seconds.
Prefix dict has been built successfully.
分析文章标签/关键词...

com      210
http     209
mp       209
weixin   209
qq       209
s        209
__       209
biz      209
MzIwNjQ5MDk3NA   209
mid      209
idx      209
sn       209
chksm    209
scene    209
21       209
wechat   209
redirect         209
PHP      160
实现     61
php      43
微信     23
方法     21
功能     20
—        19
技术     17
支付     15
学习     13
互联网   12
语言     11
开发     11
文件     10
代码     10
数据     9
中       9
篇       9
##       8
网站     8
程序     8
登录     8
实例     8
详解     8
问题     8
使用     8
操作     7
处理     7
技巧     7
程序员   7
四年     7
精华     7
合集     7
保留关键词: ['PHP']

准备文章meta信息:

---
title: "PHP技术精华合集"
isCJKLanguage: true
date: 2020-05-27 15:48:53
categories:
- 其他
tags:
- PHP
keywords:
- PHP
---

 

发现 <! -- /TOC -- >

插入< !--more-- >...

写入新文件: I:\src\hugo\9ong\content\post\其他\PHP技术精华合集.md

----- 完成文章处理: I:\src\hugo\9ong\content\post\其他\PHP技术精华合集.md  -----

效果

---
title: "PHP技术精华合集"
isCJKLanguage: true
date: 2020-05-27 15:48:53
categories: 
- 其他
tags: 
- PHP
keywords: 
- PHP
---

< !-- TOC -- >

- [PHP](#php)
- [**一线资讯**](#一线资讯)
- [**微信技术**](#微信技术)
- [**电子商务技术**](#电子商务技术)
- [**时间操作**](#时间操作)
- [**基本实战**](#基本实战)
- [**浅谈PHP**](#浅谈php)
- [**求职就业**](#求职就业)

< !-- /TOC -- >
< !--more-- >

由于TOC 与more 标签会被markdown解析,这里都在前后加了个空格,防止解析