URL内容解析

如果用户发送的消息里面存在链接，我们希望能够解析出这个链接里的内容，例如网站的图片，标题等信息

URL解析问题

何时解析

发送消息的时候，入库之前，后端解析
消息发送到前端的时候，前端去做解析工作

显然选择第一个，因为就一条消息，如果前端去做解析，相当于每个人都会去解析这个URL，那么如果群成员人数很大，就会对目标网站形成ddoc攻击，因此，在消息入库之前，我们后端主动做一次解析，这样的好处是也不需要解析多次

如何取出URL

@Test
public void testUrl() {
    String content = "这是一个很长的字符串再来 www.github.com，其中包含一个URL www.baidu.com,, 一个带有端口号的URL http://www.jd.com:80, 一个带有路径的URL http://mallchat.cn, 还有美团技术文章https://mp.weixin.qq.com/s/hwTf4bDck9_tlFpgVDeIKg";
    Pattern pattern = Pattern.compile("((http|https)://)?(www.)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?");
    List<String> matchList = ReUtil.findAll(pattern, content, 0);// hutool工具类
    System.out.println(matchList);
}

运行结果：

解析URL

我们使用爬虫Jsoup工具类解析

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.15.3</version>
</dependency>

解析网站：

@Test
public void Jsoup() throws IOException {
    String url = "http://www.baidu.com";
    Connection connect = Jsoup.connect(url);
    Document document = connect.get();
    String title = document.title();
    System.out.println("title = " + title);
}

运行结果：

网站的图标：

然而对于微信公众号等文章，它们的网站标题和图标又要换另外的方式获取

也就是说，我们的代码必须要适应不同种类型的网站规则。

这就让我们很自然的相当了可以使用策略模式来消除ifelse等。

URL解析具体实现

整体的结构

定义URL解析接口UrlDiscover

public interface UrlDiscover {
    @Nullable
    Map<String, UrlInfo> getUrlContentMap(String content);

    @Nullable
    UrlInfo getContent(String url);

    @Nullable
    String getTitle(Document document);

    @Nullable
    String getDescription(Document document);

    @Nullable
    String getImage(String url, Document document);

抽象类-异步框架解析

public abstract class AbstractUrlDiscover implements UrlDiscover {
    // 链接识别的正则
    private static final Pattern PATTERN = Pattern.compile("((http|https)://)?(www.)?([\\w_-]+(?:(?:\\.[\\w_-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?");
    @Nullable
    @Override
    public Map<String, UrlInfo> getUrlContentMap(String content) {
        if (StrUtil.isBlank(content)) {
            return new HashMap<>();
        }
        List<String> matchList = ReUtil.findAll(PATTERN, content, 0);
        // 并行请求
        List<CompletableFuture<Pair<String, UrlInfo>>> futures = matchList.stream().map(match -> CompletableFuture.supplyAsync(() -> {
            UrlInfo urlInfo = getContent(match);
            return Objects.isNull(urlInfo) ? null : Pair.of(match, urlInfo);
        })).collect(Collectors.toList());
        CompletableFuture<List<Pair<String, UrlInfo>>> future = FutureUtils.sequenceNonNull(futures);
        // 结果组装
        return future.join().stream().collect(Collectors.toMap(Pair::getFirst, Pair::getSecond, (a, b) -> a));
    }

    @Nullable
    @Override
    public UrlInfo getContent(String url) {
        Document document = getUrlDocument(assemble(url));
        if (Objects.isNull(document)) {
            return null;
        }

        return UrlInfo.builder().title(getTitle(document)).description(getDescription(document)).image(getImage(assemble(url), document)).build();
    }

    private String assemble(String url) {
        if (!StrUtil.startWith(url, "http")) {
            return "http://" + url;
        }
        return url;
    }

    protected Document getUrlDocument(String matchUrl) {
        try {
            Connection connect = Jsoup.connect(matchUrl);
            connect.timeout(2000);
            return connect.get();
        } catch (Exception e) {
            log.error("find error:url:{}", matchUrl, e);
        }
        return null;
    }

    /**
     * 判断链接是否有效
     * 输入链接
     * 返回true或者false
     */
    public static boolean isConnect(String href) {
        // 请求地址
        URL url;
        // 请求状态码
        int state;
        // 下载链接类型
        String fileType;
        try {
            url = new URL(href);
            HttpURLConnection httpURLConnection = (HttpURLConnection) url.openConnection();
            state = httpURLConnection.getResponseCode();
            fileType = httpURLConnection.getHeaderField("Content-Disposition");
            // 如果成功200，缓存304，移动302都算有效链接，并且不是下载链接
            if ((state == 200 || state == 302 || state == 304) && fileType == null) {
                return true;
            }
            httpURLConnection.disconnect();
        } catch (Exception e) {
            return false;
        }
        return false;
    }
}

这段代码实现了一个抽象的 UrlDiscover 接口,用于从给定的内容中提取 URL 信息,并获取每个 URL 的标题、描述和图片等信息。下面是代码解释：

异步框架解析URL：如果串行解析每个URL，那将会是一个非常耗时的操作，如果使用多线程，那么总时间就是解析所有URL的耗时最长的那个时间，这将会大大提升效率。

在代码中：

// 并行请求
List<CompletableFuture<Pair<String, UrlInfo>>> futures = matchList.stream().map(match -> CompletableFuture.supplyAsync(() -> {
    UrlInfo urlInfo = getContent(match);
    return Objects.isNull(urlInfo) ? null : Pair.of(match, urlInfo);
})).collect(Collectors.toList());
CompletableFuture<List<Pair<String, UrlInfo>>> future = FutureUtils.sequenceNonNull(futures);
// 结果组装
return future.join().stream().collect(Collectors.toMap(Pair::getFirst, Pair::getSecond, (a, b) -> a));

我们主要是用FutureUtils工具类来将串行转为并行，这个工具类是美团技术写的：

https://mp.weixin.qq.com/s/GQGidprakfticYnbVYVYGQ

对于每个 URL,使用 CompletableFuture.supplyAsync 异步地调用 getContent(match) 方法获取对应的 UrlInfo 对象。
使用 FutureUtils.sequenceNonNull 方法等待 futures 列表中所有 CompletableFuture 任务完成,并过滤掉 null 值。
使用future.join()方法等待最终的 CompletableFuture 任务完成,并获取结果列表。

解析超时处理：

对于一些在互联网长城之外的网站，例如 www.github.com，我们解析的时候很可能会出现请求超时的情况，对于这种情况，那么就不要再去解析了，我们可以在Jsoup设置最大超时时间，如果超时就直接结束，不进行解析。不能因为这个没解析出来，就影响用户消息的发送。

抽象类的实现

public class PrioritizedUrlDiscover extends AbstractUrlDiscover {

    private final List<UrlDiscover> urlDiscovers = new ArrayList<>(2);

    public PrioritizedUrlDiscover() {
        urlDiscovers.add(new WxUrlDiscover());
        urlDiscovers.add(new CommonUrlDiscover());
    }


    @Nullable
    @Override
    public String getTitle(Document document) {
        for (UrlDiscover urlDiscover : urlDiscovers) {
            String urlTitle = urlDiscover.getTitle(document);
            if (StrUtil.isNotBlank(urlTitle)) {
                return urlTitle;
            }
        }
        return null;
    }

    @Nullable
    @Override
    public String getDescription(Document document) {
        for (UrlDiscover urlDiscover : urlDiscovers) {
            String urlDescription = urlDiscover.getDescription(document);
            if (StrUtil.isNotBlank(urlDescription)) {
                return urlDescription;
            }
        }
        return null;
    }

    @Nullable
    @Override
    public String getImage(String url, Document document) {
        for (UrlDiscover urlDiscover : urlDiscovers) {
            String urlImage = urlDiscover.getImage(url, document);
            if (StrUtil.isNotBlank(urlImage)) {
                return urlImage;
            }
        }
        return null;
    }
}

这个 PrioritizedUrlDiscover 类继承自 AbstractUrlDiscover 抽象类,实现了 UrlDiscover 接口。它使用了优先级的方式来获取 URL 的标题、描述和图片信息。

这里很像一些设计模式，例如策略模式，责任链模式等

我们先使用WxUrlDiscover尝试进行获取，如果获取不到信息，那么再用CommonUrlDiscover尝试获取。

通用提取

package com.yunfei.chat.common.utils.discover;

import cn.hutool.core.util.StrUtil;
import org.jsoup.nodes.Document;
import org.springframework.lang.Nullable;


public class CommonUrlDiscover extends AbstractUrlDiscover {
    @Nullable
    @Override
    public String getTitle(Document document) {
        return document.title();
    }

    @Nullable
    @Override
    public String getDescription(Document document) {
        String description = document.head().select("meta[name=description]").attr("content");
        String keywords = document.head().select("meta[name=keywords]").attr("content");
        String content = StrUtil.isNotBlank(description) ? description : keywords;
        //只保留一句话的描述
        return StrUtil.isNotBlank(content) ? content.substring(0, content.indexOf("。")) : content;
    }

    @Nullable
    @Override
    public String getImage(String url, Document document) {
        String image = document.select("link[type=image/x-icon]").attr("href");
        //如果没有去匹配含有icon属性的logo
        String href = StrUtil.isEmpty(image) ? document.select("link[rel$=icon]").attr("href") : image;
        //如果url已经包含了logo
        if (StrUtil.containsAny(url, "favicon")) {
            return url;
        }
        //如果icon可以直接访问或者包含了http
        if (isConnect(!StrUtil.startWith(href, "http") ? "http:" + href : href)) {
            return href;
        }

        return StrUtil.format("{}/{}", url, StrUtil.removePrefix(href, "/"));
    }


}

微信提取

public class WxUrlDiscover extends AbstractUrlDiscover {

    @Nullable
    @Override
    public String getTitle(Document document) {
        return document.getElementsByAttributeValue("property", "og:title").attr("content");
    }

    @Nullable
    @Override
    public String getDescription(Document document) {
        return document.getElementsByAttributeValue("property", "og:description").attr("content");
    }

    @Nullable
    @Override
    public String getImage(String url, Document document) {
        String href = document.getElementsByAttributeValue("property", "og:image").attr("content");
        return isConnect(href) ? href : null;
    }
}

URL内容解析

URL解析问题​

何时解析​

如何取出URL​

解析URL​

URL解析具体实现​

定义URL解析接口UrlDiscover​

抽象类-异步框架解析​

抽象类的实现​

通用提取​

微信提取​