Zhengyi Yang's Publications

[by year] [by type]

(* Corresponding Author; # Equal Contribution)

[39] Wenqian Zhang, Zhengyi Yang*, Dong Wen, Wentao Li, Wenjie Zhang, and Xuemin Lin. Accelerating core decomposition in billion-scale hypergraphs. In Proceedings of the 2025 International Conference on Management of Data (SIGMOD). Association for Computing Machinery, 2025. [ bib ]
[38] Diya Yan, Riza Yosia Sunindijo, Cynthia C Wang, and Zhengyi Yang. Comparing the career paths of male and female managers in the construction industry: Insights from linkedin data. In C. Thomson and C. J. Neilson, editors, Proceedings 40th Annual ARCOM Conference, pages 247–256, London South Bank University, UK, September 2–4 2024. Association of Researchers in Construction Management. [ bib | .pdf ]
As one of the most male-dominated industries, the construction industry in Australia is known for the low representation of women in the workforce. Previous studies have indicated that women face significant obstacles in their career developments, progressing more slowly than their male counterparts and struggling to attain managerial positions. Building on past research, this research compares the differences between the career paths of male and female managers in construction to understand key factors contributing to gender disparities in career progression. Based on an examination of the LinkedIn profiles of 480 managerial level employees from 61 companies, this research leveraged the rich dataset provided by the world's largest professional network, LinkedIn, to examine these key factors, such as educational backgrounds, years of experience, job titles, career transitions, and network sizes. This research contributes to existing literatures on gender disparities in leadership roles within the construction industry. By utilising LinkedIn data, which reflect the lived experiences of managers, this research provides empirical evidence that can inform policymaking, talent management, and diversity initiatives aimed at narrowing the gender gap in construction leadership.
[37] Xiao Li, Yanping Wu, Xiaoyang Wang, Zhengyi Yang, Wenjie Zhang, and Ying Zhang. Keyword-based betweenness centrality maximization in attributed graphs. In Australasian Database Conference (ADC), pages 209–223, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
Betweenness centrality is a key concept in graph analysis that measures the significance of a node by counting how often it appears in the shortest paths between other nodes. The task of betweenness centrality maximization, which seeks to identify a set of nodes with the highest centrality scores, is crucial in various real-world applications. Most existing studies about betweenness centrality focus on general graphs. However, in reality, users in networks are usually associated with attributes such as preferences, which play an essential role in analyzing the properties of networks. Therefore, the traditional betweenness centrality is not applicable to the attribute graphs. Motivated by this, we propose a novel concept called Keyword-based Betweenness Centrality (KBC), which quantifies the number of times each node acts as the midpoint of shortest paths between nodes having one of the given attributes. Given an attribute graph G, a query attribute set Q, and a positive integer k, in this paper, we aim to find a node set of size no larger than k so that its KBC value based on Q is maximized. To address this problem, we propose a keyword-based hyper-edge sampler and devise an algorithm achieving the approximation guarantee of $$(1-1/e-\epsilon )$$(1-1/e-ϵ)with at least 1-$$\delta $$δprobability. Extensive experiments on four real networks demonstrate the efficiency and effectiveness of our proposed algorithms.
[36] Jiaxuan Wu, Xushuo Tang, Zhengyi Yang*, Kongzhang Hao, Longbin Lai, and Yongfei Liu. An experimental evaluation of llm on image classification. In Australasian Database Conference (ADC), pages 506–518, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
Image classification is one of the fundamental tasks in computer vision (CV) and has numerous practical applications. Traditionally, machine learning and deep learning methods such as k-Nearest Neighbors (kNN), decision trees, and Convolutional Neural Networks (CNN) have been widely used to perform this task. However, with the recent emergence of large language models (LLMs), such as Generative Pre-trained Transformers (GPT), originally designed for natural language processing, their cross-domain applications, including in CV, are now being explored. In this paper, we investigate the capabilities of GPT-4o, a variant of the GPT model, for image classification on the Fashion-MNIST dataset. By using carefully designed prompts, we evaluate GPT-4o's performance and compare it with more traditional models. Our study offers insights into the cross-domain potential of GPT models, explores how prompt engineering can enhance GPT's performance on image classification tasks, and suggests new avenues for developing more flexible and adaptable multimodal LLM systems. The code can be found at https://github.com/Tanghaha1424/gpt-fashionmnist.
[35] Yi Ding, Hualong Lin, Zhengyi Yang*, Dong Wen, Xiaoyang Wang, and Wenjie Zhang. Benchmarking rdf systems on the cloudan experimental comparison of rdf systems on cloud. In Australasian Database Conference (ADC), pages 30–43, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
With the growing use of graph-based applications across various domains, the demand for efficient and scalable RDF (Resource Description Framework) systems has intensified. As organizations increasingly deploy RDF systems on cloud infrastructures, understanding the trade-offs between performance and cost becomes critical. This paper presents a comprehensive experimental comparison of multiple RDF systems, analyzing their performance in cloud and on-premises environments. Our study evaluates key metrics such as query execution time, data ingestion speed, storage efficiency, and cost-effectiveness across several benchmarks, including LDBC SNB, WatDiv, LUBM, and DBpedia. The experiments reveal distinct advantages and limitations in RDF systems. Notably, Virtuoso demonstrated strong performance and cost-efficiency across most datasets, while systems like gStore exhibited higher variability under stress conditions. Our findings offer actionable insights for practitioners seeking to balance performance and cost when deploying RDF systems in cloud environments. We also highlight areas for improvement in system documentation, compatibility, and stability under extreme workloads. Future work will explore alternative architectures and further optimization strategies for cloud-based RDF solutions. The related code, scripts and data of this paper are available online (https://github.com/unswdb/RDF_on_Cloud).
[34] Xia Li, Zhengyi Yang, Kongzhang Hao, Xin Shu, Xin Cao, and Wenjie Zhang. Distributed hop-constrained s-t simple path enumeration in labelled graphs. In Australasian Database Conference (ADC), pages 265–278, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
Hop-constrained s-t simple path (HC-s-t path) enumeration is a core problem in graph analysis, commonly applied to unlabelled graphs without considering label constraints. However, in many practical scenarios, graphs are edge-labelled, requiring queries to satisfy specific label constraints on the paths between vertices. This introduces new computational challenges that traditional HC-s-t path algorithms are not equipped to handle, especially in distributed environments dealing with large-scale graphs. To address these challenges, we propose a distributed algorithm for labelled hop-constrained s-t path (LHC-s-t path) enumeration. Our approach introduces an online label-based index to prune unnecessary computations, reducing both redundant processing and communication overhead. This enables efficient LHC-s-t path enumeration across distributed systems. Extensive experiments on large real-world graphs demonstrate that our algorithm significantly outperforms existing methods, achieving over an order of magnitude improvement in performance and scalability.
[33] Zhuoqing Xu, Zhengyi Yang, Xiaoshuang Chen, Huangleshuai He, and Xin Cao. Efficient answering of k-reachability on temporal bipartite graphs. In Australasian Database Conference (ADC), pages 224–238, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
This paper investigates the k-reachability problem on temporal bipartite graphs, which determines whether a vertex can reach another vertex within a given time interval I and a step constraint k. The problem is motivated by real-world scenarios where the number of hops from a source vertex s to a target vertex t reflects the influence s exerts on t. Bipartite graphs are used to model relationships between two types of entities, such as people-location, author-paper, and customer-product. When modeling real-world applications like disease outbreaks, biomedical reactions, and travel planning, edges are often enriched with temporal information. Temporal bipartite graphs provide an effective tool for modeling these dynamic scenarios. Recent studies show that k-reachability on temporal bipartite graphs can expand research in the above fields. Adding a step constraint k provides more insights into the connectivity of temporal bipartite graphs. Although k-reachability has been extensively studied on unipartite graphs, it remains unexplored on temporal bipartite graphs. To fill this research gap, we propose an efficient algorithm for k-reachability problem on temporal bipartite graphs in this paper. Specifically, we introduce a 2-hop index-based approach to efficiently answer k-reachability queries. Extensive experiments on real-world datasets demonstrate that the proposed method achieves a speedup of at least three orders of magnitude in query performance.
[32] Nimish Ukey#, Guangjian Zhang#, Zhengyi Yang*, Xiaoyang Wang, Binghao Li, Serkan Saydam, and Wenjie Zhang. A cluster-based approach to knn join over batch-dynamic high-dimensional data. In International Conference Advanced Data Mining and Applications (ADMA), pages 81–96, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
The k nearest neighbors (kNN) join is a crucial operation in data mining, retrieving the kNN for each point in the query set within an answer set. This operation finds extensive applications in various domains, including recommendation systems, spatial databases, and knowledge discovery. With the surge in data volume and dimensionality, numerous approaches have emerged to enhance the efficiency of kNN join operations on static and dynamic high-dimensional data. However, we observed that research on batch-dynamic kNN join, where updates occur in batches rather than individually, remains scarce. To bridge this gap, we propose a novel cluster-based approach tailored for batch-dynamic kNN join over high-dimensional data. Our contributions include a cluster-based batch update technique, which efficiently processes similar updates in clusters, and a cluster-based pruning method using the high-dimensional R-tree (HDR-Tree) for optimised search during updates. Extensive experimental evaluations across 6 real-world datasets demonstrate the efficiency of our approach, significantly outperforming state-of-the-art methods by 19 to 55 times.
[31] Haonan Yan, Zhengyi Yang*, Tianming Zhang, Dong Wen, Qi Luo, and Nimish Ukey. Approximating temporal katz centrality with monte carlo methods. In Web and Big Data. APWeb-WAIM 2024 International Workshops. Springer, 2024. [ bib ]
[30] Jingtian Wei, Zhengyi Yang*, Qi Luo, Yu Zhang, Lu Qin, and Wenjie Zhang. High-order local clustering on hypergraphs. EAI Endorsed Transactions on Scalable Information Systems, 11(6), Nov. 2024. [ bib | DOI | http ]
[29] Qi Luo, Wenjie Zhang, Zhengyi Yang, Dong Wen, Xiaoyang Wang, Dongxiao Yu, and Xuemin Lin. Hierarchical structure construction on hypergraphs. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), CIKM '24, pages 1597–1606, New York, NY, USA, 2024. Association for Computing Machinery. [ bib | DOI | http ]
Exploring the hierarchical structure of graphs presents notable advantages for graph analysis, revealing insights ranging from individual vertex behavior to community distribution and overall graph stability. This paper studies hierarchical structures within hypergraphs, where a hyperedge can connect multiple vertices. We observed that directly extending hierarchical frameworks from pairwise graphs to hypergraphs overlooks high-order interactions and can result in either high computational complexity or sparse hierarchy structure. To address this challenge, we introduce a dual-layer hypergraph hierarchy consisting of a primary hierarchy and a secondary hierarchy, enabling the construction of a refined hypergraph hierarchy in linear time. The dual-layer hierarchy establishes a global hierarchy based on vertex cohesion, utilizing vertex-induced subhypergraphs, and a local hierarchy based on hyperedge containment, employing edge-induced subhypergraphs. The combination of global and local hierarchy mitigates the homogeneity and sparsity issues inherent in single-layer hierarchies, allowing more effective modeling of high-order interactions. Furthermore, we propose an efficient hierarchical construction algorithm by leveraging a novel hyperedge-based disjoint set to identify connected subhypergraphs. Additionally, to optimize the local hierarchy further and prevent the emergence of excessively redundant levels, we introduce a compact local hierarchy by defining a restricted subgraph metric to eliminate redundancy caused by large-sized hyperedges. Empirical studies on real-world hypergraphs demonstrate the effectiveness of our approach.
[28] Haoran Ning, Bocheng Han, Zhengyi Yang*, Kongzhang Hao, Miao Ma, Chunling Wang, Boge Liu, Xiaoshuang Cheng, Yu Hao, Yi Jin, Wanchun Zhang, and Chengwei Zhang. Exploring simple architecture of just-in-time compilation in databases. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pages 504–514, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
Just-in-Time (JIT) compilation is an effective technique for enhancing query execution in modern relational databases, and it has gained increasing attention from academia and industry in recent years. However, the architectures of state-of-the-art JIT-based database systems are often complex, leading to challenges and limitations when adopted for commercial use. In this paper, we present an industrial view to JIT compilation for relational databases, emphasizing practicality and applicability. Our focus is on minimizing engineering effort, simplifying testing, and ensuring seamless integration with existing database ecosystems. We achieve these goals by adhering to three core principles: a simple, lightweight architecture; reuse of existing technologies and frameworks, particularly LLVM; and strong extensibility and compatibility. We demonstrate the feasibility and potential of this approach through an initial exploration using LLVM's mature JIT compilation capabilities to translate TPC-H database queries into optimized machine code. This proof-of-concept implementation shows the promise of our approach, pivoting the way for a comprehensive database system that leverages a lightweight yet powerfu JIT compilation framework for real-world applications.
[27] Yi Ding, Zhengyi Yang*, Shunyang Li, Liuyi Chen, Haoran Ning, Kongzhang Hao, and Yongfei Liu. Fgaq: Accelerating graph analytical queries using fpga. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pages 357–361, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
Field-programmable gate arrays (FPGAs) have significant advantages in parallelism and energy efficiency over CPUs and GPUs and are widely deployed by many enterprises and cloud server providers nowadays. In this paper, we demonstrate FGAQ, an FPGA-based system for accelerating graph queries on massive graphs. FGAQ supports the two most fundamental types of graph queries, namely subgraph and path queries, and features 1) a CPU-FPGA co-designed framework, 2) a fully pipelined FPGA execution, and 3) reduced data transfer from FPGA's external memory. FGAQ provides a user-friendly interface and significantly improved performance. Performance evaluation shows that $$\textsf{FGAQ}$$FGAQoutperforms the most popular graph database, Neo4j, by up to three orders of magnitude. The demo video can be found at https://www.youtube.com/watch?v=pEkzw_DOQYE.
[26] Wenke Yang, Zihan Yang, Liuyi Chen, Ruiqing Yan, Zhengyi Yang*, Linhan Zhang, and Tang Yifu. Parallel program generation for hybrid tabular-textual question answering. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pages 121–137, Singapore, 2024. Springer Nature Singapore. [ bib | DOI ]
Hybrid tabular-textual question answering (HTQA) involves tapping into a mosaic of data sources, traditionally managed through LSTM-based step-by-step reasoning, which has been vulnerable to exposure bias and subsequent error accumulation. This paper introduces an innovative parallel program generation method, ConcurGen, aiming to transform this paradigm by simultaneously formulating comprehensive program constructs that seamlessly blend operations and values. This approach not only rectifies the inherent pitfalls of sequential methodologies but also infuses efficiency into the process. When subjected to rigorous evaluation on benchmarks like the ConvFinQA and MultiHiertt datasets, our methodology showcased significant superiority over prevalent models such as FinQANet and MT2Net. This was evidenced by enhancements in various performance metrics, effectively raising the bar for what's deemed state-of-the-art. Notably, beyond setting these commendable benchmarks, our method facilitates a striking acceleration in program creation, achieving speeds nearly 21 times faster. Additionally, a salient feature of our approach becomes evident when numerical reasoning steps escalate: unlike traditional models, our system sustains its robust performance, emphasizing its adaptability and resilience in complex scenarios.
[25] Kaiyu Chen, Dong Wen, Wentao Li, Zhengyi Yang, and Wenjie Zhang. On compressing historical cliques in temporal graphs. In Database Systems for Advanced Applications (DASFAA), pages 37–53, Singapore, 2024. Springer Nature Singapore. [ bib | DOI | http ]
Maximal clique is a fundamental cohesive subgraph model that plays an important role in many practical applications such as social network analysis and bioinformatics. Many real-world graphs change over time, with edges arriving continuously, and each edge has a timestamp representing the arrival time of that edge; such graphs are also known as temporal graphs. All maximal cliques in all snapshots since all possible historical moments are called historical cliques. Querying historical cliques has not been explored in existing research efforts. In this paper, we study the problem of compressing historical cliques in temporal graphs. We design a novel trie index structure called HC-Trie to compress historical cliques by exploiting the duration of historical cliques and the overlapping relationship between different historical cliques. We also propose an algorithm for maintaining the index when the graph changes. Experiments on real-world temporal graphs demonstrate that our solution achieves much faster query efficiency than the online solution and significantly less index space than the straightforward solution.
[24] Tianming Zhang, Junkai Fang, Zhengyi Yang, Bin Cao, and Jing Fan. Tatkc: A temporal graph neural network for fast approximate temporal katz centrality ranking. In Proceedings of the ACM on Web Conference 2024 (WWW), WWW '24, pages 527–538, New York, NY, USA, 2024. Association for Computing Machinery. [ bib | DOI | http ]
Numerous real-world networks are represented as temporal graphs, which capture the dynamics of connections over time. Identifying important nodes on temporal graphs has a plethora of real-life applications, such as information propagation and influential user identification, etc. Temporal Katz centrality, a popular temporal metric, gauges the importance of nodes by taking into account both the number of temporal walks and the timespan between the interactions. The computation of traditional temporal Katz centrality is computationally expensive, especially when applied to massive temporal graphs. Therefore, in this paper, we design a temporal graph neural network to approximate temporal Katz centrality computation. To the best of our knowledge, we are the first to address temporal Katz centrality computation purely from a learning-based perspective. We propose a time-injected self-attention model that consists of two phases. In the first phase, we utilize a time-injected self-attention mechanism to acquire node representations that encompass both structural information and temporal relevance. The second phase is structured as a multi-layer perceptron (MLP) which uses the learned node representation to predict node rankings. Furthermore, normalization and neighbor sampling strategies are integrated into the model to enhance its overall performance. Extensive experiments on real-world networks demonstrate the efficiency and accuracy of TATKC.
[23] Tianming Zhang, Yunjun Gao, Jie Zhao, Lu Chen, Lu Jin, Zhengyi Yang, Bin Cao, and Jing Fan. Efficient exact and approximate betweenness centrality computation for temporal graphs. In Proceedings of the ACM on Web Conference 2024 (WWW), WWW '24, pages 2395–2406, New York, NY, USA, 2024. Association for Computing Machinery. [ bib | DOI | http ]
Betweenness centrality of a vertex in a graph evaluates how often the vertex occurs in the shortest paths. It is a widely used metric of vertex importance in graph analytics. While betweenness centrality on static graphs has been extensively investigated, many real-world graphs are time-varying and modeled as temporal graphs. Examples include social networks and telecommunication networks, where a relationship between two vertices occurs at a specific time. Hence, in this paper, we target efficient methods for temporal betweenness centrality computation. We firstly propose an exact algorithm with the new notion of time instance graph, based on which, we derive a temporal dependency accumulation theory for iterative computation. To reduce the size of the time instance graph and improve the efficiency, we propose an additional optimization, which compresses the time instance graph with equivalent vertices and edges, and extends the dependency theory to the compressed graph. Since it is theoretically complex to compute temporal betweenness centrality, we further devise a probabilistically guaranteed approximate method to handle massive temporal graphs. Extensive experimental results on real-world temporal networks demonstrate the superior performance of the proposed methods. In particular, our exact and approximate methods outperform the state-of-the-art methods by up to two and five orders of magnitude, respectively.
[22] Lingbo Li, Zhichun Li, Fusen Guo, Haoyu Yang, Jingtian Wei, and Zhengyi Yang. Prototype comparison convolutional networks for one-shot segmentation. IEEE Access, 12:54978–54990, 2024. [ bib | DOI ]
In few-shot semantic segmentation (FSS), the key challenges are efficiently tuning the interaction between the support set and the query set and distinguishing between context, background, and interfering items. To address these challenges, we propose prototype comparison networks for one-shot segmentation (OPCN) to capture the details required for FSS. Specifically, we offer the Fusion Interaction Module (FIM) to improve the segmentation performance by capturing the correlation and semantic information between the support set and query set features. Subsequently, we propose the Feature Enhancement Module (FEM), which aims to enhance the information required in the support set and query set features while increasing the focus on critical details by reducing the weight of the background regions to provide a more targeted feature representation for subsequent query image segmentation. Then, we propose the Feature Refinement Module (FRM) to filter irrelevant background information in the features and specify the target location region. Finally, the Feature Matching Module (FM) generates the final segmentation mask for the query image. Extensive experiments on the PASCAL-5i and COCO-20i datasets show that our approach achieves excellent performance in the case of the one-shot setup.
[21] Tianming Zhang, Xinwei Cai, Lu Chen, Zhengyi Yang, Yunjun Gao, Bin Cao, and Jing Fan. Towards efficient simulation-based constrained temporal graph pattern matching. World Wide Web, 27(3):22, 2024. [ bib | DOI | http ]
In the context of searching a single data graph G, graph pattern matching is to find all the occurrences of a pattern graph Q in G, specified by a matching rule. It is of paramount importance in many real applications such as social network analysis and cyber security, among others. A wide spectrum of studies target general graph pattern matching. However, to analyze time-relevant services such as studying the spread of diseases and detecting attack patterns, it is attractive to study inexact temporal graph pattern matching. Hence, in this paper, we propose a relaxed matching rule called constrained temporal dual simulation, and study simulation-based constrained temporal graph pattern matching which guarantees that the matching result (i) preserves the ancestor and descendant temporal connectivities; and (ii) implements edge-to-temporal path mapping. We devise a decomposition-based matching method, which first decomposes the data graph into Source Temporal Connected Components, and then performs matching on decomposed subgraphs. To speed up the matching, we define child/parent dependency relation tables and propose an efficient double hierarchical traverse strategy. Considering that the temporal graphs are naturally dynamic, we further propose update algorithms. An extensive empirical study over real-world and synthetic temporal graphs has demonstrated the effectiveness and efficiency of our approach.
[20] Sai Li, Peng Kou, Miao Ma, Haoyu Yang, Shuo Huang, and Zhengyi Yang. Application of semi-supervised learning in image classification: Research on fusion of labeled and unlabeled data. IEEE Access, 12:27331–27343, 2024. [ bib | DOI ]
Deep learning has attracted wide attention recently because of its excellent feature representation ability and end-to-end automatic learning method. Especially in clinical medical imaging diagnosis, the semi-supervised deep learning model is favored and widely used because it can make maximum use of a limited number of labeled data and combine it with a large number of unlabeled data to extract more information and knowledge from it. However, the scarcity of medical image data, the vast image size, and the instability of image quality directly affect the model's robustness, generalization, and image classification performance. Therefore, this paper proposes a new semi-supervised learning model, which uses quadratic neurons instead of traditional neurons, aiming to use quadratic convolution instead of the conventional convolution layer to improve the feature extraction ability of the model. In addition, we introduce two Dropout layers and two fully connected layers at the end of the model to enhance the nonlinear fitting ability of the network. Experiments on two large medical public data sets - ISIC 2019 and Retinopathy OCT - show that our method can improve the model's generalization performance and image classification accuracy.
[19] Wenqian Zhang, Zhengyi Yang*, Dong Wen, and Xiaoyang Wang. Efficient distributed core graph decomposition. In 2023 IEEE International Conference on Data Mining Workshops (ICDMW), pages 1023–1031, 2023. [ bib | DOI ]
Core decomposition is one of the most fundamental problems in graph analytics, which is associated with numerous applications, such as community detection, protein network analysis, and system structure analysis. As the sizes of graphs are becoming increasingly large, it is challenging to compute core decomposition on a single machine. In this paper, we study the problem of k-Core decomposition in the distributed environment. Specifically, we propose the distributed Filter-Array k-Core (FAkCore) algorithm, which adopts the commonly used Scatter-Gather framework. We design an auxiliary data structure of running counts for each vertex to track the statistics of its neighbors' core number. It allows us to recompute the core number of a vertex only when the value is updated. Together with an enhanced message filtering mechanism, our method significantly reduces redundant computation and communication in the existing distributed k-Core decomposition algorithm. Experiments on 10 real-world graphs show that our method outperforms the baseline algorithms by 1.4 times on average and up to 2.2 times.
[18] Liuyi Chen, Bocheng Han, Xuesong Wang, Jiazhen Zhao, Wenke Yang, and Zhengyi Yang*. Machine learning methods in weather and climate applications: A survey. Applied Sciences, 13(21), 2023. [ bib | DOI | http ]
With the rapid development of artificial intelligence, machine learning is gradually becoming popular for predictions in all walks of life. In meteorology, it is gradually competing with traditional climate predictions dominated by physical models. This survey aims to consolidate the current understanding of Machine Learning (ML) applications in weather and climate prediction—a field of growing importance across multiple sectors, including agriculture and disaster management. Building upon an exhaustive review of more than 20 methods highlighted in existing literature, this survey pinpointed eight techniques that show particular promise for improving the accuracy of both short-term weather and medium-to-long-term climate forecasts. According to the survey, while ML demonstrates significant capabilities in short-term weather prediction, its application in medium-to-long-term climate forecasting remains limited, constrained by factors such as intricate climate variables and data limitations. Current literature tends to focus narrowly on either short-term weather or medium-to-long-term climate forecasting, often neglecting the relationship between the two, as well as general neglect of modeling structure and recent advances. By providing an integrated analysis of models spanning different time scales, this survey aims to bridge these gaps, thereby serving as a meaningful guide for future interdisciplinary research in this rapidly evolving field.
[17] Miao Ma, Zhengyi Yang*, Kongzhang Hao, Liuyi Chen, Chunling Wang, and Yi Jin. An empirical analysis of just-in-time compilation in modern databases. In Australasian Database Conference (ADC), pages 227–240. Springer Nature Switzerland, 2023. [ bib | DOI | http ]
JIT (Just-in-Time) technology has garnered significant attention for improving the efficiency of database execution. It offers higher performance by eliminating interpretation overhead compared to traditional execution engines. LLVM serves as the primary JIT architecture, which was implemented in PostgreSQL since version 11. However, recent advancements in WASM-based databases, such as Mutable, present an alternative JIT approach. This approach minimizes the extensive engineering efforts associated with the execution engine and focuses on optimizing supported operators for lower latency and higher throughput. In this paper, we perform comprehensive experiments on these two representative open-source databases to gain deeper insights into the effectiveness of different JIT architectures.
[16] Nimish Ukey, Zhengyi Yang*, Wenke Yang, Binghao Li, and Runze Li. knn join for dynamic high-dimensional data: A parallel approach. In Australasian Database Conference (ADC), pages 3–16. Springer Nature Switzerland, 2023. [ bib | DOI | http ]
The k nearest neighbor (kNN) join operation is a fundamental task that combines two high-dimensional databases, enabling data points in the User dataset U to identify their k nearest neighbor points from the Item dataset I. This operation plays a crucial role in various domains, including knowledge discovery, data mining, similarity search applications, and scientific research. However, exact kNN search in high-dimensional spaces is computationally demanding, and existing sequential methods face challenges in handling large datasets. In this paper, we propose an efficient parallel solution for dynamic kNN join over high-dimensional data, leveraging the high-dimensional R tree (HDR Tree) for improved efficiency. Our solution harnesses the power of Simultaneous Multi-Threading (SMT) technologies and Single-Instruction-Multiple-Data (SIMD) instructions in modern CPUs for parallelisation. Importantly, our research is the first to introduce parallel computation for exact kNN join over high-dimensional data. Experimental results demonstrate that our proposed approach outperforms the sequential HDR Tree method by up to 1.2 times with a single thread. Moreover, our solution provides near-linear scalability as the number of threads increases.
[15] Nimish Ukey#, Guangjian Zhang#, Zhengyi Yang*, Binghao Li, Wei Li, and Wenjie Zhang. Efficient continuous knn join over dynamic high-dimensional data. World Wide Web, 2023. [ bib | DOI | http ]
Given a user dataset U and an object dataset I, a kNN join query in high-dimensional space returns the k nearest neighbors of each object in dataset U from the object dataset I. The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR+ Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.
[14] Kongzhang Hao, Long Yuan, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. Efficient and scalable distributed graph structural clustering at billion scale. In International Conference on Database Systems for Advanced Applications (DASFAA), pages 234–251, Cham, 2023. Springer Nature Switzerland. [ bib | DOI | http ]
Structural Graph Clustering (SCAN) is a fundamental problem in graph analysis and has received considerable attention recently. Existing distributed solutions either lack efficiency or suffer from high memory consumption when addressing this problem in billion-scale graphs. Motivated by these, in this paper, we aim to devise a distributed algorithm for SCAN that is both efficient and scalable. We first propose a fine-grained clustering framework tailored for SCAN. Based on the new framework, we devise a distributed SCAN algorithm, which not only keeps a low communication overhead during execution, but also effectively reduces the memory consumption at all time. We also devise an effective workload balance mechanism that is automatically triggered by the idle machines to handle skewed workloads. The experiment results demonstrate the efficiency and scalability of our proposed algorithm.
[13] Zhengyi Yang, Wenjie Zhang, Xuemin Lin*, Ying Zhang, and Shunyang Li. Hgmatch: A match-by-hyperedge approach for subgraph matching on hypergraphs. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 2063–2076, 2023. [ bib | DOI ]
Hypergraphs are generalisation of graphs in which a hyperedge can connect any number of vertices. It can describe n-ary relationships and high-order information among entities compared to conventional graphs. In this paper, we study the fundamental problem of subgraph matching on hypergraphs (i.e, subhypergraph matching). Existing methods directly extend subgraph matching algorithms to the case of hypergraphs. However, this approach delays hyperedge verification and underutilises the high-order information in hypergraphs, which leads to large search space and high enumeration cost. Furthermore, with the growing size of hypergraphs, it is becoming hard to compute subhypergraph matching sequentially. Thus, we propose an efficient and parallel subhypergraph matching system, HGMatch, to handle subhypergraph matching in massive hypergraphs. We proposes a novel match-by-hyperedge framework to utilise high-order information in hypergraphs and uses set operations for efficient candidates generation. Moreover, we develop an optimised parallel execution engine in HGMatch based on the dataflow model, which features a task-based scheduler and fine-grained dynamic work stealing to achieve bounded memory execution and better load balancing. Experimental evaluation on 10 real-world datasets shows that HGMatch outperforms the extended version of the state-of-the-art subgraph matching algorithms (CFL, DAF, CECI and RapidMatch) by orders of magnitude when using a single thread, and achieves almost linear scalability when the number of threads increases.
[12] Nimish Ukey, Zhengyi Yang*, Binghao Li, Guangjian Zhang, Yiheng Hu, and Wenjie Zhang. Survey on exact knn queries over high-dimensional data space. Sensors, 23(2), 2023. [ bib | DOI | http ]
k nearest neighbours (kNN) queries are fundamental in many applications, ranging from data mining, recommendation system and Internet of Things, to Industry 4.0 framework applications. In mining, specifically, it can be used for the classification of human activities, iterative closest point registration and pattern recognition and has also been helpful for intrusion detection systems and fault detection. Due to the importance of kNN queries, many algorithms have been proposed in the literature, for both static and dynamic data. In this paper, we focus on exact kNN queries and present a comprehensive survey of exact kNN queries. In particular, we study two fundamental types of exact kNN queries: the kNN Search queries and the kNN Join queries. Our survey focuses on exact approaches over high-dimensional data space, which covers 20 kNN Search methods and 9 kNN Join methods. To the best of our knowledge, this is the first work of a comprehensive survey of exact kNN queries over high-dimensional datasets. We specifically categorise the algorithms based on indexing strategies, data and space partitioning strategies, clustering techniques and the computing paradigm. We provide useful insights for the evolution of approaches based on the various categorisation factors, as well as the possibility of further expansion. Lastly, we discuss some open challenges and future research directions.
[11] Weixiao Xu, Lin Sun, Cheng Zhen, Bo Liu, Zhengyi Yang, and Wenke Yang. Deep learning-based image recognition of agricultural pests. Applied Sciences, 12(24), 2022. [ bib | DOI | http ]
Pests and diseases are an inevitable problem in agricultural production, causing substantial economic losses yearly. The application of convolutional neural networks to the intelligent recognition of crop pest images has become increasingly popular due to advances in deep learning methods and the rise of large-scale datasets. However, the diversity and complexity of pest samples, the size of sample images, and the number of examples all directly affect the performance of convolutional neural networks. Therefore, we designed a new target-detection framework based on Cascade RCNN (Regions with CNN features), aiming to solve the problems of large image size, many pest types, and small and unbalanced numbers of samples in pest sample datasets. Specifically, this study performed data enhancement on the original samples to solve the problem of a small and unbalanced number of examples in the dataset and developed a sliding window cropping method, which could increase the perceptual field to learn sample features more accurately and in more detail without changing the original image size. Secondly, combining the attention mechanism with the FPN (Feature Pyramid Networks) layer enabled the model to learn sample features that were more important for the current task from both channel and space aspects. Compared with the current popular target-detection frameworks, the average precision value of our model (mAP@0.5) was 84.16%, the value of (mAP@0.5:0.95) was 65.23%, the precision was 67.79%, and the F1 score was 82.34%. The experiments showed that our model solved the problem of convolutional neural networks being challenging to use because of the wide variety of pest types, the large size of sample images, and the difficulty of identifying tiny pests.
[10] Xia Li, Kongzhang Hao, Zhengyi Yang, Xin Cao, and Wenjie Zhang. Hop-constrained s-t simple path enumeration in large uncertain graphs. In Australasian Database Conference (ADC), pages 115–127. Springer International Publishing, 2022. [ bib | DOI | http ]
Uncertain graphs are graphs where each edge is assigned with a probability of existence. In this paper, we study the problem of hop-constrained s-t simple path enumeration in large uncertain graphs. To the best of our knowledge, we are the first to study this problem in the literature. Specifically, we propose a light-weight index to prune candidate paths by adopting the concept of probability-constrained distance. An efficient enumeration algorithm is designed based on the index structure. Experiment results on real-world datasets show that our proposed methods significantly outperform the baseline methods by up to 6 times.
[9] Nimish Ukey, Zhengyi Yang*, Guangjian Zhang, Boge Liu, Binghao Li, and Wenjie Zhang. Efficient knn join over dynamic high-dimensional data. In Australasian Database Conference (ADC), pages 63–75. Springer International Publishing, 2022. [ bib | DOI | http ]
Given a user dataset U and an object dataset I in high-dimensional space, a kNN join query retrieves each object in dataset U its k nearest neighbors from the dataset I. kNN join is a fundamental and essential operation in applications from many domains such as databases, computer vision, multi-media, machine learning, recommendation systems, and many more. The datasets in real world often update dynamically on insertion or deletion of objects. However, existing algorithms of dynamic kNN join lack support for deletion and batch update, which are important in real-life applications. In this paper, we propose a new method of kNN join over dynamic high-dimensional data. Specifically, our method features lazy updates, batch operations, and optimised deletions. Experiments on real-world datasets show that our method outperforms the existing algorithms of naive RkNN join and HDR Tree by up to 5 and 4 times, respectively.
[8] Xia Li, Kongzhang Hao, Zhengyi Yang, Xin Cao, Wenjie Zhang, Long Yuan, and Xuemin Lin. Hop-constrained s-t simple path enumeration in billion-scale labelled graphs. In Web Information Systems Engineering (WISE), pages 49–64. Springer International Publishing, 2022. [ bib | DOI | http ]
Hop-constrained s-t simple path (HC-s-tpath) enumeration is a fundamental problem in graph analysis. Existing solutions for this problem focus on unlabelled graphs and assume queries are issued without any label constraints. However, in many real-world applications, graphs are edge-labelled and the queries involve label constraints on the path connecting two vertices. Therefore, we study the problem of labelled hop-constrained s-t path (LHC-s-tpath) enumeration in this paper. We aim to efficiently enumerate the HC-s-tpaths using only edges with provided labels. To achieve this goal, we first demonstrate the existence of unnecessary computation specific to the label constraints in the state-of-the-art HC-s-tpath enumeration algorithm. We then devise a novel online index to identify the fruitless exploration during the enumeration. Based on the proposed index, we design an efficient LHC-s-tpath enumeration algorithm in which unnecessary computation is effectively pruned. Extensive experiments are conducted on real-world graphs with billions of edges. Experiment results show that our proposed algorithms significantly outperform the baseline methods by over one order of magnitude.
[7] Shunyang Li, Zhengyi Yang, Xianhang Zhang, Wenjie Zhang, and Xuemin Lin. Sql2cypher: Automated data and query migration from rdbms to gdbms. In 22nd International Conference on Web Information Systems Engineering (WISE), pages 510–517. Springer International Publishing, 2021. [ bib | DOI | http ]
There are many real-world application domains where data can be naturally modelled as a graph, such as social networks and computer networks. Relational Database Management Systems (RDBMS) find it hard to capture the relationships and inherent graph structure of data and are inappropriate for storing highly connected data; thus, graph databases have emerged to address the challenges of high data connectivity. As the performance of querying highly connected data in relational query statements is usually worse than that in the graph database. Transforming data from a relational database to a graph database is imperative for improving the performance of graph queries. In this paper, we demonstrate SQL2Cypher, a system for migrating data from a relational database to a graph database automatically. This system also supports translating SQL queries into Cypher queries. SQL2Cypher is open-source (https://github.com/UNSW-database/SQL2Cypher) to allow researchers and programmers to migrate data efficiently. Our demonstration video can be found here: https://www.youtube.com/watch?v=eGaeBrVTJws.
[6] Zhengyi Yang, Longbin Lai, Xuemin Lin, Kongzhang Hao, and Wenjie Zhang. Huge: An efficient and scalable subgraph enumeration system. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD), SIGMOD/PODS '21, pages 2049–2062, New York, NY, USA, 2021. Association for Computing Machinery. [ bib | DOI | http ]
Subgraph enumeration is a fundamental problem in graph analytics, which aims to find all instances of a given query graph on a large data graph. In this paper, we propose a system called HUGE to efficiently process subgraph enumeration at scale in the distributed context. HUGE features 1) an optimiser to compute an advanced execution plan without the constraints of existing works; 2) a hybrid communication layer that supports both pushing and pulling communication; 3) a novel two-stage execution mode with a lock-free and zero-copy cache design; 4) a BFS/DFS-adaptive scheduler to bound memory consumption; and 5) two-layer intra- and inter-machine load balancing. HUGE is generic such that all existing distributed subgraph enumeration algorithms can be plugged in to enjoy automatic speed up and bounded-memory execution.
[5] Xin Jin, Zhengyi Yang, Xuemin Lin, Shiyu Yang, Lu Qin, and You Peng. Fast: Fpga-based subgraph matching on massive graphs. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1452–1463, 2021. [ bib | DOI ]
Subgraph matching is a basic operation widely used in many applications. However, due to its NP-hardness and the explosive growth of graph data, it is challenging to compute subgraph matching, especially in large graphs. In this paper, we aim at scaling up subgraph matching on a single machine using FPGAs. Specifically, we propose a CPU-FPGA co-designed framework. On the CPU side, we first develop a novel auxiliary data structure called candidate search tree (CST) which serves as a complete search space of subgraph matching. CST can be partitioned and fully loaded into FPGAs' on-chip memory. Then, a workload estimation technique is proposed to balance the load between the CPU and FPGA. On the FPGA side, we design and implement the first FPGA-based subgraph matching algorithm, called FAST. To take full advantage of the pipeline mechanism on FPGAs, task parallelism optimization and task generator separation strategy are proposed for FAST, achieving massive parallelism. Moreover, we carefully develop a BRAM-only matching process to fully utilize FPGA's on-chip memory, which avoids the expensive intermediate data transfer between FPGA's BRAM and DRAM. Comprehensive experiments show that FAST achieves up to 462.0x and 150.0x speedup compared with the state-of-the-art algorithm DAF and CECI, respectively. In addition, FAST is the only algorithm that can handle the billion-scale graph using one machine in our experiments.
[4] Ran Wang, Zhengyi Yang, Wenjie Zhang, and Xuemin Lin. An empirical study on recent graph database systems. In 13th International Conference Knowledge Science, Engineering and Management (KSEM), pages 328–340, Berlin, Heidelberg, 2020. Springer International Publishing. [ bib | DOI | http ]
Graphs are widely used to model the intricate relationships among objects in a wide range of applications. The advance in graph data has brought significant value to artificial intelligence technologies. Recently, a number of graph database systems have been developed. In this paper, we present a comprehensive overview and empirical investigation on existing property graph database systems such as Neo4j, AgensGraph, TigerGraph and LightGraph (LightGraph has recently renamed to TuGraph.). These systems support declarative graph query languages. Our empirical studies are conducted in a single-machine environment against on the LDBC social network benchmark, consisting of three different large-scale datasets and a set of benchmark queries. This is the first empirical study to compare these graph database systems by evaluating data bulk importing and processing simple and complex queries. Experimental results provide insightful observations of various graph data systems and indicate that AgensGraph works well on SQL based workload and simple update queries, TigerGraph is powerful on complex business intelligence queries, Neo4j is user-friendly and suitable for small queries, and LightGraph is a more balanced product achieving good performance on different queries. The related code, scripts and data of this paper are available online (https://github.com/UNSW-database/GraphDB-Benchmark).
[3] Kongzhang Hao, Zhengyi Yang, Longbin Lai, Zhengmin Lai, Xin Jin, and Xuemin Lin. Patmat: A distributed pattern matching engine with cypher. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), CIKM '19, pages 2921–2924, New York, NY, USA, 2019. Association for Computing Machinery. [ bib | DOI | http ]
Graph pattern matching is one of the most fundamental problems in graph database and is associated with a wide spectrum of applications. Due to its computational intensiveness, researchers have primarily devoted their efforts to improving the performance of the algorithm while constraining the graphs to have singular labels on vertices (edges) or no label. Whereas in practice graphs are typically associated with rich properties, thus the main focus in the industry is instead on powerful query languages that can express a sufficient number of pattern matching scenarios. We demo PatMat in this work to glue together the academic efforts on performance and the industrial efforts on expressiveness. To do so, we leverage the state-of-the-art join-based algorithms in the distributed contexts and Cypher query language - the most widely-adopted declarative language for graph pattern matching. The experiments demonstrate how we are capable of turning complex Cypher semantics into a distributed solution with high performance.
[2] Longbin Lai, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai, Ran Wang, Kongzhang Hao, Xuemin Lin, Lu Qin, Wenjie Zhang, Ying Zhang, Zhengping Qian, and Jingren Zhou. Distributed subgraph matching on timely dataflow. In Proc. VLDB Endow. (VLDB), volume 12, pages 1099–1112. VLDB Endowment, 2019. [ bib | DOI | http ]
Recently there emerge many distributed algorithms that aim at solving subgraph matching at scale. Existing algorithm-level comparisons failed to provide a systematic view of distributed subgraph matching mainly due to the intertwining of strategy and optimization. In this paper, we identify four strategies and three general-purpose optimizations from representative state-of-the-art algorithms. We implement the four strategies with the optimizations based on the common Timely dataflow system for systematic strategy-level comparison. Our implementation covers all representative algorithms. We conduct extensive experiments for both unlabelled matching and labelled matching to analyze the performance of distributed subgraph matching under various settings, which is finally summarized as a practical guide.
[1] Zhengmin Lai, Zhengyi Yang, and Longbin Lai. Improving distributed subgraph matching algorithm on timely dataflow. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pages 266–273, 2019. [ bib | DOI ]
The subgraph matching problem is defined to find all subgraphs of a data graph that are isomorphic to a given query graph. Subgraph matching plays a vital role in the fields of e-commerce, social media and biological science. CliqueJoin is a distributed subgraph matching algorithm that is designed to be efficient and scalable. However, CliqueJoin is originally developed on MapReduce, thus the performance of the algorithm can be affected by the notorious I/O issue of MapReduce while processing multi-round join tasks. Meanwhile, CliqueJoin does not propose a cost evaluation strategy for labelled graphs, which limits its application in practice where most real-world graphs are labelled. Targeting the limitations of CliqueJoin, we propose CliqueJoin++ to improve CliqueJoin in two aspects. Firstly, we implement CliqueJoin++ on the Timely dataflow system instead of MapReduce to avoid considerable I/O cost. Secondly, we extend the cost evaluation function in CliqueJoin to compute optimal join plans for labelled graphs in the distributed context. Extensive experiments have been conducted to show that the proposed method is up to 10 times faster than the MapReduce version for unlabelled matching, and it achieves good performance and scalability for labelled matching.

This file was generated by bibtex2html 1.99.