Initial Codebase for Gradescope Submission
Background
In many scientific fields, moving beyond simple correlation to understand cause-and-effect relationships is a primary goal. Causal discovery is a discipline that aims to infer causal structures directly from observational data. Instead of just knowing that X and Y are related, we want to determine if X causes Y, Y causes X, or if a hidden common cause influences both.
The PC algorithm, named after its creators Peter Spirtes and Clark Glymour, is a foundational constraint-based method for causal discovery. It operates on the principle of conditional independence. The core idea is that if two variables X and Y are causally related (e.g., X -> Y), they will be statistically dependent. However, this dependence might vanish when we condition on other variables. The PC algorithm systematically uses statistical tests for conditional independence to prune a graph, revealing the underlying causal skeleton and orienting as many edges as possible.
This assignment will guide you through implementing the three main stages of the PC algorithm:
Skeleton Discovery: Identifying which variables are connected, regardless of direction.
V-Structure Identification: Finding and orienting “colliders” (
X -> Z <- Y), which form the initial set of directed edges.Edge Orientation: Propagating the directional information from V-structures throughout the graph using a set of logical rules.
Learning Objectives
By completing this assignment, you will:
Understand the principles of constraint-based causal discovery.
Implement the key stages of the PC algorithm: skeleton discovery, V-structure identification, and edge orientation.
Learn how to apply conditional independence tests to learn graphical structures from data.
Gain practical experience in translating an algorithm from theory to code.
Evaluate the performance of a causal discovery algorithm against a ground truth.
Environment Setup
Language: Python 3.9+
Required packages:
conda create -n causality python=3.10
conda activate causality
pip install numpy networkx pandas scipy scikit-learn statsmodels pydot matplotlib graphviz causal-learn
Provided Files:
framework.py– Contains the scaffold for the PC algorithm. You will fill in the## TODOsections.data_linear_10.txt– The observational dataset your algorithm will run on.
Note: The causal graph is represented by an adjacency matrix where:
graph[i, j] = graph[j, i] = -1represents an undirected edgei -- j.graph[i, j] = 1andgraph[j, i] = 0represents a directed edgei -> j.graph[i, j] = 0andgraph[j, i] = 0represents no edge betweeniandj.
Assignment Tasks
Task 1: Skeleton Discovery (40 pts)
Objective: Implement the first phase of the PC algorithm to find the undirected skeleton of the causal graph.
Description: You will complete the skeleton_discovery() function in framework.py. This function starts with a fully connected undirected graph. Your task is to iteratively remove edges between pairs of nodes (i, j) that are found to be conditionally independent. You will test for independence conditioning on sets of neighbors of increasing size (d = 0, 1, 2, ...).
For each pair of adjacent nodes
(i, j), and for each possible conditioning setSof sizeddrawn from the neighbors ofi(orj), you will perform a conditional independence test using the providedfisherzfunction.If the test’s p-value is greater than the significance level
alpha, the nodesiandjare conditionally independent. The edge between them should be removed.You must also record the separating set
Sthat madeiandjindependent in thesepsetsdictionary. This dictionary is crucial for the next task.The process stops when no more edges can be removed.
Task 2: V-Structure Identification (30 pts)
Objective: Implement the second phase of the PC algorithm to identify and orient V-structures (colliders).
Description: You will complete the identify_v_structures() function in framework.py. A V-structure is a pattern of the form X -> Z <- Y, where X and Y are not themselves adjacent. This structure is identifiable because X and Y are independent, but become dependent when conditioned on Z.
Your implementation should iterate through all triples of nodes
(i, k, j)that form a “v” shape:iis adjacent tok,jis adjacent tok, butiandjare not adjacent.For each such triple, you will check the
sepsetsdictionary from Task 1. If the middle nodekis not in the separating set for(i, j), then the structurei-k-jmust be a V-structure.You should then orient the edges as
i -> k <- jin the adjacency matrix.
Task 3: Edge Orientation (30 pts)
Objective: Apply a set of logical orientation rules (Meek’s rules) to orient as many of the remaining undirected edges as possible.
Description: You will complete the orient_edges() function in framework.py. After identifying V-structures, some directed edges are known. These directions can be propagated through the graph to orient other edges, based on two main principles: avoiding the creation of new V-structures and avoiding the creation of cycles.
You will implement a loop that repeatedly applies the orientation rules to the graph until no more edges can be oriented.
Rule 1 (Avoid New V-Structures): If you have a structure
X -> Y - ZandXandZare not adjacent, you must orient the edge asY -> Z. Orienting it asY <- Zwould create a new V-structure (X -> Y <- Z), which would have been detected in Task 2.Rule 2 (Avoid Cycles): If you have a chain
X -> Y -> Zand also an undirected edge betweenXandZ, you must orient it asX -> Z. Orienting it asX <- Zwould create a cycle (X -> Y -> Z <- X).Rule 3: Orient
x - yintox -> yif there are two pathsx-k->yandx-l->y, withkandlnon-adjacent.The hints in the
framework.pyfile provide a starting point for implementing these rules.
Task 4: Analysis & Evaluation (Ungraded)
Objective: Run your completed PC algorithm on the provided data and analyze how the choice of the significance level alpha affects the output.
Description: The analyze_causal_graph function is already written for you. Once your implementation is complete, you can run framework.py directly. The script will execute your PC algorithm with three different alpha values (0.01, 0.05, 0.2). For each value, it will:
Calculate the Structural Hamming Distance (SHD) between your algorithm’s output and the true graph. SHD is a score that counts the number of edge additions, deletions, or reversals needed to match the true graph (lower is better).
Visualize the learned causal graph.
Observe the output and think about the following questions:
How does a stricter
alpha(e.g., 0.01) affect the number of edges in the learned skeleton?Which
alphavalue gives the best SHD score?What does this tell you about the trade-off between finding true causal links and falsely identifying links that aren’t there?
Implementation Guidelines
Only modify code within the
## TODOsections inframework.py.Do not change the function signatures or import statements.
The helper functions for conditional independence testing, graph plotting, and SHD calculation are already provided for you.
Hand-in Requirements
Submission Format
- Submit only one file:
framework.py.
File Requirements
Your
framework.pymust contain all implemented functions.Do not modify the original function signatures or provided helper functions.