Lab 4: Peter-Clark (PC) Algorithm

Initial Codebase for Gradescope Submission

Background

In many scientific fields, moving beyond simple correlation to understand cause-and-effect relationships is a primary goal. Causal discovery is a discipline that aims to infer causal structures directly from observational data. Instead of just knowing that X and Y are related, we want to determine if X causes Y, Y causes X, or if a hidden common cause influences both.

The PC algorithm, named after its creators Peter Spirtes and Clark Glymour, is a foundational constraint-based method for causal discovery. It operates on the principle of conditional independence. The core idea is that if two variables X and Y are causally related (e.g., X -> Y), they will be statistically dependent. However, this dependence might vanish when we condition on other variables. The PC algorithm systematically uses statistical tests for conditional independence to prune a graph, revealing the underlying causal skeleton and orienting as many edges as possible.

This assignment will guide you through implementing the three main stages of the PC algorithm:

Skeleton Discovery: Identifying which variables are connected, regardless of direction.
V-Structure Identification: Finding and orienting “colliders” (X -> Z <- Y), which form the initial set of directed edges.
Edge Orientation: Propagating the directional information from V-structures throughout the graph using a set of logical rules.

Learning Objectives

By completing this assignment, you will:

Understand the principles of constraint-based causal discovery.
Implement the key stages of the PC algorithm: skeleton discovery, V-structure identification, and edge orientation.
Learn how to apply conditional independence tests to learn graphical structures from data.
Gain practical experience in translating an algorithm from theory to code.
Evaluate the performance of a causal discovery algorithm against a ground truth.

Environment Setup

Language: Python 3.9+
Required packages:

conda create -n causality python=3.10

conda activate causality

pip install numpy networkx pandas scipy scikit-learn statsmodels pydot matplotlib graphviz causal-learn

Provided Files:
- framework.py – Contains the scaffold for the PC algorithm. You will fill in the ## TODO sections.
- data_linear_10.txt – The observational dataset your algorithm will run on.

Note: The causal graph is represented by an adjacency matrix where:

graph[i, j] = graph[j, i] = -1 represents an undirected edge i -- j.
graph[i, j] = 1 and graph[j, i] = 0 represents a directed edge i -> j.
graph[i, j] = 0 and graph[j, i] = 0 represents no edge between i and j.

Assignment Tasks

Task 1: Skeleton Discovery (40 pts)

Objective: Implement the first phase of the PC algorithm to find the undirected skeleton of the causal graph.

Description: You will complete the skeleton_discovery() function in framework.py. This function starts with a fully connected undirected graph. Your task is to iteratively remove edges between pairs of nodes (i, j) that are found to be conditionally independent. You will test for independence conditioning on sets of neighbors of increasing size (d = 0, 1, 2, ...).

For each pair of adjacent nodes (i, j), and for each possible conditioning set S of size d drawn from the neighbors of i (or j), you will perform a conditional independence test using the provided fisherz function.
If the test’s p-value is greater than the significance level alpha, the nodes i and j are conditionally independent. The edge between them should be removed.
You must also record the separating set S that made i and j independent in the sepsets dictionary. This dictionary is crucial for the next task.
The process stops when no more edges can be removed.

Task 2: V-Structure Identification (30 pts)

Objective: Implement the second phase of the PC algorithm to identify and orient V-structures (colliders).

Description: You will complete the identify_v_structures() function in framework.py. A V-structure is a pattern of the form X -> Z <- Y, where X and Y are not themselves adjacent. This structure is identifiable because X and Y are independent, but become dependent when conditioned on Z.

Your implementation should iterate through all triples of nodes (i, k, j) that form a “v” shape: i is adjacent to k, j is adjacent to k, but i and j are not adjacent.
For each such triple, you will check the sepsets dictionary from Task 1. If the middle node k is not in the separating set for (i, j), then the structure i-k-j must be a V-structure.
You should then orient the edges as i -> k <- j in the adjacency matrix.

Task 3: Edge Orientation (30 pts)

Objective: Apply a set of logical orientation rules (Meek’s rules) to orient as many of the remaining undirected edges as possible.

Description: You will complete the orient_edges() function in framework.py. After identifying V-structures, some directed edges are known. These directions can be propagated through the graph to orient other edges, based on two main principles: avoiding the creation of new V-structures and avoiding the creation of cycles.

You will implement a loop that repeatedly applies the orientation rules to the graph until no more edges can be oriented.
Rule 1 (Avoid New V-Structures): If you have a structure X -> Y - Z and X and Z are not adjacent, you must orient the edge as Y -> Z. Orienting it as Y <- Z would create a new V-structure (X -> Y <- Z), which would have been detected in Task 2.
Rule 2 (Avoid Cycles): If you have a chain X -> Y -> Z and also an undirected edge between X and Z, you must orient it as X -> Z. Orienting it as X <- Z would create a cycle (X -> Y -> Z <- X).
Rule 3: Orient x - y into x -> y if there are two paths x-k->y and x-l->y, with k and l non-adjacent.
The hints in the framework.py file provide a starting point for implementing these rules.

Task 4: Analysis & Evaluation (Ungraded)

Objective: Run your completed PC algorithm on the provided data and analyze how the choice of the significance level alpha affects the output.

Description: The analyze_causal_graph function is already written for you. Once your implementation is complete, you can run framework.py directly. The script will execute your PC algorithm with three different alpha values (0.01, 0.05, 0.2). For each value, it will:

Calculate the Structural Hamming Distance (SHD) between your algorithm’s output and the true graph. SHD is a score that counts the number of edge additions, deletions, or reversals needed to match the true graph (lower is better).
Visualize the learned causal graph.

Observe the output and think about the following questions:

How does a stricter alpha (e.g., 0.01) affect the number of edges in the learned skeleton?
Which alpha value gives the best SHD score?
What does this tell you about the trade-off between finding true causal links and falsely identifying links that aren’t there?

Implementation Guidelines

Only modify code within the ## TODO sections in framework.py.
Do not change the function signatures or import statements.
The helper functions for conditional independence testing, graph plotting, and SHD calculation are already provided for you.