API Reference

reddwarf.implementations.polis

reddwarf.implementations.polis.run_clustering(votes, mod_out_statement_ids=[], min_user_vote_threshold=7, keep_participant_ids=[], init_centers=None, max_group_count=5, force_group_count=None, random_state=None)

An essentially feature-complete implementation of the Polis clustering algorithm.

Still missing
  • base-cluster calculations (so can't match output of conversations larger than 100 participants),
  • k-smoothing, which holds back k-value (group count) until re-calculated 3 consecutive times,
  • some advanced participant filtering that involves past state (you can use keep_participant_ids to mimic manually).
Parameters:
  • votes (list[dict]) –

    Raw list of vote dicts, with keys for "participant_id", "statement_id", "vote" and "modified"

  • mod_out_statement_ids (list[int], default: [] ) –

    List of statement IDs to moderate/zero out

  • min_user_vote_threshold (int, default: 7 ) –

    Minimum number of votes a participant must make to be included in clustering

  • keep_participant_ids (list[int], default: [] ) –

    List of participant IDs to keep in clustering algorithm, regardless of normal filters.

  • max_group_count

    Max number of group (k-values) to test using k-means and silhouette scores

  • init_centers (list[list[float]], default: None ) –

    Initial guesses of [x,y] coordinates for k-means (Length of list must match max_group_count)

  • force_group_count (int, default: None ) –

    Instead of using silhouette scores, force a specific number of groups (k value)

  • random_state (int, default: None ) –

    If set, will force determinism during k-means clustering

Returns:
  • projected_data( DataFrame ) –

    Dataframe of projected participants, with columns "x", "y", "cluster_id"

  • comps( list[list[float]] ) –

    List of principal components for each statement

  • eigenvalues( list[float] ) –

    List of eigenvalues for each principal component

  • center( list[float] ) –

    List of centers/means for each statement

  • cluster_centers( list[list[float]] ) –

    List of center xy coordinates for each cluster

Source code in reddwarf/implementations/polis.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def run_clustering(
    votes: list[dict],
    mod_out_statement_ids: list[int] = [],
    min_user_vote_threshold: int = 7,
    keep_participant_ids: list[int] = [],
    init_centers: Optional[list[list[float]]] = None,
    max_group_count: int = 5,
    force_group_count: Optional[int] = None,
    random_state: Optional[int] = None,
) -> tuple[DataFrame, NDArray, NDArray, NDArray, NDArray | None]:
    """
    An essentially feature-complete implementation of the Polis clustering algorithm.

    Still missing:
        - base-cluster calculations (so can't match output of conversations larger than 100 participants),
        - k-smoothing, which holds back k-value (group count) until re-calculated 3 consecutive times,
        - some advanced participant filtering that involves past state (you can use keep_participant_ids to mimic manually).

    Args:
        votes (list[dict]): Raw list of vote dicts, with keys for "participant_id", "statement_id", "vote" and "modified"
        mod_out_statement_ids (list[int]): List of statement IDs to moderate/zero out
        min_user_vote_threshold (int): Minimum number of votes a participant must make to be included in clustering
        keep_participant_ids (list[int]): List of participant IDs to keep in clustering algorithm, regardless of normal filters.
        max_group_count (): Max number of group (k-values) to test using k-means and silhouette scores
        init_centers (list[list[float]]): Initial guesses of [x,y] coordinates for k-means (Length of list must match max_group_count)
        force_group_count (int): Instead of using silhouette scores, force a specific number of groups (k value)
        random_state (int): If set, will force determinism during k-means clustering

    Returns:
        projected_data (DataFrame): Dataframe of projected participants, with columns "x", "y", "cluster_id"
        comps (list[list[float]]): List of principal components for each statement
        eigenvalues (list[float]): List of eigenvalues for each principal component
        center (list[float]): List of centers/means for each statement
        cluster_centers (list[list[float]]): List of center xy coordinates for each cluster
    """
    vote_matrix = generate_raw_matrix(votes=votes)
    participant_ids_in = get_participant_ids(vote_matrix, vote_threshold=min_user_vote_threshold)
    if keep_participant_ids:
        participant_ids_in = list(set(participant_ids_in + keep_participant_ids))

    vote_matrix = simple_filter_matrix(
        vote_matrix=vote_matrix,
        mod_out_statement_ids=mod_out_statement_ids,
    )
    projected_data, comps, eigenvalues, center = run_pca(vote_matrix=vote_matrix)

    projected_data = projected_data.loc[participant_ids_in, :]

    # To match Polis output, we need to reverse signs for centers and projections
    # TODO: Investigate why this is. Perhaps related to signs being flipped on agree/disagree back in the day.
    projected_data, center = -projected_data, -center

    if force_group_count:
        cluster_labels, cluster_centers = run_kmeans(
            dataframe=projected_data,
            n_clusters=force_group_count,
            init_centers=init_centers,
            random_state=random_state,
        )
    else:
        _, _, cluster_labels, cluster_centers = find_optimal_k(
            projected_data=projected_data,
            max_group_count=max_group_count,
            init_centers=init_centers,
            random_state=random_state,
        )
    projected_data = projected_data.assign(cluster_id=cluster_labels)

    return projected_data, comps, eigenvalues, center, cluster_centers

reddwarf.implementations.agora

reddwarf.implementations.agora.run_clustering_v1(conversation, options={})

A minimal Polis-based clustering agorithm suitable for use by Agora Citizen Network.

This does the following:

  1. builds a vote matrix (includes as statement with at least 1 participant vote),
  2. filters out any participants with less than 7 votes,
  3. runs PCA and projects active participants into 2D coordinates,
  4. scales the projected participants out from center when low number of votes,
  5. test 2-5 groups for best k-means fit via silhouette scores (random state set for reproducibility)
  6. returns a list of clusters, each with a list of participant members and their projected 2D coordinates.
Warning

This will technically function without PASS votes, but scaling factors will not be effective in compensating for missing votes, and so participant projections will be bunched up closer to the origin.

Parameters:
  • conversation (Conversation) –

    A minimal conversation object with votes.

  • options (ClusteringOptions, default: {} ) –

    Configuration options for override defaults.

Returns:
Source code in reddwarf/implementations/agora.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def run_clustering_v1(
    conversation: Conversation,
    options: ClusteringOptions = {},
) -> ClusteringResult:
    """
    A minimal Polis-based clustering agorithm suitable for use by Agora Citizen Network.

    This does the following:

    1. builds a vote matrix (includes as statement with at least 1 participant vote),
    2. filters out any participants with less than 7 votes,
    3. runs PCA and projects active participants into 2D coordinates,
    4. scales the projected participants out from center when low number of votes,
    5. test 2-5 groups for best k-means fit via silhouette scores (random state set for reproducibility)
    6. returns a list of clusters, each with a list of participant members and their projected 2D coordinates.

    Warning:
        This will technically function without PASS votes, but scaling
        factors will not be effective in compensating for missing votes,
        and so participant projections will be bunched up closer to the
        origin.

    Args:
        conversation (Conversation): A minimal conversation object with votes.
        options (ClusteringOptions): Configuration options for override defaults.

    Returns:
        result (ClusteringResult): Results of the clustering operation.
    """
    vote_matrix = utils.generate_raw_matrix(votes=conversation["votes"])
    # Any statements with votes are included.
    all_statement_ids = vote_matrix.columns
    vote_matrix = utils.filter_matrix(
        vote_matrix=vote_matrix,
        min_user_vote_threshold=options.get("min_user_vote_threshold", DEFAULT_MIN_USER_VOTE_THRESHOLD),
        active_statement_ids=all_statement_ids,
    )
    projected_data, *_ = utils.run_pca(vote_matrix=vote_matrix)

    _, _, cluster_labels, _ = utils.find_optimal_k(
        projected_data=projected_data,
        max_group_count=options.get("max_clusters", DEFAULT_MAX_CLUSTERS),
        # Ensure reproducible kmeans calculation between runs.
        random_state=DEFAULT_KMEANS_RANDOM_STATE,
    )

    # Add cluster label column to dataframe.
    projected_data = projected_data.assign(cluster_id=cluster_labels)
    # Convert participant_id index into regular column, for ease of transformation.
    projected_data = projected_data.reset_index()

    result: ClusteringResult = {
        "clusters": [
            {
                "id": cluster_id,
                "participants": [
                    {
                        "id": row.participant_id,
                        "x": row.x,
                        "y": row.y,
                    }
                    for row in group.itertuples(index=False)
                ]
            }
            for cluster_id, group in projected_data.groupby("cluster_id")
        ]
    }

    return result

reddwarf.utils.matrix

reddwarf.utils.matrix.generate_raw_matrix(votes, cutoff=None)

Generates a raw vote matrix from a list of vote records.

See filter_votes method for details of cutoff arg.

Parameters:
  • votes (List[Dict]) –

    An unsorted list of vote records, where each record is a dictionary containing:

    • "participant_id": The ID of the voter.
    • "statement_id": The ID of the statement being voted on.
    • "vote": The recorded vote value.
    • "modified": A unix timestamp object representing when the vote was made.
  • cutoff (int, default: None ) –

    A cutoff unix timestamp (ms) or index position in date-sorted votes list.

Returns:
  • raw_matrix( DataFrame ) –

    A full raw vote matrix DataFrame with NaN values where:

    1. rows are voters,
    2. columns are statements, and
    3. values are votes.

    This includes even voters that have no votes, and statements on which no votes were placed.

Source code in reddwarf/utils/matrix.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def generate_raw_matrix(
        votes: List[Dict],
        cutoff: Optional[int] = None,
) -> VoteMatrix:
    """
    Generates a raw vote matrix from a list of vote records.

    See `filter_votes` method for details of `cutoff` arg.

    Args:
        votes (List[Dict]): An unsorted list of vote records, where each record is a dictionary containing:

            - "participant_id": The ID of the voter.
            - "statement_id": The ID of the statement being voted on.
            - "vote": The recorded vote value.
            - "modified": A unix timestamp object representing when the vote was made.

        cutoff (int): A cutoff unix timestamp (ms) or index position in date-sorted votes list.

    Returns:
        raw_matrix (pd.DataFrame): A full raw vote matrix DataFrame with NaN values where:

            1. rows are voters,
            2. columns are statements, and
            3. values are votes.

            This includes even voters that have no votes, and statements on which no votes were placed.
    """
    if cutoff:
        votes = filter_votes(votes=votes, cutoff=cutoff)

    raw_matrix = pd.DataFrame.from_dict(votes)
    raw_matrix = raw_matrix.pivot(
        values="vote",
        index="participant_id",
        columns="statement_id",
    )

    return raw_matrix

reddwarf.utils.matrix.simple_filter_matrix(vote_matrix, mod_out_statement_ids=[])

The simple filter on the vote_matrix that is used by Polis prior to running PCA.

Parameters:
  • vote_matrix (VoteMatrix) –

    A raw vote_matrix (with missing values)

  • mod_out_statement_ids (list, default: [] ) –

    A list of moderated-out participant IDs to zero out.

Returns:
  • VoteMatrix( VoteMatrix ) –

    Another vote_matrix with statements zero'd out

Source code in reddwarf/utils/matrix.py
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def simple_filter_matrix(
        vote_matrix: VoteMatrix,
        mod_out_statement_ids: list[int] = [],
) -> VoteMatrix:
    """
    The simple filter on the vote_matrix that is used by Polis prior to running PCA.

    Args:
        vote_matrix (VoteMatrix): A raw vote_matrix (with missing values)
        mod_out_statement_ids (list): A list of moderated-out participant IDs to zero out.

    Returns:
        VoteMatrix: Another vote_matrix with statements zero'd out
    """
    for tid in mod_out_statement_ids:
        # Zero out column only if already exists (ie. has votes)
        if tid in vote_matrix.columns:
            vote_matrix.loc[:, tid] = 0

    return vote_matrix

reddwarf.utils.matrix.get_participant_ids(vote_matrix, vote_threshold)

Find participant IDs that meet a vote threshold in a vote_matrix.

Parameters:
  • vote_matrix (VoteMatrix) –

    A raw vote_matrix (with missing values)

  • vote_threshold (int) –

    Vote threshold that each participant must meet

Returns:
  • participation_ids( list ) –

    A list of participant IDs that meet the threshold

Source code in reddwarf/utils/matrix.py
163
164
165
166
167
168
169
170
171
172
173
174
def get_participant_ids(vote_matrix: VoteMatrix, vote_threshold: int) -> list:
    """
    Find participant IDs that meet a vote threshold in a vote_matrix.

    Args:
        vote_matrix (VoteMatrix): A raw vote_matrix (with missing values)
        vote_threshold (int): Vote threshold that each participant must meet

    Returns:
        participation_ids (list): A list of participant IDs that meet the threshold
    """
    return vote_matrix[vote_matrix.count(axis="columns") >= vote_threshold].index.to_list()

reddwarf.utils.pca

reddwarf.utils.pca.run_pca(vote_matrix, n_components=2)

Process a prepared vote matrix to be imputed and return projected participant data, as well as eigenvectors and eigenvalues.

The vote matrix should not yet be imputed, as this will happen within the method.

Parameters:
  • vote_matrix (DataFrame) –

    A vote matrix of data. Non-imputed values are expected.

  • n_components (int, default: 2 ) –

    Number n of principal components to decompose the vote_matrix into.

Returns:
  • projected_data( DataFrame ) –

    A dataframe of projected xy coordinates for each vote_matrix row/participant.

  • eigenvectors( List[List[float]] ) –

    Principal n components, one per column/statement/feature.

  • eigenvalues( List[float] ) –

    Explained variance, one per column/statements/feature.

  • means( list[float] ) –

    Means/centers of column/statements/features.

Source code in reddwarf/utils/pca.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def run_pca(
        vote_matrix: VoteMatrix,
        n_components: int = 2,
) -> Tuple[ pd.DataFrame, np.ndarray, np.ndarray, np.ndarray ]:
    """
    Process a prepared vote matrix to be imputed and return projected participant data,
    as well as eigenvectors and eigenvalues.

    The vote matrix should not yet be imputed, as this will happen within the method.

    Args:
        vote_matrix (pd.DataFrame): A vote matrix of data. Non-imputed values are expected.
        n_components (int): Number n of principal components to decompose the `vote_matrix` into.

    Returns:
        projected_data (pd.DataFrame): A dataframe of projected xy coordinates for each `vote_matrix` row/participant.
        eigenvectors (List[List[float]]): Principal `n` components, one per column/statement/feature.
        eigenvalues (List[float]): Explained variance, one per column/statements/feature.
        means (list[float]): Means/centers of column/statements/features.
    """
    imputed_matrix = impute_missing_votes(vote_matrix)

    pca = PCA(n_components=n_components) ## pca is apparently different, it wants
    pca.fit(imputed_matrix) ## .T transposes the matrix (flips it)

    eigenvectors = pca.components_
    eigenvalues = pca.explained_variance_
    # TODO: Why does this need to be inverted to match polismath output? BUG?
    # TODO: Investigate why some numbers are a bit off here.
    #       ANSWER: Because centers are calculated on unfiltered raw matrix for some reason.
    #       means = -raw_vote_matrix.mean(axis="rows")
    means = pca.mean_

    # Project participant vote data onto 2D using eigenvectors.
    # TODO: Determine what exactly we want to be doing here.

    # (1) This is what we used to use:
    # projected_data = pca.transform(imputed_matrix)

    # (2) Perhaps we could be doing this instead: (would need cleanup to clear out NaN values)
    # projected_data = pca.transform(vote_matrix)

    # (3) This is what we'll do for now, as it reproduced Polis calculations exactly:
    # Project data from raw, non-imputed vote_matrix.
    # TODO: Figure out why signs are flipped here after custom projection, unlike pca.transform().
    # Not required for regular pca.transform(), so perhaps a polismath BUG?
    # We fix this in implementations.run_clustering().
    projected_data = [sparsity_aware_project_ptpt(ptpt_votes, pca.components_, pca.mean_) for pid, ptpt_votes in vote_matrix.iterrows()]

    projected_data = pd.DataFrame(projected_data, index=vote_matrix.index, columns=np.asarray(["x", "y"]))
    projected_data.index.name = "participant_id"

    return projected_data, eigenvectors, eigenvalues, means

reddwarf.utils.clustering

reddwarf.utils.clustering.find_optimal_k(projected_data, max_group_count=5, init_centers=None, random_state=None, debug=False)

Use silhouette scores to find the best number of clusters k to assume to fit the data.

Parameters:
  • projected_data (DataFrame) –

    A dataframe with two columns (assumed x and y).

  • max_group_count (int, default: 5 ) –

    The max K number of groups to test for. (Default: 5)

  • init_centers (List, default: None ) –

    A list of xy coordinates to use as initial center guesses.

  • random_state (int, default: None ) –

    Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

  • debug (bool, default: False ) –

    Whether to print debug output. (Default: False)

Returns:
  • optimal_k( int ) –

    Ideal number of clusters.

  • optimal_silhouette_score( float ) –

    Silhouette score for this K value.

  • optimal_cluster_labels( ndarray | None ) –

    A list of index labels assigned a group to each row in projected_date.

  • optimal_cluster_centers( ndarray | None ) –

    A list of xy centers for the optimal clusters.

Source code in reddwarf/utils/clustering.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def find_optimal_k(
        projected_data: pd.DataFrame,
        max_group_count: int = 5,
        init_centers: Optional[List] = None,
        random_state: Optional[int] = None,
        debug: bool = False,
) -> Tuple[int, float, np.ndarray | None, np.ndarray | None]:
    """
    Use silhouette scores to find the best number of clusters k to assume to fit the data.

    Args:
        projected_data (pd.DataFrame): A dataframe with two columns (assumed `x` and `y`).
        max_group_count (int): The max K number of groups to test for. (Default: 5)
        init_centers (List): A list of xy coordinates to use as initial center guesses.
        random_state (int): Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.
        debug (bool): Whether to print debug output. (Default: False)

    Returns:
        optimal_k (int): Ideal number of clusters.
        optimal_silhouette_score (float): Silhouette score for this K value.
        optimal_cluster_labels (np.ndarray | None): A list of index labels assigned a group to each row in projected_date.
        optimal_cluster_centers (np.ndarray | None): A list of xy centers for the optimal clusters.
    """
    K_RANGE = range(2, max_group_count+1)
    k_best = 0 # Best K so far.
    best_silhouette_score = -np.inf
    best_cluster_labels = None
    best_cluster_centers = None

    for k_test in K_RANGE:
        cluster_labels, cluster_centers = run_kmeans(
            dataframe=projected_data,
            n_clusters=k_test,
            init_centers=init_centers,
            random_state=random_state,
        )
        this_silhouette_score = silhouette_score(projected_data, cluster_labels)
        if debug:
            print(f"{k_test=}, {this_silhouette_score=}")
        if this_silhouette_score >= best_silhouette_score:
            k_best = k_test
            best_silhouette_score = this_silhouette_score
            best_cluster_labels = cluster_labels
            best_cluster_centers = cluster_centers

    optimal_k = k_best
    optimal_silhouette = best_silhouette_score
    optimal_cluster_labels = best_cluster_labels
    optimal_cluster_centers = best_cluster_centers

    return optimal_k, optimal_silhouette, optimal_cluster_labels, optimal_cluster_centers

reddwarf.utils.clustering.run_kmeans(dataframe, n_clusters=2, init_centers=None, random_state=None)

Runs K-Means clustering on a 2D DataFrame of xy points, for a specific K, and returns labels for each row and cluster centers. Optionally accepts guesses on cluster centers, and a random_state to reproducibility.

Parameters:
  • dataframe (DataFrame) –

    A dataframe with two columns (assumed x and y).

  • n_clusters (int, default: 2 ) –

    How many clusters k to assume.

  • init_centers (List, default: None ) –

    A list of xy coordinates to use as initial center guesses.

  • random_state (int, default: None ) –

    Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

Returns:
  • cluster_labels( ndarray | None ) –

    A list of zero-indexed labels for each row in the dataframe

  • cluster_centers( ndarray ) –

    A list of center coords for clusters.

Source code in reddwarf/utils/clustering.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def run_kmeans(
        dataframe: pd.DataFrame,
        n_clusters: int = 2,
        # TODO: Improve this type. 3d?
        init_centers: Optional[List] = None,
        random_state: Optional[int] = None,
) -> Tuple[np.ndarray | None, np.ndarray]:
    """
    Runs K-Means clustering on a 2D DataFrame of xy points, for a specific K,
    and returns labels for each row and cluster centers. Optionally accepts
    guesses on cluster centers, and a random_state to reproducibility.

    Args:
        dataframe (pd.DataFrame): A dataframe with two columns (assumed `x` and `y`).
        n_clusters (int): How many clusters k to assume.
        init_centers (List): A list of xy coordinates to use as initial center guesses.
        random_state (int): Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.

    Returns:
        cluster_labels (np.ndarray | None): A list of zero-indexed labels for each row in the dataframe
        cluster_centers (np.ndarray): A list of center coords for clusters.
    """
    if init_centers:
        # Pass an array of xy coords to see kmeans guesses.
        init_arg = init_centers[:n_clusters]
    else:
        # Use the default strategy in sklearn.
        init_arg = "k-means++"
    # TODO: Set random_state to a value eventually, so calculation is deterministic.
    kmeans = KMeans(
        n_clusters=n_clusters,
        random_state=random_state,
        init = init_arg, # type:ignore (because sklearn things it's just a string)
        n_init="auto"
    ).fit(dataframe)

    return kmeans.labels_, kmeans.cluster_centers_

reddwarf.utils

(These are in the process of being either moved or deprecated.)

reddwarf.utils.filter_votes(votes, cutoff=None)

Filters a list of votes.

If a cutoff is provided, votes are filtered based on either:

  • An int representing unix timestamp (ms), keeping only votes before or at that time.
    • Any int above 13_000_000_000 is considered a timestamp.
  • Any other positive or negative int is considered an index, reflecting where to trim the time-sorted vote list.
    • positive: filters in votes that many indices from start
    • negative: filters out votes that many indices from end
Parameters:
  • votes (List[Dict]) –

    An unsorted list of vote records, where each record is a dictionary containing:

    • "participant_id": The ID of the voter.
    • "statement_id": The ID of the statement being voted on.
    • "vote": The recorded vote value.
    • "modified": A unix timestamp object representing when the vote was made.
  • cutoff (int, default: None ) –

    A cutoff unix timestamp (ms) or index position in date-sorted votes list.

Returns:
  • votes( List[Dict] ) –

    An list of vote records, sorted by modified if index-based filtering occurred.

Source code in reddwarf/utils/matrix.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def filter_votes(
        votes: List[Dict],
        cutoff: Optional[int] = None,
) -> List[Dict]:
    """
    Filters a list of votes.

    If a `cutoff` is provided, votes are filtered based on either:

    - An `int` representing unix timestamp (ms), keeping only votes before or at that time.
        - Any int above 13_000_000_000 is considered a timestamp.
    - Any other positive or negative `int` is considered an index, reflecting where to trim the time-sorted vote list.
        - positive: filters in votes that many indices from start
        - negative: filters out votes that many indices from end

    Args:
        votes (List[Dict]): An unsorted list of vote records, where each record is a dictionary containing:

            - "participant_id": The ID of the voter.
            - "statement_id": The ID of the statement being voted on.
            - "vote": The recorded vote value.
            - "modified": A unix timestamp object representing when the vote was made.

        cutoff (int): A cutoff unix timestamp (ms) or index position in date-sorted votes list.

    Returns:
        votes (List[Dict]): An list of vote records, sorted by `modified` if index-based filtering occurred.
    """
    if cutoff:
        # TODO: Detect datetime object as arg instead.
        try:
            if cutoff > 1_300_000_000:
                cutoff_timestamp = cutoff
                votes = [v for v in votes if v['modified'] <= cutoff_timestamp]
            else:
                cutoff_index = cutoff
                votes = sorted(votes, key=lambda x: x["modified"])
                votes = votes[:cutoff_index]
        except KeyError as e:
            raise RedDwarfError("The `modified` key is missing from a vote object that must be sorted") from e

    return votes

reddwarf.utils.filter_matrix(vote_matrix, min_user_vote_threshold=7, active_statement_ids=[], keep_participant_ids=[], unvoted_filter_type='drop')

Generates a filtered vote matrix from a raw matrix and filter config.

Parameters:
  • vote_matrix (DataFrame) –

    The [raw] vote matrix.

  • min_user_vote_threshold (int, default: 7 ) –

    The number of votes a participant must make to avoid being filtered.

  • active_statement_ids (List[int], default: [] ) –

    The statement IDs that are not moderated out.

  • keep_participant_ids (List[int], default: [] ) –

    Preserve specific participants even if below threshold.

  • unvoted_filter_type (drop | zero, default: 'drop' ) –

    When a statement has no votes, it can't be imputed. This determined whether to drop the statement column, or set all the value to zero/pass. (Default: drop)

Returns:
  • filtered_vote_matrix( VoteMatrix ) –

    A vote matrix with the following filtered out:

    1. statements without any votes,
    2. statements that have been moderated out,
    3. participants below the vote count threshold,
    4. participants who have not been explicitly selected to circumvent above filtering.
Source code in reddwarf/utils/matrix.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
def filter_matrix(
        vote_matrix: VoteMatrix,
        min_user_vote_threshold: int = 7,
        active_statement_ids: List[int] = [],
        keep_participant_ids: List[int] = [],
        unvoted_filter_type: Literal["drop", "zero"] = "drop",
) -> VoteMatrix:
    """
    Generates a filtered vote matrix from a raw matrix and filter config.

    Args:
        vote_matrix (pd.DataFrame): The [raw] vote matrix.
        min_user_vote_threshold (int): The number of votes a participant must make to avoid being filtered.
        active_statement_ids (List[int]): The statement IDs that are not moderated out.
        keep_participant_ids (List[int]): Preserve specific participants even if below threshold.
        unvoted_filter_type ("drop" | "zero"): When a statement has no votes, it can't be imputed. \
            This determined whether to drop the statement column, or set all the value to zero/pass. (Default: drop)

    Returns:
        filtered_vote_matrix (VoteMatrix): A vote matrix with the following filtered out:

            1. statements without any votes,
            2. statements that have been moderated out,
            3. participants below the vote count threshold,
            4. participants who have not been explicitly selected to circumvent above filtering.
    """
    # Filter out moderated statements.
    vote_matrix = vote_matrix.filter(active_statement_ids, axis='columns')
    # Filter out participants with less than 7 votes (keeping IDs we're forced to)
    # Ref: https://hyp.is/JbNMus5gEe-cQpfc6eVIlg/gwern.net/doc/sociology/2021-small.pdf
    participant_ids_in = get_participant_ids(vote_matrix, min_user_vote_threshold)
    # Add in some specific participant IDs for Polismath edge-cases.
    # See: https://github.com/compdemocracy/polis/pull/1893#issuecomment-2654666421
    participant_ids_in = list(set(participant_ids_in + keep_participant_ids))
    vote_matrix = (vote_matrix
        .filter(participant_ids_in, axis='rows')
        # .filter() and .drop() lost the index name, so bring it back.
        .rename_axis("participant_id")
    )

    # This is otherwise the more efficient way, but we want to keep some participant IDs
    # to troubleshoot edge-cases in upsteam Polis math.
    # self.matrix = self.matrix.dropna(thresh=self.min_votes, axis='rows')

    unvoted_statement_ids = vote_matrix.pipe(get_unvoted_statement_ids)

    # TODO: What about statements with no votes? E.g., 53 in oprah. Filter out? zero?
    # Test this on a conversation where it will actually change statement count.
    if unvoted_filter_type == 'drop':
        vote_matrix = vote_matrix.drop(unvoted_statement_ids, axis='columns')
    elif unvoted_filter_type == 'zero':
        vote_matrix[unvoted_statement_ids] = 0

    return vote_matrix

reddwarf.utils.impute_missing_votes(vote_matrix)

Imputes missing votes in a voting matrix using column-wise mean. All columns must have at least one vote.

Reference

Small, C. (2021). "Polis: Scaling Deliberation by Mapping High Dimensional Opinion Spaces." Specific highlight: https://hyp.is/8zUyWM5fEe-uIO-J34vbkg/gwern.net/doc/sociology/2021-small.pdf

Parameters:
  • vote_matrix (DataFrame) –

    A vote matrix DataFrame with NaN/None values where: 1. rows are voters, 2. columns are statements, and 3. values are votes.

Returns:
  • imputed_matrix( DataFrame ) –

    The same vote matrix DataFrame imputing missing values with column mean.

Source code in reddwarf/utils/matrix.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def impute_missing_votes(vote_matrix: VoteMatrix) -> VoteMatrix:
    """
    Imputes missing votes in a voting matrix using column-wise mean. All columns must have at least one vote.

    Reference:
        Small, C. (2021). "Polis: Scaling Deliberation by Mapping High Dimensional Opinion Spaces."
        Specific highlight: <https://hyp.is/8zUyWM5fEe-uIO-J34vbkg/gwern.net/doc/sociology/2021-small.pdf>

    Args:
        vote_matrix (pd.DataFrame):  A vote matrix DataFrame with `NaN`/`None` values where: \
                                        1. rows are voters, \
                                        2. columns are statements, and \
                                        3. values are votes.

    Returns:
        imputed_matrix (pd.DataFrame): The same vote matrix DataFrame imputing missing values with column mean.
    """
    if vote_matrix.isna().all(axis="rows").any():
        raise RedDwarfError("impute_missing_votes does not support vote matrices containing statement columns with no votes.")

    mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    imputed_matrix = pd.DataFrame(
        mean_imputer.fit_transform(vote_matrix),
        columns=vote_matrix.columns,
        index=vote_matrix.index,
    )
    return imputed_matrix

reddwarf.utils.scale_projected_data(projected_data, vote_matrix)

Scale projected participant xy points based on vote matrix, to account for any small number of votes by a participant and prevent those participants from bunching up in the center.

Parameters:
  • projected_data (DataFrame) –

    the project xy coords of participants.

  • vote_matrix (VoteMatrix) –

    the processed vote matrix data frame, from which to generate scaling factors.

Returns:
  • scaled_projected_data( DataFrame ) –

    The coord data rescaled based on participant votes.

Source code in reddwarf/utils/pca.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def scale_projected_data(
        projected_data: pd.DataFrame,
        vote_matrix: VoteMatrix
) -> pd.DataFrame:
    """
    Scale projected participant xy points based on vote matrix, to account for any small number of
    votes by a participant and prevent those participants from bunching up in the center.

    Args:
        projected_data (pd.DataFrame): the project xy coords of participants.
        vote_matrix (VoteMatrix): the processed vote matrix data frame, from which to generate scaling factors.

    Returns:
        scaled_projected_data (pd.DataFrame): The coord data rescaled based on participant votes.
    """
    total_active_comment_count = vote_matrix.shape[1]
    participant_vote_counts = vote_matrix.count(axis="columns")
    # Ref: https://hyp.is/x6nhItMMEe-v1KtYFgpOiA/gwern.net/doc/sociology/2021-small.pdf
    # Ref: https://github.com/compdemocracy/polis/blob/15aa65c9ca9e37ecf57e2786d7d81a4bd4ad37ef/math/src/polismath/math/pca.clj#L155-L156
    participant_scaling_coeffs = np.sqrt(total_active_comment_count / participant_vote_counts).values
    # See: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
    # Reshape scaling_coeffs list to match the shape of projected_data matrix
    participant_scaling_coeffs = np.reshape(participant_scaling_coeffs, (-1, 1))

    return projected_data * participant_scaling_coeffs

reddwarf.utils.get_unvoted_statement_ids(vote_matrix)

A method intended to be piped into a VoteMatrix DataFrame, returning list of unvoted statement IDs.

See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html

Parameters:
  • vote_matrix (DataFrame) –

    A pivot of statements (cols), participants (rows), with votes as values.

Returns:
  • unvoted_statement_ids( List[int] ) –

    list of statement IDs with no votes.

Example:

unused_statement_ids = vote_matrix.pipe(get_unvoted_statement_ids)
Source code in reddwarf/utils/matrix.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def get_unvoted_statement_ids(vote_matrix: VoteMatrix) -> List[int]:
    """
    A method intended to be piped into a VoteMatrix DataFrame, returning list of unvoted statement IDs.

    See: <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html>

    Args:
        vote_matrix (pd.DataFrame): A pivot of statements (cols), participants (rows), with votes as values.

    Returns:
        unvoted_statement_ids (List[int]): list of statement IDs with no votes.

    Example:

        unused_statement_ids = vote_matrix.pipe(get_unvoted_statement_ids)
    """
    null_column_mask = vote_matrix.isnull().all()
    null_column_ids = vote_matrix.columns[null_column_mask].tolist()

    return null_column_ids

reddwarf.data_presenter

reddwarf.data_presenter.generate_figure(coord_dataframe, labels=None)

Generates a matplotlib scatterplot with optional bounded clusters.

The plot is drawn from a dataframe of xy values, each point labelled by index participant_id. When a list of labels are supplied (corresponding to each row), concave hulls are drawn around them.

Parameters:
  • coord_dataframe (DataFrame) –

    A dataframe of coordinates with columns named x and y, indexed by participant_id.

  • labels (List[int], default: None ) –

    A list of labels, one for each row in coord_dataframe.

Returns:
  • None

    None.

Source code in reddwarf/data_presenter.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def generate_figure(
        coord_dataframe: pd.DataFrame,
        labels: Optional[List[int]] = None,
) -> None:
    """
    Generates a matplotlib scatterplot with optional bounded clusters.

    The plot is drawn from a dataframe of xy values, each point labelled by index `participant_id`.
    When a list of labels are supplied (corresponding to each row), concave hulls are drawn around them.

    Args:
        coord_dataframe (pd.DataFrame): A dataframe of coordinates with columns named `x` and `y`, indexed by `participant_id`.
        labels (List[int]): A list of labels, one for each row in `coord_dataframe`.

    Returns:
        None.
    """
    plt.figure(figsize=(7, 5), dpi=80)
    plt.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    plt.axvline(x=0, color='k', linestyle='-', linewidth=0.5)
    plt.gca().invert_yaxis()

    # Label points with participant_id if no labels set.
    for participant_id, row in coord_dataframe.iterrows():
        plt.annotate(str(participant_id),
            (float(row["x"]), float(row["y"])),
            xytext=(2, 2),
            color="gray",
            textcoords='offset points')

    scatter_kwargs = defaultdict()
    scatter_kwargs["x"] = coord_dataframe.loc[:,"x"]
    scatter_kwargs["y"] = coord_dataframe.loc[:,"y"]
    scatter_kwargs["s"] = 10       # point size
    scatter_kwargs["alpha"] = 0.8  # point transparency
    if labels is not None:
        # Ref: https://matplotlib.org/stable/users/explain/colors/colormaps.html#qualitative
        scatter_kwargs["cmap"] = "Set1"    # color map
        scatter_kwargs["c"] = labels        # color indexes

        print("Calculating convex hulls around clusters...")
        unique_labels = set(labels)
        for label in unique_labels:
            points_df = coord_dataframe[labels == label]
            print(f"Hull {str(label)}, bounding {len(points_df)} points")
            if len(points_df) < 3:
                # TODO: Accomodate 2 points like Polis platform does.
                print("Cannot create concave hull for less than 3 points. Skipping...")
                continue
            vertex_indices = concave_hull_indexes(np.asarray(points_df.loc[:, ["x", "y"]]), concavity=4.0)
            hull_points = points_df.iloc[vertex_indices, :]
            hull_points = hull_points.loc[:, ["x", "y"]]
            polygon = patches.Polygon(
                hull_points,
                fill=True,
                color="gray",
                alpha=0.3,
                edgecolor=None,
            )
            plt.gca().add_patch(polygon)
    scatter = plt.scatter(**scatter_kwargs)

    # Add a legend if labels are provided
    if labels is not None:
        plt.colorbar(scatter, label="Cluster", ticks=labels)

    plt.show()

    return None

Types

reddwarf.types.agora.Conversation

Bases: TypedDict

Attributes:
  • votes (list[Vote]) –

    A list of votes

Source code in reddwarf/types/agora.py
50
51
52
53
54
55
class Conversation(TypedDict):
    """
    Attributes:
        votes (list[Vote]): A list of votes
    """
    votes: List[Vote]

reddwarf.types.agora.Vote

Bases: TypedDict

Attributes:
Source code in reddwarf/types/agora.py
38
39
40
41
42
43
44
45
46
47
48
class Vote(TypedDict):
    """
    Attributes:
        statement_id (Identifier): Statement ID
        participant_id (Identifier): Participant ID
        vote (VoteValueEnum): Vote value
    """
    statement_id: Identifier # statement.id
    participant_id: Identifier # participant.id

    vote: VoteValueEnum

reddwarf.types.agora.VoteValueEnum

Bases: IntEnum

Source code in reddwarf/types/agora.py
32
33
34
35
36
class VoteValueEnum(IntEnum):
    AGREE = 1
    DISAGREE = -1
    # Can withhold using "pass" at own discretion.
    PASS = 0

reddwarf.types.agora.Identifier = int | str module-attribute

reddwarf.types.agora.ClusteringOptions

Bases: TypedDict

Attributes:
  • min_user_vote_threshold (Optional[int]) –

    By default we filter out participants who've placed less than 7 votes. This overrides that.

  • max_clusters (Optional[int]) –

    By default we check kmeans for 2-5 groups. This overrides the upper bound.

Source code in reddwarf/types/agora.py
73
74
75
76
77
78
79
80
class ClusteringOptions(TypedDict):
    """
    Attributes:
        min_user_vote_threshold (Optional[int]): By default we filter out participants who've placed less than 7 votes. This overrides that.
        max_clusters (Optional[int]): By default we check kmeans for 2-5 groups. This overrides the upper bound.
    """
    min_user_vote_threshold: Optional[int]
    max_clusters: Optional[int]

reddwarf.types.agora.ClusteringResult

Bases: TypedDict

Attributes:
  • clusters (list[Cluster]) –

    List of clusters.

Source code in reddwarf/types/agora.py
66
67
68
69
70
71
class ClusteringResult(TypedDict):
    """
    Attributes:
        clusters (list[Cluster]): List of clusters.
    """
    clusters: List[Cluster]

reddwarf.types.agora.Cluster

Bases: TypedDict

Attributes:
  • id (int) –

    Cluster ID

  • participants (list[ClusteredParticipant]) –

    List of clustered participants.

Source code in reddwarf/types/agora.py
57
58
59
60
61
62
63
64
class Cluster(TypedDict):
    """
    Attributes:
        id (int): Cluster ID
        participants (list[ClusteredParticipant]): List of clustered participants.
    """
    id: int
    participants: List[ClusteredParticipant]

reddwarf.types.agora.ClusteredParticipant

Bases: TypedDict

Attributes:
  • id (Identifier) –

    Participant ID

  • x (float) –

    X coordinate

  • y (float) –

    Y coordinate

Source code in reddwarf/types/agora.py
21
22
23
24
25
26
27
28
29
30
class ClusteredParticipant(TypedDict):
    """
    Attributes:
        id (Identifier): Participant ID
        x (float): X coordinate
        y (float): Y coordinate
    """
    id: Identifier # participant.id
    x: float
    y: float