Skip to content
This repository has been archived by the owner on Nov 19, 2020. It is now read-only.

Confusion Matrix? #669

Closed
fdncred opened this issue Jun 30, 2017 · 35 comments
Closed

Confusion Matrix? #669

fdncred opened this issue Jun 30, 2017 · 35 comments

Comments

@fdncred
Copy link
Collaborator

fdncred commented Jun 30, 2017

Is there a way to create a confusion matrix with items other than numbers? For example, on an OCR project I want to create a csv with A-Z in the column header and A-Z on the row header with the data being how many mistakes were made. I would provide data as in input like this:

GroundTruth,OCROutput
A,A
B,B
B,8

If I've read the code correctly, I only see a way to create a confusion matrix with numbers.

I'd also like to have properties for precision, recall, f-measure, accuracy, failure rate. So maybe this is just a feature request, if the ability doesn't exist.

Thanks,
Darren

@cesarsouza
Copy link
Member

Hi there,

Thanks for opening the issue! If you have data as strings, just use the Codification filter to transform them to numbers on the fly when you need it.

I think it could be something as simple as

var codification = new Codification("labels", your_ground_truth_as_A_B_C_etc));

int[] expected = codifications.Transform(your_ground_truth_as_A_B_C_etc)
int[] actual = codifications.Transform(the_outputs_of_your_classifier_as_A_B_C_etc)

Do you think this could work for you? I think that Precision, Recall, Accuracy, are all implemented in ConfusionMatrix, so you would be able to get those measures from there.

Regards,
Cesar

@fdncred
Copy link
Collaborator Author

fdncred commented Jun 30, 2017

I guess I can't because I keep getting KeyNotFoundException on condification.Transform(actualList.ToArray()). I'm not sure what's going on.

    string[] labelsList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".ToCharArray().Select(c => c.ToString()).ToArray();

    var codification = new Acc.Filters.Codification("Labels", labelsList);
    int[] expected = codification.Transform(expectedList.ToArray());
    int[] actual = codification.Transform(actualList.ToArray());

    var confusionMtx = new Acc.Analysis.ConfusionMatrix(actual, expected);
    var mtx = confusionMtx.ToGeneralMatrix();

As you might guess, my expectedList and my actualList are [A-Z][0-9].

Perhaps my actualList or expectedList doesn't have all 36 chars. Would that cause it to bomb like this?

@fdncred
Copy link
Collaborator Author

fdncred commented Jun 30, 2017

My mistake, I had a space in the actual list so it wasn't just base36. My next question though, is how do I get an actual matrix, similar to this screenshot? Having the results is good but I need the actual matrix too.

image

@cesarsouza
Copy link
Member

cesarsouza commented Jul 1, 2017

I think that you actually need to use the GeneralConfusionMatrix class instead of ConfusionMatrix. The ConfusionMatrix was originally made for binary classification problems, and the "General" for multi-class problems. I guess it would be better if ConfusionMatrix was renamed to BinaryConfusionMatrix to avoid confusions, but at this point I didn't want to introduce yet another breaking change in the framework...

If you start creating a GeneralConfusionMatrix instead of a ConfusionMatrix, from there you will be able to access the Matrix, RowTotals and ColumnTotals properties to build the matrix as shown in your screenshot.

Regards,
Cesar

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 5, 2017

Thanks @cesarsouza. I have the basic grid now, although it was a pain to create. ;) I still have a question though. Why do these statements return different results? The table below is generated from the confusionMtx variable, so it seems correct to me. When I look at the cm variable, I like the stats but they don't seem to match the confusionMtx. i.e. the TP,FP,FN,TN are way different than I expect. Perhaps I'm just ignorant about what's really going on here. Maybe it's all confused here because it's expecting a binary value like 1 or 0 and i have all kinds of other values?

    var confusionMtx = new Acc.Analysis.GeneralConfusionMatrix(labelsList.Length, expected, predicted);
    var cm = new Acc.Analysis.ConfusionMatrix(predicted, expected);

image

For anyone else trying to visualize a confusion matrix, this is how I did it.

    string[] labelsList = actualList.Concat(expectedList).Distinct().OrderBy(c=>c).ToArray();
    var codification = new Acc.Filters.Codification("Labels", labelsList);
    int[] expected = codification.Transform(expectedList.ToArray()); // ground truth data
    int[] predicted = codification.Transform(actualList.ToArray()); // predicted from OCR

    var confusionMtx = new Acc.Analysis.GeneralConfusionMatrix(labelsList.Length, expected, predicted);

    // Put the confusion matrix into a datatable
    var dt = new DataTable();
    dt.Columns.Add("Matrix");
    foreach (var col in labelsList)
    {
        dt.Columns.Add(col);
    }

    //Add the reporting columns
    dt.Columns.Add("Error");
    dt.Columns.Add("Precision");
    dt.Columns.Add("Total");

    for (int y = 0; y < confusionMtx.Classes; y++)
    {
        DataRow row = dt.NewRow();
        row[0] = labelsList[y];
        for (int x = 0; x < confusionMtx.Classes; x++)
        {
            row[x+1] = confusionMtx.Matrix[x, y];
        }
        // Add the error for this row
        row[confusionMtx.Classes + 1] = confusionMtx.ColumnTotals[y] - confusionMtx.Diagonal[y];
        // Add the precision for this row
        try { row[confusionMtx.Classes + 2] = ((float)((float)confusionMtx.Diagonal[y] / confusionMtx.ColumnTotals[y]) * 100).ToString("N2"); } catch { row[confusionMtx.Classes + 2] = 0.0f; }
        // Add the total for this row
        row[confusionMtx.Classes + 3] = confusionMtx.ColumnTotals[y];

        dt.Rows.Add(row);
    }

    DataRow errRow = dt.NewRow();
    errRow[0] = "Error";
    DataRow recallRow = dt.NewRow();
    recallRow[0] = "Recall";
    DataRow totalRow = dt.NewRow();
    totalRow[0] = "Total";

    for (int i = 0; i < confusionMtx.Classes; i++)
    {
        // Add the error for the columns
        errRow[i + 1] = confusionMtx.RowTotals[i] - confusionMtx.Diagonal[i];
        // Add the recall for the columns
        try { recallRow[i + 1] = ((float)((float)confusionMtx.Diagonal[i] / confusionMtx.RowTotals[i]) * 100).ToString("N2"); } catch { recallRow[i + 1] = 0.0f; }
        // Add the total for the columns
        totalRow[i + 1] = confusionMtx.RowTotals[i];
    }

    // Add total samples
    totalRow[confusionMtx.Classes + 3] = confusionMtx.Samples;
    // Add total Errors
    errRow[confusionMtx.Classes + 1] = confusionMtx.RowTotals.Sum() - confusionMtx.Diagonal.Sum();

    dt.Rows.Add(errRow);
    dt.Rows.Add(recallRow);
    dt.Rows.Add(totalRow);

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 6, 2017

I think I've finally figured all this mess out. With the addition of this code I was able to calculate the values I previously requested and create a ConfusionMatrix based on the values to easily get statistics.

    // Calculate scores
    for (int i = 0; i < confusionMtx.Classes; i++)
    {
        // Diagonal also known as TruePositive (TP)
        int TP = confusionMtx.Diagonal[i];
        // FalsePositive (FP) = Sum of column minus TP
        int FP = confusionMtx.RowTotals[i] - TP;
        // TrueNegative (TN) = Sum of everything not in your class's row or column
        int TN = confusionMtx.Samples - confusionMtx.RowTotals[i] - confusionMtx.ColumnTotals[i];
        // FalseNegative (FN) = Sum of row minus TP
        int FN = confusionMtx.ColumnTotals[i] - TP;

        var cm = new Acc.Analysis.ConfusionMatrix(TP, FN, FP, TN);
        //Console.WriteLine($"Class=[{codification.Columns[0].Mapping.FirstOrDefault(x => x.Value == i).Key}] TP=[{TP}] FP=[{FP}] FN=[{FN}] TN=[{TN}] Acc=[{cm.Accuracy}] FS=[{cm.FScore}]");
        dt.Rows[i][confusionMtx.Classes + 4] = TP;
        dt.Rows[i][confusionMtx.Classes + 5] = TN;
        dt.Rows[i][confusionMtx.Classes + 6] = FP;
        dt.Rows[i][confusionMtx.Classes + 7] = FN;
        dt.Rows[i][confusionMtx.Classes + 8] = (cm.Precision * 100).ToString("N3");
        dt.Rows[i][confusionMtx.Classes + 9] = (cm.Recall * 100).ToString("N3");
        dt.Rows[i][confusionMtx.Classes + 10] = (cm.FScore * 100).ToString("N3");
    }
    dt.WriteToCsvFile($"{msOutputFolderName}\\{msOutputFileName}.csv");

The last 7 columns are calculated in this for loop.
image

@cesarsouza
Copy link
Member

cesarsouza commented Jul 6, 2017

Hi @fdncred,

This is a bit weird. Do you have a multi-class problem or a binary problem? If you have multiple classes, then you should be using only the GeneralConfusionMatrix class and not the ConfusionMatrix at all. When you have multiple classes you don't actually have true positives or false negatives for the problem because your problem is not actually organized as positive/negative samples.

What you can do, however, is to get a binary confusion matrix for each class in your multi-class decision problem. In this case, indeed you can use the ConfusionMatrix, but you need to use the constructor that accepts a positiveValue parameter where you can specify which class label should be considered the "positive", so all the others can be considered "negative". This way you will be able to create N ConfusionMatrix matrices out of a classification problem of N classes.

Maybe I could also try adding a method for that in GeneralConfusionMatrix.

Regards,
Cesar

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 6, 2017

@cesarsouza Weird or not, it appears to work. The stats match. If you look at Precision vs Precision2 and Recall vs Recall2, they're identical and calculated differently. Precision2 and Recall2 are based on your ConfusionMatrix. Precision and Recall are based off of GeneralConfusionMatrix.

I have multiple classes, one for each OCR recognized and mis-recognized character. You can see in my code snippet above how TP and FN are calculated. Some of the maths came from here.

@cesarsouza
Copy link
Member

cesarsouza commented Jul 6, 2017

Thanks a lot for sharing your code, @fdncred. I will be taking a closer look at it and possibly come up with a solution to integrate this functionality in Accord.NET. Indeed, weird or not, user needs are a top priority in the development of this framework!

@cesarsouza
Copy link
Member

Hi @fdncred,

Do you have the list of expected and predicted values in the variables expectedList and actualList you used in your example, so I can double-check that my implementation matches?

Regards,
Cesar

@cesarsouza
Copy link
Member

Also, I know it might be late for you since you already resolved your original problem, but I am adding functionality to allow creating the table you need in a way similar to this:

// Example for https://github.com/accord-net/framework/issues/669
string[] expectedLabels = { "A", "A", "B", "C", "A", "B", "B" };
string[] predictedLabels = { "A", "B", "C", "C", "A", "C", "B" };

// Create a codification object to translate char into symbols
var codification = new Codification("Labels", expectedLabels);
int[] expected = codification.Transform(expectedLabels);   // ground truth data
int[] predicted = codification.Transform(predictedLabels); // predicted from OCR

// Create a new confusion matrix for multi-class problems
var cm = new GeneralConfusionMatrix(expected, predicted);

// Obtain relevant measures
int[,] matrix = cm.Matrix;
int[] error = cm.PerClassMatrices.Apply(x => x.Errors);
double[] recall = cm.PerClassMatrices.Apply(x => x.Recall);
int[] total = cm.PerClassMatrices.Apply(x => x.Samples);
int[] tp = cm.PerClassMatrices.Apply(x => x.TruePositives);
int[] tn = cm.PerClassMatrices.Apply(x => x.TrueNegatives);
int[] fp = cm.PerClassMatrices.Apply(x => x.FalsePositives);
int[] fn = cm.PerClassMatrices.Apply(x => x.FalseNegatives);
double[] precision = cm.PerClassMatrices.Apply(x => x.Precision);
double[] fscore = cm.PerClassMatrices.Apply(x => x.FScore);

// Create a matrix with all measures
double[,] values = matrix.ToDouble()
    .InsertColumn(error)
    .InsertColumn(recall)
    .InsertColumn(total)
    .InsertColumn(tp)
    .InsertColumn(tn)
    .InsertColumn(tp)
    .InsertColumn(fn)
    .InsertColumn(precision)
    .InsertColumn(fscore);
            
// Name of each of the columns in order to create a data table
string[] columnNames = codification.Columns[0].Values.Concatenate(
    "Error", "Recall", "Total", "TP", "TN", "FP", "FN", "Precision", "F-Score");

// Create a table from the matrix and columns
DataTable table = values.ToTable(columnNames);

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 8, 2017

@cesarsouza Attached is the test sample I used. (List removed)

This is the code I used to prepare it for the confusion matrix so I could compare each character.

    string[] lines = File.ReadAllLines(msInputFileName).Skip(1).ToArray();

    //            Truth,OCR
    //3C6TR5DT2HG675819,3C5TR5DT2HG675819
    //1C6RR6JT0HS707986,1C6RR6JT0HS70798B
    //1C6RR6TT3HS707994,HC6RR6TT3HS707994

    // Create lists of chars
    List<string> expectedList = new List<string>();
    List<string> predictedList = new List<string>();
    foreach (var item in lines)
    {
        string[] tokens = item.Split(',');
        string truthStr = tokens[0];
        string ocrStr = tokens[1];

        if (truthStr.Length > 0 && truthStr.Length == ocrStr.Length)
        {
            // split to chars
            char[] truthCharArr = truthStr.ToCharArray();
            char[] ocrCharArr = ocrStr.ToCharArray();
            for (int i = 0; i < truthCharArr.Length; i++)
            {
                expectedList.Add(truthCharArr[i].ToString());
                predictedList.Add(ocrCharArr[i].ToString());
            }
        }
    }

I haven't tried your code yet but I'm not sure about one part. In my code I combined the predictedList labels and expectedList labels with this line:

    string[] labelsList = predictedList.Concat(expectedList).Distinct().OrderBy(c => c).ToArray();

You're doing it differently. I ran into problems because with OCR your expected list and predicted list are not always the same. For example, if you assume the attached list labels are [A-Z0-9], you'll run into problems because there are other characters than that. I'll have to test yours and see what happens.

I like that you're adding code to get the information out as a DataTable. I'll use your implementation if it gives the same results. Thanks!

@cesarsouza
Copy link
Member

cesarsouza commented Jul 8, 2017

Ah I see. I was considering that the expected list was coming from a training set and that the predictions were coming from the an OCR model that has been created from this same training set. Under normal situations, the training set should contain all the data labels that would have been expected during testing. If the training set doesn't contain a label, then there is no hope that this label could have been ever correctly identified in a testing set.

However, now I see that in the kind of testing your are doing you are not necessarily measuring the performance of a classifier that has been created from the set of expected labels. You are measuring a general model on different sets from the ones that have been used to create it.

Anyways, I will update the thread soon after I check all the outputs are correct. Thanks a lot for providing the data example!

Regards,
Cesar

cesarsouza added a commit that referenced this issue Jul 9, 2017
…ting on binary classification results;

Adding a new PerClassesMatrices property that can be used to retrieve per-class confusion matrices and related measures.

- Updates GH-669: Confusion Matrix
@fdncred
Copy link
Collaborator Author

fdncred commented Jul 10, 2017

@cesarsouza I was looking at your PerClassMatrices code and noticed that you're calculating fp and fn differently than I, the rows and columns are swapped. Is that because I was calculating it wrong?

Your code:

    int fp = colSum[i] - diagonal[i];
    int fn = rowSum[i] - diagonal[i];

My code:

    // FalsePositive (FP) = Sum of column minus TP
    int FP = confusionMtx.RowTotals[i] - TP;
    // FalseNegative (FN) = Sum of row minus TP
    int FN = confusionMtx.ColumnTotals[i] - TP;

@accord-net accord-net deleted a comment from fdncred Jul 10, 2017
@accord-net accord-net deleted a comment from fdncred Jul 10, 2017
@cesarsouza
Copy link
Member

Hmm... I have to take a look again, but the way I did made the unit tests pass (I was comparing to a ConfusionMatrix that was being created directly from the data as if the label associated with the class was a "positive" and all the rest were "negative").

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 10, 2017

My only other "complaint" is that I haven't figured out how to insert the Row Headers and Row Summaries of Errors, Totals, and Precision by column.

It's not clear to me, by looking at your code, if InsertColumn is really Insert a column, such as one at the very beginning, or if it's really an AppendColumn, that only adds columns to the end. Still looking.

@cesarsouza
Copy link
Member

cesarsouza commented Jul 10, 2017

By default, InsertColumn inserts at the end. But if you think it looks confusing, you can also Matrix.Concatenate(), as long as you transform the single vectors into columns using Matrix.ColumnVector(columnArray).

@cesarsouza
Copy link
Member

i.e.: You could use

Matrix.Concatenate(
    matrix.ToDouble(), 
    Matrix.ColumnVector(error), 
    Matrix.ColumnVector(recall)
);

(the number of arguments is variable, you can concatenate horizontally as many matrices as you want)

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 10, 2017

Thanks but I'm still lost. :) VS2017 says this with your code

error CS0411: The type arguments for method 'Matrix.Concatenate<T>(T[], params T[])' cannot be inferred from the usage. Try specifying the type arguments explicitly. 

I'm also not sure if your example is supposed to give me a new column of row headers or a summary of errors and precision rows at the end of the matrix.

Sorry if I'm being obtuse.

@cesarsouza
Copy link
Member

It is just because the arrays we are trying to concatenate have mixed types (some are double[] and others are int[]):

            // Obtain relevant measures
            int[,] matrix = cm.Matrix;
            int[] error = cm.PerClassMatrices.Apply(x => x.Errors);
            double[] recall = cm.PerClassMatrices.Apply(x => x.Recall);
            int[] total = cm.PerClassMatrices.Apply(x => x.Samples);
            int[] tp = cm.PerClassMatrices.Apply(x => x.TruePositives);
            int[] tn = cm.PerClassMatrices.Apply(x => x.TrueNegatives);
            int[] fp = cm.PerClassMatrices.Apply(x => x.FalsePositives);
            int[] fn = cm.PerClassMatrices.Apply(x => x.FalseNegatives);
            double[] precision = cm.PerClassMatrices.Apply(x => x.Precision);
            double[] fscore = cm.PerClassMatrices.Apply(x => x.FScore);

            double[,] values = Matrix.Concatenate<double>(
                matrix.ToDouble(),
                Matrix.ColumnVector(error).ToDouble(),
                Matrix.ColumnVector(recall),
                Matrix.ColumnVector(total).ToDouble(),
                Matrix.ColumnVector(tp).ToDouble(),
                Matrix.ColumnVector(tn).ToDouble(),
                Matrix.ColumnVector(tp).ToDouble(),
                Matrix.ColumnVector(fn).ToDouble(),
                Matrix.ColumnVector(precision),
                Matrix.ColumnVector(fscore)
            );

The end result is just a concatenation of all the columns into a single two-dimensional array. I am just processing it like this because I think its easier than adding columns to a data table (as the framework does not offer methods for doing the same with data tables).

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 10, 2017

So I guess you're saying that I can't use this methodology to get Row Headers or Summaries of the Columns in the rows beneath the matrix.

it looks to me like this:

            double[,] values = Matrix.Concatenate<double>(
                matrix.ToDouble(),
                Matrix.ColumnVector(error).ToDouble(),
                Matrix.ColumnVector(recall),
                Matrix.ColumnVector(total).ToDouble(),
                Matrix.ColumnVector(tp).ToDouble(),
                Matrix.ColumnVector(tn).ToDouble(),
                Matrix.ColumnVector(tp).ToDouble(),
                Matrix.ColumnVector(fn).ToDouble(),
                Matrix.ColumnVector(precision),
                Matrix.ColumnVector(fscore)
            );

produces the same exact result as this:

                double[,] values2 = matrix.ToDouble()
                    .InsertColumn(error)
                    .InsertColumn(recall)
                    .InsertColumn(total)
                    .InsertColumn(tp)
                    .InsertColumn(tn)
                    .InsertColumn(tp)
                    .InsertColumn(fn)
                    .InsertColumn(precision)
                    .InsertColumn(fscore);

Maybe I'm not explaining myself well enough. With this matrix there is a header row containing the list of labels. I want that same list of labels going down the first column as well. And then at the end of the matrix I want column summaries in rows and row summaries in columns. I essentially want it to look like the screenshot from a few days ago.

@cesarsouza
Copy link
Member

Oh sorry, now I realize what you mean. You mean adding the class labels as the first column, as well as adding the "Error", "Precision" and "Total" rows at the bottom of the table. I thought I had read your question too fast, now I am sure.

In fact, I haven't added the functionality to calculate the errors and precisions for the rows yet. I guess that for now, it could be possible to add those columns and rows after the initial data table has been created using a mixture of the method I had shown above and the initial method you were using before.

Well, sorry for the confusion in the previous answers.

If you want to have an initial array of "string" type in the beginning of your matrix as I had shown above, you can use something like:

 object[,] values = Matrix.Concatenate<object>(
                Matrix.ColumnVector(codification.Columns[0].Values),
                matrix.ToObject(),
                Matrix.ColumnVector(error).ToObject(),
                Matrix.ColumnVector(recall).ToObject(),
                Matrix.ColumnVector(total).ToObject(),
                Matrix.ColumnVector(tp).ToObject(),
                Matrix.ColumnVector(tn).ToObject(),
                Matrix.ColumnVector(tp).ToObject(),
                Matrix.ColumnVector(fn).ToObject(),
                Matrix.ColumnVector(precision).ToObject(),
                Matrix.ColumnVector(fscore).ToObject()
            );

The problem is that now it will not be possible to create a DataTable with the column names already set from this matrix definition because I've just seen that this overload is missing. You would have to name the columns manually, after:

            // Name of each of the columns in order to create a data table
            string[] columnNames = "Label".Concatenate(codification.Columns[0].Values.Concatenate(
                "Error", "Recall", "Total", "TP", "TN", "FP", "FN", "Precision", "F-Score"));

            // Create a table from the matrix and columns
            DataTable table = values.ToTable();
            for (int i = 0; i < columnNames.Length; i++)
                table.Columns[i].ColumnName = columnNames[i];

Now, for the rows that should come below, unfortunately they would have to be manually set at this time, at least until I can implement the functionality to compute them from the confusion table in the future.

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 10, 2017

Much better. Thanks for the progress and sorry for the confusion. I'll try to figure out the column summary rows tomorrow.

And please let me know what you figure out about the precision and recall being swapped.

@cesarsouza
Copy link
Member

cesarsouza commented Jul 10, 2017

It turns out you can actually append the summary rows using InsertRow:

object[,] values = Matrix.Concatenate<object>(
    Matrix.ColumnVector(codification.Columns[0].Values),
    matrix.ToObject(),
    Matrix.ColumnVector(colErrors).ToObject(),
    Matrix.ColumnVector(colRecall).ToObject(),
    Matrix.ColumnVector(colTotal).ToObject(),
    Matrix.ColumnVector(tp).ToObject(),
    Matrix.ColumnVector(tn).ToObject(),
    Matrix.ColumnVector(tp).ToObject(),
    Matrix.ColumnVector(fn).ToObject(),
    Matrix.ColumnVector(precision2).ToObject(),
    Matrix.ColumnVector(recall2).ToObject(),
    Matrix.ColumnVector(fscore).ToObject()
);

values = values.InsertRow(Matrix.Concatenate<object>("Error", colErrors.ToObject()));
values = values.InsertRow(Matrix.Concatenate<object>("Precision", rowPrecision.ToObject()));
values = values.InsertRow(Matrix.Concatenate<object>("Total", colTotal.ToObject()));


// Name of each of the columns in order to create a data table
string[] columnNames = "Label".Concatenate(codification.Columns[0].Values.Concatenate(
    "Error", "Recall", "Total", "TP", "TN", "FP", "FN", "Precision2", "Recall2", "F-Score"));

// Create a table from the matrix and columns
DataTable table = values.ToTable(columnNames);

Now, the only problem is that you might have to compute some of the variables manually for now (e.g. colErrors, rowErrors) as I had not included them yet in the framework.

Also, note that the syntax above is a bit ugly because we are stretching a bit the usage of InsertRow/InsertCol. Normally those would have been used with columns/rows/values with matching data types. It is nice to know it works with mixed data types as well, though.

Regards,
Cesar

cesarsouza added a commit that referenced this issue Jul 10, 2017
…ionMatrix;

Updating general confusion matrix construction example to consider total rows;

 - Updates GH-669: Confusion Matrix
@fdncred
Copy link
Collaborator Author

fdncred commented Jul 11, 2017

Thanks @cesarsouza for putting so much work into this!

I got your latest changes and built the libraries and tested this morning. I used the code in your unit test with my dataset. The results are super close but there are still some things that don't look right. See the highlighted items in the screenshot below.

  1. FP column is being calculated wrong.
  2. The TN also looks off when comparing to my original screenshot. One of us is right, I just haven't validated who is yet.
  3. Precision row and Recall column should be a double. I think this must be a bug because cm.Precision and cm.Recall are returning ints. Actually they're returning doubles with int values.
  4. Error column and Error row have identical values. I think this is because theres a type-o in the Error row of the unit test code. It should be rowErrors. The same is true for Total row. I think it should be rowTotal.
  5. The last thing that is really driving me crazy, mostly because of my ignorance, is that my precision is your recall and my recall is your precision. I'm not sure what it should be, if you're right or I'm right. I tend to side with you but I'd just like to figure it out. :) I think the reason behind this swapping is because your FN is my FP and vice-versa.
  6. I also just noticed that your rows are my columns and vice-versa. So, even with the type-o fix in point 4 above, the error row matches the values in the that should be in the error column. This is probably my mistake confusing the meaning of rowTotal and colTotal.
    image

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 11, 2017

I think I might have this figured out. This is a list of my changes.

GeneralConfusionMatrix.cs
Line 202: changed it from matrix[i, j]++ to matrix[j, i]++ <-- fixed the rows and columns being swapped
Line 311: changed the second diagonal[i] to (double)diagonal[i] so we get double results versus int <-- fixed Precision
Line 332: changed the second diagonal[i] to (double)diagonal[i] <-- fixed Recall
Lines 800-808: changed to this:

    // Diagonal also known as TruePositive (TP)
    int tp = Diagonal[i];
    // FalsePositive (FP) = Sum of column minus TP
    int fp = ColumnErrors[i];
    // TrueNegative (TN) = Sum of everything not in your class's row or column
    int tn = Samples - RowTotals[i] - ColumnTotals[i];
    // FalseNegative (FN) = Sum of row minus TP
    int fn = RowErrors[i];

    matrices[i] = new ConfusionMatrix(
        truePositives: tp, falseNegatives: fn,
        falsePositives: fp, trueNegatives: tn);

then i changed your test code to this:

    string[] labelsList = predictedList.Concat(expectedList).Distinct().OrderBy(c => c).ToArray();
    var codification = new Codification("Labels", labelsList);
    int[] expected = codification.Transform(expectedList.ToArray());
    int[] predicted = codification.Transform(predictedList.ToArray());
    var cm = new GeneralConfusionMatrix(expected, predicted);

    int[] rowErrors = cm.RowErrors;
    int[] colErrors = cm.ColumnErrors;

    double[] rowPrecision = cm.Precision;
    double[] colRecall = cm.Recall;

    int[] colTotal = cm.ColumnTotals;
    int[] rowTotal = cm.RowTotals;

    // Obtain relevant measures
    int[,] matrix = cm.Matrix;
    int[] tp = cm.PerClassMatrices.Apply(x => x.TruePositives);
    int[] tn = cm.PerClassMatrices.Apply(x => x.TrueNegatives);
    int[] fp = cm.PerClassMatrices.Apply(x => x.FalsePositives);
    int[] fn = cm.PerClassMatrices.Apply(x => x.FalseNegatives);
    double[] precision2 = cm.PerClassMatrices.Apply(x => x.Precision);
    double[] recall2 = cm.PerClassMatrices.Apply(x => x.Recall);
    double[] fscore = cm.PerClassMatrices.Apply(x => x.FScore);

    object[,] column01 = Matrix.ColumnVector(codification.Columns[0].Values).ToObject();
    object[,] columns2_to_4 = matrix.ToObject();
    object[,] column05 = Matrix.ColumnVector(rowErrors).ToObject();
    object[,] column06 = Matrix.ColumnVector(colRecall).ToObject();
    object[,] column07 = Matrix.ColumnVector(rowTotal).ToObject();
    object[,] column08 = Matrix.ColumnVector(tp).ToObject();
    object[,] column09 = Matrix.ColumnVector(tn).ToObject();
    object[,] column10 = Matrix.ColumnVector(tp).ToObject();
    object[,] column11 = Matrix.ColumnVector(fn).ToObject();
    object[,] column12 = Matrix.ColumnVector(precision2).ToObject();
    object[,] column13 = Matrix.ColumnVector(recall2).ToObject();
    object[,] column14 = Matrix.ColumnVector(fscore).ToObject();

    object[,] values = Matrix.Concatenate(
        column01,
        columns2_to_4,
        column05,
        column06,
        column07,
        column08,
        column09,
        column10,
        column11,
        column12,
        column13,
        column14
    );

    object[] row05 = Matrix.Concatenate<object>("Error", colErrors.ToObject());
    object[] row06 = Matrix.Concatenate<object>("Precision", rowPrecision.ToObject());
    object[] row07 = Matrix.Concatenate<object>("Total", colTotal.ToObject());

    values = values.InsertRow(row05)
        .InsertRow(row06)
        .InsertRow(row07);

    // Name of each of the columns in order to create a data table
    string[] columnNames = "Label".Concatenate(codification.Columns[0].Values.Concatenate(
        "Error", "Recall", "Total", "TP", "TN", "FP", "FN", "Precision2", "Recall2", "F-Score"));

    // Create a table from the matrix and columns
    DataTable table = values.ToTable(columnNames);

And things seem to work better now. The only thing I can't figure out is how to get the 2nd to last row to appear as double. I think it may have something to do with the ToTable() method.

This is what the chart looks like with these changes.
image

@cesarsouza
Copy link
Member

cesarsouza commented Jul 11, 2017

Hi fdncred,

I can't write much right now as I am on the go, but I just wanted to say that confusion matrices (especially multi-class ones) are sometimes transposed depending on the implementation. It depends on the convention being used. As long as the metrics are correct in respect to the convention chosen, it should be fine.

The problem here (besides the many others you identified, thanks about that) is that I didn't realize your conventions were transposed wrt mine and I based some of the formulas on the code you posted before. It was actually my fault for not checking beforehand.

@cesarsouza
Copy link
Member

In the convention you need, where would you put the labels "expected" and "prediction" by the way? "prediction" on top of the table, and "expected" on the left side?

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 11, 2017

I can work with it transposed in either way if all the numbers are correct. My problem was that it was a mix of some of the data being in one way and some of the other data being transposed all in one table. That was just too confusing.

So maybe you could report it one way and have a helper function to transpose it if a user needed it in the other way.

In my examples the column labels are the ground truth (expected) and the row labels (predicted) are the OCR output.

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 11, 2017

Just found another bug/type-o.

    object[,] column08 = Matrix.ColumnVector(tp).ToObject();
    object[,] column09 = Matrix.ColumnVector(tn).ToObject();
    object[,] column10 = Matrix.ColumnVector(tp).ToObject();
    object[,] column11 = Matrix.ColumnVector(fn).ToObject();

There are two tp in this section of code. There should be a tp and fp.

@cesarsouza
Copy link
Member

cesarsouza commented Jul 11, 2017

Hi @fdncred

Thanks a lot for finding all those typos, and also thanks for the suggestion about transposing the matrix in case the user needs it. Actually, this was a very good one. Most (maybe all?) libraries just go for a single convention, and then expect users to adapt to it or convert to / from by themselves. It would be really neat if the framework could support both cases without expecting too much from the user side.

I may try to implement a few things in this direction.

By the way, sorry again for the bugs, I had indeed not validated the code enough since it was not about to be included in an official release.

Regards,
Cesar

@cesarsouza
Copy link
Member

cesarsouza commented Jul 15, 2017

Hi @fdncred,

Can I at least include the values of your confusion matrix as a unit test in the framework (under the LGPL), together with a shuffled version of the list of assignments (i.e. the integer vectors for the predicted and expected values)? It should be impossible to trace back those values to your original data, since they would be seen as just arrays of seemingly random integer numbers.

Regards,
Cesar

@fdncred
Copy link
Collaborator Author

fdncred commented Jul 16, 2017

Sure. That sounds reasonable.

cesarsouza added a commit that referenced this issue Jul 16, 2017
…passed;

Fixing Matrix's ToTable method to use the most high level type possible when creating columns;

 - Updates GH-669
cesarsouza added a commit that referenced this issue Jul 16, 2017
…instead of predictions;

Adding shuffled, randomized and summarized test data for GH-669.
@fdncred
Copy link
Collaborator Author

fdncred commented Jul 18, 2017

@cesarsouza FYI, I dropped in 3.6.3-alpha into my code base and it appears to have run well. My "old" results match the results with 3.6.3-alpha. So, that's good! Thanks for all the hard work!!!

@cesarsouza
Copy link
Member

Added in release v3.7.0.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants