Difference between revisions of "TestSuite-blueprint"

Latest revision as of 10:35, 31 March 2012

Inkscape has had a number of unit tests for quite some time now. This page describes a proposal (by Jasper van de Gronde) for the GSoC 2008 to improve the current test suite.

In short, I will make sure these tests run on all supported platforms, write new tests to increase the unit test coverage of the codebase, create a way to perform higher-level tests and set up a system to run all these tests automatically and periodically, publishing the results on-line (in a raw form) and possibly notifying certain individuals if something goes wrong.

Unit tests

Building on other platforms

Currently the unit tests only run on Linux. The other two major platforms that Inkscape runs on are Windows and MacOS X. MacOS X uses more or less the same build system as Linux, but Windows uses a special-purpose buildtool. I will make sure all tests run on all three platforms.

I have tried building the existing unit tests on Windows and encountered the following problems:

There are both CxxTest and Utest based unit tests, each requiring a different build method. The CxxTest framework being the newer, more full-featured framework.
For Utest an executable is created per test file, as buildtool isn't built for this it either has to be modified to make this more easy, or a relatively large piece of XML has to be copied/pasted for each test file.
CxxTest on the other hand does use a single executable, but the test files themselves are only header files and a code generator (cxxtestgen) is used to generate the necessary cpp('s). Buildtool can reasonably easily be extended to support this.

Fortunately those were (practically) the only problems I encountered so far, as a feasibility test I tried (and succeeded) to get all the unit tests compiling and running on Windows with just a few changes to buildtool and build.xml (see bug #208821).

To get the CxxTests running I simply implemented two new tasks for buildtool to generate the necessary files, I will keep it this way as it is reasonably clean and works quite well. The main thing that will have to be discussed with the Inkscape community is whether or not these steps will indeed become part of the normal build process.

For Utest I modified the link task in buildtool to allow it to generate an executable for each file in its fileset instead of just one executable for the entire fileset. This is a bit more dubious and as a lot of the tests are already in (/converted to) the CxxTest framework I will convert the remaining Utest tests to CxxTest.

Test suite

An important part of this project is the creation of new unit tests. For starters I will write unit tests to fill some gaps and try to have more of the code that deals with converting SVG into (ultimately) a bitmap covered. This means I will at least write unit tests for:

display/ (selected files)
libnr/nr-compose (already partially done while working on a previous patch)
svg/svg-*
I might create some tests for livarot, but since it is going to be replaced by cairo it will not be my first priority

In addition I will use coverage/profiling data (I already successfully tried to use gcov and gprof with, parts of, inkscape) and SVN logs, as well as inquiries on the developers mailinglist (for example), to try and find parts of the code which are potentially interesting to create (unit) tests for. I will publish these results, but I will probably not create unit tests for every potentially interesting piece of code this might identify (there will probably be a lot that might be interesting).

Apart from using coverage data to identify interesting parts of the code to test I might also use gcov to keep track of how much of the codebase is covered by (unit) tests. This kind of data could be interesting to see whether any of the existing unit tests might need some extra test cases. For example, executing the CxxTest unit tests covers only about 44% of the lines in svg-color.cpp. Looking at the coverage analysis in more detail reveals that the (static) rgb24_to_css function is completely untested.

Coverage data (and/or the dependency data calculated for building inkscape) could also be used to reduce the number of tests that need to be rerun, if that would prove to be useful (so far I do not expect it to be though, the unit tests take very little time, perhaps that the higher level tests described below will take long enough to make this useful).

The table below shows for which files unit tests currently exist (and for what framework), the files that still have to be converted are marked in yellow, files that still have to be created from scratch with red. The files marked in green/with an asterisk are new (at least their CxxTest versions). Files marked with S exist but only contain a stub (with perhaps one or two tests).

File	CxxTest	Utest
attributes	Y
color-profile	Y
dir-util	Y
extract-uri	Y
mod360	Y
round	Y
sp-gradient	Y
sp-style-elem	Y
style	Y (small)	Y
verbs	Y
display/bezier-utils	Y*	Y
display/curve	Y*
helper/units	Y*	Y
libnr/in-svg-plane	Y	Y
libnr/nr-compose	Y*
libnr/nr-matrix	Y	Y
libnr/nr-point-fns	Y	Y
libnr/nr-rotate-fns	Y	Y
libnr/nr-rotate	Y	Y
libnr/nr-scale	Y	Y
libnr/nr-translate	Y	Y
libnr/nr-types	Y	Y
svg/css-ostringstream	Y
svg/stringstream	Y
svg/svg-affine	Y*
svg/svg-color	Y
svg/svg-length	Y*
svg/svg-path	Y*
util/list-container	Y*	Y
xml/quote	Y*	Y
xml/repr-action	Y*	Y

High-level tests

Apart from unit tests - which focus on low-level, self-contained functionality - it is useful to test how Inkscape functions on a higher level. I will include both "rendering" tests and "verb" tests. Rendering tests simply let Inkscape render an input SVG to a bitmap and compare the result to a reference image (for an example, see SVG Test Suite Compliance). Verb tests attempt to test all sorts of UI operations, like path intersections.

Currently the results are evaluated completely manually. I will create a simple tool to let Inkscape render a collection of files and compare the result images to previous result images (judged by humans), initially just the reference image (which is obviously acceptable). If it finds any images for which the result differs from all judged images it will report these, the user can then judge for itself if they are acceptable. (Obviously failures to render an image at all are always reported.)

The procedure described above is used because the result images will hardly ever resemble the reference image exactly. This way human judgment is only needed when something actually changes. This scheme *might* be improved by allowing approximate matches based on mean squared error or pdiff (Cairo uses something very similar in its test suite). Or even by disregarding parts of the result images (specifically certain labels in tests where the labels themselves are not important).

After rendering all the test images the test tool would output a list of all images along with the result of the test. This could include a comment field showing to which reference image the output image was matched, or, if an error occurred, what went wrong. This *might* be complemented by further information on how well/bad the test was passed/failed (for example, if MSE is used it might also report the actual MSE). So for example:

Image	Result	Comment
gradient-test.svg	Pass	Matched to gradient-test-good.png
transform-test.svg	Fail	Matched to transform-test-bad.png
font-test.svg	New	No reference file(s) yet.
font-test.svg	Error	font-test.svg not found
animation-test.svg	Error	Inkscape crashed.

As an example of how this might work I made a small set of test files and a test program that simply runs Inkscape on the different test files and uses first FC (Windows file compare) and then Pdiff (if necessary) to compare the output files with a set of reference files. A number of possible scenarios are tested (including a crash). Currently Win32-only (mostly because of my use of file compare), the test program (+test files) can be downloaded (source+pre-compiled binaries) from [1].

Verb tests will initially work much the same as rendering tests, except that each test consists not of a single SVG file, but rather of multiple files. Each test could be accompanied by information on what verbs should be executed and what the (initial) Inkscape configuration file should be (this can affect the behaviour of some verbs).

For judging the result files the same procedure as with rendering tests will be followed, except that in this case it might be useful to allow for more than one result file. And it would also be useful to allow not just for bitmaps as result files. But initially the judging functionality will be pretty limited (that is, not much beyond simple file comparisons), as I feel it would exceed the scope of this project to create sophisticated comparison utilities. Where possible/desirable I will of course leverage existing utilities as much as possible.

Test system

My intention is to run all the tests periodically on my own hardware for the duration of the project (for all three platforms), and I will document how I accomplished this, so that others can take over this task. SourceForge does allow some ways of being notified whenever a commit has been processed, but it is not ideal for this situation, so it might just be simpler (and just as useful) to simply poll the repository periodically (one or two times a day for example).

Because of security and ease of use I will probably make the tests run on virtual machines. At the moment I'm already using this approach (using qemu with ubuntu 7) for Linux and it would likely be reasonably easy to do this for Windows as well in the exact same way. For MacOS X I will initially use real hardware, but it would be nice to use a virtual machine for this as well.

Running the tests will result in a set of output files (test results, generated output, etc.). I will make sure at least the test results will be made available on-line after each test run. This way Inkscape developers can easily see the current status of tests (as well as historical data) without having to recompile/test the code themselves (useful for tracking down when something broke for example). It would also allow them to compare the results for different platforms without having access to all platforms themselves.

I might also set up the system to send a notification (to the developers/testers mailinglist for example) when it finds that a test suddenly fails (while it first succeeded).

Performance

All the above is just about correctness, whether or not the code performs correctly. It would however also be interesting to look at the performance of Inkscape. This is not the primary goal of this project, but it might be possible to set a few steps in this direction.

I regularly use gprof to profile Inkscape myself, so as a first step in this direction I might simply enable profiling when building Inkscape for testing, accumulate the profiling data from all the test runs and publish the resulting profile with the test results.

Another reasonably simple way to provided some performance data would be to simply look at the total execution time of each test and publish that along with the test results.

Rough prioritization

The following is an outline of the tasks, roughly ordered by their priority.

Note that I'm postponing any Mac related activity at the moment because I first want to get the tests working on both Linux and Windows (the latter works, the former is giving me some trouble because of technical difficulties with my QEMU/Linux installation) and because I haven't been able to build Inkscape on my Mac yet.

Make the existing (CxxTest) tests build (and run) on all three platforms (I can currently make them work for Windows and Linux).
- (done) Modify buildtool to add support for generating the necessary .cpp files using cxxtestgen. (see my patch mentioned earlier) Depending on discussion with the Inkscape community this might also require changes to the way buildtool handles dependencies (that is, if the Inkscape community doesn't want compilation of the tests to be part of the usual build process).
- (done) Add the existing CxxTest unit tests to build.xml. (again see my patch)
- (not doing) Get Inkscape to compile on my Mac (or a virtual machine).
- (not doing) Check that the unit tests compile on MacOS X.
Converting the remaining utest tests to the CxxTest framework.
- (done) Convert helper/units to CxxTest.
- (done) Convert display/bezier-utils.
- (done) Convert util/list-container.
- (done) Convert xml/*.
- (done) Make the new CxxTests build on Linux.
- (not doing) Remove the old utest tests.
Create unit tests for libnr/nr-compose and (the mentioned parts of) 2geom/.
- (done) Convert (and possibly update) my existing tests for libnr/nr-compose to CxxTest.
- (not necessary, 2geom already has this) Adapt existing (libnr) unit tests / implement new unit tests for the parts of 2geom that correspond to parts of libnr.
Implement rendering tests based on (binary and pdiff-like) comparison with reference files. See [2] for a proof of concept that uses (Windows') file compare and perceptualdiff (the latter should also run fine on Linux and MacOS X).
- (done) Create a test program that can execute command-line applications (inkscape, file compare, perceptualdiff, ...) on all three platforms. And can deal with those applications crashing (on Windows this requires using SetErrorMode to prevent Windows from presenting the user with an exception dialog). It should accept a list of test files, export them to png using Inkscape and compare the output files to reference images (one per test file) using binary comparison.
- (partially done, it can be used, but it's not done by default, and the current version of perceptualdiff has a problem with transparency) Make it use perceptualdiff for .png's.
- (done, at least the multiple references part) Extend it to use more than one reference file per test file and/or comparison masks to ignore certain portions of the images (or other means of making the comparisons more useful).
(not doing this, but see TestingInkscape for information on running the tests unattended) Set up test system to run periodically and upload results.
- (not doing, but the tests can be run unattended) Set up test (virtual) machines to update their working copies and run all unit and rendering tests periodically.
- (not doing) Make the test systems bundle (in a zip file, a directory or a single, XML, file) and upload (using ftp) the test results per run.
Create unit tests for:
- (done, only for curve) display/
- (done) svg/
(not doing this) Implement verb tests analogously to rendering tests.
- Let the rendering test tool accept entire Inkscape command lines as well (or create a derivative tool which reuses most of the code but accepts different test definitions).
- Add some support for comparing XML.
Identify interesting areas of the code for creating further unit tests.
- (sent mail) Discuss problem areas in the code with the Inkscape community.
- (done) Use SVN logs to identify often edited, very new and very old code.
- (done) Use gcov to determine what parts of the code are executed very often and which are executed hardly ever.

Future work

Inkscape's unit tests have been expanded and a reasonably straightforward but functional rendering test system is in place. However, there is still quite a lot that can be done. First of all, here are a few things that were planned for the GSoC test suite project but were not finished in time:

Verb tests: Could be incredibly useful, especially if Inkscape would allow more complete scripting. It shouldn't take a lot of effort to add this capability to the rendering test framework if it would be enough to compare either a single PNG or a single XML output using binary comparison. In that case it would suffice to create some extension like '.verb' for verb tests and make such a file contain a list of verbs in some form (or perhaps a complete inkscape command line) and the output file (it cannot be determined automatically).
MacOS X support: It might just work, but it also might not. Due to some technical difficulties this simply wasn't tried.
Running the tests unattended: In principle the tests can be run unattended, but I haven't set up a system to do so (lack of time and hardware that I can leave on for a few months).
Performance tests
Removing the old utest tests: Mostly a janitorial task, editing some Makefiles, removing some files and making sure nothing was broken in the process.

In addition, after looking at coverage data for the unit tests and rendering tests there are some unit tested files that still contain (sometimes quite a lot of) code that is executed by the rendering tests but is not tested during the unit tests. This is not desirable, as rendering tests are not meant to test very low level functionality. The following files are affected (part of the output of a 'coverage.py -w test -c ...' run, using coverage data of just the unit tests and the unit and rendering tests), note that lower coverage values are good in this case:

display\bezier-utils.cpp: 1% (2/331)
verbs.cpp: 1% (6/863)
extract-uri.cpp: 5% (2/39)
libnr\nr-compose.cpp: 6% (41/704)
libnr\nr-scale.h: 6% (1/16)
svg\svg-color.cpp: 7% (14/190)
verbs.h: 9% (2/23)
sp-style-elem.cpp: 10% (15/156)
svg\svg-length.cpp: 12% (34/275)
color-profile.cpp: 13% (62/484)
libnr\nr-point-fns.h: 19% (5/26)
display\curve.cpp: 27% (58/218)
svg\svg-length.h: 100% (6/6)

Also, looking at files with code that is executed very, very often the following (classes of) files would be interesting to write new tests for:

color.cpp
conditions.cpp
inkscape.cpp (specifically inkscape_get_repr), update: "replaced by either Inkscape::Preferences, or Inkscape::Preferences::Observer" (thanks tweenk)
sp-image.cpp
sp-*
style.*
uri.cpp
verbs.*
display/nr-arena-shape.cpp
display/nr-filter-*
display/*
libnr/nr-compose*
libnr/nr-gradient.cpp
libnr/*
xml/*

I also looked at the least executed parts of the codebase, but that's much trickier (there is much more "noise" in the form of files that contain some trivial line of code that is executed once or twice), so I decided not to draw any conclusions from that.

Something similar holds for SVN logs. I had a look at the most and least recently updated files, but there was way too much (not very discriminating) data to make a useful statement. But it might give a nice guideline if you want to prioritize creation of a predetermined set of tests. In any case, a small tool can be found in SVN (activity.py) that can read XML formatted SVN logs and output a list of touched files, sorted by their last action date.

Finally, I have focussed mainly on rendering as it is relatively easy to test and quite critical, but it would be good to look at more aspects of Inkscape in the future. Verb tests could help here, but more direct UI testing would also be good, as would testing of Inkscape's export functionality, effects, etc.

Note that I have ignored livarot (and third-party libraries, including 2geom) entirely.