Please read the project booklet which contains the answers to many questions that students have asked over the years.

- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be saved regularly on your SVN or private GitHub archive.
- You can expect to see your supervisor each week for a twenty minute meeting.
- An MSci project is expected to take about 200 hours of work. That is about seven weeks of full time work.
- You are expected to successfully implement at least one extension in your MSci project.
- Pay particular attention to the prerequisites. Do not choose a project if you are not doing the required modules.

Please read the project booklet at the project moodle page. It contains the answers to many questions that students may ask.

- In the ballot you must select five projects in order of preference.
- Pay particular attention to the prerequisites. Do not choose a project if you are not doing the required modules.
- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be regularly backed up (e.g., saved regularly on your SVN or private GitHub archive).
- During the first month of the project it is advised that you meet your supervisor at least on a weekly basis. Later, the intervals between meetings may increase.
- Not all projects are the same. There is detailed guidance on assessment in the project handbook.

Please read the project booklet which contains the answers to many questions that students have asked over the years.

- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be saved regularly on your SVN or private GitHub archive.
- Your project meetings will last twenty minutes, so prepare well.
- A full unit project is expected to take about 200 hours of work. That is about seven weeks of full time work.
- Not all projects are the same. There is detailed guidance on assessment in the project handbook.
- All projects can be taken at either third year or MSci level. Supervisors will adjust their expectations accordingly.
- Pay particular attention to the prerequisites. Do not choose a project if you are not doing the required modules.

Please read the project booklet which contains the answers to many questions that students have asked over the years.

- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be saved regularly on your SVN or private GitHub archive.
- You can expect to see your supervisor each week for a twenty minute meeting.
- A Half unit project is expected to take about 100 hours of work. That is about three weeks of full time work.

Please read the project booklet at the project moodle page. It contains the answers to many questions that students may ask.

- In the ballot you must select five projects in order of preference.
- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be regularly backed up (e.g., saved regularly on your SVN or private GitHub archive).
- During the first month of the project it is advised that you meet your supervisor at least on a weekly basis. Later, the intervals between meetings may increase.
- Not all projects are the same. There is detailed guidance on assessment in the project handbook.

Please read the online project booklet (Moodle) which contains the answers to many questions that students have asked over the years.

This list is liable to change when new titles are offered or old titles withdrawn.
- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be saved regularly on your SVN or private GitHub archive.
- Your project meetings will last twenty minutes, so prepare well.
- A full unit project is expected to take about 200 hours of work. That is about seven weeks of full time work.
- Not all projects are the same. There is detailed guidance on assessment in the project handbook.

Please read the project booklet at the project moodle page. It contains the answers to many questions that students may ask.

- In the ballot you must select five projects in order of preference.
- Any project from the list can get 100% marks if done sufficiently well.
- All artefacts (reports, code etc) MUST be regularly backed up (e.g., saved regularly on your SVN or private GitHub archive).
- During the first month of the project it is advised that you meet your supervisor at least on a weekly basis. Later, the intervals between meetings may increase.
- Not all projects are the same. There is detailed guidance on assessment in the project handbook.

(Social) web single sign-on systems, such as Facebook Login or Google Sign-in, have found a widespread adoption in the web to authenticate users. These systems are typically based on the standards OpenID Connect 1.0 and OAuth 2.0. These protocols, however, can be used incorrectly, making their usage insecure. While many of these problems are addressed by guidance and protocol improvements, web sites might not follow this advice.

In this project, we want to (semi-)automatically scan the web for web sites that use social web single sign-on and check if they are using these authentication systems in insecure ways.

- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).
- Good programming skills.

- A report describing (social) web single sign-on and their security problems and explaining which of these security problems can be detected without knowledge of the inner workings of the web services.

- A web crawler that detects insecure usage of (social) web single sign-on based on OpenID Connect 1.0 and OAuth 2.0.
- Complete report

- https://tools.ietf.org/html/rfc6749 (OAuth 2.0)
- https://tools.ietf.org/html/rfc6750 (Bearer Token Usage)
- https://tools.ietf.org/html/rfc6819 (OAuth 2.0 Threat Model and Security Considerations)
- https://tools.ietf.org/html/draft-ietf-oauth-security-topics-21 (Draft OAuth Security BCP)
- https://tools.ietf.org/html/draft-ietf-oauth-browser-based-apps-12 (Draft Best Practices for OAuth Browser-Based Apps)
- https://openid.net/specs/openid-connect-core-1_0.html (OpenID Connect Core 1.0)
- Ronghai Yang, Guanchen Li, Wing Cheong Lau, Kehuan Zhang, and Pili Hu. 2016. Model-based Security Testing: An Empirical Study on OAuth 2.0 Implementations. ASIA CCS '16. ACM, New York, 651–662. DOI:https://doi.org/10.1145/2897845.2897874
- Shernan E., Carter H., Tian D., Traynor P., Butler K. 2015. More Guidelines Than Rules: CSRF Vulnerabilities from Noncompliant OAuth 2.0 Implementations. In: Almgren M., Gulisano V., Maggi F. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2015. Lecture Notes in Computer Science, vol 9148. Springer, Cham. https://doi.org/10.1007/978-3-319-20550-2_13
- San-Tsai Sun and Konstantin Beznosov. 2012. The devil is in the (implementation) details: an empirical analysis of OAuth SSO systems. CCS '12. ACM, New York, 378–390. DOI:https://doi.org/10.1145/2382196.2382238 - Yuchen Zhou and David Evans. 2014. SSOScan: Automated Testing of Web Applications for Single Sign-On Vulnerabilities. USENIX Security 2014. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/zhou
- Daniel Fett, Ralf Küsters, Guido Schmitz. 2016. A Comprehensive Formal Security Analysis of OAuth 2.0. In: 23rd ACM SIGSAC Conference on Computer and Communications Security (CCS 2016). https://dl.acm.org/doi/pdf/10.1145/2976749.2978385
- Daniel Fett, Ralf Küsters, Guido Schmitz. 2017. The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines. In: IEEE 30th Computer Security Foundations Symposium (CSF 2017). https://ieeexplore.ieee.org/iel7/8048777/8049639/08049720.pdf

- Good programming skills, particularly Python and/or NodeJS.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing current approaches to integrate apps into MS Teams and Slack
- A report with an overview of possible security services to integrate in the chatbot (VirusTotal, etc.)
- A proof of concept app for either MS Teams and Slack (following the tutorials would be enough)

- A chatbot that allows users to access the functionality of a specific API via MS Teams or Slack (the bot may include a setup phase were API keys need to be provided)
- A test suite to verify that all implemented tests work as intended
- A security analysis of the chatbot app
- A user manual for the bot
- A critical comparison with related works

- https://api.slack.com
- https://docs.microsoft.com/en-us/microsoftteams/platform/
- https://link.springer.com/chapter/10.1007/978-3-319-99927-2_3
- https://s3lab-rhul.slack.com/apps/A81FQ3116-1password

- Willingness to learn and engage with simulation and statistical analysis software (NS3/SimPy and MATLAB/Octave/R).
- Enthusiasm for mathematical problems of intermediate complexity.
- A general knowledge of how randomness is used in crypto-systems, and the supply-chain involved in distributing security-related hardware to critical infrastructure/end-users.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report on hardware RNG and appropriate third-party secure modules (such as TPM), including the methods used by manufacturers to test randomness (and how end users are extended the opportunity to perform some tests for themselves).
- A report on the current state of the art in supply chain attacks against critical security infrastructure.
- A clear workplan describing the set of tests to be performed, including any simulation, emulation or experimental systems that will be used.

- A simulation or experimental demonstration of structured information that can pass simple statistical tests of randomness.
- Appropriate remediation measures, be they existing or novel tests of randomness, with appropriate justifications (explaining why the additional time, cost or other resources are warranted and under what circumstances).
- A final report documenting the project. This report will extend to cover the experimental/simulation design, generation of biased random sequences, and how lightweight statistical tests were selected and evaluated. A clear grasp of experimental process and hypothesis testing should be demonstrated through appropriate use of tools and techniques.

- https://ieeexplore.ieee.org/abstract/document/9069949/
- https://link.springer.com/chapter/10.1007/978-3-030-04762-7_8
- https://dl.acm.org/doi/abs/10.1145/3264628
- https://dl.acm.org/doi/abs/10.1145/1268776.1268777
- https://ieeexplore.ieee.org/abstract/document/6135498/

- Proof of concept program: A prototype implementation of a board game (e.g., chess).
- Proof of concept program: A prototype program that exhibits the behaviour of concurrent execution.
- Report: A survey report on game environments.
- Report: A description of the prototype implementations.

- The system must be implemented according to modern software engineering principles.
- The environment should contain a working system, with functionalities such as error reporting.
- A graphical interface should be implemented.
- The environment should allow the users to play multiple games at the same time.
- The report should provide an overview of game-playing environments.
- The report should describe the environment including its functionalities and design.
- The report should describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report should describe the testing procedures of the environment.
- The report should describe the software engineering process involved in the design and implementation.

- Advanced GUI with other useful features such as chat room.
- Advanced implementation issues such as multiple game playing based on concurrency and sharing.

- Proof of concept program: A prototype implementation of a board game (e.g., chess).
- Proof of concept program: A prototype program that exhibits the behaviour of concurrent execution.
- Report: A survey report on game environments.
- Report: A description of the prototype implementations.

- The system must be implemented according to modern software engineering principles.
- The environment should contain a working system, with functionalities such as error reporting.
- A graphical interface should be implemented.
- The environment should allow the users to play multiple games at the same time.
- The report should provide an overview of game-playing environments.
- The report should describe the environment including its functionalities and design.
- The report should describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report should describe the testing procedures of the environment.
- The report should describe the software engineering process involved in the design and implementation.

- Advanced GUI with other useful features such as chat room.
- Advanced implementation issues such as multiple game playing based on concurrency and sharing.

- A text-based (non-interactive) monochrome web-page
- A colourful web-site including images and navigation
- GUI built with buttons etc.
- Report: about 15 pages including sketches of designs.

- Design and implement a more advanced interface(s)
- Complete report
- The programs must have an object-oriented design, using modern software engineering principles.
- The report will describe the software engineering processes involved in generating your software.
- The report will include comparisons of interfaces with a discussion of their meanings.
- The report will include a User Manual.

- http://hci.rwth-aachen.de/HCIBooks
- http://www.netmagazine.com/features/top-50-books-web-designers-and-developers

- A text-based (non-interactive) monochrome web-page
- A colourful web-site including images and navigation
- GUI built with buttons etc.
- Report: about 15 pages including sketches of designs.

- Design and implement a more advanced interface(s)
- Complete report
- The programs must have an object-oriented design, using modern software engineering principles.
- The report will describe the software engineering processes involved in generating your software.
- The report will include comparisons of interfaces with a discussion of their meanings.
- The report will include a User Manual.

- http://hci.rwth-aachen.de/HCIBooks
- http://www.netmagazine.com/features/top-50-books-web-designers-and-developers

The Underhanded C Contest was a programming contest to create code that is malicious, but passes a rigorous inspection, and looks like an honest mistake even if discovered.

The goal of the project is to understand whether it is possible (and to what extent) to create an AI-based system that automatically generates malicious code that looks benign by training the system on existing examples/rules (e.g., the underhanded context samples)."

- Good programming skills, particularly C.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing techniques/methods to create malicious code that look benign
- Techniques (e.g., AI-based) that can be used to generate malicious code autonomously
- A clear workplan describing the design of the system to be implemented and set of tests to be performed to verify the generated code

- A thorough analysis of examples of automatically-generated malicious code that pass rigorous inspection
- Describe in detail the design of the system and of the tests
- Describe and analyse the (AI-based) techniques and features that have been used during the project to automatically generate malicious code
- Critically compare the findings with related works
- Suggestions for improving the inspection of code generated by the proposed tool

- https://en.wikipedia.org/wiki/Underhanded_C_Contest
- http://underhanded-c.org/

Web development is at the core of a large number of software projects. Thus, it is nowadays a fundamental skill that computer scientists are expected to comprehend and be able to apply - if not to essentially dominate. The main goal of the project is to develop a full-stack Web (online) application/service using state-of-the-art technologies and established best-practices. Example services include one of the following:

- an online store, e.g. providing the ability to authenticate, post items for sale, choose items to buy, search, perform payments, etc.
- a local exchange trading system, e.g. a system in which members of a community can exchange goods and services without the use of a conventional monetary currency.
- a social media application.
- an online tool for drawing and animating interactive diagrams (such as logic circuits, network graphs, or state machines), whose features include the creation/editing interface and the capability of exporting or embedding the diagrams easily into other web pages

Students may propose their own Web applications/service to implement. However, a clear scope (i.e. a set of expected, well-defined use cases) should be specified and approved by their supervisor prior to commencing the project.

The application’s design must follow an N-tier architectural pattern consisting of a web browser interface, Web and Application server layers and a Database.

The basic elements to be supported must include the application logic and user interface implemented using one of the existing web application technologies/platforms (such as PHP, Ruby on Rails, Django, JavaScript-based frameworks such as NodeJS, REACT, Angular etc.), a relational database schema equipped with sound constraints and designed according to the sound normalization rules, and the interaction mechanisms between the application logic and the database. It is expected that prior to selecting the concrete technologies to be used, an evaluation of the state-of-the-art options will be performed along a number of comparison criteria (e.g. overall set of capabilities, support for aspects such as concurrency, security, modularity, performance, UX practices). The student is expected to (i) adequately establish such criteria (ii) investigate and reflect on a number of candidate technologies and (iii) appropriately compare and justify the concrete choices made.

Advanced features may include supporting ACID transactions, comprehensive security/privacy analysis and advanced security mechanisms (such as protection against SQL injection attacks), using NoSQL data stores (such as MongoDB) to speed up user interaction while maintaining sound levels of transactional consistency, advanced UI features (such as push notifications), using emerging platforms (such as a blockchain), etc.

- A report describing the state-of-the-art of Web development, including technologies/frameworks/platforms and their comparison as described above. The report will also explain the concrete choices made and the respective rationale/justification from a software engineering perspective (i.e. a justification merely based on a skills learning perspective will not suffice)
- A report on Web development architectural paradigms and applicable design patterns, including an initial discussion of security, privacy and key-operational aspects and considerations (e.g. cloud deployments and DevOps aspects)
- A report describing the system to be implemented, including a comprehensive set of well-defined use cases (or user stories).
- A functional prototype implementation of the system, clearly demonstrating sound design principles and implementation practices (e.g. on DB design, on UI design and interaction).

- A fully functional software application, with an appropriate end-to-end architecture, applicable design patterns, using modern Web technologies and rigorous software engineering principles.
- Complete report, encompassing also relevant content (further explained/discussed) from the early deliverables reports.
- The report will describe the software engineering processes involved in generating your software.
- The report will include a User Manual.

Methods of prediction of expert advice allow us to combine a pool of different prediction algorithms (called experts) in the on-line mode. By using them we can perform nearly as well as any expert in the pool.

There is a range of prediction with expert advice methods that are applicable in different situations. First, we may have different requirements for the quality of predictions: the deviation between predictions and outcomes can be measured using different loss functions. The absolute loss is very different from square loss and is usually handled by different algorithms. Secondly, one may be interested in different scenarios of using expert advice. The aggregating algorithm performs as well as any expert in the pool. The fixed share competes with "switching experts" that can switch from one original expert to another once in a while. The guarantees that can be provided in this situation are weaker but cover a much larger pool of predictors. Finally there are recent developments such as specialist experts that can skip a turn without giving a prediction. There are ways to handle this scenario.

On this project you will implement prediction with expert algorithms and use them to merge predictions by experts, which will normally be standard prediction algorithms. One may think of prediction with expert advice as an alternative to model selection: instead of picking one algorithm, we merge a group.

The project is recommended for Computational Finance students.

- A simple proof-of-concept implementation of the aggregating algorithm and fixed share algorithm.
- Report: An overview of the problem of prediction with expert advice and the work of the aggregating algorithm.
- Report: The performance of the aggregating algorithm on a small artificial dataset.

- The program will work with a real dataset, read the data from a file, preprocess it, and use a standard algorithm to generate experts' predictions (if necessary).
- The program will implement aggregating algorithm plus possibly fixed share and the aggregating algorithm for sleeping experts. The results will be visualised.
- The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe computational experiments, analyse their results and draw conclusions.

- Application of sleeping experts for predicting the implied volatility of options.
- Experiments with the weak aggregating algorithm.

- Y.Kalnishkan, The AA And Laissez-Faire Investment http://onlineprediction.net/?n=Main.TheAAAndLaissez-FaireInvestment
- Kalnishkan, Y. Prediction with expert advice for a finite number of experts: A practical introduction. Pattern Recognition, 126 (2022).
- Vovk, V. Competitive on‐line statistics. International Statistical Review, 69(2), 213-248 (2001).
- M.Herbster and M.Warmuth, Tracking the Best Expert, Machine Learning, 32, 151-178 (1998)
- Y.Kalnishkan and M.V.Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8): 1228--1244 (2008)

Methods of prediction of expert advice allow us to combine a pool of different prediction algorithms (called experts) in the on-line mode. By using them we can perform nearly as well as any expert in the pool.

There is a range of prediction with expert advice methods that are applicable in different situations. First, we may have different requirements for the quality of predictions: the deviation between predictions and outcomes can be measured using different loss functions. The absolute loss is very different from square loss and is usually handled by different algorithms. Secondly, one may be interested in different scenarios of using expert advice. The aggregating algorithm performs as well as any expert in the pool. The fixed share competes with "switching experts" that can switch from one original expert to another once in a while. The guarantees that can be provided in this situation are weaker but cover a much larger pool of predictors. Finally there are recent developments such as specialist experts that can skip a turn without giving a prediction. There are ways to handle this scenario.

On this project you will implement prediction with expert algorithms and use them to merge predictions by experts, which will normally be standard prediction algorithms. One may think of prediction with expert advice as an alternative to model selection: instead of picking one algorithm, we merge a group.

The project is recommended for Computational Finance students.

- A simple proof-of-concept implementation of the aggregating algorithm and fixed share algorithm.
- Report: An overview of the problem of prediction with expert advice and the work of the aggregating algorithm.
- Report: The performance of the aggregating algorithm on a small artificial dataset.

- The program will work with a real dataset, read the data from a file, preprocess it, and use a standard algorithm to generate experts' predictions (if necessary).
- The program will implement aggregating algorithm plus possibly fixed share and the aggregating algorithm for sleeping experts. The results will be visualised.
- The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe computational experiments, analyse their results and draw conclusions.

- Application of sleeping experts for predicting the implied volatility of options.
- Experiments with the weak aggregating algorithm.

- Y.Kalnishkan, The AA And Laissez-Faire Investment http://onlineprediction.net/?n=Main.TheAAAndLaissez-FaireInvestment
- Kalnishkan, Y. Prediction with expert advice for a finite number of experts: A practical introduction. Pattern Recognition, 126 (2022).
- Vovk, V. Competitive on‐line statistics. International Statistical Review, 69(2), 213-248 (2001).
- M.Herbster and M.Warmuth, Tracking the Best Expert, Machine Learning, 32, 151-178 (1998)
- Y.Kalnishkan and M.V.Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8): 1228--1244 (2008)

- Computation of Nash equilibria (two-player games, polymatrix games, or net coordination games).
- Dynamics, like best response and fictitious play, in games (net coordination games, networks of complements, Schelling games).
- Fair division of indivisible items.
- Fair division of divisible goods (Cake cutting algorithms, Rent division).
- Stable Matching Algorithms.
- Matching markets.
- Simulation and calculation of results in many voting like systems

- Report: A general report on algorithms for economics social choice, finishing with a determination of the project topic or topics.
- Report: A full formal description of the problem and the solution concept.
- Report: Known algorithms on the topic. How they work?
- Proof of concept: A program that generates a (small) instance of the problem(s) you are studying and checks whether a given solution is indeed correct.
- Proof of concept: Simple greedy heuristics for the topic.
- Implementation of a known algorithm on the topic that works on some toy-examples of the problem.

- A program developped with proper software engineering techniques
- A nice and easy to use GUI for the chosen topic that allows interesting experiments (the GUI and its functionalities depend on the topic).
- The program should work on at least three special cases of the problem.
- Implementation and comparison of different heuristics and standard algorithms.
- Final report should contain a section on the outcomes of the experiments done and conclusions drawn about which algorithms seem suitable in different situations.

- A more careful theory report, including notions of problem hardness, or running time for the algorithms.
- More involved algorithms for the problem.
- In depth analysis of the algorithms for specific classes of instances.
- Nice analysis options for your problem including an implementation in some Discrete Event Simulation framework
- An interactive Web page allowing people to explore and understand your topic using examples and simulation.

- The table of contents of the Handbook of Computational Social Choice is very useful to see what kinds of project you might want to do
- This course on Algorithmic Game Theory is worth having a skim over. At least the list of titles.
- There is also a good book on Algorithmic Game Theory
- Nash equilibrium: The Nash Equilibrium Wiki page is nice as an introduction.
- Stable Marriage: The Stable Marriage Wiki page is also nice as an introduction.
- Some fun Youtube videos about matching - worth watching even if you do not take this project: video one, or video two.
- A more technical introduction to Networks of Complements.
- A more technical introduction to Schelling games.
- A more technical introduction to Coordination Games on Graphs.
- A more technical introduction to Fair Allocation of Indivisible Goods.
- A (slightly) more technical introduction to Cake cutting algorithms.
- A more technical introduction to Matching Markets

- Computation of Nash equilibria (two-player games, polymatrix games, or net coordination games).
- Dynamics, like best response and fictitious play, in games (net coordination games, networks of complements, Schelling games).
- Fair division of indivisible items.
- Fair division of divisible goods (Cake cutting algorithms, Rent division).
- Stable Matching Algorithms.
- Matching markets.
- Simulation and calculation of results in many voting like systems

- Report: A general report on algorithms for economics social choice, finishing with a determination of the project topic or topics.
- Report: A full formal description of the problem and the solution concept.
- Report: Known algorithms on the topic. How they work?
- Proof of concept: A program that generates a (small) instance of the problem(s) you are studying and checks whether a given solution is indeed correct.
- Proof of concept: Simple greedy heuristics for the topic.
- Implementation of a known algorithm on the topic that works on some toy-examples of the problem.

- A program developped with proper software engineering techniques
- A nice and easy to use GUI for the chosen topic that allows interesting experiments (the GUI and its functionalities depend on the topic).
- The program should work on at least three special cases of the problem.
- Implementation and comparison of different heuristics and standard algorithms.
- Final report should contain a section on the outcomes of the experiments done and conclusions drawn about which algorithms seem suitable in different situations.

- A more careful theory report, including notions of problem hardness, or running time for the algorithms.
- More involved algorithms for the problem.
- In depth analysis of the algorithms for specific classes of instances.
- Nice analysis options for your problem including an implementation in some Discrete Event Simulation framework
- An interactive Web page allowing people to explore and understand your topic using examples and simulation.

- The table of contents of the Handbook of Computational Social Choice is very useful to see what kinds of project you might want to do
- This course on Algorithmic Game Theory is worth having a skim over. At least the list of titles.
- There is also a good book on Algorithmic Game Theory
- Nash equilibrium: The Nash Equilibrium Wiki page is nice as an introduction.
- Stable Marriage: The Stable Marriage Wiki page is also nice as an introduction.
- Some fun Youtube videos about matching - worth watching even if you do not take this project: video one, or video two.
- A more technical introduction to Networks of Complements.
- A more technical introduction to Schelling games.
- A more technical introduction to Coordination Games on Graphs.
- A more technical introduction to Fair Allocation of Indivisible Goods.
- A (slightly) more technical introduction to Cake cutting algorithms.
- A more technical introduction to Matching Markets

The full technical name of the problem this project description is concerned with is uncapacitated facility location (with metric costs). Much of the below can also be done for OR problems other than facility location, but facility location is a problem for which both good benchmarks and interesting algorithms have been developed.

In this project, you will try out some powerful tools for solving such problems (in particular, you will learn to use an LP-solver), and you will implement and compare some of the algorithms which have been proposed.

- Report: A full description of the facility location problem
- Report: P vs NP, NP-completeness, NP-hardness
- Report: Strategies for solving NP-complete problems (ex: heuristics, approximations, algorithms for special cases and non-worst case inputs).
- Report: Linear programming, LP-relaxations and LP Software packages
- Worker threads in a GUI?
- Proof of concept program: A simple greedy algorithm
- Proof of concept program: A program which solves the LP-relaxation of the problem (by using an LP-solver software package)

- The program should be able to read a problem description from an input file, in one of the formats used in popular benchmark libraries
- The program should have a GUI, which displays a problem instance and can be used to run several different algorithms to solve the instance
- The program should support several different algorithms to solve a problem, including at least one heuristic (e.g., a greedy algorithm) and at least one algorithm using the LP-solution as guidance. (It is recommended that one of the algorithms comes from the literature on approximation algorithms; see Extensions.)
- The program should also provide the option of showing the cost of the LP solution, and should ideally provide the option of attempting to produce an exact solution via integer programming (so-called MIP).
- Since some of the computations can be lengthy (e.g., the MIP solver), the program must allow the user a way to abort an algorithm. Therefore the program must be threaded, so that the user can interact with the GUI despite some algorithm being busy in the background.

- A more careful treatment of approximation: A report on "Approximation algorithms and approximation ratios", and an implemention of various types of approximation algorithm
- More algorithms and solvers. Examples include such meta-heuristics as local search, simulated annealing, etc.)
- Reports on problem variants and extensions (examples: capacitated, multi-level, or the k-center/k-median problem), perhaps including motivations and theoretical properties (NP-hardness, approximation properties)
- Careful experiments with running times and result quality for different algorithms and problem instances

- Report on connected components in undirected graphs and and strongly connected components in directed graphs
- Report on distances in directed graphs.
- Some pseudo-code for important algorithms
- Implementation and testing of some important algorithms.

- Reports for a wider selection of graph algorithms
- Extensive and accurate commented Pseudo-code
- Full implementation and testing of algorithms.

- Nice User friendly GUI for algorithms.

- New biological data types
- New clustering algorithms
- New techniques for the visualisation/assessment of the clusters

- Proof of concept programs: parsers for biological databases,
- Proof of concept programs: implementation of one clustering algorithm,
- Proof of concept programs: design of an interface.
- Reports: Biological databases.
- Reports: Clustering algorithms.
- Reports: Design Patterns in Java (mainly plugin architecture).

- The software should be able to work with at least two kind of biological data (microarray data and sequence data) and two clustering algorithms (for instance k-means and hierarchical clustering).
- At least one visualisation/assessment technique should be added (for example, over-representation analysis).
- The report should include a biological overview of the problem,
- The report should include a detailed description of the implementation of the plugin-architecture,
- The report should include a comprehensive documentation on how to extend the program.

- Adding the possibility of processing more biological data (PPI data, for example, CSV, etc)
- Adding more clustering algorithms, possibly using the Façade pattern with existing implementations (SCSP, Cluster ONE, ...).
- Developing cluster visualisation techniques: MDS visualisation, dendrograms, etc.

- Introduction to Horn clauses and their analysis in Z3.
- Exposition of the encoding of pushdown systems as Horn-clauses (based on a summary algorithm).
- Implementation of pushdown analysis by reduction to Horn-clauses.

- Encoding of the CSS redundancy problem for the simplified tree language as Horn-clauses.
- Implementation of the above encoding.
- Benchmarking of implementation on examples derived from jQuery code.
- Automatic, implemented translation from some fragment of HTML and jQuery to the CSS redundancy problem for the simplified tree language.

- Ability to research a new topic, create/use computer systems and design and implement software.
- Good programming skills such as python, java, etc and some knowledge of Windows/Linux internals.
- Knowledge/interest in data analysis, databases.
- Knowledge/interest in machine learning [optional].

- A report describing client honeypots and their use.
- [Optional] Set up a client honeypot system and gather data.
- A report on client honeypot data, gathering, processing, enhancing (e.g. gathering other data from VirusTotal etc).
- Design of a visualisation, monitoring and data analysis system.

- Design and develop proof of concept for some or all of a visualisation, monitoring and data analysis system. An example might be a program for extracting features from honeypot log file traces and clustering the resulting data.
- Develop an experiment to gather data and answer a research question e.g. to determine whether phishing sites use drive-by-downloads, or to determine whether a sample of current twitter URLs are malicious, etc.
- Data analysis and report, e.g. create open source intelligence or show clustering of data and justification of the machine learning methods used and example outputs.
- Final report including critical comparison, evaluation and visualisation of results showing the capabilities of the tool using different selected features and techniques.

- [1] Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
- [2] Machine learning: a probabilistic perspective, by K. Murphy, MIT Press 2012
- [3] Virtual Honeypots: From Botnet Tracking to Intrusion Detection Paperback, by Neils Provos and Thorsten Holz, Wiley, 2007
- [4] Seifert, Christian, Ramon Steenson, Ian Welch, Peter Komisarczuk, and Barbara Endicott-Popovsky. "Capture–A behavioral analysis tool for applications and documents." digital investigation 4 (2007): 23-30. http://dfrws.org/sites/default/files/session-files/paper-capture_-_a_tool_for_behavioral_analysis_of_applications_and_documents.pdf
- [5] Dell'Aera, Angelo. "Thug: a new low-interaction honeyclient." Recuperado el 11 (2012). http://www.tlc.unipr.it/veltri/sicurezza-2011-2012/Sec-4-Threats-03_Honeyclient.pdf
- [6] https//github.com/buffer/thug

- Study conformal prediction theory and meaning of its validity and efficiency properties.
- Pick artificial or natural data examples with various level of difference between the data segments.
- Implement Conformal Prediction with few versions of conformity score functions and compare their performance on these data examples.

- Survey of conformity scores for CP framework based on standard machine learning algorithms.
- For one of the algorithms, develop and compare different ways to define conformity score function.
- Compare their validity and efficiency on the data examples including the transfer learning challenge.

- To try the same for other underlying algorithms, and to formulate general principles behinds safe conformity score functions

- Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
- A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).

- Study conformal prediction theory and meaning of its validity and efficiency properties.
- Pick artificial or natural data examples with various level of difference between the data segments.
- Implement Conformal Prediction with few versions of conformity score functions and compare their performance on these data examples.

- Survey of conformity scores for CP framework based on standard machine learning algorithms.
- For one of the algorithms, develop and compare different ways to define conformity score function.
- Compare their validity and efficiency on the data examples including the transfer learning challenge.

- To try the same for other underlying algorithms, and to formulate general principles behinds safe conformity score functions

- Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
- A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).

Logic is a cornerstone of science and human thinking by and large. As such, it is an active subject of research for more than two thousands years (e.g., from the time of Aristotle). The computing revolution of recent decades has brought logic into the realm of real-world applications. One such development is that of Description Logic which underlies the modern modelling language OWL (Web Ontology Language).

OWL ontologies are basically descriptions of certain enterprises, reminiscent to a database modelling of an enterprise; the advantage of ontologies over databases though is the ability to model not only basic individual information but rather semantic connections between entities in the enterprise (e.g., that a certain class of entities is a sub-class of a different class of entities). This give the user the ability to infer *new* logical conclusions from the data (and not only to extract already existing information).

Ontologies are one of the technologies of the Semantic Web and are currently used extensively in domains such as medicine, biology, and in the general area of digital humanities including archives.

The project will offer students the opportunity to learn both the rich and fundamental theory behind ontologies, namely Description Logic (which are weak versions of first-order logic), and the use of ontology frameworks (for example, Protege) in real data sets.

A successful project will (1) Investigate and understand the basic concepts of the logical theory of Description Logic; (2) define a developed and sufficiently rich OWL ontology for a particular chosen domain using the the Protege tool; (3) Show the applicability of the ontology as to store and infer new information from existing one.

- Project plan with tasks and milestones.
- An overview of Description Logic that supports ontological reasoning.
- Gathering of requirements from a real user, or justifying such requirements from the perspective of a potential user.
- Initial experiments with Protege. And building a preliminary model for the chosen domain.

- An ontology of the knowledge domain associated with the data set.
- A discussion of Description Logic with explicit examples in the ontology constructed.
- Investigate inferencing algorithms in Description Logic and their run-time complexity.
- At least a partial encoding of the data set in the ontology.
- The report will include examples of usage of the system.
- The report will include discuss the perceived advantages and disadvantages of using ontology frameworks for representing and querying the chosen data set

- Connect the ontology to other existing ontologies in related areas.
- User-friendly interfaces for data representation and querying.

- Semantic Web for the Working Ontologist. Dean Allemang and Jim Hendler.
- A Description Logic Primer. Markus Krotzsch, Frantiek Simancik, Ian Horrocks.
- http://protege.stanford.edu

Broadly speaking, planning is the art of thinking before acting. Automated planning is a crucial and well-established field of artificial intelligence and is concerned with the automatic synthesis of a course of action to achieve a set of goals based on a given description of the world. A planner is an instance of a general problem solver: it accepts an abstract description of a problem and returns a solution for it. No knowledge of the domain is provided to the planning, which must be able to adapt to any description of the problem. This description consists of a model of how the world functions, the initial state of the environment and a set of objectives to achieve. The solution of a planning problem, a plan, is a sequence of actions that accomplish the goals when executed. The choice of different ways to describe these fundamental elements, world model, initial state and objectives, has determined the developments of a number of distinct planning paradigms. They range from those addressing fully observable, deterministic, static and discrete environments to those dealing with partially observable or stochastic environments. In all cases, the derivation of a course of action from a model is computational intractable. Hence, the central challenge in planning is to achieve both generality and scalability.

Many problems can be formalised as planning tasks. Prototypical examples are single-player games ("puzzles") such as the Rubik's Cube or the Freecell card game. Real-world applications of automated planning are the generation of plans for robotic and virtual agents, logistic problems, airport ground traffic control, testing protocols for logical flaws and printer control.

Within the area of AI planning, a wide variety of projects are possible, ranging from highly theoretical to practical and implementation-oriented. Projects can focus either on modelling a problem as a planning task (e.g. search and rescue, urban traffic control, internet of things) and then solving it with existing state-of-the-art planners or on devising new algorithms for efficiently searching for plans (e.g. generating new heuristics for speeding up the search for plans).

- Familiarisation with the modelling languages for automated planning. Show examples of models that describe specific applications and validate them.
- Familiarisation with state-of-the-art planning systems. Show manually worked examples of their operation.
- For modelling projects, identification of a specific scenario/application to model as a planning task. In depth description of why planning is well suited to model such a scenario/application. Early prototypes of the models and their validation. For algorithm-oriented projects, identification of a specific planning algorithm to improve. In depth description of how the algorithm can be improved. Early prototypes of improved algorithm with relating experiments.

- For modelling projects, planning model for the chosen application as well as in depth report that describes and motivates the model. Experimental evidence that the problem under consideration can be solved efficiently via planning. Code developed in order to adapt existing planning algorithms to solve the problem under consideration efficiently.
- For algorithm-oriented projects, code of the algorithms implemented, experimental evidence that the algorithms work efficiently and in depth report on the theoretical approach behind the algorithms implemented.

- Artificial Intelligence: A Modern Approach (Third edition) by Stuart Russell and Peter Norvig. Chapters 10 and 11.
- A Concise Introduction to Models and Methods for Automated Planning by H. Geffner and B. Bonet. Morgan and Claypool, 2013
- Automated Planning - Theory and Practice by Malik Ghallab, Dana S. Nau, Paolo Traverso. Elsevier, 2004

- Basics of binary analysis, knowledge of Python/C, basic/advanced notions of machine learning.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing techniques/methods to extract features from binaries
- A preliminary tool to extract features from binaries (e.g., control-flow graphs)
- Creation of a reasonably large dataset composed of malware binaries coming from different authors or groups (e.g., APT)
- A preliminary set of tests with machine learning classifiers (using scikit-learn) to classify a subset of the dataset based on the extracted features

- A thorough analysis of techniques and methods to extract features from binaries
- Describe and analyse the techniques and features that have been used during the project to extract and attribute binaries
- Describe in detail the design of the attribution system
- Test the system on the entire dataset and discuss the results
- Suggestions for improving the proposed tool

- https://www.sciencedirect.com/science/article/pii/S1742287614000176
- https://arxiv.org/abs/1512.08546
- https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/caliskan-islam
- https://dl.acm.org/doi/abs/10.1145/3292577
- https://link.springer.com/chapter/10.1007/978-3-642-23822-2_10

- Good programming skills, particularly Java, Python (for NLP), and knowledge of Android.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing previous literature in privacy policy analisys
- A report describing libraries for natural language processing
- A report describing tools for static analysis of Android apps

- A tool that uses NLP to perform analysis of Android app privacy policies
- A detailed description of the tests and evaluation performed on the tool
- A detailed account of the results obtained and an analysis of them
- A critical comparison with findings from related works

- https://www.nltk.org
- https://spacy.io
- https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/wang-ruowen
- https://www.usenix.org/conference/usenixsecurity18/presentation/harkous
- https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/zimmeck
- https://frida.re
- https://github.com/skylot/jadx

Symbolic execution allows to test programs and find potentially exploitable bugs fully automatically. The key idea is that a symbolic execution engine runs a program on symbolic inputs that represent all possible input values, instead of testing just individual concrete values. "If" statements over symbolic inputs can then evaluate both ways, and the symbolic execution engine considers both branches in parallel. Not all sequences of branches can occur in reality because their conditions may contradict each other. The engine rules out such cases by querying an automated constraint solver.

There are several symbolic execution engines available for low-level languages like C, but there is still little support for interpreted languages like Ruby, Python, Perl, Bash, PHP, or JavaScript, to name a few.

As shown recently in the "Chef" system (http://dslab.epfl.ch/proj/chef/), symbolic execution engines for interpreted languages can be derived by symbolically executing the language interpreter running the target program. Chef runs the interpreter on top of a low-level engine running an entire virtual machine stack and has been successfully used to develop engines for Python and Lua.

The goal of this project will be to use a symbolic execution engine such as Chef, KLEE, or CREST to develop *another* symbolic execution engine for a new interpreted language, and to run a series of experiments to evaluate its efficiency and effectiveness. In the first phase, you will experiment with at least two engines and decide which one to build on for the remainder of the project.

- Experiments: Set up of an existing symbolic execution engine and experiments with example code
- Code: Test infrastructure for symbolically executing the target interpreter
- Report: Mechanics of symbolic execution and state of the art
- Report: Comparison of two symbolic engines and the results from running them on the interpreter
- Report: Overview of the implementation of the target interpreter

- Code: A mature testing system and performance optimisations for symbolically executing an interpreted language
- Report: An evaluation of its efficiency and effectiveness on a small set of benchmarks
- Report: Process for developing the new engine, including interpreter modifications
- Report: Testing infrastructure and libraries for the target language
- Report: Responsible disclosure of vulnerabilities
- Development and evaluation of high-level assertions of program correctness that allow to find concrete vulnerabilities and classes of bugs relevant to the target programs under test and/or the entire target interpreted language.

- Novel heuristics for improving performance and test coverage.
- Detailed evaluation of the effects of different combinations of optimisations.

- Bucur et al.: Prototyping Symbolic Execution Engines for Interpreted Languages. Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Salt Lake City, UT, March 2014
- Cadar and Sen: Symbolic execution for software testing: three decades later. Communications of the ACM, 56:2, February 2013

CS2821 Systems Programming

CS3470 Compilers and Code Generation or equivalent prior knowledge.

Good programming skills and ability to quickly understand a new language.

Knowledge of C and willingness to learn basic C++.

- Experience with C++
- Being able to autonomously download/compile/install libraries from various repositories (e.g., GitHub).

- A report describing possible applications of homomorphic encryption
- A report describing one homomorphic encryption scheme
- A performance comparison of two homomorphic encryption libraries implementing that scheme for a simple homomorphic evaluation, e.g. weighted average on a small dataset

- Describe commonly-used homomorphic encryption schemes and their implementations in open-source libraries
- Explain how scheme / libraries parameters should be set to optimise performance
- Compare performance of two libraries for a more involved evaluation, using one or more advanced techniques such as batching or bootstrapping
- Based on implementation experience, evaluate suitability of different libraries for a given application

- https://homomorphicencryption.org
- http://www.humangenomeprivacy.org/2019/
- https://github.com/Microsoft/SEAL
- https://github.com/homenc/HElib
- https://eprint.iacr.org/2016/421
- https://eprint.iacr.org/2012/144

High-performance consensus algorithms are a fundamental building block underpinning the distributed infrastructure of many Internet Services. Single-shot consensus algorithms allow a set nodes in a distributed system to agree on a single value. In practice however, real-world applications often require multi-shot consensus, i.e. agreement on an ever-growing sequence of values. For example, in Blockchain applications, participating nodes continually seek agreement on the next block of transactions to append to the Blockchain. Well-known multi-shot consensus algorithms include (Multi-) Paxos[1][2] and Raft[3] for the case where nodes fail by crashing. There is currently also considerable interest in Byzantine-fault tolerant multi-shot consensus algorithms [4][5][6] (e.g. to support 'permissioned' crypto-currencies such as Facebook's Libra/Diem or as a building block in recent 'permissionless' crypto-currencies). In this project you will build your own Blockchain using the Elixir distributed programming language and evaluate its performance. The key challenge will be to implement a suitable multi-shot consensus algorithm. For this, you may base your design on existing algorithms from the literature, or propose your own after discussion with your supervisor. You are free to choose between existing crash-fault tolerant (e.g. Raft/Multi-Paxos), or Byzantine fault-tolerant algorithms (e.g. Streamlet, PBFT).

- A report giving an overview of prominent distributed multi-shot consensus algorithms.
- A report describing consensus algorithms in the context of prominent permissioned and/or permissionless Blockchains.
- A proof-of-concept program implementing a simple (possibly) inefficient multi-shot consensus algorithm

- A permissioned Blockchain that is resilient to crash and/or Byzantine failures and achieves a high transaction throughput.
- A report describing the architecture of the implemented Blockchain
- A report on the performed experiments and performance evaluation results

- Support for nodes joining and leaving the system.
- An integration with the Jepsen distributed testing framework.

- Paxos made simple, L. Lamport, ACM SIGACT News, December 2001, https://www.microsoft.com/en-us/research/publication/paxos-made-simple
- The Part-Time parliament, L. Lamport, ACM TOCS '98: https://www.microsoft.com/en-us/research/publication/part-time-parliament/
- Lecture notes on Leader-based Sequence Paxos - an understandable sequence consensus algorithm, S. Haridi et al, https://arxiv.org/pdf/2008.13456.pdf
- In search of an understandable consensus algorithm, D. Ongaro et al, USENIX ATC '14 https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
- Streamlet: Textbook Streamlined Blockchains, B. Chan and E. Shi, https://eprint.iacr.org/2020/088
- Practical Byzantine fault-tolerance, M. Castro and B. Liskov, OSDI '99: http://pmg.csail.mit.edu/papers/osdi99.pdf

Games development is one of the largest parts of the computing industry. There are many platforms for writing games and indeed, many architectures and frameworks.

In this project you will build and deploy a game.

In the first term you will research existing games and their underlying technology platforms. You will also look at game building tools such as Unity3D. You will consider game frameworks such as PyGame and also investigate APIs for game development such as Vulkan.

You will need to be familiar with the design patterns required for gaming: Flyweight, publish/subscribe (for message passing), Object Pool, Service, Dirty Flag, Factory, Visitor, Singleton, Game Loop, State, Observer, Command, Event Queue (See perhaps Game Programming Patterns by Robert Nystrom. Also consider reading some of the material on The Game Designs Patterns Wiki

Your target could be a tower defence game written in Python in which case you will need to understand Threading and Graphics in Python.

Perhaps you will use Blender and other tools and do some maths and produce an excellent Unity game.

Maybe you will program a platform game on an Android device, using the sensors and learning about the Android life cycle, Manifesto, and the mobile approach to gaming.

- A simple program displaying an animation using sprites or similar animation technology.
- A program displaying a simple, interactive user interface. You might want to explore the different means of displaying and writing user interfaces (XML, Object Orientated, Immediate Mode).
- 3D demo using an API
- A program that exhibits some 3D graphics using a relatively low level graphics API (Vulkan, OpenGL, WebGL, JavaFX etc. This might provide a starting point for choosing technology: https://en.wikipedia.org/wiki/List_of_3D_graphics_libraries)
- A simple game using a game engine / library / framework. Think of it as an opportunity to learn about the technologies you'll use for your main game but keep it simple.
- A report on design patterns in games, an enumeration of commons ones, the problems they solve and perhaps discuss some specific instances of their use.
- A report on the specific technologies you've tried / intend to use for your game. You should do deep on you intended tech, show as deep an understanding as you can.
- A report on animation techniques, enumerate them, perhaps tell a history of them, compare and contrast. The goal is to show a deep understanding of how animation in games is achieved.
- A report on Threading an User Interfaces. What problems do games experience with user interfaces without threading, how does multi threading solve these problems and what different ways can this parallelism or concurrency be achieved? Perhaps review how it has actually been achieve in technologies or platform relevant to your game.
- A report on designing games with graphical user interfaces for multiple platforms.
- A report on the technologies used to write and draw graphical user interfaces.

- Game Source Code: You should use a consistent architectural style. Our suggestion is object orientation but you could opt for another with sufficient justification.
- Game Source Code: The game needs to have at least 3 'screens' or pages that are interactive using the GUI technology of choice.
- Game Source Code: You should publish your game! (Android Store, itch.io)
- Game Source Code: The game play should make use of animation.
- The Final Report: An in depth review of background materials you used to learn the concepts and technologies used to write the game, perhaps presenting a historical timeline of those concepts / technologies.
- The Final Report: A description of the more difficult concepts involved in writing graphical games e.g. Multi Threading and threads / processes, event based systems (handlers, events), special consideration for the environment your game with run in (memory foot print, battery usage).
- The Final Report: A description of the software engineering processes involved in writing your game.
- The Final Report: A description of interesting techniques, hacks or data structures you used (or could have used) in writing your game.

- Use of AI, machine learning on the results of the simulation to improve performance.
- Use of a camera or bluetooth or other libraries.
- High Score stored securely off-phone to enable competition
- The goal here is to push the envelop of the project, expand the breadth of technologies and concepts you'd need to learn to complete the game.
- Perhaps: Using complex game AI, perhaps also using machine learning to improve the performance of that AI.
- Perhaps: Using peripheral devices for interaction with the game.
- Perhaps: Enabling multiplayer modes with networking, perhaps through scoreboards or other means.

- Mario Zechner. Beginning Android Games. 2011 - Springer.
- http://www.gameprogrammingpatterns.com/contents.html - Game Programming Patterns by Robewrt Nystrom

Methods of prediction of expert advice allow us to combine a pool of different prediction algorithms (called experts) in the on-line mode. By using them we can perform nearly as well as any expert in the pool.

There is a range of prediction with expert advice methods that are applicable in different situations. First, we may have different requirements for the quality of predictions: the deviation between predictions and outcomes can be measured using different loss functions. The absolute loss is very different from square loss and is usually handled by different algorithms. Secondly, one may be interested in different scenarios of using expert advice. The aggregating algorithm performs as well as any expert in the pool. The fixed share competes with "switching experts" that can switch from one original expert to another once in a while. The guarantees that can be provided in this situation are weaker but cover a much larger pool of predictors. Finally there are recent developments such as specialist experts that can skip a turn without giving a prediction. There are ways to handle this scenario.

On this project you will implement prediction with expert algorithms and use them to merge predictions by experts, which will normally be standard prediction algorithms. One may think of prediction with expert advice as an alternative to model selection: instead of picking one algorithm, we merge a group.

The project is recommended for Computational Finance students.

- A simple proof-of-concept implementation of the aggregating algorithm and fixed share algorithm.
- Report: An overview of the problem of prediction with expert advice and the work of the aggregating algorithm.
- Report: The performance of the aggregating algorithm on a small artificial dataset.

- The program will work with a real dataset, read the data from a file, preprocess it, and use a standard algorithm to generate experts' predictions (if necessary).
- The program will implement aggregating algorithm plus possibly fixed share and the aggregating algorithm for sleeping experts. The results will be visualised.
- The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe computational experiments, analyse their results and draw conclusions.

- Application of sleeping experts for predicting the implied volatility of options.
- Experiments with the weak aggregating algorithm.

- Y.Kalnishkan, The AA And Laissez-Faire Investment http://onlineprediction.net/?n=Main.TheAAAndLaissez-FaireInvestment
- Kalnishkan, Y. Prediction with expert advice for a finite number of experts: A practical introduction. Pattern Recognition, 126 (2022).
- Vovk, V. Competitive on‐line statistics. International Statistical Review, 69(2), 213-248 (2001).
- M.Herbster and M.Warmuth, Tracking the Best Expert, Machine Learning, 32, 151-178 (1998)
- Y.Kalnishkan and M.V.Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8): 1228--1244 (2008)

- Study conformal prediction theory and meaning of its validity and efficiency properties.
- Pick artificial or natural data examples with various level of difference between the data segments.
- Implement Conformal Prediction with few versions of conformity score functions and compare their performance on these data examples.

- Survey of conformity scores for CP framework based on standard machine learning algorithms.
- For one of the algorithms, develop and compare different ways to define conformity score function.
- Compare their validity and efficiency on the data examples including the transfer learning challenge.

- To try the same for other underlying algorithms, and to formulate general principles behinds safe conformity score functions

- Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
- A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).

High-performance consensus algorithms are a fundamental building block underpinning the distributed infrastructure of many Internet Services. Single-shot consensus algorithms allow a set nodes in a distributed system to agree on a single value. In practice however, real-world applications often require multi-shot consensus, i.e. agreement on an ever-growing sequence of values. For example, in Blockchain applications, participating nodes continually seek agreement on the next block of transactions to append to the Blockchain. Well-known multi-shot consensus algorithms include (Multi-) Paxos[1][2] and Raft[3] for the case where nodes fail by crashing. There is currently also considerable interest in Byzantine-fault tolerant multi-shot consensus algorithms [4][5][6] (e.g. to support 'permissioned' crypto-currencies such as Facebook's Libra/Diem or as a building block in recent 'permissionless' crypto-currencies). In this project you will build your own Blockchain using the Elixir distributed programming language and evaluate its performance. The key challenge will be to implement a suitable multi-shot consensus algorithm. For this, you may base your design on existing algorithms from the literature, or propose your own after discussion with your supervisor. You are free to choose between existing crash-fault tolerant (e.g. Raft/Multi-Paxos), or Byzantine fault-tolerant algorithms (e.g. Streamlet, PBFT).

- A report giving an overview of prominent distributed multi-shot consensus algorithms.
- A report describing consensus algorithms in the context of prominent permissioned and/or permissionless Blockchains.
- A proof-of-concept program implementing a simple (possibly) inefficient multi-shot consensus algorithm

- A permissioned Blockchain that is resilient to crash and/or Byzantine failures and achieves a high transaction throughput.
- A report describing the architecture of the implemented Blockchain
- A report on the performed experiments and performance evaluation results

- Support for nodes joining and leaving the system.
- An integration with the Jepsen distributed testing framework.

- Paxos made simple, L. Lamport, ACM SIGACT News, December 2001, https://www.microsoft.com/en-us/research/publication/paxos-made-simple
- The Part-Time parliament, L. Lamport, ACM TOCS '98: https://www.microsoft.com/en-us/research/publication/part-time-parliament/
- Lecture notes on Leader-based Sequence Paxos - an understandable sequence consensus algorithm, S. Haridi et al, https://arxiv.org/pdf/2008.13456.pdf
- In search of an understandable consensus algorithm, D. Ongaro et al, USENIX ATC '14 https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
- Streamlet: Textbook Streamlined Blockchains, B. Chan and E. Shi, https://eprint.iacr.org/2020/088
- Practical Byzantine fault-tolerance, M. Castro and B. Liskov, OSDI '99: http://pmg.csail.mit.edu/papers/osdi99.pdf

This project is to implement and compare a variety of clustering methods on some artificial and real data.

- Proof of concept program: Gaussian mixture model applied to a small artificial dataset.
- A report giving an overview of clustering and describing three different clustering methods.
- A report describing and assessing the results of applying the proof of concept program to data.

- The program(s) will work with a real dataset(s).
- You will generate and compare different clusterings or generative models using several different methods, such as: kernel-based semi-supervised clustering, spectral learning, factor analysis, tree-based methods, or other comparable methods.
- You will compare and visualize the clusterings produced.
- The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

- There is a very wide range of clustering and generative-modelling algorithms in standard use, and an even wider variety of methods of fitting them to data. This would be an opportunity to implement several different methods and get to understand them by using them in practice.

- Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
- Machine learning: a probabilistic perspective”, by K. Murphy, MIT Press 2012
- S. Basu, A. Bamerjee and R. Mooney, "Semi-supervised clustering by seeding", Proceedings of the 19th International Conference on Machine Learning, pp.19-26, 2002.

Machine Learning algorithms have been successfully used in medicine, pharmaceutical industry and many other fields to handle a large set of data in order to find important patterns and as well as outliers and anomalies.

Recently developed conformal predictors can be used to narrow down the search space and to improve accuracy of classification.

- A review and a report of the approach in the case of classification and regression.
- A design of the system and its extension for conformal predictors.
- Implementation of a pilot system and some preliminary experiments.

- Implementation of the system, experiments with various parameters.
- Experimenting with different underlying ML algorithms.
- It is desirable that the final system will have a graphical user interface.
- The final report will describe the theory, the implementation issues, computational experiments with different data sets and parameters.

- Online learning with conformal predictors.

- V.Vovk, A.Gammerman, G.Shafer. "Algorithmic Learning in a Random World", Springer, 2005.

- Find and choose a data set
- Implement a relevant version of Conformal Prediction algorithm
- Impute some mistakes into the data, apply the algorithm, and observe the results.

- Implement a Conformal Prediction with an additional function of error detection.
- Compare different versions of Conformal Prediction by their efficiency for data correction.
- Analyse and report the output.

- Propose, justify theoretically, and implement a scheme of multiple error correction.
- Compare how the approach works in different (on-line, batch, cross-validation, inductive) modes.

- Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom

- Study data sets and identify one relevant for the project, describe it and the structure of its label space.
- Select a basic machine learning approach applicable to this data.
- Develop and implement a multi-class version of this approach and get some initial results on the data

- Survey of the data and machine learning techniques.
- Develop a conformity score functions for data with structured label space.
- Implement a conformal prediction algorithm applicable to the data.
- Analysis of the output.

- Compare results with different basic machine learning approaches.
- Compare original label space structure attached to the data against its automatic structuring by means of data analysis.

- Predicting structured objects with support vector machines by T.Joachims, T.Hofmann, Y.Yue, C.N.Yu. Communications of the ACM 52 (11), 97--104

Non-speech body sounds recognition had not received major attentions, despite containing valuable information about the person's health conditions.

This project aims at detecting the two most popular infectious events, coughs and sneezes, from raw recordings.

The challenge for such objective is the variety of coughs and sneezes, amongst other non-speech body sounds. Hence, the confidence measure of conformal prediction would be useful in such scenario.

This project is recommended for CS5920 students, who have successfully implemented conformal prediction in the first assignment, and would like to further study its performance on real world problems.

- Producing a simple proof-of-concept implementation of classification-based conformal prediction with different underlying algorithms (e.g. SVM, k-NN, Random Forests, etc.) from scratch.
- A report of the empirical results of such conformal predictors on a small synthetic dataset.

- A detailed literature review of existing machine learning algorithms used for non-speech sound recognition.
- Producing a detailed report of the dataset structure, including some visual analysis (e.g. t-SNE, UMAP, etc.).
- Implementing conformal predictors on the whole dataset, optimising the underlying algorithms' parameters.
- Carefully analysing the p-values, comparing the validity, efficiency, accuracy of the implemented conformal predictors.

- Comparing different methods of extracting sound features (e.g. MFCC, CFA, ZCR) from raw recordings, and their impacts on the prediction performance.
- Using the entire dataset to train a new model, then testing such model on other real world recordings.

- Nguyen, K. A. and Luo, Z. (2018). Cover Your Cough: Detection of Respiratory Events with Confidence Using a Smartwatch. Proceedings of Machine Learning Research, 91, 1-18.
- Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018). ACM.
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT

An important task in Bioinformatics is identifying if a protein sequence is part of a particular protein family. This is typically carried out using tools such as Hidden Markox Models (HMM's).

The difficulty with such models is that while they produce a probabilistic likelihood of a hit, the probabilities are based on a number of modelled parameters. Typically, such probabilities generically generate incredibly small p-values and yet some suspician remains that these methods are not robust as the results suggest.

Conformal predictors suggets an alternative to estimating such probabilities by moving away from the paradigm on ROC curves etc. and attempt to create probabilities based on the models themselves.

- A report on Hidden Markov Model (HMM) tools in Bioinformatics.
- A report on conformal predictors.
- A simple implementation of conformal prediction on a family of proteins from PFAM.

- A framework to evaluate the credibility and confidence of individual HMM's from PFAM.

- Using conformal predictors on HMM models in protein sequence prediction.

- The Pfam protein families database. Finn RD1, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. doi: 10.1093/nar/gkp985.
- http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

Attended CS5100 and be likely to get a distinction in the module.

The student must have an excellent grasp of conformal predictors, Python and Unix systems. In the latter they must be able to build simple workflows of Bioinformatics tools. The student must speak with the advisor in question before applying for this project.

- A proof-of-concept program implementing neural nets both in the case of regression and in the case of classification (where neural networks output probability predictions).
- Building a cross-conformal predictor on top of neural nets and exploring its empirical validity.
- A preliminary empirical study of the efficiency of the student's implementation and standard implementations (such as the one in scikit-learn).

- Careful cross-validation of various parameters, including the stopping rule, learning rate, and the numbers of hidden neurons at various levels.
- Implementing regularization.
- Experimenting with different architectures, possibly including deep architectures.
- Ideally, the final product will have a graphical user interface.
- The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

- Careful study of the optimal number of folds.
- Replacing back-propagation with other optimization algorithms.
- Implementing jackknife+, a popular recent version of cross-conformal prediction.

- Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Chapter 11.
- Mitchell, Machine Learning, 1997, Chapter 4.
- Research papers.

This project is to implement and assess a deep learning algorithm - typically for a feed-forward network - and to perform visualisations and experiments to analyse and improve its performance

- Proof of concept program: Implementation of a deep learning algorithm for a relatively small network on a relatively simple dataset, together with experiments determining the effect of hyperparameters on convergence
- A report giving an overview of a range of related deep learning methods.
- A report describing and assessing the performance of the initial implementation and any experiments performed

- The program(s) will work with a real dataset(s).
- You will implement and compare multiple architectures and optimisation methods
- You will visualize and assess the performance of the algorithms produced.
- The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

- This would be an opportunity to implement several deep learning methods and get to understand them in practice by applying them to a significant dataset.

- "Deep Learning", by Goodfellow, Bengio, and Courville, MIT Press 2016
- "Deep Learning with Python" by Francois Chollet, Manning Publications 2018

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

This project will require some mathematical understanding; it is not simply programming.

- A proof-of-concept implementation of bagging, random forests, or boosting both in the case of regression and in the case of classification.
- Implementing underlying algorithms, such as decision stumps or decision trees; simple proof-of-concept implementations are sufficient.

- A final program implementing bagging, random forests, or boosting (or all three) with a careful choice of the parameters.
- Implementations of the underlying algorithms should be complemented by a careful empirical study of the effect of various parameters; producing optimized programs.
- Comparing the performance of different ensemble techniques in the problems of classification and regression.
- Using bagged and boosted decision trees as probability predictors; evaluating their perfomance using proper loss functions.
- Ideally, the final product will have a graphical user interface.
- The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

- Implementing the out-of-bag estimate of error probability.
- Investigating variable importance.
- Proximity plots.
- Theoretical analysis of the algorithms.

- Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Section 8.7, Chapters 10 and 15-16
- James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning, 2013, Chapter 8.
- Research papers.

DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. Persistent identifiers (PIDs) are not only important to uniquely identify a publication, dataset, or person, but the metadata for these persistent identifiers can provide unambiguous linking between persistent identifiers of the same type, e.g. journal articles citing other journal articles, or of different types, e.g. linking a researcher and the datasets they produced. DataCite has constructed a PID graph that relates the different types of PID into a graph that can be queried using a language called GraphQL.

A variety of graph properties can be tested with this data. In particular there has been interest in the behaviour of the fraction of nodes with degree k following a pattern. Scale-free networks where this fraction is a power law in k generated significant interest though it is apparent these are less common than throught previously.

This project will require the student to examine the behaviour of the degree of the PID graph.

- A report on graphs and scale-free networks.
- A report on the PID graph.
- A report on GraphQL.

- The development of a tool to query the PID graph and visualise graphs.
- Identifying the behaviour of the degree of the PID graph.

- Exploring other graph obseravbles e.g. centrality.
- Exploring using a recommender system to identify papers of interest.

- Introducing the PID Graph - https://blog.datacite.org/introducing-the-pid-graph/
- The Fullstack Introduction for GraphQL - https://www.howtographql.com/
- Learning GraphQL, E.Porcello and A. Banks O'Reilly Media (available online through Safari Books)
- Handbook of Graph Theory, J.L. Gross, J. Yellen and P. Zhang, Chapman and Hall/C (available online through Safari Books)
- Scale-free networks are rare, A.D. Broido and A. Clauset, Nature Comms. (2019) DOI - 10.1038/s41467-019-08746-5

Understanding of or willingness to study graph theory.

The student must have a good understanding of Python or a similar tool.

Machine recognition of Greek characters in images is a challenging process, because of the large number of more than 270 character classes, and the similarity of the diacritic marks in each character.

Until today, Greek text search facility is far from perfect. Hence, it is interesting to apply machine learning techniques and compare their accuracies for such task.

- Proof of concept program: identify one machine learning task suitable for characters recognition and implement one learning algorithm for the chosen task.
- Report: An overview of the datasets. Statistics and numbers involved.
- Report: A report on machine learning tasks for character recognition.

- Survey of machine learning techniques used for characters recognition.
- The developed machine learning software for Greek characters recognition.
- The software engineering process involved in generating your software.

- Provide predictions with confidence measure.

- Gatos, Basilis, et al. GRPOLY-DB: An old Greek polytonic document image database. Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
- http://www.iit.demokritos.gr/~nstam/GRPOLY-DB

The problem of working out the price of a house using its characteristics has been studied intensively over past decades. Many hosing price datasets are available on the Internet and many machine learning techniques have been applied to them.

Boston housing dataset has been a playground for statistics and machine learning methods since 1970s. A more recent California Housing Prices dataset dates to 1990s. In 2010s Ames dataset was put forward: it is based on transactions and contains dates as part of individual records.

The first task is to work out house prices in a batch setup when the time is not available (or ignored). We can compare learning methods and study the effect of parameters on the test error.

Secondly, a study of house prices in the online mode can be undertaken. Here datapoints become available one by one: knowing that a house is about to be sold, we predict the sales price, then we see the actual price in the transaction, then we make a prediction for the price in the next sale etc. We can use the data with timestamps to study inflation. Do prices grow with time? How fast?

The project is recommended for Computational Finance students.

- Proof of concept program pre-processing the data and applying two machine learning methods to Boston Housing dataset
- Report: A report describing supervised learning, the concepts 'training set' and 'test set', and the theory of the methods.

- A machine learning method (such as Ridge Regression; the choice is by discussion with the supervisor) should be implemented from scratch.
- Several different datasets should be used
- Tests comparing different kernels, parameters, etc should be performed. The results should be visualised.
- Prediction in the online mode should be implemented and results visuaised.
- The report will describe different algorithms and implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe computational experiments, analyse their results and draw conclusions

- Methods of prediction with expert advice can be applied in the online mode.
- Is there inflation? A study can be undertaken.

- Boston housing dataset on Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
- California housing dataset on Kaggle: https://www.kaggle.com/camnugent/california-housing-prices
- De Cock, Dean. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19.3 (2011).
- Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
- J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
- B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Molecular Biology is now generating colossal amounts of data. In particular, there are a variety of technologies that can scan entire genomes or transcriptomes and determine the likelihood of their association with a particular Biological process. However such results are very noisy and suffer from a "large p, small n" effect, namely that the ratio of possible degrees of freedom, p to the number of independent observations is much greater than 1 and hence suffer from very large false positive rates.

On the other hand, the very extensive body of biomedical literature could be used to filter out spurious genes from the data before undergoing analysis. Tools such as GOCat mine the literature to relate Gene Ontology terms to specific search terms representing Biological processes. This specific tool is no longer functional and an alternative needs to be provided.

This project will require the student to develop an equivalent of GOCat.

- A report on Gene Ontologies.
- A report on the tool GOCat explaining the methods.
- A report on the data set required to build the tool.

- The development of a stand alone tool to implement the equivalent of GOCat.
- An API to query the tool.
- Rigourous testing of the tool.

- Using more advanced methods beyond kNN for the implementation.
- An API to query the tool.

- Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Gobeill J Pasche E Vishnyakova D Ruch P DOI 10.1093/database/bat041
- A survey on annotation tools for the biomedical literature. Neves M Leser U DOI 10.1093/bib/bbs084
- The GOA database in 2009--an integrated Gene Ontology Annotation resource Barrell D Dimmer E Huntley R Binns D O'Donovan C Apweiler R dx.doi.org/10.1093/nar/gkn803

Understanding of machine learning methods such as kNN.

The student must have a good understanding of R and/or Python.

- Implementation of transfer learning algorithms based on pre-trained networks for any of the following tasks, (1) object detection/classification, (2) image segmentation, and (3) video classification.
- A report describing transfer learning algorithms implemented with some evaluation results.

- Implementing transfer learning, deep learning (e.g. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)), or hybrid deep networks for any of the aforementioned tasks.
- Implementing several deep learning algorithms with different architectures, where the employed architectures are either existing, or newly proposed.
- Implementing manual hyper-parameter fine-tuning of the implemented deep networks.
- A comprehensive evaluation and comparison against existing state-of-the-art studies.
- The final report will describe methodology, implementation and testing with theoretical analysis and detailed evaluation results.

- Implementing deep hybrid networks with new architectures.
- Implementing optimal hyper-parameter selection using an automatic process.
- Theoretical analysis of the algorithms.

- Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
- Chollet, F., 2021. Deep learning with Python.
- Journal/conference papers.

A prediction region is a region that is guaranteed to include a prediction with given probability. A joint prediction region (JPR) is an "envelope" around the predicted future path (a sequence of predictions) guaranteeing that the true future path is included in the JPR with some probability. A similar criterion is the k-family-wise error rate (k-FWE), which bounds the probability that at least k points in path are not included in the JPR.

In a recent paper, Wolf and Wunderli proposed a bootstrap-based method to derive JPRs for time-series prediction (see reference below) and evaluated the method on autoregressive models and compared it with different techniques to produce JPR. In this project, you will implement the bootstrap JPR method, and evaluate it on an application of your choice or one suggested by the supervisor.

From a methodological viewpoint, the key steps of the project will be:

- Learn a time-series predictor using a dataset of choice
- Devise a method to compute prediction standard errors for your model
- Implement the bootstrap algorithm of Wolf and Wunderli, using the above two models.

- Report 1: bootstrap sampling.
- Report 2: overview of chosen case study.
- Report 3: prediction regions, JPRs and bootstrap JPRs.
- Report 4: time-series models.
- Report 5: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
- Implement and evaluate predictor and prediction standard errors.

- Implement code for bootstrap JPRs and evaluate empirical coverage and family-wise error rate for different hyper-parameters (e.g., k, horizon length, sample size, number of bootstrap samples).
- Final dissertation describing problem, methods and algorithm, implementation details, computational experiments and their analysis, and conclusions.

- 1. Compare different predictors.
- 2. Compare bootstrap JPRs with other methods mentioned in the paper (e.g., joint marginals and Scheffé)
- 3. Compare bootstrap JPRs with Monte Carlo sampling based on Bayesian neural networks and uncertainty recalibration

- (bootstrap) Chapter 6, Lecture notes of 36-402, CMU Undergraduate Advanced Data Analysis, available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
- (bootstrap JPRs) Wolf, Michael, and Dan Wunderli. "Bootstrap joint prediction regions." Journal of Time Series Analysis 36.3 (2015): 352-376.
- (time-series models) Stock Market Predictions with LSTM in Python, available at https://www.datacamp.com/community/tutorials/lstm-python-stock-market
- (time-series models) Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2019). Dive into deep learning, chapters 8 and 9, available at https://d2l.ai/index.html
- (extension 3) Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016, available at http://proceedings.mlr.press/v48/gal16.pdf
- (extension 3) Kuleshov, Volodymyr, Nathan Fenner, and Stefano Ermon. "Accurate Uncertainties for Deep Learning Using Calibrated Regression." ICML 2018, available at https://arxiv.org/pdf/1807.00263.pdf

You will start by studying background literature on various
distributed graph processing platforms and then focus on a specific
platform of your choice. You will then use this platform to implement
**3** proof-of-concept programs solving a few simple graph
analytics problems chosen from the list below. This will be followed
by focusing on a **single** more advanced problem of your choice,
and implementing and evaluating its prototype solution on a
distributed platforms you have chosen.

Proof-of-Concept:

- Compute the number of vertices and edges in a graph
- Compute maximum, minimum, and average vertex degrees
- Compute and analyse the vertex degree distribution (probability mass function)
- Compute the graph diameter (longest shortest path between any pair of nodes)
- Compute and analyse the distribution (probability mass function) for the number of friends of a given degree k > 1

Advanced problems:

(A) Community Detection: Community in a social network refers to a collection of users whose ties to each other (in terms of e.g., the friend/follower relation) is closer than those to the other users. Community detection is a hot topic in the social network research and industry. The communities more often than not share a large number of common traits (such as e.g., interests in specific topics, geographical areas, social status, etc.), which makes them extremely useful for developing better recommendation systems, or advertisement campaigns. There is an extensive literature on the community detection problem, and a large number of algorithms have been proposed. Your task will be to identify and implement an algorithm suitable for implementing on a distributed computing platform, and evaluate its scalability and performance on various datasets.

(B) Triangle Enumeration: Similarly to community detection, finding and listing the triangles that occur in a graph is an important problem in network analysis and useful for determining network statistics such as the clustering coefficient. In the literature, there are several algorithms for triangle enumeration most of which are designed for a centralized model of computation. More recently, algorithms for distributed graph frameworks were proposed.

(C) PageRank: When considering real world graphs, we are often interested in determining the "important" vertices. In social networks, for example, important vertices correspond to well connected people. To decide whether a vertex is important, it is usually not sufficient to simply count its degree as this is not a very robust measure and can easily be skewed, for example, by connecting a vertex to many fake vertices that are solely created for this purpose and who do not have any other connections. Instead, the PageRank of a vertex depends on the number of links from vertices that are themselves important. PageRank is commonly used by search engines to determine the importance of web pages.

(D) Connected components: Connected components is the set of subgraphs of a given graph such that no two of these subgraphs have a vertex in common. For directed graphs, the connectivity between a pair of vertices a and b can be be either strong (i.e., b is reachable from a by following a path of directed arcs) or weak (i.e., a and b are connected in the underlying undirected graph). You will need to design and implement a distributed algorithm to identify both weakly and strongly connected components in a given graph, and analyse their respective size distributions.

(E) SimRank: is an important metric measuring similarity between two vertices in a graph through their structural context: i.e., two vertices are similar if they are referenced by two vertices which are also similar. SimRank found applications in a variety of diverse contexts including recommender systems, spam filtering, sponsored search, malware detection, etc. Your task is to implement a distributed algorithm to compute the SimRank score for any two pair of vertices in a given graph and experimentally analyse its performance.

(F) Implement a prototype of an interactive browser-based graph-analytics tool that will provide interface for both specifying a graph analytics problems and their inputs and visualising their execution on a remote distributed platform (Giraph or GraphX) in a creative way. Your tool should support at least 4 proof-of-concept graph analytics algorithms above, and be extensible to allow integrating other algorithms in the future.

Data: There is a large number of graph datasets of different nature available in various online repositories. The links to the commonly used ones can be found at https://moodle.royalholloway.ac.uk/mod/page/view.php?id=168248.

Tools: Both Giraph and GraphX are available on the department's big data cluster. Other platforms (such as GraphLab and PowerGraph) may also be available.

- Report on the comparative study of the distributed platforms for processing large-scale graphs.
- Implementation and evaluation of the proof-of-concept programs.
- Detailed plan for the final project outcomes as agreed with the advisor.

- The software you built to achieve the agreed project goals. The dissertation will report on different algorithms for the advanced problem you chose comparing their strengths and weaknesses. It will discuss techniques and findings, and provide meaningful visualisation of the results.

*Type 1 diabetes (T1D)* is a disease in which the human pancreas is not able to produce a sufficient amount of insulin to regulate blood glucose levels, thereby inhibiting glucose uptake in muscle and fatty tissue. Many automated control techniques for T1D require some model for predicting future glucose levels to improve therapy effectiveness. Traditional models include mechanistic (aka white-box) differential equations models, which provide good accuracy and interpretability but require a potentially error-prone state estimation procedure, i.e., figuring out the state of the differential equation model out of past observations. In contrast, data-driven (aka black-box) machine learning models don't require state estimation as they can be trained to directly map past observations into predicted glucose values.

In this project, you will derive models for short-term glucose prediction (up to 1 hour prediction horizon) using deep learning techniques. Here by glucose we mean the subcutaneous glucose levels as measured by a CGM sensor. First, you will review the problem of automated insulin control and existing glucose prediction models. Then, you will implement data-driven machine learning models for glucose prediction, and train them with synthetic data provided by the project supervisor. These data consist of sequences of glucose measurements, insulin dosages, and ingested meals. You will experiment with and compare the accuracy of different deep neural network architectures and different lengths for the prediction horizon. If time permits, possible extensions include: analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout; deriving optimal insulin control policies using the learned glucose prediction models; or comparison of the learned deep models with more classical autoregressive models.

- Report 1: overview of the artificial pancreas and underlying control algorithms.
- Report: review of main models for glucose prediction.
- Report: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
- Implement code for data processing and for learning neural network-based glucose predictors. Perform training for at least one NN architecture and at least one prediction horizon.

- Final dissertation describing problem, data, solution method, machine learning methods used, implementation details, computational experiments and their analysis, and conclusions.

- Analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout.
- Derive optimal insulin control policies using the learned glucose prediction models
- Compare the learned deep models with more classical autoregressive models.

- (for report 1) Boughton, Charlotte K., and Roman Hovorka. "Advances in artificial pancreas systems." Science translational medicine 11.484 (2019). PDF available from supervisor
- (for reports 1 and 2) Lunze, Katrin, et al. "Blood glucose control algorithms for type 1 diabetic patients: A methodological review." Biomedical signal processing and control 8.2 (2013): 107-119.
- (for report 2 and for ideas on neural net architecture) Li, Kezhi, et al. "Convolutional recurrent neural networks for glucose prediction." IEEE Journal of Biomedical and Health Informatics (2019).
- (for report 2 and for example of autoregressive model) Sparacino, Giovanni, et al. "Glucose concentration can be predicted ahead in time from continuous glucose monitoring sensor time-series." IEEE Transactions on biomedical engineering 54.5 (2007): 931-937.
- (for monte carlo dropout) Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
- (for inductive conformal prediction) Papadopoulos, Harris. "Inductive conformal prediction: Theory and application to neural networks." Tools in artificial intelligence. IntechOpen, 2008.

- Proof of concept program: identify one machine learning task suitable for the smart meter data and implement one learning algorithm for the chosen task.
- Report: An overview of the datasets used. Statistics and numbers involved.
- Report: A report on machine learning tasks for smart meter data.

- Survey of machine learning techniques used for smart meter data analysis.
- The developed machine learning software for smart meter data.
- The software engineering process involved in generating your software.
- Experiments with different smart meter data.

- Extend the learning algorithms for online setting.
- Provide predictions with confidence measure.

- Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. Sean Barker, Aditya Mishra, David Irwin, Emmanuel Cecchet, Prashant Shenoy, and Jeannie Albrecht. Proceedings of the 2012 Workshop on Data Mining Applications in Sustainability (SustKDD 2012), Beijing, China, August 2012.
- Smart* Data Set for Sustainability http://traces.cs.umass.edu/index.php/Smart/Smart
- SmartMeter Energy Consumption Data in London Households https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
- UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://jack-kelly.com/data/
- Revealing Household Characteristics from Smart Meter Data. Christian Beckel, Leyna Sadamori, Thorsten Staake, Silvia Santini, Energy, Vol. 78, pp. 397-410, December 2014.
- Piotr Mirowski, Sining Chen, Tin Kam Ho, Chun-Nam Yu, Demand Forecasting in Smart Grids, Bell Labs Technical Journal, 18(4), pp.135-158, 2014.

We spend most of our time indoors, where limited or no GPS service is available. The most popular solution leverages the ubiquitous WiFi signal to create a radio map of the tracking zone (i.e. WiFi fingerprinting).

WiFi fingerprinting is interesting, because it may be considered as both classification and regression problem. Firstly, as the training labels are discrete finite indoor locations, it may be addressed as a classification problem. Secondly, as these training labels are real numbers (i.e. the Cartesian coordinate), a regression model will be more suitable when there are too many training positions.

The challenge for both approaches is the high degree of variation of the radio signal, and the high dimensional training examples. Hence, it is useful to compare their performances on a real world dataset.

- Producing a simple proof-of-concept implementation of some classification and regression algorithms from scratch.
- A report of the empirical results of such learning algorithms on a small synthetic dataset.

- A detailed literature review of the machine learning application in WiFi fingerprinting.
- Producing a detailed report of the dataset structure, including visual analysis (e.g. t-SNE, UMAP, etc.).
- Implementing the algorithms on the entire competition dataset, and optimising their parameters.
- Carefully analysing and comparing the performance accuracy of such learning algorithms.

- Providing location estimations with some confidence measure.
- Extending the learning algorithms to an online setting, where the user provides continuous samples as she traverses the building.

- Bahl, P. and Padmanabhan, V. N. (2000). RADAR: An in-building RF-based user location and tracking system. In INFOCOM 2000. 19th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 2, pp. 775-784).
- Honkavirta, V., Perala, T., Ali-Loytty, S., and Piché, R. (2009). A comparative survey of WLAN location fingerprinting methods. In Positioning, Navigation and Communication, 2009. WPNC 2009. 6th Workshop on (pp. 243-251). IEEE.
- https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

- Proof of concept program: tf-idf based similarity measurement
- Proof of concept program: word-embedding based similarity measurement
- (Optional) Proof of concept program: sentence-embedding based similarity measurement.
- A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

- Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

- Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
- (Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
- A report describing and assessing the results of applying the proof of concept programmes on data.

- Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
- Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
- (Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
- Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
- A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

You will use the tools you built to analyse various meta-data attributes associated with the data (such as time, location, hashtags, etc.) as well as properties of the content (e.g., sentiment score) to establish temporal, spatial, content-related, and other correlations between the real-life event you selected and the relevant social media properties. The report should reflect upon and draw general conclusions in regards to the predictive power of the social media platforms in relation to the dynamics and outcomes of real-life events.

Your project must use one or a combination of the large-scale data engineering tools you have studied in CS5234. For example, past projects have used Hadoop/MapReduce to identify tweets relevant to the studied event, extract the attributes of relevance, and assign sentiment score to the posted statuses using various Natural-Language Processing (NLP) techniques, such as [O'Connor, 2010] and/or Stanford NLP libraries (http://nlp.stanford.edu/)

Data: You will have access to the 300G Twitter dataset on the department's cloud, or could use an alternative dataset of your own subject to the supervisor's approval.

Tools: The project must use one or a combination of the large-scale data engineering tools studied in CS5234.

- Precise specification of the project goals as agreed with the advisor along with data collection and analysis tools, and methodologies you plan to use to achieve the specified goals.
- Intermediate report and programs showing and reflecting upon initial findings.

- The software you built to achieve the project goals.
- Final report will discuss techniques and findings, and provide meaningful visualisation of the results.

- Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, Noah A. Smith, "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series" In Proceedings of the International AAAI Conference on Weblogs and Social Media (2010)

There are many approaches to this problem and many different algorithms based on various principles. The classical PageRank algorithm assigns ranks on the basis of links so that pages linked by many others get high ranks. The rank turns out to be the stationary distribution of a Markov chain.

Recently ranking has been increasingly approached in the context of machine learning. The problem is called "learning to rank" and supervised and semi-supervised methods are applied to it. One possibility is to train regression or classification algorithms on relevance scores provided by human evaluators. A number of parameters called features are calculated using the content of the page and the query; then a learning algorithm is applied to the features. Different measures have been used to analyse the effectiveness of ranking, including Mean Average Precision, Discounted Cumulative Gain, Kendall's rank, Yandex pfound etc.

The project consists of implementing ranking techniques and analysing their performance on synthetic and real-world data. The efficiency issue is particularly important as the programs should work on large datasets.

Benchmark datasets:

LETOR: Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/projects/letor/

Kaggle Learning to Rank 2019 challenge https://www.kaggle.com/c/learning-to-rank-2019/

Yandex Internet Mathematics 2009 competition dataset (archived) https://web.archive.org/web/20150317144535/http://imat2009.yandex.ru/academic/mathematic/2009/en/

- Proof of concept program: Page rank applied to a small (possibly synthetic) dataset. Its performance and efficiency should be analysed.
- Report: An overview of the problem of document ranking and main approaches to it. An overview of PageRank.

- Two or more different machine learning technique should be applied to a standard benchmark dataset and their performance compared.
- The performance of the program should be optimised to enable them to process large amounts of data.
- The report will describe the theory of underlying learning algorithms.
- The report will describe computational experiments, analyse their esults and draw conclusions.

- Comparison of pointwise, pairwise and listwise approaches.
- Use of real Internet dumps such as http://commoncrawl.org/

- Christiane Rousseau, How Google works: Markov chains and eigenvalues, URL http://blog.kleinproject.org/?p=280
- Chuan He, Cong Wang, Yi-xin Zhong, and Rui-Fan Li. A survey on learning to rank. Machine Learning and Cybernetics, 2008 International Conference on. DOI 10.1109/ICMLC.2008.4620685.
- Tie-Yan Liu, Learning to Rank for Information Retrieval, Berlin, Heidelberg: Springer. 2011

This project aims at studying different problems related to regression. One is the problem of parameter selection. How do we ensure that Ridge Regression performs best? How do we select the right parameters? Or is it best to tune parameters in the on-line mode? The problem of overfitting is closely related. It is known that in many situations more sophisticated techniques achieving better results in training later behave worse than coarse methods; how do we handle this? The project has the potential to lead to new research results of independent interest.

The project is recommended for Computational Finance students.

- Proof of concept program: Kernel Ridge Regression applied to a small artificial dataset.
- Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for Ridge Regression and defining all terms in them.
- Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.

- The program will work with a real dataset such as Boston housing, read the data from a file, preprocess the data (normalise, split the dataset into the test and training sets) and apply Ridge Regression using a kernel with parameters.
- The program will automatically perform tests such as comparison of different kernels, parameters, etc. The results will be visualised.
- The program will work in batch and on-line modes.
- Tests will be performed to compare plain Ridge Regression against other regression-based algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies, or gradient descent.
- The report will describe the theory of Ridge Regression and derive the formulas.
- The report will describe computational experiments, analyse their results and draw conclusions

- Application of Ridge Regression to different datasets (including those suggested by the student).
- Comparison with other learning algorithms (e.g., aggregating algorithm regression, neural networks, gradient descent-based etc).
- Combining regression with methods of prediction with expert advice.

- Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
- J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
- C.E.Rasmussen and C.Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 http://www.gaussianprocess.org/
- B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

*Satisfiability Modulo Theories* (SMT) is a kind of constraint satisfaction problem that allows solving constraints specified in some fragment of first order logic. In other words, SMT extends propositional satisfiability (SAT) to support (besides propositional ones as in SAT) numerical variables over a fixed arithmetic theory as well as quantifiers over such variables. State-of-the-art SMT solvers like Z3 by Microsoft provide very efficient decision procedures for such problems.

Among the numerous applications of SMT, you will focus on *bounded model checking* (BMC), i.e., the problem of verifying whether a given transition system can reach some state within a finite number of steps. In particular, you will use BMC to solve AI planning problems. A planning problem can be expressed as one of BMC, where the objective is to find a path of the AI agent leading to some goal state.

In the first part of the project, you will study the SMT problem and SMT solving techniques, and you will use the Z3 tool and its Python APIs to solve simple example instances. In the second part, you will study the BMC problem and implement code to solve it for generic transition systems using Z3+Python. In the final part, you will extend your program to encode and solve planning problems on 2-dimensional grids with obstacles. Extensions include visualization of path solutions, and planning under cost constraints.

- Report: overview of SAT, SMT and SMT solving techniques.
- Report: bounded model checking and encoding into SMT.
- Install z3 and Python APIs. Develop code for SMT solving of a simple example problem (e.g., Sudoku, or job shop scheduling).
- Develop code for BMC of generic transition systems.

- Develop code for encoding a planning problem on a 2-d grid with obstacle as a BMC problem.
- Evaluate your solution with arbitrary and randomly generated planning instances.
- Analyze SMT solver performance for different grid sizes and numbers of obstacles.
- Final dissertation. It should include background on SAT, SMT, BMC, and AI planning, as well as the description of your solution for the encoding of AI planning into SMT, some example solutions obtained with the developed planning method, and performance analysis results.

- Extend the planning problem and your solution with transition costs and cost constraints.
- Develop code to visualize path solutions on the 2d grid.

- De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: introduction and applications." Communications of the ACM 54.9 (2011): 69-77. (available at https://dl.acm.org/doi/abs/10.1145/1995376.1995394)
- De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: An appetizer." Brazilian Symposium on Formal Methods (2009): 23-36. (available at https://link.springer.com/chapter/10.1007/978-3-642-10452-7_3)
- Amla, Nina, et al. "An analysis of SAT-based model checking techniques in an industrial environment." (up to section 2.3 included). (available at https://link.springer.com/chapter/10.1007/11560548_20)
- Lectures on " Introduction to planning" and " Planning formalisms" of the " Autonomous Intelligent Systems" course
- Z3 tool page, https://github.com/Z3Prover/z3
- Programming Z3 tutorial (sections 1,2,3,5), https://theory.stanford.edu/~nikolaj/programmingz3.html
- Jupyter notebooks illustrating Z3's Python APIs (includes code for Programming Z3 tutorial), https://notebooks.azure.com/nbjorner/projects/z3examples
- Online Z3 tutorial (using SMT-lib input format), https://rise4fun.com/z3/tutorial/guide
- Dennis Yurichev, "SAT/SMT by Examples", https://yurichev.com/writings/SAT_SMT_by_example.pdf (a huge collection of maths and CS problems solved with SAT or SMT)

Key value stores are data structures that provide dictionary-style operations for data storage and retrieval. To ensure scalability and resilience to environmental changes, a key value store can be distributed among multiple nodes in a distributed system. In this project, the focus is on implementing peer-to-peer key value stores in an actor model framework as provided by Elixir/Erlang or Cloud Haskell.

- A report giving an overview of prominent existing key value stores such as Chord and SkipGraph.
- A report describing how existing key value stores deal with environmental changes.
- A proof-of-concept program implementing a simple (possibly inefficient) distributed key value store.

- A distributed key value store that provides standard dictionary operations and is scalable to large data sets.
- A report describing the architecture of the implemented key value store.
- A report on the performed experiments and simulation results.

- Supporting advanced dictionary operations, e.g., range queries.
- More efficient storage of data by using coding techniques.
- Simulating node churn.

- Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
- James Aspnes, Gauri Shah: Skip graphs. ACM Trans. Algorithms 3(4): 37 (2007)

The project is recommended for Computational Finance students.

- You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
- Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
- The report will describe the program (software engineering, algorithms, etc.); discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

- The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation. Option pricing can be performed using the Black-Scholes formula, or binomial trees (for American options), or a different variety of Monte Carlo simulation (for Asian options).
- Allow an arbitrary number of stocks in the model-building approach (using eigen- or Cholesky decomposition).
- Allow an arbitrary number of stocks using histotical simulation.
- Ideally, it will have a graphical user interface.
- The final report will describe the theory behind the algorithms; the implementation issues necessary to apply the theory; the software engineering process involved in generating your software; computational experiments with different data sets, methods, and parameters.

- Computing conditional VaR (also known as expected shortfall)
- Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
- Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
- Complement back testing by stress testing.
- Replacing the Gaussian model for stock returns by robust models.
- Computing monetary measures of risk different from VaR.
- Using proper scoring rules for quantiles for valuating computed VaRs.
- Using methods of worst-case risk aggregation, such as the Rearrangement Algorithm, or worst-case online decision making, such as the Weak Aggregating Algorithm (currently active areas of research).

- John C. Hull. Options, futures and other derivatives, 10th ed., Pearson, Upper Saddle River, NJ, 2018.
- Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time 3rd ed., de Gruyter, Berlin, 2011.
- Yahoo Finance as source of data: http://finance.yahoo.com/
- http://gloria-mundi.com

This project is to implement and compare a variety of clustering methods on some artificial and real data.

- Proof of concept program: Gaussian mixture model applied to a small artificial dataset.
- A report giving an overview of clustering and describing three different clustering methods.
- A report describing and assessing the results of applying the proof of concept program to data.

- The program(s) will work with a real dataset(s).
- You will generate and compare different clusterings or generative models using several different methods, such as: kernel-based semi-supervised clustering, spectral learning, factor analysis, tree-based methods, or other comparable methods.
- You will compare and visualize the clusterings produced.
- The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

- There is a very wide range of clustering and generative-modelling algorithms in standard use, and an even wider variety of methods of fitting them to data. This would be an opportunity to implement several different methods and get to understand them by using them in practice.

- Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
- Machine learning: a probabilistic perspective”, by K. Murphy, MIT Press 2012
- S. Basu, A. Bamerjee and R. Mooney, "Semi-supervised clustering by seeding", Proceedings of the 19th International Conference on Machine Learning, pp.19-26, 2002.

This project is to implement and compare a variety of clustering methods on some artificial and real data.

- Proof of concept program: Gaussian mixture model applied to a small artificial dataset.
- A report giving an overview of clustering and describing three different clustering methods.
- A report describing and assessing the results of applying the proof of concept program to data.

- The program(s) will work with a real dataset(s).
- You will generate and compare different clusterings or generative models using several different methods, such as: kernel-based semi-supervised clustering, spectral learning, factor analysis, tree-based methods, or other comparable methods.
- You will compare and visualize the clusterings produced.

- There is a very wide range of clustering and generative-modelling algorithms in standard use, and an even wider variety of methods of fitting them to data. This would be an opportunity to implement several different methods and get to understand them by using them in practice.

- Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
- Machine learning: a probabilistic perspective”, by K. Murphy, MIT Press 2012
- S. Basu, A. Bamerjee and R. Mooney, "Semi-supervised clustering by seeding", Proceedings of the 19th International Conference on Machine Learning, pp.19-26, 2002.

- You will implement simple machine learning algorithms such as nearest neighbours and decision trees.
- They will be tested using simple artificial data sets.
- Report: a description of 1-nearest neighbour and k-nearest neighbours algorithms, with different strategies for breaking the ties;
- Report: a description of decision trees using different measures of uniformity.

- May include nearest neighbours using kernels and multi-class support vector machines.
- The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
- For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
- The performance of all algorithms will be explored using a hold-out test set and cross-validation.
- The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- Ideally, it will have a graphical user interface.
- The report will describe: the theory behind the algorithms.
- The report will describe: the implementation issues necessary to apply the theory.
- The report will describe: the software engineering process involved in generating your software.
- The report will describe: computational experiments with different data sets and parameters.

- Modifications of known algorithms.
- Dealing with machine learning problems with asymmetrical errors (such as spam detection) and ROC analysis.
- Nontrivial adaptations of known algorithms to applications in a specific area, such as medical diagnosis or option pricing.
- Exploring the cross-validation procedure, in particular the leave-one out procedure. What is the optional number of folds?
- Comparative study of different strategies of reducing multi-class classifiers to binary classifiers (such as one-against-one, one-against-the-rest, coding-based).

- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Second edition. Springer, New York, 2009.
- Tom M. Mitchell. Machine Learning. McGrow-Hill, New York, 1997.
- Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Second edition. Springer, New York, 2000.
- Vladimir N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
- Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu.
- Delve archive. http://www.cs.toronto.edu/~delve.

- You will implement simple machine learning algorithms such as nearest neighbours and decision trees.
- They will be tested using simple artificial data sets.
- Report: a description of 1-nearest neighbour and k-nearest neighbours algorithms, with different strategies for breaking the ties;
- Report: a description of decision trees using different measures of uniformity.

- May include nearest neighbours using kernels and multi-class support vector machines.
- The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
- For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
- The performance of all algorithms will be explored using a hold-out test set and cross-validation.
- The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- Ideally, it will have a graphical user interface.
- The report will describe: the theory behind the algorithms.
- The report will describe: the implementation issues necessary to apply the theory.
- The report will describe: the software engineering process involved in generating your software.
- The report will describe: computational experiments with different data sets and parameters.

- Modifications of known algorithms.
- Dealing with machine learning problems with asymmetrical errors (such as spam detection) and ROC analysis.
- Nontrivial adaptations of known algorithms to applications in a specific area, such as medical diagnosis or option pricing.
- Exploring the cross-validation procedure, in particular the leave-one out procedure. What is the optional number of folds?
- Comparative study of different strategies of reducing multi-class classifiers to binary classifiers (such as one-against-one, one-against-the-rest, coding-based).

- Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Second edition. Springer, New York, 2009.
- Tom M. Mitchell. Machine Learning. McGrow-Hill, New York, 1997.
- Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Second edition. Springer, New York, 2000.
- Vladimir N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
- Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu.
- Delve archive. http://www.cs.toronto.edu/~delve.

The Burrows-Wheeler Transform is a text transformation scheme that has found applications in various aspects of the data explosion problem, from data compression to indexing. Specifically, the transform is a permutation of an input string which often clusters occurrences of the same letter in the input into `runs’ of the letter. Hence it is useful as a preprocessing step in compression methods like run-length encoding.

A string of letters over an alphabet is a Lyndon word if it is the least in lexicographic (dictionary) order amongst the set of its rotations. For example, "data" is not a Lyndon word since it is not lexicographically least in the set {data, atad, tada, adat} whereas "adat" is a Lyndon word. Similarly, "computer" is a Lyndon word while "science" is not. An important technique for handling string computations is to split an input string into substrings, each with a common pattern. One such method is the Lyndon factorisation, where each substring is a Lyndon word. Efficiencies are achieved with the Burrows-Wheeler transform, by first factoring the input string into Lyndon words, and then applying the transform. It is also known that representing a string as a suffix tree can improve the speed of the Burrows-Wheeler transform.

In this project you will study lossy and lossless compression techniques. You will study relevant Shannon information theory such as entropy encoding, and some of the classic compression algorithms such as Huffman coding including their complexities. You will implement lossless compression algorithms applied to textual data, and preprocessing techniques.

Benchmarking will be performed in order to compare compression ratios: between different encoding methods, plus with and without preprocessing . You will develop a GUI suitable for importing text files; the GUI will incorporate HCI (human computer interaction) concepts.

- Report on classic lossless algorithms: Implement: run-length encoding; move-to-front encoding; Huffman encoding.
- Program: GUI built with buttons etc.
- Program: Efficient Suffix Tree implementation
- Program: Naive implementation of Burrows-Wheeler
- Report: How does Burrows-Wheeler work with an emphasis on writing and understanding a proof that the reverse transform works.
- Report: Shannon information theory and lossy v lossless compression; complexity theory; pseudo-code of algorithms implemented.
- Reports: Complexity, NP hardness and the big O notation
- Report: The preprocessing phase for compression.
- Program: Implement an efficient Burrows-Wheeler Transform using Suffix Trees.
- Program: Apply the Lyndon factorisation to an input string, and apply the Burrows-Wheeler transform to each of the Lyndon factors.

- The program will have a Graphical User Interface with which to input text files and display compression ratios for compressed files.
- The Lyndon factorisation and Burrows-Wheeler transformation of a string will be displayed.
- The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- The report will describe the software engineering process involved in generating your software, interesting programming techniques and data structures used (or able to be used) on the project.
- The report will describe the complexity analysis of your computations.
- The report will include some results of benchmarking compression algorithms with a discussion of their meanings.
- The report will include discussion of applications of the Burrows-Wheeler transform and Lyndon factorisation.
- The report will include a User Manual.

- Build a mobile Burrow-Wheeler application: this will require use of Worker Threads.
- Show the output using an effective JTree with callbacks to enable user interaction.
- Animate the Burrows-Wheeler in order to help explain its working to a student unfamiliar with the algorithm.
- Adding Digital Signatures to compressed files.

- D. Adjeroh, T. Bell, A. Mukherjee, The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching, Springer (2008) 352 pp.
- M. Burrows and D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation (1994).
- M. Salson, T. Lecroq, M. Leonard and L. Mouchard, A four-stage algorithm for updating a Burrows-Wheeler transform, Theoret. Comput. Sci. 410 (43) (2009) 4350-4359 doi:10.1016/j.tcs.2009.07.016.
- http://en.wikipedia.org/wiki/Burrows-Wheeler transform
- M. Lothaire, Combinatorics onWords, Addison-Wesley, Reading, MA, 1983; 2nd Edition, Cambridge University Press, Cambridge, 1997.
- J. P. Duval, Factorising words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.
- http://en.wikipedia.org/wiki/Data_compression

The Burrows-Wheeler Transform is a text transformation scheme that has found applications in various aspects of the data explosion problem, from data compression to indexing. Specifically, the transform is a permutation of an input string which often clusters occurrences of the same letter in the input into `runs’ of the letter. Hence it is useful as a preprocessing step in compression methods like run-length encoding.

A string of letters over an alphabet is a Lyndon word if it is the least in lexicographic (dictionary) order amongst the set of its rotations. For example, "data" is not a Lyndon word since it is not lexicographically least in the set {data, atad, tada, adat} whereas "adat" is a Lyndon word. Similarly, "computer" is a Lyndon word while "science" is not. An important technique for handling string computations is to split an input string into substrings, each with a common pattern. One such method is the Lyndon factorisation, where each substring is a Lyndon word. Efficiencies are achieved with the Burrows-Wheeler transform, by first factoring the input string into Lyndon words, and then applying the transform. It is also known that representing a string as a suffix tree can improve the speed of the Burrows-Wheeler transform.

In this project you will study lossy and lossless compression techniques. You will study relevant Shannon information theory such as entropy encoding, and some of the classic compression algorithms such as Huffman coding including their complexities. You will implement lossless compression algorithms applied to textual data, and preprocessing techniques.

Benchmarking will be performed in order to compare compression ratios: between different encoding methods, plus with and without preprocessing . You will develop a GUI suitable for importing text files; the GUI will incorporate HCI (human computer interaction) concepts.

- Report on classic lossless algorithms: Implement: run-length encoding; move-to-front encoding; Huffman encoding.
- Program: GUI built with buttons etc.
- Program: Efficient Suffix Tree implementation
- Program: Naive implementation of Burrows-Wheeler
- Report: How does Burrows-Wheeler work with an emphasis on writing and understanding a proof that the reverse transform works.
- Report: Shannon information theory and lossy v lossless compression; complexity theory; pseudo-code of algorithms implemented.
- Reports: Complexity, NP hardness and the big O notation
- Report: The preprocessing phase for compression.
- Program: Implement an efficient Burrows-Wheeler Transform using Suffix Trees.
- Program: Apply the Lyndon factorisation to an input string, and apply the Burrows-Wheeler transform to each of the Lyndon factors.

- The program will have a Graphical User Interface with which to input text files and display compression ratios for compressed files.
- The Lyndon factorisation and Burrows-Wheeler transformation of a string will be displayed.
- The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- The report will describe the software engineering process involved in generating your software, interesting programming techniques and data structures used (or able to be used) on the project.
- The report will describe the complexity analysis of your computations.
- The report will include some results of benchmarking compression algorithms with a discussion of their meanings.
- The report will include discussion of applications of the Burrows-Wheeler transform and Lyndon factorisation.
- The report will include a User Manual.

- Build a mobile Burrow-Wheeler application: this will require use of Worker Threads.
- Show the output using an effective JTree with callbacks to enable user interaction.
- Animate the Burrows-Wheeler in order to help explain its working to a student unfamiliar with the algorithm.
- Adding Digital Signatures to compressed files.

- D. Adjeroh, T. Bell, A. Mukherjee, The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching, Springer (2008) 352 pp.
- M. Burrows and D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation (1994).
- M. Salson, T. Lecroq, M. Leonard and L. Mouchard, A four-stage algorithm for updating a Burrows-Wheeler transform, Theoret. Comput. Sci. 410 (43) (2009) 4350-4359 doi:10.1016/j.tcs.2009.07.016.
- http://en.wikipedia.org/wiki/Burrows-Wheeler transform
- M. Lothaire, Combinatorics onWords, Addison-Wesley, Reading, MA, 1983; 2nd Edition, Cambridge University Press, Cambridge, 1997.
- J. P. Duval, Factorising words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.
- http://en.wikipedia.org/wiki/Data_compression

This set of projects focuses on the specification and implementation of syntax and semantics for such processors. Interestingly, a variety of algorithms have been developed which allow the tools to be automatically generated from appropriate specifications. These techniques are very important because nearly all of our software is written in high level languages, and nearly all high level languages are implemented using these ideas.

Technologies used by programming language processors include: the use of context free grammars to specify syntax; the use of lexical analysis and parsing algorithms to convert an user's program into an internal tree form; the use of term rewriting to modify those trees; and the use of attribute grammars and structural operational semantics to specify interpretations of that syntax including code generation and optimisation.

To do well on these projects you need to be interested in programming languages, have good programming skills and be comfortable with the kind of mathematics that you encountered in CS1870. You will write your code in Java, and make use of tools developed within the department such as RDP and ART. In the first part of this project you will learn about fundamental topics in this area including syntax specification, lexical analysis, parsing and simple interpretation. You will then move on to a more detailed study of one area. The first term deliverables are described in the early deliverables section. In the second part of the project (mostly the second term) you will focus on the design and implementation of tooling for programming language processing. Your detailed topic must be agreed with your superviser; suggestions include the following.

- A tool that takes a set of regular expressions and uses Thompson'sconstruction and the subset construction to produce a lexical analyserfor the set of expressions.
- A compiler for a simple Java subset language capable of executing small programs.
- An interpreter for a small language using either attribute grammars or structural operational semantics.
- A parser generator using either the LR or recursive descent algorithms.
- An implementation of a context free term rewriting system.

- Lexical analysers are typically based on finite state automata of the kind discussed in CS1870. The algorithms are well known but implementation details are important in the development of an efficient tool, and you will be expected to show how such optimisations can be used to improve performance.
- A minimal language might comprise an integer-only subset of Java which supported basic control flow. Once this has been demonstrated, then it could then be extended to support extended control flow constructs such as switch, break and procedure call, as well as extensions to the type system, especially arrays.
- Language interpretation can be viewed as adding actions to BNF specifications such that when a parser recognises a sub-phrase of the language, it can immediately execute those actions. Attribute grammars work by embedding actions within the grammar and providing attributes that allow information to flow up and down the trees constructed by parsers. Structural Operational Semantics take a different approach in which actions are triggered during the `reduction' of a program to some minimal form.
- A variety of parsing algorithms have been developed. Two of the most commonly used are LR parsing and Recursive Descent parsing. A parser generator takes a specification for a language L written in BNF (for which you will already have a parser from your term 1 work) and then adds actions that cause it to emit a program that when compiled, will act as a parser for L. Your task is to understand your chosen parsing algorithm, and then write a parser generator for that algorithm. Interestingly, a correctly working parser generator can be `bootstrapped', that is it can be used to process its own specification.
- A term rewriting system rewrites expressions. For instance, we know that in Boolean algebra, the logical AND of FALSE and any value is false. We can express this as FALSE AND x rewrites to FALSE. We also know that the logical AND of true and some value is just that value: TRUE AND x rewrites to x. A term rewriter takes a collection of these rewrites representing (in this case) Boolean identities, and attempts to simplify expressions by performing these substitutions. In this project, you will build a system which will explore rewrites over terms and perform useful computations.

- A report on the use of context free grammars to specify syntax, along with manual procedures for generating and parsing languages.
- A report on the use of a parsing tool to produce derivations, and the use of grammar idioms to capture notions of associativity and priority in arithmetic expressions.
- The development of an interpreter for a language that models a four function calculator with memory.
- The development of a pretty printer for BNF specifications, demonstrating that programs can process their own source code.
- A technical report covering all of the above work, including a bibliography.
- A Term 2 project schedule with a list of deliverables and dates.

- A project report containing a description of all the research you have carried out and the details of your second term implementation.
- An implementation of your chosen tool, See the suggestions in the Background Section.

This set of projects focuses on the specification and implementation of syntax and semantics for such processors. Interestingly, a variety of algorithms have been developed which allow the tools to be automatically generated from appropriate specifications. These techniques are very important because nearly all of our software is written in high level languages, and nearly all high level languages are implemented using these ideas.

Technologies used by programming language processors include: the use of context free grammars to specify syntax; the use of lexical analysis and parsing algorithms to convert an user's program into an internal tree form; the use of term rewriting to modify those trees; and the use of attribute grammars and structural operational semantics to specify interpretations of that syntax including code generation and optimisation.

To do well on these projects you need to be interested in programming languages, have good programming skills and be comfortable with the kind of mathematics that you encountered in CS1870. You will write your code in Java, and make use of tools developed within the department such as RDP and ART. In the first part of this project you will learn about fundamental topics in this area including syntax specification, lexical analysis, parsing and simple interpretation. You will then move on to a more detailed study of one area. The first term deliverables are described in the early deliverables section. In the second part of the project (mostly the second term) you will focus on the design and implementation of tooling for programming language processing. Your detailed topic must be agreed with your superviser; suggestions include the following.

- A tool that takes a set of regular expressions and uses Thompson'sconstruction and the subset construction to produce a lexical analyserfor the set of expressions.
- A compiler for a simple Java subset language capable of executing small programs.
- An interpreter for a small language using either attribute grammars or structural operational semantics.
- A parser generator using either the LR or recursive descent algorithms.
- An implementation of a context free term rewriting system.

- Lexical analysers are typically based on finite state automata of the kind discussed in CS1870. The algorithms are well known but implementation details are important in the development of an efficient tool, and you will be expected to show how such optimisations can be used to improve performance.
- A minimal language might comprise an integer-only subset of Java which supported basic control flow. Once this has been demonstrated, then it could then be extended to support extended control flow constructs such as switch, break and procedure call, as well as extensions to the type system, especially arrays.
- Language interpretation can be viewed as adding actions to BNF specifications such that when a parser recognises a sub-phrase of the language, it can immediately execute those actions. Attribute grammars work by embedding actions within the grammar and providing attributes that allow information to flow up and down the trees constructed by parsers. Structural Operational Semantics take a different approach in which actions are triggered during the `reduction' of a program to some minimal form.
- A variety of parsing algorithms have been developed. Two of the most commonly used are LR parsing and Recursive Descent parsing. A parser generator takes a specification for a language L written in BNF (for which you will already have a parser from your term 1 work) and then adds actions that cause it to emit a program that when compiled, will act as a parser for L. Your task is to understand your chosen parsing algorithm, and then write a parser generator for that algorithm. Interestingly, a correctly working parser generator can be `bootstrapped', that is it can be used to process its own specification.
- A term rewriting system rewrites expressions. For instance, we know that in Boolean algebra, the logical AND of FALSE and any value is false. We can express this as FALSE AND x rewrites to FALSE. We also know that the logical AND of true and some value is just that value: TRUE AND x rewrites to x. A term rewriter takes a collection of these rewrites representing (in this case) Boolean identities, and attempts to simplify expressions by performing these substitutions. In this project, you will build a system which will explore rewrites over terms and perform useful computations.

- A report on the use of context free grammars to specify syntax, along with manual procedures for generating and parsing languages.
- A report on the use of a parsing tool to produce derivations, and the use of grammar idioms to capture notions of associativity and priority in arithmetic expressions.
- The development of an interpreter for a language that models a four function calculator with memory.
- The development of a pretty printer for BNF specifications, demonstrating that programs can process their own source code.
- A technical report covering all of the above work, including a bibliography.
- A Term 2 project schedule with a list of deliverables and dates.

- A project report containing a description of all the research you have carried out and the details of your second term implementation.
- An implementation of your chosen tool, See the suggestions in the Background Section.

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- A basic background of image processing techniques or OpenCV is an advantage.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in automated camera systems.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied

- Design and implement a framework for development of new and existing physical security techniques
- Describe in detail the design of tests for the framework
- Describe and analyse the features that have been used during the project to detect/stop threats
- Critically compare the findings with related works

- OpenCV: https://opencv.org/
- TensorFlow: https://www.tensorflow.org/
- Computer vision: a modern approach

Machine Learning algorithms have been successfully used in medicine, pharmaceutical industry and many other fields to handle a large set of data in order to find important patterns and as well as outliers and anomalies.

Recently developed conformal predictors can be used to narrow down the search space and to improve accuracy of classification.

- A review and a report of the approach in the case of classification and regression.
- A design of the system and its extension for conformal predictors.
- Implementation of a pilot system and some preliminary experiments.

- Implementation of the system, experiments with various parameters.
- Experimenting with different underlying ML algorithms.
- It is desirable that the final system will have a graphical user interface.
- The final report will describe the theory, the implementation issues, computational experiments with different data sets and parameters.

- Online learning with conformal predictors.

- V.Vovk, A.Gammerman, G.Shafer. "Algorithmic Learning in a Random World", Springer, 2005.

- Find and choose a data set
- Implement a relevant version of Conformal Prediction algorithm
- Impute some mistakes into the data, apply the algorithm, and observe the results.

- Implement a Conformal Prediction with an additional function of error detection.
- Compare different versions of Conformal Prediction by their efficiency for data correction.
- Analyse and report the output.

- Propose, justify theoretically, and implement a scheme of multiple error correction.
- Compare how the approach works in different (on-line, batch, cross-validation, inductive) modes.

- Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom

- Find and choose a data set
- Implement a relevant version of Conformal Prediction algorithm
- Impute some mistakes into the data, apply the algorithm, and observe the results.

- Implement a Conformal Prediction with an additional function of error detection.
- Compare different versions of Conformal Prediction by their efficiency for data correction.
- Analyse and report the output.

- Propose, justify theoretically, and implement a scheme of multiple error correction.
- Compare how the approach works in different (on-line, batch, cross-validation, inductive) modes.

- Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom

- Study data sets and identify one relevant for the project, describe it and the structure of its label space.
- Select a basic machine learning approach applicable to this data.
- Develop and implement a multi-class version of this approach and get some initial results on the data

- Survey of the data and machine learning techniques.
- Develop a conformity score functions for data with structured label space.
- Implement a conformal prediction algorithm applicable to the data.
- Analysis of the output.

- Compare results with different basic machine learning approaches.
- Compare original label space structure attached to the data against its automatic structuring by means of data analysis.

- Predicting structured objects with support vector machines by T.Joachims, T.Hofmann, Y.Yue, C.N.Yu. Communications of the ACM 52 (11), 97--104

- Study data sets and identify one relevant for the project, describe it and the structure of its label space.
- Select a basic machine learning approach applicable to this data.
- Develop and implement a multi-class version of this approach and get some initial results on the data

- Survey of the data and machine learning techniques.
- Develop a conformity score functions for data with structured label space.
- Implement a conformal prediction algorithm applicable to the data.
- Analysis of the output.

- Compare results with different basic machine learning approaches.
- Compare original label space structure attached to the data against its automatic structuring by means of data analysis.

- Predicting structured objects with support vector machines by T.Joachims, T.Hofmann, Y.Yue, C.N.Yu. Communications of the ACM 52 (11), 97--104

- You will implement simple inductive conformal predictors and run them on small artificial data sets.
- They will be based both on underlying algorithms implemented by you (such as nearest neighbours)
- ... and on advanced machine learning algorithms included in available libraries (such as gradient boosting).
- You will produce a report containing descriptions of algorithms and basics of the theory.

- You will implement transductive conformal predictors based on the method of k nearest neighbours; transductive conformal predictors are less computationally efficient but have better accuracy.
- You will implement cross-conformal predictors, which combine the accuracy of transductive conformal predictors and the computational efficiency of inductive conformal predictors.
- The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
- For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
- Two aspects of the performance of the algorithms will be explored: their validity and efficiency.
- The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- Ideally, it will have a graphical user interface.
- The report will describe: the theory behind the algorithms.
- The report will describe: the implementation issues necessary to apply the theory.
- The report will describe: the software engineering process involved in generating your software.
- The report will describe: computational experiments with different data sets, underlying algorithms, and parameters.

- Empirical study of the performance of the method of cross validation for different numbers of folds and extend to the method of cross-conformal prediction.
- Exploring the validity of conformal predictors conditional on the training set.
- Studying connections between probability-type inductive conformal predictors and ROC curves.

- Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9:371-421, 2008. http://arxiv.org/abs/0706.3188
- Alex Gammerman and Vladimir Vovk. Hedging predictions in machine learning (with discussion). The Computer Journal, 50:151-177, 2007. http://arxiv.org/abs/cs/0611011
- Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, 2005.
- Vladimir Vovk. Cross-conformal predictors. http://arxiv.org/abs/1208.0806
- Vladimir Vovk. Inductive conformal predictors in the batch mode. http://arxiv.org/abs/1209.2673
- Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu
- Delve archive. http://www.cs.toronto.edu/~delve

Non-speech body sounds recognition had not received major attentions, despite containing valuable information about the person's health conditions.

This project aims at detecting the two most popular infectious events, coughs and sneezes, from raw recordings.

The challenge for such objective is the variety of coughs and sneezes, amongst other non-speech body sounds. Hence, the confidence measure of conformal prediction would be useful in such scenario.

This project is recommended for CS5920 students, who have successfully implemented conformal prediction in the first assignment, and would like to further study its performance on real world problems.

- Producing a simple proof-of-concept implementation of classification-based conformal prediction with different underlying algorithms (e.g. SVM, k-NN, Random Forests, etc.) from scratch.
- A report of the empirical results of such conformal predictors on a small synthetic dataset.

- A detailed literature review of existing machine learning algorithms used for non-speech sound recognition.
- Producing a detailed report of the dataset structure, including some visual analysis (e.g. t-SNE, UMAP, etc.).
- Implementing conformal predictors on the whole dataset, optimising the underlying algorithms' parameters.
- Carefully analysing the p-values, comparing the validity, efficiency, accuracy of the implemented conformal predictors.

- Comparing different methods of extracting sound features (e.g. MFCC, CFA, ZCR) from raw recordings, and their impacts on the prediction performance.
- Using the entire dataset to train a new model, then testing such model on other real world recordings.

- Nguyen, K. A. and Luo, Z. (2018). Cover Your Cough: Detection of Respiratory Events with Confidence Using a Smartwatch. Proceedings of Machine Learning Research, 91, 1-18.
- Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018). ACM.
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT

Non-speech body sounds recognition had not received major attentions, despite containing valuable information about the person's health conditions.

This project aims at detecting the two most popular infectious events, coughs and sneezes, from raw recordings.

The challenge for such objective is the variety of coughs and sneezes, amongst other non-speech body sounds. Hence, the confidence measure of conformal prediction would be useful in such scenario.

This project is recommended for CS5920 students, who have successfully implemented conformal prediction in the first assignment, and would like to further study its performance on real world problems.

- Producing a simple proof-of-concept implementation of classification-based conformal prediction with different underlying algorithms (e.g. SVM, k-NN, Random Forests, etc.) from scratch.
- A report of the empirical results of such conformal predictors on a small synthetic dataset.

- A detailed literature review of existing machine learning algorithms used for non-speech sound recognition.
- Producing a detailed report of the dataset structure, including some visual analysis (e.g. t-SNE, UMAP, etc.).
- Implementing conformal predictors on the whole dataset, optimising the underlying algorithms' parameters.
- Carefully analysing the p-values, comparing the validity, efficiency, accuracy of the implemented conformal predictors.

- Comparing different methods of extracting sound features (e.g. MFCC, CFA, ZCR) from raw recordings, and their impacts on the prediction performance.
- Using the entire dataset to train a new model, then testing such model on other real world recordings.

- Nguyen, K. A. and Luo, Z. (2018). Cover Your Cough: Detection of Respiratory Events with Confidence Using a Smartwatch. Proceedings of Machine Learning Research, 91, 1-18.
- Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018). ACM.
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT

An important task in Bioinformatics is identifying if a protein sequence is part of a particular protein family. This is typically carried out using tools such as Hidden Markox Models (HMM's).

The difficulty with such models is that while they produce a probabilistic likelihood of a hit, the probabilities are based on a number of modelled parameters. Typically, such probabilities generically generate incredibly small p-values and yet some suspician remains that these methods are not robust as the results suggest.

Conformal predictors suggets an alternative to estimating such probabilities by moving away from the paradigm on ROC curves etc. and attempt to create probabilities based on the models themselves.

- A report on Hidden Markov Model (HMM) tools in Bioinformatics.
- A report on conformal predictors.
- A simple implementation of conformal prediction on a family of proteins from PFAM.

- A framework to evaluate the credibility and confidence of individual HMM's from PFAM.

- Using conformal predictors on HMM models in protein sequence prediction.

- The Pfam protein families database. Finn RD1, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. doi: 10.1093/nar/gkp985.
- http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

Attended CS5100 and be likely to get a distinction in the module.

The student must have an excellent grasp of conformal predictors, Python and Unix systems. In the latter they must be able to build simple workflows of Bioinformatics tools. The student must speak with the advisor in question before applying for this project.

An important task in Bioinformatics is identifying if a protein sequence is part of a particular protein family. This is typically carried out using tools such as Hidden Markox Models (HMM's).

The difficulty with such models is that while they produce a probabilistic likelihood of a hit, the probabilities are based on a number of modelled parameters. Typically, such probabilities generically generate incredibly small p-values and yet some suspician remains that these methods are not robust as the results suggest.

Conformal predictors suggets an alternative to estimating such probabilities by moving away from the paradigm on ROC curves etc. and attempt to create probabilities based on the models themselves.

- A report on Hidden Markov Model (HMM) tools in Bioinformatics.
- A report on conformal predictors.
- A simple implementation of conformal prediction on a family of proteins from PFAM.

- A framework to evaluate the credibility and confidence of individual HMM's from PFAM.

- Using conformal predictors on HMM models in protein sequence prediction.

- The Pfam protein families database. Finn RD1, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. doi: 10.1093/nar/gkp985.
- http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

Attended CS5100 and be likely to get a distinction in the module.

The student must have an excellent grasp of conformal predictors, Python and Unix systems. In the latter they must be able to build simple workflows of Bioinformatics tools. The student must speak with the advisor in question before applying for this project.

- Familiarisation with the idea of strategies and the identification of set of them.
- Implementation of strategies and the definition of an interface that allows agents to be selected and play against each other
- Definition of a tournament
- Report on the design of a tournament.
- Implementation of a tournament, where a user selects the number of agents, associates them with existing strategies, specifies the number of iterations and deploys a tournament
- Report on the Iterative Prisoner's dilemma problem
- Report on previous work with the Iterative Prisoner's dilemma

- A program that allows us to simulate tournaments of the iterative prisoner's dilemma.
- A GUI that allows a user to create and manage tournaments.
- GUI should provide summary statistics that are tournament specific, i.e. a summary for each player/strategy and how it is compared with the other players/strategies in the tournament.
- The report should describe some of the theory behind strategies in cooperative settings.
- The report should provide an analysis and an evaluation of the different strategies, including an indication of which one is the best strategy, if any.
- The report should contain a design of the program and a discussion of any software engineering issues encountered.
- The report should describe any useful and interesting programming techniques that have been employed to develop the final prototype.

- Make use of multi-agent system platform to support the tournament in a completely distributed setting.
- Use logic-based rules to specify the strategies and wrap them in the Java implementation.
- Use strategy recognition techniques so that agents can detect what strategies opponents are playing, and adapt their behaviour accordingly.
- Create a methodology for developing tournaments of this type.

- M. Nowak, R. May and K. Sigmund. The Arithmetics of Mutual Help, Scientific American, June 1995.
- R. Axelrod, W. D. Hamilton (1981), The Evolution of Cooperation, Science 211: 1390–96.
- K. Stathis, A game-based architecture for developing interactive components in computational logic, Journal of Functional and Logic Programming, 20(1), Jan 2000, MIT Press.

- A proof-of-concept program implementing neural nets both in the case of regression and in the case of classification (where neural networks output probability predictions).
- Building a cross-conformal predictor on top of neural nets and exploring its empirical validity.
- A preliminary empirical study of the efficiency of the student's implementation and standard implementations (such as the one in scikit-learn).

- Careful cross-validation of various parameters, including the stopping rule, learning rate, and the numbers of hidden neurons at various levels.
- Implementing regularization.
- Experimenting with different architectures, possibly including deep architectures.
- Ideally, the final product will have a graphical user interface.
- The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

- Careful study of the optimal number of folds.
- Replacing back-propagation with other optimization algorithms.
- Implementing jackknife+, a popular recent version of cross-conformal prediction.

- Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Chapter 11.
- Mitchell, Machine Learning, 1997, Chapter 4.
- Research papers.

- A proof-of-concept program implementing neural nets both in the case of regression and in the case of classification (where neural networks output probability predictions).
- Building a cross-conformal predictor on top of neural nets and exploring its empirical validity.
- A preliminary empirical study of the efficiency of the student's implementation and standard implementations (such as the one in scikit-learn).

- Careful cross-validation of various parameters, including the stopping rule, learning rate, and the numbers of hidden neurons at various levels.
- Implementing regularization.
- Experimenting with different architectures, possibly including deep architectures.
- Ideally, the final product will have a graphical user interface.

- Careful study of the optimal number of folds.
- Replacing back-propagation with other optimization algorithms.
- Implementing jackknife+, a popular recent version of cross-conformal prediction.

- Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Chapter 11.
- Mitchell, Machine Learning, 1997, Chapter 4.
- Research papers.

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- WebGL and OpenGL are an advantage.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in resilience.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied.

- Design and implement a framework for development of new and existing visualization techniques.
- Describe in detail the design of tests for the framework.
- Describe and analyse the features that have been used during the project to detect/stop threats.
- Critically compare the findings with related works.

- WebGL fundamentals https://webgl2fundamentals.org/
- SecViz.org https://secviz.org/
- Bokeh, Vega, D3, Plotly visualization suites
- R Marty. Applied security visualization.
- H Shiravi et al. A survey of visualization systems for network security.

This project is to implement and assess a deep learning algorithm - typically for a feed-forward network - and to perform visualisations and experiments to analyse and improve its performance

- Proof of concept program: Implementation of a deep learning algorithm for a relatively small network on a relatively simple dataset, together with experiments determining the effect of hyperparameters on convergence
- A report giving an overview of a range of related deep learning methods.
- A report describing and assessing the performance of the initial implementation and any experiments performed

- The program(s) will work with a real dataset(s).
- You will implement and compare multiple architectures and optimisation methods
- You will visualize and assess the performance of the algorithms produced.

- This would be an opportunity to implement several deep learning methods and get to understand them in practice by applying them to a significant dataset.

- "Deep Learning", by Goodfellow, Bengio, and Courville, MIT Press 2016
- "Deep Learning with Python" by Francois Chollet, Manning Publications 2018

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

This project will require some mathematical understanding; it is not simply programming.

This project is to implement and assess a deep learning algorithm - typically for a feed-forward network - and to perform visualisations and experiments to analyse and improve its performance

- Proof of concept program: Implementation of a deep learning algorithm for a relatively small network on a relatively simple dataset, together with experiments determining the effect of hyperparameters on convergence
- A report giving an overview of a range of related deep learning methods.
- A report describing and assessing the performance of the initial implementation and any experiments performed

- The program(s) will work with a real dataset(s).
- You will implement and compare multiple architectures and optimisation methods
- You will visualize and assess the performance of the algorithms produced.

- This would be an opportunity to implement several deep learning methods and get to understand them in practice by applying them to a significant dataset.

- "Deep Learning", by Goodfellow, Bengio, and Courville, MIT Press 2016
- "Deep Learning with Python" by Francois Chollet, Manning Publications 2018

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

This project will require some mathematical understanding; it is not simply programming.

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in resilience.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied

- Design and implement a framework for development of new and existing resilience techniques
- Describe in detail the design of tests for the framework
- Describe and analyse the features that have been used during the project to detect/stop threats
- Critically compare the findings with related works

- Julia Allen. Measures for managing operational resilience.Technical Report, 2011.
- Julia Allen, Pamela Curtis, Nader Mehravari, Andrew Moore, Kevin Partridge, Robert Stoddard, and Randy Trzeciak. Analyzing cases of resilience success and failure-a research study. Technical report, Carnegie Mellon University, the Software Engineering Institute, 2012.
- Richard A Caralli, Julia Allen, and David W White.CERT resilience management model: A maturity model for managing operational resilience. Addison-Wesley Professional, 2010.
- Deborah Bodeau and Richard Graubart. Cyber resilience metrics: Key observations. Technical Report,2016.
- Ronald J Brachman, Richard E Fikes, and Hector J Levesque. Krypton: A functional approach to knowledge representation.Computer, (10):67–73, 1983.
- Linkov and Trump. The science and practice of resilience. 2019

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in malware disguising themselves as other processes.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied

- Design and implement a framework for development of new and existing related malware detection techniques
- Describe in detail the design of tests for the framework
- Describe and analyse the features that have been used during the project to detect/stop threats
- Critically compare the findings with related works

- Vaas and Happa. Detecting disguised processes using application-behavior profiling. 2017
- Geden and Happa. Classification of malware families based on runtime behaviour. 2018

- Good programming skills, particularly Android app development, and knowledge of IDEs.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing different security apps for integrating into security suite.
- A proof-of-concept implementation of the security suite
- Testing the system

- Design and develop the final security suite
- Complete report which must include a User Manual

- Conti, M et al. Context-related policy enforcement for android
- Bojjagani, S. et al. Stamba: Security testing for Android mobile banking apps.
- Vulnerability Test Suite for Android - NowSecure https://info.nowsecure.com › VTS-for-Android
- https://www.android.com/intl/en_uk/enterprise/security/

- Basic knowledge in cryptography (mainly hash functions, digital signatures, public-key infrastructures)
- Basic knowledge in UI development, web services (mainly REST API), cryptographic libraries (mainly OpenSSL)

- Implement a user-friendly signing tool that applies a digital signature to a given file
- Implement a TSA service that can be accessed via a REST API
- Design a suitable file format to accumulate digital evidence

- Extend the signing tool with capabilities to automatically interact with TSAs to generate and refresh digital evidence
- Extend the signing tool with capabilities to verify whether a digitally signed and timestamped document existed at a certain point in time

- Martín Vigil, Johannes Buchmann, Daniel Cabarcas, Christian Weinert, Alexander Wiesmaier: "Integrity, Authenticity, Non-Repudiation, and Proof of Existence for Long-Term Archiving: A Survey". In Computers & Security 2015.
- Christian Weinert, Denise Demirel, Martín Vigil, Matthias Geihs, Johannes Buchmann: "MoPS: A Modular Protection Scheme for Long-Term Storage". In ASIACCS 2017.

- Basic knowledge in cryptography (mainly hash functions) and complexity theory (mainly "Big O" notation)
- Knowledge of either C/C++, Go, Java, or Rust

- Algorithms for three new insertion strategies with different trade-offs (in terms of time and storage complexity)
- Implementation of at least one of the new strategies in one existing Cuckoo filter implementation
- Implementation of a simulation environment that will allow to measure the success and run-time overhead of different insertion strategies

- Implementation of the remaining new insertion strategies
- Simulation results for all insertion strategies

- Bin Fan, David G. Andersen, Michael Kaminsky, Michael Mitzenmacher: "Cuckoo Filter: Practically Better Than Bloom". In: CoNEXT 2014.
- Daniel Kales, Christian Rechberger, Thomas Schneider, Matthias Senker, Christian Weinert: "Mobile Private Contact Discovery at Scale". In USENIX Security Symposium 2019.
- https://github.com/efficient/cuckoofilter
- https://github.com/linvon/cuckoo-filter
- https://github.com/MGunlogson/CuckooFilter4J
- https://github.com/axiomhq/rust-cuckoofilter

- A basic implementation for sending sequences of characters. The 'droplets' may be object instances with appropriate fields. The implementation should consist of two programs: an encoder that reads in a text file and writes out a sequence of 'droplets', and a decoder that reads in a sequence of droplets and which outputs a text file, which should be identical to the original text file.
- A description of how the basic encoding and decoding algorithms work, with pseudocode and a running example. It should also include a discussion of the 'degree distribution', and of how to sample random values from a discrete distribution.

- An efficient program for encoding and decoding.
- A sequence of well designed and analysed experiments investigating the program's efficiency and behaviour.
- The report will contain a discussion of erasure channels, contrast with feedback channels. When are erasure channels needed? What other solutions are there?
- The report will contain a description of Digital Fountain Codes: basic encoding and decoding, with a running example. Role of degree distribution. Behaviour with special cases of degree distribution. Luby's degree distributions. Time and space efficiency of variants of decoding algorithm.
- The report will contain 3. Design and implementation: What is overall structure of program? What are design issues? How were these design issues resolved?
- The report will contain complete testing schedule: What is evidence program is correct (test cases)
- The report will contain an analysis of the time and space efficiency of algorithm(s)
- The report will contain, for each experiment: aim, method, tables/graphs of results, questions raised

- Implementation should be efficient: different optimisations lead to time and space efficiency. Many small extensions to algorithm are possible. You could design and carry out experiments to assess the efficiency of your algorithm.
- It is possible to design experiments to check the efficiency of the packet degree distribution, and you could try to optimise the degree distribution in the light of your experiments.

- Start with Wikipedia on 'Luby Transform Code', and the references it gives.

- A basic implementation for sending sequences of characters. The 'droplets' may be object instances with appropriate fields. The implementation should consist of two programs: an encoder that reads in a text file and writes out a sequence of 'droplets', and a decoder that reads in a sequence of droplets and which outputs a text file, which should be identical to the original text file.
- A description of how the basic encoding and decoding algorithms work, with pseudocode and a running example. It should also include a discussion of the 'degree distribution', and of how to sample random values from a discrete distribution.

- An efficient program for encoding and decoding.
- A sequence of well designed and analysed experiments investigating the program's efficiency and behaviour.
- The report will contain a discussion of erasure channels, contrast with feedback channels. When are erasure channels needed? What other solutions are there?
- The report will contain a description of Digital Fountain Codes: basic encoding and decoding, with a running example. Role of degree distribution. Behaviour with special cases of degree distribution. Luby's degree distributions. Time and space efficiency of variants of decoding algorithm.
- The report will contain 3. Design and implementation: What is overall structure of program? What are design issues? How were these design issues resolved?
- The report will contain complete testing schedule: What is evidence program is correct (test cases)
- The report will contain an analysis of the time and space efficiency of algorithm(s)
- The report will contain, for each experiment: aim, method, tables/graphs of results, questions raised

- Implementation should be efficient: different optimisations lead to time and space efficiency. Many small extensions to algorithm are possible. You could design and carry out experiments to assess the efficiency of your algorithm.
- It is possible to design experiments to check the efficiency of the packet degree distribution, and you could try to optimise the degree distribution in the light of your experiments.

- Start with Wikipedia on 'Luby Transform Code', and the references it gives.

Although cloud infrastructures, e.g., Amazon EC2, provide several benefits (maintenance, scalability, reliability), they do not provide any security guarantee: a bug in the cloud infrastructure or a malicious cloud provider could be even more dangerous than a malicious client.

Intel has recently released the Software Guard eXtensions [6, 9, 10], a set of instructions dedicated to improving the security of an application by running (parts of) it in a secure container, called an enclave. The data and computation happening in this container is protected by the CPU chip from any external attacker, be it another malicious process, the operating system, or physical access to the machine.

The goal of this project is to partition a real-world game (e.g. ioquake3 [7], a clone of the popular Quake 3: Arena video game), by integrating a subset of its components inside an enclave. As a result some computation at the server can be offloaded to the client while preventing cheating. A list of potential cheats can be found in [8].

- Familiarity with necessary background on Intel SGX.
- Familiarity with Intel SGX SDK. Show examples of its operation on toy/tutorial problems.
- Report: Intel SGX overview
- Report: Distributed Games, including architecture and potential for cheating in the game of your choice.
- Proof-of-concept: Partition a real-world game of your choice (e.g. ioquake3) so that the server runs inside an enclave
- Proof-of-concept: Offload some of the computation at the server to the client, inside an enclave

- Modify the communication protocol between the client and the server so that the integrity of the game is preserved even when a client is cheating;
- Establish a secure communication protocol, e.g., TLS, so that an attacker cannot observe the network traffic for cheating;
- Evaluate the performance and scalability benefits of your distributed game.
- Report: Detailed architecture description of final partitioned game.
- Report: Security analysis of final partitioned game.

- Using the Intel SGX functionalities, establish a framework to remotely attest/verify the authenticity of a client/server and provision it with sensitive data.

- https://support.steampowered.com/kb_article.php?ref=7849-Radz-6869
- http://www.gamesradar.com/diablo-iii-error-37-causing-great-wailing-and-gnashing-teeth-players-worldwide-fail-log-solution-wait-a-bit/
- http://www.ibtimes.com/division-servers-struggle-ubisoft-confirms-login-problems-2332172
- Raaen et al., Latency Thresholds for Usability in Games: A Survey, NIK 2014.
- Howard et al., Cascading Impact of Lag on User Experience in Cooperative Multiplayer Games, NetGames 2014.
- https://software.intel.com/en-us/sgx
- https://ioquake3.org/
- Yahyavi et al., Watchmen: Scalable cheat-resistant support for distributed multi-player online games, ICDCS 2013.
- Intel SGX Linux SDK Sample Code: https://github.com/intel/linux-sgx/tree/master/SampleCode/SampleEnclave
- Intel SGX Explained, https://eprint.iacr.org/2016/086.pdf

- Ability to research a new topic and implement designs
- Good programming skills particularly Node JS , Java, Golang, etc and knowledge of Windows/Linux internals.
- Knowledge Data/Database Modelling
- Familiarity with DLT and Blockchain technologies (e.g. HyperLedger)

- A report describing architectural components of the IoT system and their relevant activities which need to be audited and addressing Blockchain technologies available for use
- A design of the auditing data model
- A proof of concept implementation of the auditing system using a DLT technologies
- Testing the system

- Design and develop the data model of the IoT-based auditing system
- Develop a prototype of Blockchain-IoT-based auditing system
- Complete report
- The report will include a User Manual

- https://www.networkmanagementsoftware.com/what-is-syslog/
- https://en.wikipedia.org/wiki/Syslog
- https://medium.com/swlh/top-5-dlt-blockchain-protocols-to-consider-in-your-project-dae7fc6d381d
- https://www.ibm.com/blogs/research/2018/02/architecture-hyperledger-fabric/
- https://hyperledger-fabric.readthedocs.io/en/release-1.4/getting_started.html

- A proof-of-concept implementation of bagging, random forests, or boosting both in the case of regression and in the case of classification.
- Implementing underlying algorithms, such as decision stumps or decision trees; simple proof-of-concept implementations are sufficient.

- A final program implementing bagging, random forests, or boosting (or all three) with a careful choice of the parameters.
- Implementations of the underlying algorithms should be complemented by a careful empirical study of the effect of various parameters; producing optimized programs.
- Comparing the performance of different ensemble techniques in the problems of classification and regression.
- Using bagged and boosted decision trees as probability predictors; evaluating their perfomance using proper loss functions.
- Ideally, the final product will have a graphical user interface.

- Implementing the out-of-bag estimate of error probability.
- Investigating variable importance.
- Proximity plots.
- Theoretical analysis of the algorithms.

- Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Section 8.7, Chapters 10 and 15-16
- James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning, 2013, Chapter 8.
- Research papers.

- A proof-of-concept implementation of bagging, random forests, or boosting both in the case of regression and in the case of classification.
- Implementing underlying algorithms, such as decision stumps or decision trees; simple proof-of-concept implementations are sufficient.

- A final program implementing bagging, random forests, or boosting (or all three) with a careful choice of the parameters.
- Implementations of the underlying algorithms should be complemented by a careful empirical study of the effect of various parameters; producing optimized programs.
- Comparing the performance of different ensemble techniques in the problems of classification and regression.
- Using bagged and boosted decision trees as probability predictors; evaluating their perfomance using proper loss functions.
- Ideally, the final product will have a graphical user interface.

- Implementing the out-of-bag estimate of error probability.
- Investigating variable importance.
- Proximity plots.
- Theoretical analysis of the algorithms.

- Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Section 8.7, Chapters 10 and 15-16
- James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning, 2013, Chapter 8.
- Research papers.

DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. Persistent identifiers (PIDs) are not only important to uniquely identify a publication, dataset, or person, but the metadata for these persistent identifiers can provide unambiguous linking between persistent identifiers of the same type, e.g. journal articles citing other journal articles, or of different types, e.g. linking a researcher and the datasets they produced. DataCite has constructed a PID graph that relates the different types of PID into a graph that can be queried using a language called GraphQL.

A variety of graph properties can be tested with this data. In particular there has been interest in the behaviour of the fraction of nodes with degree k following a pattern. Scale-free networks where this fraction is a power law in k generated significant interest though it is apparent these are less common than throught previously.

This project will require the student to examine the behaviour of the degree of the PID graph.

- A report on graphs and scale-free networks.
- A report on the PID graph.
- A report on GraphQL.

- The development of a tool to query the PID graph and visualise graphs.
- Identifying the behaviour of the degree of the PID graph.

- Exploring other graph obseravbles e.g. centrality.
- Exploring using a recommender system to identify papers of interest.

- Introducing the PID Graph - https://blog.datacite.org/introducing-the-pid-graph/
- The Fullstack Introduction for GraphQL - https://www.howtographql.com/
- Learning GraphQL, E.Porcello and A. Banks O'Reilly Media (available online through Safari Books)
- Handbook of Graph Theory, J.L. Gross, J. Yellen and P. Zhang, Chapman and Hall/C (available online through Safari Books)
- Scale-free networks are rare, A.D. Broido and A. Clauset, Nature Comms. (2019) DOI - 10.1038/s41467-019-08746-5

Understanding of or willingness to study graph theory.

The student must have a good understanding of Python or a similar tool.

DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. Persistent identifiers (PIDs) are not only important to uniquely identify a publication, dataset, or person, but the metadata for these persistent identifiers can provide unambiguous linking between persistent identifiers of the same type, e.g. journal articles citing other journal articles, or of different types, e.g. linking a researcher and the datasets they produced. DataCite has constructed a PID graph that relates the different types of PID into a graph that can be queried using a language called GraphQL.

A variety of graph properties can be tested with this data. In particular there has been interest in the behaviour of the fraction of nodes with degree k following a pattern. Scale-free networks where this fraction is a power law in k generated significant interest though it is apparent these are less common than throught previously.

This project will require the student to examine the behaviour of the degree of the PID graph.

- A report on graphs and scale-free networks.
- A report on the PID graph.
- A report on GraphQL.

- The development of a tool to query the PID graph and visualise graphs.
- Identifying the behaviour of the degree of the PID graph.

- Exploring other graph obseravbles e.g. centrality.
- Exploring using a recommender system to identify papers of interest.

- Introducing the PID Graph - https://blog.datacite.org/introducing-the-pid-graph/
- The Fullstack Introduction for GraphQL - https://www.howtographql.com/
- Learning GraphQL, E.Porcello and A. Banks O'Reilly Media (available online through Safari Books)
- Handbook of Graph Theory, J.L. Gross, J. Yellen and P. Zhang, Chapman and Hall/C (available online through Safari Books)
- Scale-free networks are rare, A.D. Broido and A. Clauset, Nature Comms. (2019) DOI - 10.1038/s41467-019-08746-5

Understanding of or willingness to study graph theory.

The student must have a good understanding of Python or a similar tool.

MCTS is an active research area, and has scored some striking successes - it is the basis of the current best Go programs (Go has too high a branching factor for alpha-beta pruning to be used).

In this project, the aim is to produce and evaluate a simple implementation of MCTS for a simple game, and to do some experimental evaluations of its performance. Possible games might be Connect4 (probably a good choice) or 9x9 capture Go (requires some subtle programming).

- Proof of concept programs: bandit problem solver ('bandit problems' are particularly trivial statistical games, which you should tackle first), with experiments showing how upper-confidence-bound (UCB) methods compare to naive strategies.
- Proof of concept programs: Program that enables 2 human players to play the selected game (no AI moves).
- Reports: UCB method for bandit problems.
- Reports: Rules of the game, expressed semi-formally as state and allowable transitions.
- Reports: Algorithm for MCTS.

- The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
- The program should have a GUI for game play, for the case of two human players, and one player against the computer.
- The report should include experimental assessments of effects on program performance of parameters of MCTS (depth of search, number of rollouts used, etc).
- The report will describe the software engineering process involved in generating your software.
- The report will have sections on: the theory of bandit processes, and the algorithm for MCTS; the game used, the software engineering process in development of the program; an explanation and justification of design decisions for the GUI; and an experimental analysis of the performance of MCTS.

- A possible enhancement would be a comparison of MCTS with alpha-beta pruning.
- The project could be done for a relatively simple game ( such as Connect4 ) or for 9x9 capture Go, which adds some programming complexity.

- Proof of concept programs: First implementation of GA for constraint satisfaction/function optimisation.
- Report on Design Patterns.
- Report on Genetic Algorithms.
- Report on encoding various problems for GAs.
- Report on the Theory of coalescence and genetic drift.

- The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
- The program will at least enable a user to interactively view and control GA optimisation for (at least one) optimisation problem.
- The program must have a text based control file from which it can be configured.
- The program will have a Graphical User Interface that can be used to animate simulations.
- The report will describe the software engineering process involved in generating your software.
- The report will include a survey of optimisation methods, including GAs.
- The report will include some results of experiments that compare different optimisation methods applied to the same problems.
- The report will describe interesting programming techniques and data structures used (or able to be used) on the project.

- This is an open-ended project that requires a substantial amount of programming. The basic project may be extended in many ways.
- One possibility would be to choose an optimisation problem, and then compare a range of optimisation techniques for it, including GAs.

Machine recognition of Greek characters in images is a challenging process, because of the large number of more than 270 character classes, and the similarity of the diacritic marks in each character.

Until today, Greek text search facility is far from perfect. Hence, it is interesting to apply machine learning techniques and compare their accuracies for such task.

- Proof of concept program: identify one machine learning task suitable for characters recognition and implement one learning algorithm for the chosen task.
- Report: An overview of the datasets. Statistics and numbers involved.
- Report: A report on machine learning tasks for character recognition.

- Survey of machine learning techniques used for characters recognition.
- The developed machine learning software for Greek characters recognition.
- The software engineering process involved in generating your software.

- Provide predictions with confidence measure.

- Gatos, Basilis, et al. GRPOLY-DB: An old Greek polytonic document image database. Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
- http://www.iit.demokritos.gr/~nstam/GRPOLY-DB

Machine recognition of Greek characters in images is a challenging process, because of the large number of more than 270 character classes, and the similarity of the diacritic marks in each character.

Until today, Greek text search facility is far from perfect. Hence, it is interesting to apply machine learning techniques and compare their accuracies for such task.

- Proof of concept program: identify one machine learning task suitable for characters recognition and implement one learning algorithm for the chosen task.
- Report: An overview of the datasets. Statistics and numbers involved.
- Report: A report on machine learning tasks for character recognition.

- Survey of machine learning techniques used for characters recognition.
- The developed machine learning software for Greek characters recognition.
- The software engineering process involved in generating your software.

- Provide predictions with confidence measure.

- Gatos, Basilis, et al. GRPOLY-DB: An old Greek polytonic document image database. Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
- http://www.iit.demokritos.gr/~nstam/GRPOLY-DB

The problem of working out the price of a house using its characteristics has been studied intensively over past decades. Many hosing price datasets are available on the Internet and many machine learning techniques have been applied to them.

Boston housing dataset has been a playground for statistics and machine learning methods since 1970s. A more recent California Housing Prices dataset dates to 1990s. In 2010s Ames dataset was put forward: it is based on transactions and contains dates as part of individual records.

The first task is to work out house prices in a batch setup when the time is not available (or ignored). We can compare learning methods and study the effect of parameters on the test error.

Secondly, a study of house prices in the online mode can be undertaken. Here datapoints become available one by one: knowing that a house is about to be sold, we predict the sales price, then we see the actual price in the transaction, then we make a prediction for the price in the next sale etc. We can use the data with timestamps to study inflation. Do prices grow with time? How fast?

The project is recommended for Computational Finance students.

- Proof of concept program pre-processing the data and applying two machine learning methods to Boston Housing dataset
- Report: A report describing supervised learning, the concepts 'training set' and 'test set', and the theory of the methods.

- A machine learning method (such as Ridge Regression; the choice is by discussion with the supervisor) should be implemented from scratch.
- Several different datasets should be used
- Tests comparing different kernels, parameters, etc should be performed. The results should be visualised.
- Prediction in the online mode should be implemented and results visuaised.
- The report will describe different algorithms and implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe computational experiments, analyse their results and draw conclusions

- Methods of prediction with expert advice can be applied in the online mode.
- Is there inflation? A study can be undertaken.

- Boston housing dataset on Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
- California housing dataset on Kaggle: https://www.kaggle.com/camnugent/california-housing-prices
- De Cock, Dean. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19.3 (2011).
- Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
- J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
- B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

The problem of working out the price of a house using its characteristics has been studied intensively over past decades. Many hosing price datasets are available on the Internet and many machine learning techniques have been applied to them.

Boston housing dataset has been a playground for statistics and machine learning methods since 1970s. A more recent California Housing Prices dataset dates to 1990s. In 2010s Ames dataset was put forward: it is based on transactions and contains dates as part of individual records.

The first task is to work out house prices in a batch setup when the time is not available (or ignored). We can compare learning methods and study the effect of parameters on the test error.

Secondly, a study of house prices in the online mode can be undertaken. Here datapoints become available one by one: knowing that a house is about to be sold, we predict the sales price, then we see the actual price in the transaction, then we make a prediction for the price in the next sale etc. We can use the data with timestamps to study inflation. Do prices grow with time? How fast?

The project is recommended for Computational Finance students.

- Proof of concept program pre-processing the data and applying two machine learning methods to Boston Housing dataset
- Report: A report describing supervised learning, the concepts 'training set' and 'test set', and the theory of the methods.

- A machine learning method (such as Ridge Regression; the choice is by discussion with the supervisor) should be implemented from scratch.
- Several different datasets should be used
- Tests comparing different kernels, parameters, etc should be performed. The results should be visualised.
- Prediction in the online mode should be implemented and results visuaised.
- The report will describe different algorithms and implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe computational experiments, analyse their results and draw conclusions

- Methods of prediction with expert advice can be applied in the online mode.
- Is there inflation? A study can be undertaken.

- Boston housing dataset on Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
- California housing dataset on Kaggle: https://www.kaggle.com/camnugent/california-housing-prices
- De Cock, Dean. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19.3 (2011).
- Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods

Molecular Biology is now generating colossal amounts of data. In particular, there are a variety of technologies that can scan entire genomes or transcriptomes and determine the likelihood of their association with a particular Biological process. However such results are very noisy and suffer from a "large p, small n" effect, namely that the ratio of possible degrees of freedom, p to the number of independent observations is much greater than 1 and hence suffer from very large false positive rates.

On the other hand, the very extensive body of biomedical literature could be used to filter out spurious genes from the data before undergoing analysis. Tools such as GOCat mine the literature to relate Gene Ontology terms to specific search terms representing Biological processes. This specific tool is no longer functional and an alternative needs to be provided.

This project will require the student to develop an equivalent of GOCat.

- A report on Gene Ontologies.
- A report on the tool GOCat explaining the methods.
- A report on the data set required to build the tool.

- The development of a stand alone tool to implement the equivalent of GOCat.
- An API to query the tool.
- Rigourous testing of the tool.

- Using more advanced methods beyond kNN for the implementation.
- An API to query the tool.

- Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Gobeill J Pasche E Vishnyakova D Ruch P DOI 10.1093/database/bat041
- A survey on annotation tools for the biomedical literature. Neves M Leser U DOI 10.1093/bib/bbs084
- The GOA database in 2009--an integrated Gene Ontology Annotation resource Barrell D Dimmer E Huntley R Binns D O'Donovan C Apweiler R dx.doi.org/10.1093/nar/gkn803

Understanding of machine learning methods such as kNN.

The student must have a good understanding of R and/or Python.

Molecular Biology is now generating colossal amounts of data. In particular, there are a variety of technologies that can scan entire genomes or transcriptomes and determine the likelihood of their association with a particular Biological process. However such results are very noisy and suffer from a "large p, small n" effect, namely that the ratio of possible degrees of freedom, p to the number of independent observations is much greater than 1 and hence suffer from very large false positive rates.

On the other hand, the very extensive body of biomedical literature could be used to filter out spurious genes from the data before undergoing analysis. Tools such as GOCat mine the literature to relate Gene Ontology terms to specific search terms representing Biological processes. This specific tool is no longer functional and an alternative needs to be provided.

This project will require the student to develop an equivalent of GOCat.

- A report on Gene Ontologies.
- A report on the tool GOCat explaining the methods.
- A report on the data set required to build the tool.

- The development of a stand alone tool to implement the equivalent of GOCat.
- An API to query the tool.
- Rigourous testing of the tool.

- Using more advanced methods beyond kNN for the implementation.
- An API to query the tool.

- Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Gobeill J Pasche E Vishnyakova D Ruch P DOI 10.1093/database/bat041
- A survey on annotation tools for the biomedical literature. Neves M Leser U DOI 10.1093/bib/bbs084
- The GOA database in 2009--an integrated Gene Ontology Annotation resource Barrell D Dimmer E Huntley R Binns D O'Donovan C Apweiler R dx.doi.org/10.1093/nar/gkn803

Understanding of machine learning methods such as kNN.

The student must have a good understanding of R and/or Python.

- Implementation of transfer learning algorithms based on pre-trained networks for any of the following tasks, (1) object detection/classification, (2) image segmentation, and (3) video classification.
- A report describing transfer learning algorithms implemented with some evaluation results.

- Implementing transfer learning, deep learning (e.g. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)), or hybrid deep networks for any of the aforementioned tasks.
- Implementing several deep learning algorithms with different architectures, where the employed architectures are either existing, or newly proposed.
- Implementing manual hyper-parameter fine-tuning of the implemented deep networks.
- A comprehensive evaluation and comparison against existing state-of-the-art studies.
- The final report will describe methodology, implementation and testing with theoretical analysis and detailed evaluation results.

- Implementing deep hybrid networks with new architectures.
- Implementing optimal hyper-parameter selection using an automatic process.
- Theoretical analysis of the algorithms.

- Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
- Chollet, F., 2021. Deep learning with Python.
- Journal/conference papers.

- Implementation of transfer learning algorithms based on pre-trained networks for any of the following tasks, (1) object detection/classification, (2) image segmentation, and (3) video classification.
- A report describing transfer learning algorithms implemented with some evaluation results.

- Implementing transfer learning, deep learning (e.g. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)), or hybrid deep networks for any of the aforementioned tasks.
- Implementing several deep learning algorithms with different architectures, where the employed architectures are either existing, or newly proposed.
- Implementing manual hyper-parameter fine-tuning of the implemented deep networks.
- A comprehensive evaluation and comparison against existing state-of-the-art studies.
- The final report will describe methodology, implementation and testing with theoretical analysis and detailed evaluation results.

- Implementing deep hybrid networks with new architectures.
- Implementing optimal hyper-parameter selection using an automatic process.
- Theoretical analysis of the algorithms.

- Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
- Chollet, F., 2021. Deep learning with Python.
- Journal/conference papers.

- Basics of malware, knowledge of Python/C, some notions of machine learning.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub)

- A report describing techniques/methods to convert a binary into an image
- A preliminary tool to convert binaries into (grayscale) images
- Creation of a reasonably large dataset composed of different types of malware binaries
- A preliminary set of tests with neural networks (CNN, DNN) to classify a subset of the dataset based on their image representation

- Leveraging neural networks for classifying malware into families
- Describe in detail the design/architecture of the system
- Test the system on the entire dataset and discuss the results (efficacy, efficiency) and pros/cons of each technique
- Suggestions for improving the proposed system

- https://dl.acm.org/doi/pdf/10.1145/2016904.2016908
- https://www.intel.com/content/www/us/en/artificial-intelligence/documents/stamina-deep-learning-for-malware-protection-whitepaper.html
- https://arxiv.org/pdf/1812.07606.pdf
- https://www.blackhat.com/docs/us-15/materials/us-15-Davis-Deep-Learning-On-Disassembly.pdf
- https://arxiv.org/pdf/1903.11551.pdf
- https://www.covert.io/research-papers/deep-learning-security/Convolutional%20Neural%20Networks%20for%20Malware%20Classification.pdf
- https://dl.acm.org/doi/pdf/10.1145/2046684.2046689

- A report describing some hard problems on lattices and how they relate
- A report motivating the use of lattice-based cryptography
- Proof of concept implementation of Regev's LWE-based PKE (perhaps using mathematical software e.g. Sage)

- Implementation in C++ of Lindner-Peikert IND-CPA secure PKE based on LWE, Ring-LWE, or Module-LWE (instead of C++, also a different programming language can be used if approved by the project supervisor)
- Use IND-CPA secure PKE as a building block to implement IND-CCA secure KEM
- Report discussing LWE variants and justification of choice of underlying hardness assumption
- Report explaining relevant cryptographic concepts e.g. notions of security, definitions of PKE and KEM and related primitives

- https://csrc.nist.gov/Projects/post-quantum-cryptography/Post-Quantum-Cryptography-Standardization
- https://cims.nyu.edu/~regev/papers/lwesurvey.pdf
- http://portal.acm.org/citation.cfm?id=1060590.1060603
- https://www.maths.ox.ac.uk/system/files/attachments/sage-introduction.pdf
- https://web.eecs.umich.edu/~cpeikert/pubs/lwe-analysis.pdf
- https://repo.zenk-security.com/Cryptographie%20.%20Algorithmes%20.%20Steganographie/Introduction%20to%20Modern%20Cryptography.pdf

- A report describing provable security and how it affects parameter choices.
- A report motivating the use of provably secure parameters.
- Proof of concept implementation of parameterizable RSA key generation.

- An implementation of PKCS#1 v1.5 signatures scheme with provably secure and standard parameters.
- A benchmarking program to compare the scheme with provably secure parameters and standard ones.
- Report explaining the implementation of RSA based schemes.
- Report explaining relevant cryptographic concepts e.g. definitions of digital signatures, notions of security, etc.

- https://eprint.iacr.org/2018/855.pdf
- https://datatracker.ietf.org/doc/html/rfc8017
- https://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf
- https://www.openssl.org/docs/manmaster/man1/openssl-speed.html

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in insider threat detection.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied.

- Design and implement a framework for development of new and existing insider threat detection techniques
- Describe in detail the design of tests for the framework
- Describe and analyse the features that have been used during the project to detect/stop insider threats
- Critically compare the findings with related works

- The CERT insider threat database: https://insights.sei.cmu.edu/insider-threat/2011/08/the-cert-insider-threat-database.html
- The CERT insider threat dataset: https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099
- Al Tabash and Happa. Insider-threat detection using gaussian mixture models and sensitivity profiles.
- Legg et al. Towards a conceptual model and reasoning structure for insider threat detection.
- Agrafiotis et al. Identifying attack patterns for insider threat detection.
- Rashid et al. A new take on detecting insider threats: exploring the use of hidden markov models.
- Legg et al. Caught in the act of an insider attack: detection and assessment of insider threat.

A prediction region is a region that is guaranteed to include a prediction with given probability. A joint prediction region (JPR) is an "envelope" around the predicted future path (a sequence of predictions) guaranteeing that the true future path is included in the JPR with some probability. A similar criterion is the k-family-wise error rate (k-FWE), which bounds the probability that at least k points in path are not included in the JPR.

In a recent paper, Wolf and Wunderli proposed a bootstrap-based method to derive JPRs for time-series prediction (see reference below) and evaluated the method on autoregressive models and compared it with different techniques to produce JPR. In this project, you will implement the bootstrap JPR method, and evaluate it on an application of your choice or one suggested by the supervisor.

From a methodological viewpoint, the key steps of the project will be:

- Learn a time-series predictor using a dataset of choice
- Devise a method to compute prediction standard errors for your model
- Implement the bootstrap algorithm of Wolf and Wunderli, using the above two models.

- Report 1: bootstrap sampling.
- Report 2: overview of chosen case study.
- Report 3: prediction regions, JPRs and bootstrap JPRs.
- Report 4: time-series models.
- Report 5: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
- Implement and evaluate predictor and prediction standard errors.

- Implement code for bootstrap JPRs and evaluate empirical coverage and family-wise error rate for different hyper-parameters (e.g., k, horizon length, sample size, number of bootstrap samples).
- Final dissertation describing problem, methods and algorithm, implementation details, computational experiments and their analysis, and conclusions.

- 1. Compare different predictors.
- 2. Compare bootstrap JPRs with other methods mentioned in the paper (e.g., joint marginals and Scheffé)
- 3. Compare bootstrap JPRs with Monte Carlo sampling based on Bayesian neural networks and uncertainty recalibration

- (bootstrap) Chapter 6, Lecture notes of 36-402, CMU Undergraduate Advanced Data Analysis, available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
- (bootstrap JPRs) Wolf, Michael, and Dan Wunderli. "Bootstrap joint prediction regions." Journal of Time Series Analysis 36.3 (2015): 352-376.
- (time-series models) Stock Market Predictions with LSTM in Python, available at https://www.datacamp.com/community/tutorials/lstm-python-stock-market
- (time-series models) Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2019). Dive into deep learning, chapters 8 and 9, available at https://d2l.ai/index.html
- (extension 3) Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016, available at http://proceedings.mlr.press/v48/gal16.pdf
- (extension 3) Kuleshov, Volodymyr, Nathan Fenner, and Stefano Ermon. "Accurate Uncertainties for Deep Learning Using Calibrated Regression." ICML 2018, available at https://arxiv.org/pdf/1807.00263.pdf

A prediction region is a region that is guaranteed to include a prediction with given probability. A joint prediction region (JPR) is an "envelope" around the predicted future path (a sequence of predictions) guaranteeing that the true future path is included in the JPR with some probability. A similar criterion is the k-family-wise error rate (k-FWE), which bounds the probability that at least k points in path are not included in the JPR.

In a recent paper, Wolf and Wunderli proposed a bootstrap-based method to derive JPRs for time-series prediction (see reference below) and evaluated the method on autoregressive models and compared it with different techniques to produce JPR. In this project, you will implement the bootstrap JPR method, and evaluate it on an application of your choice or one suggested by the supervisor.

From a methodological viewpoint, the key steps of the project will be:

- Learn a time-series predictor using a dataset of choice
- Devise a method to compute prediction standard errors for your model
- Implement the bootstrap algorithm of Wolf and Wunderli, using the above two models.

- Report 1: bootstrap sampling.
- Report 2: overview of chosen case study.
- Report 3: prediction regions, JPRs and bootstrap JPRs.
- Report 4: time-series models.
- Report 5: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
- Implement and evaluate predictor and prediction standard errors.

- Implement code for bootstrap JPRs and evaluate empirical coverage and family-wise error rate for different hyper-parameters (e.g., k, horizon length, sample size, number of bootstrap samples).
- Final dissertation describing problem, methods and algorithm, implementation details, computational experiments and their analysis, and conclusions.

- 1. Compare different predictors.
- 2. Compare bootstrap JPRs with other methods mentioned in the paper (e.g., joint marginals and Scheffé)
- 3. Compare bootstrap JPRs with Monte Carlo sampling based on Bayesian neural networks and uncertainty recalibration

- (bootstrap) Chapter 6, Lecture notes of 36-402, CMU Undergraduate Advanced Data Analysis, available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
- (bootstrap JPRs) Wolf, Michael, and Dan Wunderli. "Bootstrap joint prediction regions." Journal of Time Series Analysis 36.3 (2015): 352-376.
- (time-series models) Stock Market Predictions with LSTM in Python, available at https://www.datacamp.com/community/tutorials/lstm-python-stock-market
- (time-series models) Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2019). Dive into deep learning, chapters 8 and 9, available at https://d2l.ai/index.html
- (extension 3) Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016, available at http://proceedings.mlr.press/v48/gal16.pdf
- (extension 3) Kuleshov, Volodymyr, Nathan Fenner, and Stefano Ermon. "Accurate Uncertainties for Deep Learning Using Calibrated Regression." ICML 2018, available at https://arxiv.org/pdf/1807.00263.pdf

You will start by studying background literature on various
distributed graph processing platforms and then focus on a specific
platform of your choice. You will then use this platform to implement
**3** proof-of-concept programs solving a few simple graph
analytics problems chosen from the list below. This will be followed
by focusing on a **single** more advanced problem of your choice,
and implementing and evaluating its prototype solution on a
distributed platforms you have chosen.

Proof-of-Concept:

- Compute the number of vertices and edges in a graph
- Compute maximum, minimum, and average vertex degrees
- Compute and analyse the vertex degree distribution (probability mass function)
- Compute the graph diameter (longest shortest path between any pair of nodes)
- Compute and analyse the distribution (probability mass function) for the number of friends of a given degree k > 1

Advanced problems:

(A) Community Detection: Community in a social network refers to a collection of users whose ties to each other (in terms of e.g., the friend/follower relation) is closer than those to the other users. Community detection is a hot topic in the social network research and industry. The communities more often than not share a large number of common traits (such as e.g., interests in specific topics, geographical areas, social status, etc.), which makes them extremely useful for developing better recommendation systems, or advertisement campaigns. There is an extensive literature on the community detection problem, and a large number of algorithms have been proposed. Your task will be to identify and implement an algorithm suitable for implementing on a distributed computing platform, and evaluate its scalability and performance on various datasets.

(B) Triangle Enumeration: Similarly to community detection, finding and listing the triangles that occur in a graph is an important problem in network analysis and useful for determining network statistics such as the clustering coefficient. In the literature, there are several algorithms for triangle enumeration most of which are designed for a centralized model of computation. More recently, algorithms for distributed graph frameworks were proposed.

(C) PageRank: When considering real world graphs, we are often interested in determining the "important" vertices. In social networks, for example, important vertices correspond to well connected people. To decide whether a vertex is important, it is usually not sufficient to simply count its degree as this is not a very robust measure and can easily be skewed, for example, by connecting a vertex to many fake vertices that are solely created for this purpose and who do not have any other connections. Instead, the PageRank of a vertex depends on the number of links from vertices that are themselves important. PageRank is commonly used by search engines to determine the importance of web pages.

(D) Connected components: Connected components is the set of subgraphs of a given graph such that no two of these subgraphs have a vertex in common. For directed graphs, the connectivity between a pair of vertices a and b can be be either strong (i.e., b is reachable from a by following a path of directed arcs) or weak (i.e., a and b are connected in the underlying undirected graph). You will need to design and implement a distributed algorithm to identify both weakly and strongly connected components in a given graph, and analyse their respective size distributions.

(E) SimRank: is an important metric measuring similarity between two vertices in a graph through their structural context: i.e., two vertices are similar if they are referenced by two vertices which are also similar. SimRank found applications in a variety of diverse contexts including recommender systems, spam filtering, sponsored search, malware detection, etc. Your task is to implement a distributed algorithm to compute the SimRank score for any two pair of vertices in a given graph and experimentally analyse its performance.

(F) Implement a prototype of an interactive browser-based graph-analytics tool that will provide interface for both specifying a graph analytics problems and their inputs and visualising their execution on a remote distributed platform (Giraph or GraphX) in a creative way. Your tool should support at least 4 proof-of-concept graph analytics algorithms above, and be extensible to allow integrating other algorithms in the future.

Data: There is a large number of graph datasets of different nature available in various online repositories. The links to the commonly used ones can be found at https://moodle.royalholloway.ac.uk/mod/page/view.php?id=168248.

Tools: Both Giraph and GraphX are available on the department's big data cluster. Other platforms (such as GraphLab and PowerGraph) may also be available.

- Report on the comparative study of the distributed platforms for processing large-scale graphs.
- Implementation and evaluation of the proof-of-concept programs.
- Detailed plan for the final project outcomes as agreed with the advisor.

- The software you built to achieve the agreed project goals. The dissertation will report on different algorithms for the advanced problem you chose comparing their strengths and weaknesses. It will discuss techniques and findings, and provide meaningful visualisation of the results.

*Type 1 diabetes (T1D)* is a disease in which the human pancreas is not able to produce a sufficient amount of insulin to regulate blood glucose levels, thereby inhibiting glucose uptake in muscle and fatty tissue. Many automated control techniques for T1D require some model for predicting future glucose levels to improve therapy effectiveness. Traditional models include mechanistic (aka white-box) differential equations models, which provide good accuracy and interpretability but require a potentially error-prone state estimation procedure, i.e., figuring out the state of the differential equation model out of past observations. In contrast, data-driven (aka black-box) machine learning models don't require state estimation as they can be trained to directly map past observations into predicted glucose values.

In this project, you will derive models for short-term glucose prediction (up to 1 hour prediction horizon) using deep learning techniques. Here by glucose we mean the subcutaneous glucose levels as measured by a CGM sensor. First, you will review the problem of automated insulin control and existing glucose prediction models. Then, you will implement data-driven machine learning models for glucose prediction, and train them with synthetic data provided by the project supervisor. These data consist of sequences of glucose measurements, insulin dosages, and ingested meals. You will experiment with and compare the accuracy of different deep neural network architectures and different lengths for the prediction horizon. If time permits, possible extensions include: analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout; deriving optimal insulin control policies using the learned glucose prediction models; or comparison of the learned deep models with more classical autoregressive models.

- Report 1: overview of the artificial pancreas and underlying control algorithms.
- Report: review of main models for glucose prediction.
- Report: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
- Implement code for data processing and for learning neural network-based glucose predictors. Perform training for at least one NN architecture and at least one prediction horizon.

- Final dissertation describing problem, data, solution method, machine learning methods used, implementation details, computational experiments and their analysis, and conclusions.

- Analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout.
- Derive optimal insulin control policies using the learned glucose prediction models
- Compare the learned deep models with more classical autoregressive models.

- (for report 1) Boughton, Charlotte K., and Roman Hovorka. "Advances in artificial pancreas systems." Science translational medicine 11.484 (2019). PDF available from supervisor
- (for reports 1 and 2) Lunze, Katrin, et al. "Blood glucose control algorithms for type 1 diabetic patients: A methodological review." Biomedical signal processing and control 8.2 (2013): 107-119.
- (for report 2 and for ideas on neural net architecture) Li, Kezhi, et al. "Convolutional recurrent neural networks for glucose prediction." IEEE Journal of Biomedical and Health Informatics (2019).
- (for report 2 and for example of autoregressive model) Sparacino, Giovanni, et al. "Glucose concentration can be predicted ahead in time from continuous glucose monitoring sensor time-series." IEEE Transactions on biomedical engineering 54.5 (2007): 931-937.
- (for monte carlo dropout) Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
- (for inductive conformal prediction) Papadopoulos, Harris. "Inductive conformal prediction: Theory and application to neural networks." Tools in artificial intelligence. IntechOpen, 2008.

*Type 1 diabetes (T1D)* is a disease in which the human pancreas is not able to produce a sufficient amount of insulin to regulate blood glucose levels, thereby inhibiting glucose uptake in muscle and fatty tissue. Many automated control techniques for T1D require some model for predicting future glucose levels to improve therapy effectiveness. Traditional models include mechanistic (aka white-box) differential equations models, which provide good accuracy and interpretability but require a potentially error-prone state estimation procedure, i.e., figuring out the state of the differential equation model out of past observations. In contrast, data-driven (aka black-box) machine learning models don't require state estimation as they can be trained to directly map past observations into predicted glucose values.

In this project, you will derive models for short-term glucose prediction (up to 1 hour prediction horizon) using deep learning techniques. Here by glucose we mean the subcutaneous glucose levels as measured by a CGM sensor. First, you will review the problem of automated insulin control and existing glucose prediction models. Then, you will implement data-driven machine learning models for glucose prediction, and train them with synthetic data provided by the project supervisor. These data consist of sequences of glucose measurements, insulin dosages, and ingested meals. You will experiment with and compare the accuracy of different deep neural network architectures and different lengths for the prediction horizon. If time permits, possible extensions include: analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout; deriving optimal insulin control policies using the learned glucose prediction models; or comparison of the learned deep models with more classical autoregressive models.

- Report 1: overview of the artificial pancreas and underlying control algorithms.
- Report: review of main models for glucose prediction.
- Report: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
- Implement code for data processing and for learning neural network-based glucose predictors. Perform training for at least one NN architecture and at least one prediction horizon.

- Final dissertation describing problem, data, solution method, machine learning methods used, implementation details, computational experiments and their analysis, and conclusions.

- Analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout.
- Derive optimal insulin control policies using the learned glucose prediction models
- Compare the learned deep models with more classical autoregressive models.

- (for report 1) Boughton, Charlotte K., and Roman Hovorka. "Advances in artificial pancreas systems." Science translational medicine 11.484 (2019). PDF available from supervisor
- (for reports 1 and 2) Lunze, Katrin, et al. "Blood glucose control algorithms for type 1 diabetic patients: A methodological review." Biomedical signal processing and control 8.2 (2013): 107-119.
- (for report 2 and for ideas on neural net architecture) Li, Kezhi, et al. "Convolutional recurrent neural networks for glucose prediction." IEEE Journal of Biomedical and Health Informatics (2019).
- (for report 2 and for example of autoregressive model) Sparacino, Giovanni, et al. "Glucose concentration can be predicted ahead in time from continuous glucose monitoring sensor time-series." IEEE Transactions on biomedical engineering 54.5 (2007): 931-937.
- (for monte carlo dropout) Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
- (for inductive conformal prediction) Papadopoulos, Harris. "Inductive conformal prediction: Theory and application to neural networks." Tools in artificial intelligence. IntechOpen, 2008.

A lexical analyser is a simplified parser which performs this initial processing. The UNIX standard tool LEX is an example of such a program.

This project involves implementing a lexical analyser generator based on finite state automata. The algorithms are well known but implementation details are important in the development of an efficient tool.

For this project you will need programming aptitude. The implementation may be produced using the parser generator tool rdp or from scratch. This will be discussed in the detailed project design phase.

- Write up of regular expressions, Thompson's construction, subset construction, minimisation.
- Implementation design for the lexer generator, with an analysis of the data structures needed.
- User interface design, data structure design.
- Proof of concept implementation using hand constructed automata.

- The project report will contain a description of the construction of automata from regular expressions and their combination into a tokeniser, written at a level suitable for a beginning final year student to understand.
- An implementation of a lexical analyser generator which can take as input a set of regular expressions, generate a corresponding set of finite state automata, and use these to tokenise an input string.

L-systems or Lindenmayer systems (named after their originator, Aristid Lindenmayer) are grammar-based mechanisms for producing geometric figures. Like a conventional context free grammar, an L-system comprises an alphabet and a set of productions. An initial seed string is used to given as a starting sentential form. Strings are transformed by applying the productions to all viable elements, and the results strings are interpreted using a graphics engine.

Recursive rules may be used to define fractal-like objects, and by attaching probabilities to individual productions we can produce images which mimic some aspects of biological systems, including plant growth.

To succeed in this project, you need to be interested in automata and graphics; and you must possess good programming skills including the use of GUI's with Java.

- A set of small worked manual examples of non-stochastic L-systems.
- A Java program which interprets strings and draws sequences of line segments.
- Write a report surveying classes of L-systems.

- Demonstrate a working GUI-based L-system environment.
- Extend your system to support stochastic productions and show how they can mimic plant growth.
- A final report on the project activities, fully acknowledging and citing sources.

- Investigate the relationship between L-systems and Semi-Thue systems.
- Extend your system to draw 3-dimensional objects.

- The book 'The algorithmic beauty of plants' available free from http://algorithmicbotany.org/papers/abop/abop.pdf .
- Many online resources: start with Wikipedia

- Proof of concept program: identify one machine learning task suitable for the smart meter data and implement one learning algorithm for the chosen task.
- Report: An overview of the datasets used. Statistics and numbers involved.
- Report: A report on machine learning tasks for smart meter data.

- Survey of machine learning techniques used for smart meter data analysis.
- The developed machine learning software for smart meter data.
- The software engineering process involved in generating your software.
- Experiments with different smart meter data.

- Extend the learning algorithms for online setting.
- Provide predictions with confidence measure.

- Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. Sean Barker, Aditya Mishra, David Irwin, Emmanuel Cecchet, Prashant Shenoy, and Jeannie Albrecht. Proceedings of the 2012 Workshop on Data Mining Applications in Sustainability (SustKDD 2012), Beijing, China, August 2012.
- Smart* Data Set for Sustainability http://traces.cs.umass.edu/index.php/Smart/Smart
- SmartMeter Energy Consumption Data in London Households https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
- UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://jack-kelly.com/data/
- Revealing Household Characteristics from Smart Meter Data. Christian Beckel, Leyna Sadamori, Thorsten Staake, Silvia Santini, Energy, Vol. 78, pp. 397-410, December 2014.
- Piotr Mirowski, Sining Chen, Tin Kam Ho, Chun-Nam Yu, Demand Forecasting in Smart Grids, Bell Labs Technical Journal, 18(4), pp.135-158, 2014.

- Proof of concept program: identify one machine learning task suitable for the smart meter data and implement one learning algorithm for the chosen task.
- Report: An overview of the datasets used. Statistics and numbers involved.
- Report: A report on machine learning tasks for smart meter data.

- Survey of machine learning techniques used for smart meter data analysis.
- The developed machine learning software for smart meter data.
- The software engineering process involved in generating your software.
- Experiments with different smart meter data.

- Extend the learning algorithms for online setting.
- Provide predictions with confidence measure.

- Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. Sean Barker, Aditya Mishra, David Irwin, Emmanuel Cecchet, Prashant Shenoy, and Jeannie Albrecht. Proceedings of the 2012 Workshop on Data Mining Applications in Sustainability (SustKDD 2012), Beijing, China, August 2012.
- Smart* Data Set for Sustainability http://traces.cs.umass.edu/index.php/Smart/Smart
- SmartMeter Energy Consumption Data in London Households https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
- UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://jack-kelly.com/data/
- Revealing Household Characteristics from Smart Meter Data. Christian Beckel, Leyna Sadamori, Thorsten Staake, Silvia Santini, Energy, Vol. 78, pp. 397-410, December 2014.
- Piotr Mirowski, Sining Chen, Tin Kam Ho, Chun-Nam Yu, Demand Forecasting in Smart Grids, Bell Labs Technical Journal, 18(4), pp.135-158, 2014.

We spend most of our time indoors, where limited or no GPS service is available. The most popular solution leverages the ubiquitous WiFi signal to create a radio map of the tracking zone (i.e. WiFi fingerprinting).

WiFi fingerprinting is interesting, because it may be considered as both classification and regression problem. Firstly, as the training labels are discrete finite indoor locations, it may be addressed as a classification problem. Secondly, as these training labels are real numbers (i.e. the Cartesian coordinate), a regression model will be more suitable when there are too many training positions.

The challenge for both approaches is the high degree of variation of the radio signal, and the high dimensional training examples. Hence, it is useful to compare their performances on a real world dataset.

- Producing a simple proof-of-concept implementation of some classification and regression algorithms from scratch.
- A report of the empirical results of such learning algorithms on a small synthetic dataset.

- A detailed literature review of the machine learning application in WiFi fingerprinting.
- Producing a detailed report of the dataset structure, including visual analysis (e.g. t-SNE, UMAP, etc.).
- Implementing the algorithms on the entire competition dataset, and optimising their parameters.
- Carefully analysing and comparing the performance accuracy of such learning algorithms.

- Providing location estimations with some confidence measure.
- Extending the learning algorithms to an online setting, where the user provides continuous samples as she traverses the building.

- Bahl, P. and Padmanabhan, V. N. (2000). RADAR: An in-building RF-based user location and tracking system. In INFOCOM 2000. 19th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 2, pp. 775-784).
- Honkavirta, V., Perala, T., Ali-Loytty, S., and Piché, R. (2009). A comparative survey of WLAN location fingerprinting methods. In Positioning, Navigation and Communication, 2009. WPNC 2009. 6th Workshop on (pp. 243-251). IEEE.
- https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

We spend most of our time indoors, where limited or no GPS service is available. The most popular solution leverages the ubiquitous WiFi signal to create a radio map of the tracking zone (i.e. WiFi fingerprinting).

WiFi fingerprinting is interesting, because it may be considered as both classification and regression problem. Firstly, as the training labels are discrete finite indoor locations, it may be addressed as a classification problem. Secondly, as these training labels are real numbers (i.e. the Cartesian coordinate), a regression model will be more suitable when there are too many training positions.

The challenge for both approaches is the high degree of variation of the radio signal, and the high dimensional training examples. Hence, it is useful to compare their performances on a real world dataset.

- Producing a simple proof-of-concept implementation of some classification and regression algorithms from scratch.
- A report of the empirical results of such learning algorithms on a small synthetic dataset.

- A detailed literature review of the machine learning application in WiFi fingerprinting.
- Producing a detailed report of the dataset structure, including visual analysis (e.g. t-SNE, UMAP, etc.).
- Implementing the algorithms on the entire competition dataset, and optimising their parameters.
- Carefully analysing and comparing the performance accuracy of such learning algorithms.

- Providing location estimations with some confidence measure.
- Extending the learning algorithms to an online setting, where the user provides continuous samples as she traverses the building.

- Bahl, P. and Padmanabhan, V. N. (2000). RADAR: An in-building RF-based user location and tracking system. In INFOCOM 2000. 19th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 2, pp. 775-784).
- Honkavirta, V., Perala, T., Ali-Loytty, S., and Piché, R. (2009). A comparative survey of WLAN location fingerprinting methods. In Positioning, Navigation and Communication, 2009. WPNC 2009. 6th Workshop on (pp. 243-251). IEEE.
- https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

Implement Map and Reduce programs for several useful data analytics problems, and analyse their performance.

In this project, you are asked to code MapReduce programs for a number of typical analytics problems with varied degree of difficulty, deploy them onto a Hadoop cluster, analyse their performance, and propose an optimal parameter setting.

To test your programs, you will need to generate sample data sets, and/or download them from online sources. Pointers to several such sources will be provided by the advisor.The problems you are asked to solve are as follows:

- Given a large collection of web pages, find a number of occurrences of each word. This is a "Hello World!" of MapReduce.
- Distributed Grep: Given a large collection of web pages, and a regular expression, find all strings matching the regular expression. The output must be a list of tuples each of which consists of a matching string and its location in the input (e.g., file name, and offset or line number).
- Basic statistics analysis of a simple data set: Given a large collection of numbers (e.g., representing times to failure of a software component resulting from a large number of experiments), generate a frequency distribution of the numbers at fixed percentiles (e.g., 10%), find values of various different order statistics, such as minimum, maximum, and median, and visualise the shape of the distribution.
- Implement an advanced data mining algorithm (e.g., K-Means, or a classifier) from [4], apply it to discover interesting features in a realistic data set, and visualise the results. The problem, algorithms, and data sets will be chosen in coordination with the advisor.

- Solution of Problems 1 and 2, their performance analysis, recommendation for cluster configuration and parameter settings, and result visualisation.
- Reports: Map and Reduce algorithm description; experience with running your code on a Hadoop cluster; practical problems being encountered, and their solution; experience with performance analysis, and parameter tuning; recommendation for optimal parameter setting for each of the analysed problem.

- Solution for Problems 3 and 4, their performance analysis, recommendation for cluster configuration and parameter settings, and visualisation of the results.
- The report will provide a reflective overview of the MapReduce paradigm, and its Hadoop implementation.
- The report will describe implementation of the Map and Reduce primitives for all four problems.
- The report will discuss and motivate the choice of the data sets for each of the four problems.
- If synthetic data sets were used, the report will outline the procedures used to generate the data sets.
- The report will reflect on the experience of running, analysing and tuning your implementation on a Hadoop cluster.

- Implement advanced data mining techniques. Apply them to interesting data sets.
- Talk to machine learning experts in the department to identify open massive scale analytics problems they are interested in solving, and implement and run them in the Hadoop framework. One such problem can be identifying trends in the epidemics spread using real-time data supplied by mobile devices. If you are interested to look at this problem, you should talk to Chris Watkins and Gregory Chockler.

- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 2004, pages 137-150, 2004.
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing tool. Commun. ACM, 53(1):72-77, 2010.
- Jeffrey Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, page 1, New York, NY, USA, 2006. ACM.
- Rajaraman, Lesocovec, Ulman. Mining of Massive Datasets. Cambridge University Press. Available online at http://infolab.stanford.edu/~ullman/mmds.html.
- MapReduce bibliography by A. Kamil, 2010: http://www.columbia.edu/~ak2834/mapreduce.html.
- Hadoop: http://hadoop.apache.org.

- Report on Basics of Computational Complexity.
- First report on Local Search.
- Report on Tabu Search.
- Implementation of Tabu Search for pseudo-boolean optimisation.

- Full report on Local Search.
- Report on Iterated Local Search.
- Implementation of Iterated Local Search for pseudo-boolean optimisation.
- Report on Genetic Algorithms.
- Implementation of Genetic Algorithm for pseudo-boolean optimisation.
- Report on Memetic Algorithms.
- Implementation of Memetic Algorithm for pseudo-boolean optimisation.
- Graphic User Interface for the four implementations.
- Computational comparison of the four implementations.

- Implementing metaheuristics for other optimisation problems.

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing evasion techniques used by malware
- A report giving an overview of existing detection/mitigation techniques for evasion
- A clear workplan describing the set of tests to be performed, and the classes of evasion mechanisms/techniques to be analysed

- A thorough analysis of evasion techniques, e.g. against sandboxing, virtualised environments and debugging
- Describe in detail the design of the tests
- Describe and analyse the features that have been used during the project to detect/stop evasion
- Critically compare the findings with related works
- Suggestions for improving the security of sandboxes with respect to this problem

- A Survey On Automated Dynamic Malware Analysis Evasion and Counter-Evasion: PC, Mobile, and Web
- https://github.com/a0rtega/pafish
- https://github.com/LordNoteworthy/al-khaser
- https://github.com/joesecurity/pafishmacro

OWL ontologies are basically descriptions of certain enterprises, reminiscent to a database modelling of an enterprise; the advantage of ontologies over databases is the ability to model not only basic individual information but rather semantic connections between entities in the enterprise (e.g., that a certain class of entities is a sub-class of a different class of entities). This gives the user the ability to infer *new* information from existing one (and not only to extract the already existing information).

Ontologies are one of the technologies of the Semantic Web and are currently used extensively in domains such as medicine, biology, and in the general area of digital humanities including archives.

The project will offer students the opportunity to learn some basic concepts behind the rich and fundamental theory of ontologies, namely Description Logic (which are weak versions of first-order logic), and the use of ontology frameworks (for example, Protege) in real (or artificial) data sets.

A successful project will
(1) Investigate and understand some basic concepts of the logical theory of Description Logic: syntax and semantics;
(2) Define a developed an OWL ontology for a particular chosen domain using
the the Protege tool; Or work and possibly extend an existing model, exemplifying sophisticate inferencing in the model;
(3) Show the applicability of the ontology as to store and infer new information from existing one.
(4) There are two options for this project that a student can take: **Option 1**:
the more logical one; **Option 2**: the more ontology-engineering one.
Both options will have some parts from option 1 and 2, but with a different emphasis.

- Project plan with tasks and milestones.
- An overview of ontologies: what they are, basic technical concepts (RDF,RDFS and basic OWL).
- Option 1: An overview of Description Logic that supports ontological reasoning.
- Option 2: Gathering of requirements from a user, or justifying such requirements from the perspective of a potential user.
- Initial experiments with Protege. And building a preliminary model for the chosen domain. (The project intends to utilise Protege but other available or developed tools for ontology engineering are certainly legitimate.

- An ontology of the knowledge domain associated with the data set.
- Option 2: a web front-end for the data set based on the ontology.
- Option 1: A discussion of Description Logic with explicit examples in the ontology constructed.
- Option 1: Investigate inferencing algorithms in Description Logic and their run-time complexity.
- At least a partial encoding of the data set in the ontology.
- Option 1: Exemplifying sophisticated inferencing using Protege (or other selected tools): using restrictions (quantifiers).
- Option 2: The report will include examples of usage of the system.
- Option 2: The report will include discuss the perceived advantages and disadvantages of using ontology frameworks for representing and querying the chosen data set

- Option 2: Connect the ontology to other existing ontologies in related areas.
- Option 2: User-friendly interfaces for data representation and querying.
- Option 2: Investigate inferencing algorithms in Description Logic and their run-time complexity.
- Option 1: User-friendly interfaces for data representation and querying.
- Option 1: User-friendly interfaces for data representation and querying.
- Option 1: a web front-end for the data set based on the ontology.

- Option 2: Semantic Web for the Working Ontologist. Dean Allemang and Jim Hendler.
- Option 1: A Description Logic Primer. Markus Krotzsch, Frantiek Simancik, Ian Horrocks.
- Both options: http://protege.stanford.edu

- Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in network simulations.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied.

- Design and implement a framework for development of new and existing network simulation techniques
- Describe in detail the design of tests for the framework
- Describe and analyse the features that have been used during the project to simulate networks
- Critically compare the findings with related works

- GNS3: https://www.gns3.com/
- https://www.sciencedirect.com/topics/computer-science/network-simulation

Modern applications often consists of multiple parts that are connected using Web-based APIs (often following REST principles). A (cloud-based) backend provides persistence and implements business logic, while frontends, such as mobile and web apps, provide user interfaces that interact with the backend services. Often, 3rd-party providers can integrate their services as well, sometimes using the same APIs as the first-party application components.

In such a diverse and distributed setting, authentication and authorization are key to protect resources from unauthorized access. Protocols like OAuth 2.0 and OpenID Connect aim to solve this issue and have found widespread adoption in industry. However, their application is often not trivial and can be easily implemented in flawed ways.

To illustrate how such mechanisms can be applied in a good way, we want to build a mock application that consists of multiple components (such as a backend, a mobile app, and a web interface) and that allows for 3rd-party integration. This application should make use of common authentication and authorization frameworks (using appropriate libraries). Due to the nature of such applications, you will need to develop several components in different programming languages, potentially using different libraries.

- This is a challenging project that will require good programming skills with proficiency in multiple programming languages.
- You need to be able to identify good libraries and integrate these in your project.

- A skeleton of a mock application (in a field of your choice) with a backend, a RESTful API (for 3rd-party integrations and optionally frontend), and at least one frontend (web interface or mobile app).
- A report describing the authentication and authorization mechanisms and best practices and how you plan to integrate these in your application.

- The full application with integration of authentication and authorization mechanisms.
- Complete report

- https://tools.ietf.org/html/rfc6749 (OAuth 2.0)
- https://tools.ietf.org/html/rfc6750 (Bearer Token Usage)
- https://tools.ietf.org/html/rfc6819 (OAuth 2.0 Threat Model and Security Considerations)
- https://tools.ietf.org/html/draft-ietf-oauth-security-topics-21 (Draft OAuth Security BCP)
- https://tools.ietf.org/html/draft-ietf-oauth-browser-based-apps-12 (Draft Best Practices for OAuth Browser-Based Apps)
- https://openid.net/specs/openid-connect-core-1_0.html (OpenID Connect Core 1.0)
- https://www.oauth.com/ (OAuth 2.0 Simplified)
- https://www.oauth.net/ (Community page for OAuth 2.0)
- Daniel Fett, Ralf Küsters, Guido Schmitz. 2016. A Comprehensive Formal Security Analysis of OAuth 2.0. In: 23rd ACM SIGSAC Conference on Computer and Communications Security (CCS 2016). https://dl.acm.org/doi/pdf/10.1145/2976749.2978385
- Daniel Fett, Ralf Küsters, Guido Schmitz. 2017. The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines. In: IEEE 30th Computer Security Foundations Symposium (CSF 2017). https://ieeexplore.ieee.org/iel7/8048777/8049639/08049720.pdf

- Proof of concept program: tf-idf based similarity measurement
- Proof of concept program: word-embedding based similarity measurement
- (Optional) Proof of concept program: sentence-embedding based similarity measurement.
- A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

- Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

- Proof of concept program: tf-idf based similarity measurement
- Proof of concept program: word-embedding based similarity measurement
- (Optional) Proof of concept program: sentence-embedding based similarity measurement.
- A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

- Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

- Proof of concept program: tf-idf based similarity measurement
- Proof of concept program: word-embedding based similarity measurement
- (Optional) Proof of concept program: sentence-embedding based similarity measurement.

- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

- Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
- (Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
- A report describing and assessing the results of applying the proof of concept programmes on data.

- Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
- Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
- (Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
- Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
- A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

- Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
- (Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
- A report describing and assessing the results of applying the proof of concept programmes on data.

- Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
- Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
- (Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
- Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
- A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

- Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
- (Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
- A report describing and assessing the results of applying the proof of concept programmes on data.

- A report summarising all the experiments and analysing strengths and weaknesses of each model

- Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
- Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Mapping applications are hugely popular on mobile devices, and many of us would be lost without them. In most cases - such as Google Maps - the maps are represented as image tiles that must be cached for offline use. Searching or route planning requires a data connection.

An alternative representation is "vector maps" where the actual location data of the roads and buildings is stored and rendered in realtime (one such app is "osmand"). Because the actual map data is available, it can be searched and presented as the user wishes without a data connection.

This project will experiment with new HTML5 technologies to produce a vector map application.

- Proof of concept programs: A "hello world" offline HTML5 application using Cache and Service Workers.
- Proof of concept programs: A "todo list" application using IndexedDB or Web Storage.
- Proof of concept programs: An application to draw shapes using the HTML5 canvas.
- Proof of concept programs: A web page that loads and lists Open Street Map data in raw form.
- Reports: Basic web page development in HTML5: HTML, JavaScript, CSS.
- Reports: Advanced technologies: jQuery, HTML5 canvas.
- Reports: Developing an offline HTML5 application: Web Storage, Indexed DB, Service Workers and the Cache API.
- Reports: Open Street Map data representation and vector vs. image tile maps.

- The program will work offline.
- The program will allow the user to load and display map data.
- The program will allow the user to zoom and move around the map.
- Features for improving rendering performance such as database indexing and image caching.

- Allowing the user to search for interesting features.
- Allowing the dynamic display of different kinds of data (E.g. highlight cafes on screen).
- The support of different OSM data formats.
- Downloading map data dynamically when a connection is present.

- Web Development Tutorials: http://www.w3schools.com/
- HTML5 Canvas: https://www.w3schools.com/graphics/canvas_intro.asp
- Service Workers: https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API/Using_Service_Workers
- Cache API: https://developer.mozilla.org/en-US/docs/Web/API/Cache
- IndexedDB: https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API
- Web Storage: https://developer.mozilla.org/en-US/docs/Web/API/Web_Storage_API/Using_the_Web_Storage_API
- Open Street Map: http://wiki.openstreetmap.org/wiki/Develop

- Project plan with tasks and milestones.
- An overview of the mechanisms that support ontological reasoning - description logic.
- Gathering of requirements from a real user.
- Initial experiments with an ontology language.

- An ontology of the knowledge domain associated with the data set.
- At least a partial encoding of the data set in the ontology.
- A web front-end for the data set based on the ontology.
- The report will include a justification for the chosen ontology representation based on the requirements of the user.
- The report will include examples of usage of the system.
- The report will include discuss the perceived advantages and disadvantages of using ontology frameworks for representing and querying the chosen data set.

- Connect the ontology to other existing ontologies in related areas.
- User-friendly interfaces for data representation and querying.

- http://protege.stanford.edu
- http://en.wikipedia.org/wiki/Ontology_(information_science)

Many machine learning algorithms involve computations that can be run independently in parallel. Examples include: (1) the updates to different coordinates when performing gradient descent (for linear/logistic/ridge regression problems); (2) computing the distances between observations and centroids in K-Means; (3) computing the distances between different pairs of clusters in Agglomerative Hierarchical Clustering. When the datasets are huge (say millions of observations), we really want to perform these computations in parallel, in order to speed up the whole computation process. Thanks to the recent advances of GPU technology, this approach can now be implemented with thousands of cores at a relatively low cost.

In this project, you will learn GPU programming with Python and CUDA (a parallel computing platform and API). You will pick one machine learning algorithms you have learnt in our MSc program, and implement it using GPU.

Note that in GPU programming, you need to implement many computations from first principles, because most library functions (e.g. those in numpy and sklearn) cannot be used in GPU. You are advised to choose this topic only if you are comfortable with the assignments in CS5100 or CS5100J.

- Be familiar with the machine learning algorithm you will implement. Proof of Concept (PoC) program: implement the algorithm using CPU. This PoC will help you identify the comptuations that can be parallelized in GPU.
- Learn GPU programming with Python and CUDA. Proof of Concept programs: (i) compute the elementwise sum of two huge matrices; (ii) compute the product of two huge matrices; (iii) compute the distance between every pair of observations in a huge dataset.

- Implement the machine learning algorithm using GPU.
- A report on comparing the performance of the GPU program against the PoC CPU program.

- If you choose to implement stochastic descent algorithm, there are different schemes on how coordinates or observations are sampled. Implement these schemes and compare them.
- If you choose to implement K-Means algorithm, consider distance measures other than Euclidean, e.g. Minkowski norm. Then for each cluster, a generalized centroid should be computed instead of the classical centroid. One method to compute the generalized centroid is gradient descent, which can be parallelized using GPU.

- Wes Armour. Course on CUDA Programming on NVIDIA GPUs. Online, first available in 2019.
- Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers 2004.
- Yun Kuen Cheung, Richard Cole and Yixin Tao. Fully Asynchronous Stochastic Coordinate Descent: A Tight Lower Bound on the Parallelism Achieving Linear Speedup. Mathematical Programming Series A 2020.

Many machine learning algorithms involve computations that can be run independently in parallel. Examples include: (1) the updates to different coordinates when performing gradient descent (for linear/logistic/ridge regression problems); (2) computing the distances between observations and centroids in K-Means; (3) computing the distances between different pairs of clusters in Agglomerative Hierarchical Clustering. When the datasets are huge (say millions of observations), we really want to perform these computations in parallel, in order to speed up the whole computation process. Thanks to the recent advances of GPU technology, this approach can now be implemented with thousands of cores at a relatively low cost.

In this project, you will learn GPU programming with Python and CUDA (a parallel computing platform and API). You will pick one machine learning algorithms you have learnt in our MSc program, and implement it using GPU.

Note that in GPU programming, you need to implement many computations from first principles, because most library functions (e.g. those in numpy and sklearn) cannot be used in GPU. You are advised to choose this topic only if you are comfortable with the assignments in CS5100 or CS5100J.

- Be familiar with the machine learning algorithm you will implement. Proof of Concept (PoC) program: implement the algorithm using CPU. This PoC will help you identify the comptuations that can be parallelized in GPU.
- Learn GPU programming with Python and CUDA. Proof of Concept programs: (i) compute the elementwise sum of two huge matrices; (ii) compute the product of two huge matrices; (iii) compute the distance between every pair of observations in a huge dataset.

- Implement the machine learning algorithm using GPU.
- A report on comparing the performance of the GPU program against the PoC CPU program.

- If you choose to implement stochastic descent algorithm, there are different schemes on how coordinates or observations are sampled. Implement these schemes and compare them.
- If you choose to implement K-Means algorithm, consider distance measures other than Euclidean, e.g. Minkowski norm. Then for each cluster, a generalized centroid should be computed instead of the classical centroid. One method to compute the generalized centroid is gradient descent, which can be parallelized using GPU.

- Wes Armour. Course on CUDA Programming on NVIDIA GPUs. Online, first available in 2019.
- Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers 2004.
- Yun Kuen Cheung, Richard Cole and Yixin Tao. Fully Asynchronous Stochastic Coordinate Descent: A Tight Lower Bound on the Parallelism Achieving Linear Speedup. Mathematical Programming Series A 2020.

Software testing is one of the most difficult tasks developers and managers are faced with when developing a project. In complex projects new bugs can be introduced often, and traditional methods of testing software such as unit testing have shown limited effectiveness and carry a large development cost.

Symbolic execution is a technique for testing software by automatically running a program multiple times, each time with a different input which causes a program to follow a different control flow. This can be a valuable tool for testing as it allows for the automated exploration of many distinct controls flows of a program.

Prioritising the potential test cases in the queue to be tested can improve the effectiveness of the testing framework and increase the chances of exposing an issue sooner. This is especially important with symbolic execution frameworks, where the total number of tests that could be generated is often infeasibly large. Metrics such as code coverage or number of high risk features encountered can be used as a method for scoring each test case, and the selection of useful metrics is key to improving vulnerability detection.

The project will build upon an existing framework for symbolic execution and test generation for JavaScript, and extend it with advanced path selection algorithms.

- Code: A simple implementation of a test prioritisation algorithm
- Code: Modifying an existing testing framework to expose code metrics for scoring paths
- Report: Effectiveness of software testing metrics, e.g., code coverage
- Report: Basic mechanics of symbolic execution
- Report: Search strategies in symbolic execution

- Code: A full test prioritisation algorithm for scoring paths based on code metrics
- Code: At least two different strategies for paths scoring, e.g., code coverage or test execution time
- Report: Implemented algorithms and search strategies
- Report: Experimental evaluation and comparison of search strategies
- Report: Responsible disclosure of vulnerabilities
- Experiments with all search strategies on real world software. Building test harnesses to exercise multiple entry points.

- Implement and evaluate search strategies based on genetic algorithms
- Implement and evaluate scoring metrics based on estimated bug probabilities

- Patrice Godefroid, Nils Klarlund, Koushik Sen: DART: directed automated random testing. PLDI 2005: 213-223
- Cristian Cadar, Daniel Dunbar, Dawson R. Engler: KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. OSDI 2008: 209-224

Advanced knowledge of JavaScript

Many puzzles and games require logical thinking to choose a best move, or to find a hidden pattern. We have moved on a long way from games of patience. What unites them is that they are not time based and are all just for one player. They can all be solved by AI.

We most often use a backtrack based search to find the best move, or the solution. We can employ algorithms to improve backtracking like consistency pruning (e.g.) arc-consistency, look ahead (e.g.) forward checking, efficient backtracking (like backjumping) and variable and value heuristics (like Dynamic Variable Ordering).

If there is a pay off and we are trying to optimise then we can use branch and bound, or alpha/beta pruning, or even expectimax optimization.

In the end we can solve these puzzles quickly, and much better than people.

Some of these games have web based implementations, in which case you can extend the project to play the game automatically, as though you were typing, by using the debugging mode available on all modern browsers.

Most of these games can be read from picture files, using another form of AI, image processing.

A common and hard task for such games is to evaluate their difficulty. This is very hard to do as games that computers are slow at may not be those that people see as hardest. Some games like Sudoku have extensive online, pre-coded games libraries that you can use to try to learn what makes a game difficult.

Sometimes there are online cheat pages that give hints and human algorithms for solving games. Perhaps these should be incorporated into your solver.

The following would all make good examples for this project. Constraint Satisfaction is one of the leading technologies in Artificial Intelligence for solving search problems. A constraint satisfaction problem instance can be seen as a set of related questions that need to be answered. We cannot answer any particular question without knowing the answers to those questions which affect it. You should research and report on how this technique could be used for your puzzle/game. One goal of the project might be to allow people to play your chosen game off-line. If you are puzzle solving the perhaps you could generate new puzzles, or evaluate and solve existing ones. Here you would stress a beautiful experience rather than extensive background theory. This approach is applicable to more general AI problems found in machine vision, fleet scheduling, staff rostering, machine shop design, game playing, frequency assignment, vehicle routing etc., Perhaps your solver could be generalised to solve a whole class of such problems. It is ideal for solving Sudoku. Constraint solvers are relatively easy to implement in Java and can solve even hard Sudoku instances in very little time. In this project you will see that the time taken by a constraint solver (although very fast indeed) really represents the difficulty rating that papers gives the puzzles. It seems that we are similar in behaviour to constraint solvers. As you can see in the extensions it is sensible to extend your project by choosing to write it for a mobile device. This new set of acquired skills would need groundwork done in the fist term. Other things that would be big changes and would require significant extra work, but would allow very high scores would be to use machine vision to enter puzzle or to automatically playing the game/solve the puzzles on a public website"- Proof of concept programs: Simple colourful GUI with an active button
- Proof of concept programs: Relevant Data structures including a priority queue.
- Proof of concept programs: Solving the eight queens problem using backtracking
- Proof of concept program which solves a simple example of your puzzle.
- Reports: Design Patterns for AI and Search.
- Report on Backtracking and Recursion.
- Reports: Constraint Satisfaction, particularly consistency techniques.
- Reports: Techniques and Algorithms used by human solvers.
- Reports: User Interface design for a solver
- Reports: Complexity, NP hardness and the big O notation. Estimating problem size and hardness of your puzzle.

- The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
- The program will have a splash screen and at least two other user interaction screens.
- The program will have a Graphical User Interface that can be used to generate, load, save, solve and help with puzzle solving/game playing.
- The program must be able to produce puzzles that have only one correct solution, or games that have a single best strategy.
- The report will describe the software engineering process involved in generating your software.
- The report will describe interesting algorithms and programming techniques used (or able to be used) on the project.
- The report will discuss the choice of data structures and their performance impact on the solving and generation algorithms.
- The report will describe at least a backtracking and recursion solving algorithm and will compare them (including benchmark).
- The report will describe the CSP and algorithms useful for solving the CSP including generalised arc consistency, Forward Checking, DVO and Backjumping.

- Porting the program to a mobile device
- Solving and benchmarking public domain sets of puzzles
- Using XML based files to store and load puzzles
- Solving more than one kind of puzzle
- Using machine vision to enter puzzle
- Automatically playing the game/solving the puzzle on a public website
- The program gives hints for users when they are stuck, and allows users to step back.
- The program supports printing, different screen size, timers etc.,.

- http://ktiml.mff.cuni.cz/~bartak/constraints/

You will use the tools you built to analyse various meta-data attributes associated with the data (such as time, location, hashtags, etc.) as well as properties of the content (e.g., sentiment score) to establish temporal, spatial, content-related, and other correlations between the real-life event you selected and the relevant social media properties. The report should reflect upon and draw general conclusions in regards to the predictive power of the social media platforms in relation to the dynamics and outcomes of real-life events.

Your project must use one or a combination of the large-scale data engineering tools you have studied in CS5234. For example, past projects have used Hadoop/MapReduce to identify tweets relevant to the studied event, extract the attributes of relevance, and assign sentiment score to the posted statuses using various Natural-Language Processing (NLP) techniques, such as [O'Connor, 2010] and/or Stanford NLP libraries (http://nlp.stanford.edu/)

Data: You will have access to the 300G Twitter dataset on the department's cloud, or could use an alternative dataset of your own subject to the supervisor's approval.

Tools: The project must use one or a combination of the large-scale data engineering tools studied in CS5234.

- Precise specification of the project goals as agreed with the advisor along with data collection and analysis tools, and methodologies you plan to use to achieve the specified goals.
- Intermediate report and programs showing and reflecting upon initial findings.

- The software you built to achieve the project goals.
- Final report will discuss techniques and findings, and provide meaningful visualisation of the results.

- Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, Noah A. Smith, "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series" In Proceedings of the International AAAI Conference on Weblogs and Social Media (2010)

Most such algorithms are based upon number theory, namely, the intractability of certain problems involving prime numbers.

- Proof of concept program: An implementation of RSA using standard data types in Java or C++ encrypting and decrypting numbers.
- Proof of concept program: A simple program generating keys.
- Report: A description of the RSA encryption/decryption procedure.
- Report: Examples worked out on paper and checked using the proof of concept programme.

- The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- The program will use integers of arbitrary size and encrypt and decrypt test messages and files.
- An application with a GUI will perform a task such as a secure chat or a secure file transfer over the network.
- A combination of RSA and a symmetric encryption scheme will be implemented, where RSA will distribute the keys for the symmetric scheme.
- The report will contain an overview of modern cryptography.
- The report will describe RSA and number theory issues such as primality checking and factorisation.
- The report will describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe the software engineering process involved in generating your software.
- The report will contain the data on the running time of the programs.

- An advanced symmetric encryption scheme such as DES or AES will be implemented.
- Different primality testing and factorisation methods will be implemented. Their performance will be compared and test results reported.
- Attacks on RSA and/or the symmetric scheme will be implemented and tested.

- Taking CS3760, Information Security (or an equivalent 2nd year course) is recommended. Taking CS3580, Advanced Data Communications can be helpful.
- Good command of Java or C++.

- Experience with C++
- Being able to autonomously download/compile/install libraries from various repositories (e.g., GitHub).

- A report describing existing machine-learning applications of homomorphic encryption
- The implementation of a homomorphic evaluation of a simple function e.g. weighted average on a small dataset

- Implement the homomorphic evaluation of a function relevant to machine learning for secure genomics, inspired by recent iDash competition tasks
- Justify the scheme / library choice and steps taken to optimise performance
- Describe homomorphic encryption scheme, and machine learning task, underlying the solution

- https://homomorphicencryption.org
- http://www.humangenomeprivacy.org/2019/
- https://simons.berkeley.edu/talks/idash-competition
- https://github.com/Microsoft/SEAL

There are many approaches to this problem and many different algorithms based on various principles. The classical PageRank algorithm assigns ranks on the basis of links so that pages linked by many others get high ranks. The rank turns out to be the stationary distribution of a Markov chain.

Recently ranking has been increasingly approached in the context of machine learning. The problem is called "learning to rank" and supervised and semi-supervised methods are applied to it. One possibility is to train regression or classification algorithms on relevance scores provided by human evaluators. A number of parameters called features are calculated using the content of the page and the query; then a learning algorithm is applied to the features. Different measures have been used to analyse the effectiveness of ranking, including Mean Average Precision, Discounted Cumulative Gain, Kendall's rank, Yandex pfound etc.

The project consists of implementing ranking techniques and analysing their performance on synthetic and real-world data. The efficiency issue is particularly important as the programs should work on large datasets.

Benchmark datasets:

LETOR: Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/projects/letor/

Kaggle Learning to Rank 2019 challenge https://www.kaggle.com/c/learning-to-rank-2019/

Yandex Internet Mathematics 2009 competition dataset (archived) https://web.archive.org/web/20150317144535/http://imat2009.yandex.ru/academic/mathematic/2009/en/

- Proof of concept program: Page rank applied to a small (possibly synthetic) dataset. Its performance and efficiency should be analysed.
- Report: An overview of the problem of document ranking and main approaches to it. An overview of PageRank.

- Two or more different machine learning technique should be applied to a standard benchmark dataset and their performance compared.
- The performance of the program should be optimised to enable them to process large amounts of data.
- The report will describe the theory of underlying learning algorithms.
- The report will describe computational experiments, analyse their esults and draw conclusions.

- Comparison of pointwise, pairwise and listwise approaches.
- Use of real Internet dumps such as http://commoncrawl.org/

- Christiane Rousseau, How Google works: Markov chains and eigenvalues, URL http://blog.kleinproject.org/?p=280
- Chuan He, Cong Wang, Yi-xin Zhong, and Rui-Fan Li. A survey on learning to rank. Machine Learning and Cybernetics, 2008 International Conference on. DOI 10.1109/ICMLC.2008.4620685.
- Tie-Yan Liu, Learning to Rank for Information Retrieval, Berlin, Heidelberg: Springer. 2011

There are many approaches to this problem and many different algorithms based on various principles. The classical PageRank algorithm assigns ranks on the basis of links so that pages linked by many others get high ranks. The rank turns out to be the stationary distribution of a Markov chain.

Recently ranking has been increasingly approached in the context of machine learning. The problem is called "learning to rank" and supervised and semi-supervised methods are applied to it. One possibility is to train regression or classification algorithms on relevance scores provided by human evaluators. A number of parameters called features are calculated using the content of the page and the query; then a learning algorithm is applied to the features. Different measures have been used to analyse the effectiveness of ranking, including Mean Average Precision, Discounted Cumulative Gain, Kendall's rank, Yandex pfound etc.

The project consists of implementing ranking techniques and analysing their performance on synthetic and real-world data. The efficiency issue is particularly important as the programs should work on large datasets.

Benchmark datasets:

LETOR: Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/projects/letor/

Kaggle Learning to Rank 2019 challenge https://www.kaggle.com/c/learning-to-rank-2019/

Yandex Internet Mathematics 2009 competition dataset (archived) https://web.archive.org/web/20150317144535/http://imat2009.yandex.ru/academic/mathematic/2009/en/

- Proof of concept program: Page rank applied to a small (possibly synthetic) dataset. Its performance and efficiency should be analysed.
- Report: An overview of the problem of document ranking and main approaches to it. An overview of PageRank.

- Two or more different machine learning technique should be applied to a standard benchmark dataset and their performance compared.
- The performance of the program should be optimised to enable them to process large amounts of data.
- The report will describe the theory of underlying learning algorithms.
- The report will describe computational experiments, analyse their esults and draw conclusions.

- Comparison of pointwise, pairwise and listwise approaches.
- Use of real Internet dumps such as http://commoncrawl.org/

- Christiane Rousseau, How Google works: Markov chains and eigenvalues, URL http://blog.kleinproject.org/?p=280
- Chuan He, Cong Wang, Yi-xin Zhong, and Rui-Fan Li. A survey on learning to rank. Machine Learning and Cybernetics, 2008 International Conference on. DOI 10.1109/ICMLC.2008.4620685.
- Tie-Yan Liu, Learning to Rank for Information Retrieval, Berlin, Heidelberg: Springer. 2011

- Good programming skills, particularly C/C++, Python, and knowledge of Windows/Linux internals.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing the state of the art in privacy engineering and detection of data protection violations.
- A clear workplan describing the set of tests to be performed, tools to be implemented and classes of detection mechanisms/techniques to be proposed and studied.

- Design and implement a framework for development of new and existing techniques to detect information sharing compliance violations
- Describe in detail the design of tests
- Describe and analyse the features that have been used during the project to detect/stop violations
- Critically compare the findings with related works

- Adam Barth, Anupam Datta, John C Mitchell, and Helen Nissenbaum. 2006. Privacy and contextual integrity: Framework and applications. In Security and Privacy, 2006 IEEE Symposium on. IEEE, pp. 184–198
- David Basin, Søren Debois, and Thomas Hildebrandt. 2018. On Purpose and by Necessity: Compliance under the GDPR. Presentation.
- David Basin, Felix Klaedtke, Srdjan Marinovic, and Eugen Zalinescu. 2015. Monitoring of temporal first-order properties with aggregations. Formal methods in system design 46, 3 (2015), 262–285.
- Sean Brooks, Sean Brooks, Michael Garcia, Naomi Lefkovitz, Suzanne Lightman, and Ellen Nadeau. 2017. An Introduction to Privacy Engineering and Risk Management in Federal Systems. US Department of Commerce, National Institute of Standards and Technology.
- Ann Cavoukian et al. 2014. Privacy by design documentation for software engineers version 1.0. PbD-SE) Burlington, MA: Organization for the Advancement of Structured Information Standards (OASIS), work in progress (2014).
- Anupam Datta, Jeremiah Blocki, Nicolas Christin, Henry DeYoung, Deepak Garg, Limin Jia, Dilsun Kaynar, and Arunesh Sinha. 2011. Understanding and protecting privacy: Formal semantics and principled audit mechanisms. In International Conference on Information Systems Security. Springer, 1–27.
- European Commission. 2018. General Data Protection Regulation. https://ec.europa.eu/commission/priorities/justice-and-fundamentalrights/data-protection/2018-reform-eu-data-protection-rules_en
- Information Exchange Policy 1.0. https://www.first.org/iep/.
- Privacy Management Reference Model (PMRM) https://www.oasis-open.org/committees/pmrm/
- Gina Fisk, Calvin Ardi, Neale Pickett, John Heidemann, Mike Fisk, and Christos Papadopoulos. 2015. Privacy principles for sharing cyber security data. In Security and Privacy Workshops (SPW), 2015 IEEE. IEEE, 193–197.
- Jassim Happa, Nick Moffat, Michael Goldsmith, and Sadie Creese. 2018. Run-Time Monitoring of Data-Handling Violations. In Computer Security. Springer, 213–232.
- Bert-Jaap Koops and Ronald Leenes. 2014. Privacy regulation cannot be hardcoded. A critical comment on the 'privacy by design' provision in data-protection law. International Review of Law, Computers and Technology 28, 2 (2014), 159–171.
- Lauren B Movius and Nathalie Krup. 2009. US and EU privacy policy: comparison of regulatory approaches. International Journal of Communication 3 (2009), 19.
- Stuart Shapiro, Nickyra Washington, Kristen Miller, Julie Snyder, and Julie McEwen. 2014. MITRE Privacy Engineering Framework. MITRE Corporation Technical Report.
- Valeria Soto-Mendoza, Patricia Serrano-Alvarado, Emmanuel Desmontils, and Jose Antonio Garcia-Macias. 2015. Policies composition based on data usage context. In Sixth International Workshop on Consuming Linked Data (COLD2015) at ISWC.
- UK Information Commissioner's Office. 2016. How do we apply legitimate interests in practice? https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/legitimate-interests/how-do-we-apply-legitimate-interests-in-practice/.

Regression algorithms play an important role in machine learning. They allow us to learn dependencies between multidimensional attributes and continuous outcomes. Regression algorithms can be applied to different domains.

An example is the Boston housing problem. The Boston housing database consists of records describing houses in Boston (to be precise, average houses in certain neighbourhoods). For each house the values of attributes such as the number of rooms, distance from the city centre, quality of schools etc are known and the price, as evaluated by an expert estate agent, is given. The goal of the learner is to determine the dependency between attributes and prices and to predict the price of a house using the values of the attributes. The program is first shown a number of training examples so that it could learn the dependency and then is tested on test examples. Ridge Regression is known to perform well on this dataset.

Another example is time series data. A significant amount of real world data is temporal in nature, in that the values of the variables are sampled at multiple points over some time period, so that the data stored for each entity has an additional 'time' dimension. Such data are called time series. Examples of time series are the daily closing value of the FTSE 100 index or network traffic measurement on a router. Time series forecasting is the use of a model to predict future values based on previously observed values and regression algorithms can be used here.

Applying a learning algorithm to data is not a straightforward task. One needs to preprocess the data and choose parameters to optimise the performance. The algorithms need to be implemented in such a way so as to run fast and not to suffer from numerical error. Making sense of the results and comparing different algorithms in a fair meaningful way can be tricky. Does one algorithm outperforms another consistently, or is it a mere coincidence?

- Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for a regression algorithm and defining all terms in them.
- Proof of concept program: a regression algorithm applied to a small artificial dataset.
- Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.
- Loading a real-world dataset, simple pre-processing and visualisation.

- The program will work with a real-world dataset, read the data from a file, preprocess the data and apply regression using parameters selected by the user with parameters entered by the user.
- The program will have a graphical user interface.
- The program will automatically perform tests such as comparison of different kernels, parameters, etc and visualise the results.
- The program will implement another learning algorithm such as nearest neighbours or neural networks, and compare regression against it.
- The report will describe the theory of regression and derive the formulas.
- The report will describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
- The report will describe the software engineering process involved in generating your software.
- The report will describe computational experiments with different kernels and parameters and draw conclusions.
- The report will describe the competitor algorithm(s), compare the performance and draw conclusions.

- Use of regression in the on-line mode with growing and sliding window.
- Application of regression to different datasets (including those found by the student).
- Implementation of regression-type algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies etc and computational experiments with them.

- Taking CS3920 (Computer Learning) in the 3rd year is recommended.
- Interest in numerical methods.
- Good command of Java, C++, or Python.

- Suitable for CS with AI students.

In this project, you will understand the basic formalisation of RL as finding an optimal policy in a Markov Decision Process. This theory is nice, and not quite as complicated as it sounds (although there is a lot to learn). You will need to understand policy optimisation by dynamic programming, and the Bellman equation. This is widely applicable computational maths (not just in RL), and is good to know.

You will implement a demonstration of RL with a GUI as a MVC (model view controller) design pattern, and you will perform some experiments to determine the efficiency of the learning.

- Proof of concept program: Implementation of exact policy optimisation by value iteration for a general MDP.
- Proof of concept program: Implementation of Q-learning for grid world.
- Proof of concept program: GUI built with buttons etc.,
- Proof of concept program: MVC program with with controls.
- Report on Markov decision processes (MDPs)
- Report on notions of policy and value function; optimal policy/value function via the Bellman equation. Finding optimal policies by value iteration and policy iteration
- Report on Concept of learning as incrementally optimising policy in a MDP
- Report on Q-learning

- The program will at least demonstrate Q-learning with various exploration strategies in a simple world.
- The program must have a text based control file from which it can be configured.
- The program will have a Graphical User Interface that can be used to animate simulations.
- The report will describe the software engineering process involved in generating your software.
- The report will include a description of the theory of MDPs, optimal policies and value functions via the Bellman equation, dynamic programming methods of finding optimal policies, and Q-learning.
- The report will include some experiments that compare the effects of different parameter choices on the progress of Q-learning: over-optimistic values causing excessive exploration, and pessimistic initial values causing the development of superstitions.

- Implementing RL for more a more challenging problem than a grid world, such as an acrobat.
- Experimental analysis of the performance of different learning algorithms is interesting but challenging

- Chapter on Reinforcement Learning in 'Machine Learning' by Tom Mitchell

This project aims at studying different problems related to regression. One is the problem of parameter selection. How do we ensure that Ridge Regression performs best? How do we select the right parameters? Or is it best to tune parameters in the on-line mode? The problem of overfitting is closely related. It is known that in many situations more sophisticated techniques achieving better results in training later behave worse than coarse methods; how do we handle this? The project has the potential to lead to new research results of independent interest.

The project is recommended for Computational Finance students.

- Proof of concept program: Kernel Ridge Regression applied to a small artificial dataset.
- Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for Ridge Regression and defining all terms in them.
- Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.

- The program will work with a real dataset such as Boston housing, read the data from a file, preprocess the data (normalise, split the dataset into the test and training sets) and apply Ridge Regression using a kernel with parameters.
- The program will automatically perform tests such as comparison of different kernels, parameters, etc. The results will be visualised.
- The program will work in batch and on-line modes.
- Tests will be performed to compare plain Ridge Regression against other regression-based algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies, or gradient descent.
- The report will describe the theory of Ridge Regression and derive the formulas.
- The report will describe computational experiments, analyse their results and draw conclusions

- Application of Ridge Regression to different datasets (including those suggested by the student).
- Comparison with other learning algorithms (e.g., aggregating algorithm regression, neural networks, gradient descent-based etc).
- Combining regression with methods of prediction with expert advice.

- Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
- C.E.Rasmussen and C.Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 http://www.gaussianprocess.org/

This project aims at studying different problems related to regression. One is the problem of parameter selection. How do we ensure that Ridge Regression performs best? How do we select the right parameters? Or is it best to tune parameters in the on-line mode? The problem of overfitting is closely related. It is known that in many situations more sophisticated techniques achieving better results in training later behave worse than coarse methods; how do we handle this? The project has the potential to lead to new research results of independent interest.

The project is recommended for Computational Finance students.

- Proof of concept program: Kernel Ridge Regression applied to a small artificial dataset.
- Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for Ridge Regression and defining all terms in them.

- The program will work with a real dataset such as Boston housing, read the data from a file, preprocess the data (normalise, split the dataset into the test and training sets) and apply Ridge Regression using a kernel with parameters.
- The program will automatically perform tests such as comparison of different kernels, parameters, etc. The results will be visualised.
- The program will work in batch and on-line modes.
- Tests will be performed to compare plain Ridge Regression against other regression-based algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies, or gradient descent.
- The report will describe the theory of Ridge Regression and derive the formulas.
- The report will describe computational experiments, analyse their results and draw conclusions

- Application of Ridge Regression to different datasets (including those suggested by the student).
- Comparison with other learning algorithms (e.g., aggregating algorithm regression, neural networks, gradient descent-based etc).
- Combining regression with methods of prediction with expert advice.

- Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
- C.E.Rasmussen and C.Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 http://www.gaussianprocess.org/

*Satisfiability Modulo Theories* (SMT) is a kind of constraint satisfaction problem that allows solving constraints specified in some fragment of first order logic. In other words, SMT extends propositional satisfiability (SAT) to support (besides propositional ones as in SAT) numerical variables over a fixed arithmetic theory as well as quantifiers over such variables. State-of-the-art SMT solvers like Z3 by Microsoft provide very efficient decision procedures for such problems.

Among the numerous applications of SMT, you will focus on *bounded model checking* (BMC), i.e., the problem of verifying whether a given transition system can reach some state within a finite number of steps. In particular, you will use BMC to solve AI planning problems. A planning problem can be expressed as one of BMC, where the objective is to find a path of the AI agent leading to some goal state.

In the first part of the project, you will study the SMT problem and SMT solving techniques, and you will use the Z3 tool and its Python APIs to solve simple example instances. In the second part, you will study the BMC problem and implement code to solve it for generic transition systems using Z3+Python. In the final part, you will extend your program to encode and solve planning problems on 2-dimensional grids with obstacles. Extensions include visualization of path solutions, and planning under cost constraints.

- Report: overview of SAT, SMT and SMT solving techniques.
- Report: bounded model checking and encoding into SMT.
- Install z3 and Python APIs. Develop code for SMT solving of a simple example problem (e.g., Sudoku, or job shop scheduling).
- Develop code for BMC of generic transition systems.

- Develop code for encoding a planning problem on a 2-d grid with obstacle as a BMC problem.
- Evaluate your solution with arbitrary and randomly generated planning instances.
- Analyze SMT solver performance for different grid sizes and numbers of obstacles.
- Final dissertation. It should include background on SAT, SMT, BMC, and AI planning, as well as the description of your solution for the encoding of AI planning into SMT, some example solutions obtained with the developed planning method, and performance analysis results.

- Extend the planning problem and your solution with transition costs and cost constraints.
- Develop code to visualize path solutions on the 2d grid.

- De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: introduction and applications." Communications of the ACM 54.9 (2011): 69-77. (available at https://dl.acm.org/doi/abs/10.1145/1995376.1995394)
- De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: An appetizer." Brazilian Symposium on Formal Methods (2009): 23-36. (available at https://link.springer.com/chapter/10.1007/978-3-642-10452-7_3)
- Amla, Nina, et al. "An analysis of SAT-based model checking techniques in an industrial environment." (up to section 2.3 included). (available at https://link.springer.com/chapter/10.1007/11560548_20)
- Lectures on " Introduction to planning" and " Planning formalisms" of the " Autonomous Intelligent Systems" course
- Z3 tool page, https://github.com/Z3Prover/z3
- Programming Z3 tutorial (sections 1,2,3,5), https://theory.stanford.edu/~nikolaj/programmingz3.html
- Jupyter notebooks illustrating Z3's Python APIs (includes code for Programming Z3 tutorial), https://notebooks.azure.com/nbjorner/projects/z3examples
- Online Z3 tutorial (using SMT-lib input format), https://rise4fun.com/z3/tutorial/guide
- Dennis Yurichev, "SAT/SMT by Examples", https://yurichev.com/writings/SAT_SMT_by_example.pdf (a huge collection of maths and CS problems solved with SAT or SMT)

*Satisfiability Modulo Theories* (SMT) is a kind of constraint satisfaction problem that allows solving constraints specified in some fragment of first order logic. In other words, SMT extends propositional satisfiability (SAT) to support (besides propositional ones as in SAT) numerical variables over a fixed arithmetic theory as well as quantifiers over such variables. State-of-the-art SMT solvers like Z3 by Microsoft provide very efficient decision procedures for such problems.

Among the numerous applications of SMT, you will focus on *bounded model checking* (BMC), i.e., the problem of verifying whether a given transition system can reach some state within a finite number of steps. In particular, you will use BMC to solve AI planning problems. A planning problem can be expressed as one of BMC, where the objective is to find a path of the AI agent leading to some goal state.

In the first part of the project, you will study the SMT problem and SMT solving techniques, and you will use the Z3 tool and its Python APIs to solve simple example instances. In the second part, you will study the BMC problem and implement code to solve it for generic transition systems using Z3+Python. In the final part, you will extend your program to encode and solve planning problems on 2-dimensional grids with obstacles. Extensions include visualization of path solutions, and planning under cost constraints.

- Report: overview of SAT, SMT and SMT solving techniques.
- Report: bounded model checking and encoding into SMT.
- Install z3 and Python APIs. Develop code for SMT solving of a simple example problem (e.g., Sudoku, or job shop scheduling).
- Develop code for BMC of generic transition systems.

- Develop code for encoding a planning problem on a 2-d grid with obstacle as a BMC problem.
- Evaluate your solution with arbitrary and randomly generated planning instances.
- Analyze SMT solver performance for different grid sizes and numbers of obstacles.
- Final dissertation. It should include background on SAT, SMT, BMC, and AI planning, as well as the description of your solution for the encoding of AI planning into SMT, some example solutions obtained with the developed planning method, and performance analysis results.

- Extend the planning problem and your solution with transition costs and cost constraints.
- Develop code to visualize path solutions on the 2d grid.

- De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: introduction and applications." Communications of the ACM 54.9 (2011): 69-77. (available at https://dl.acm.org/doi/abs/10.1145/1995376.1995394)
- De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: An appetizer." Brazilian Symposium on Formal Methods (2009): 23-36. (available at https://link.springer.com/chapter/10.1007/978-3-642-10452-7_3)
- Amla, Nina, et al. "An analysis of SAT-based model checking techniques in an industrial environment." (up to section 2.3 included). (available at https://link.springer.com/chapter/10.1007/11560548_20)
- Lectures on " Introduction to planning" and " Planning formalisms" of the " Autonomous Intelligent Systems" course
- Z3 tool page, https://github.com/Z3Prover/z3
- Programming Z3 tutorial (sections 1,2,3,5), https://theory.stanford.edu/~nikolaj/programmingz3.html
- Jupyter notebooks illustrating Z3's Python APIs (includes code for Programming Z3 tutorial), https://notebooks.azure.com/nbjorner/projects/z3examples
- Online Z3 tutorial (using SMT-lib input format), https://rise4fun.com/z3/tutorial/guide
- Dennis Yurichev, "SAT/SMT by Examples", https://yurichev.com/writings/SAT_SMT_by_example.pdf (a huge collection of maths and CS problems solved with SAT or SMT)

Key value stores are data structures that provide dictionary-style operations for data storage and retrieval. To ensure scalability and resilience to environmental changes, a key value store can be distributed among multiple nodes in a distributed system. In this project, the focus is on implementing peer-to-peer key value stores in an actor model framework as provided by Elixir/Erlang or Cloud Haskell.

- A report giving an overview of prominent existing key value stores such as Chord and SkipGraph.
- A report describing how existing key value stores deal with environmental changes.
- A proof-of-concept program implementing a simple (possibly inefficient) distributed key value store.

- A distributed key value store that provides standard dictionary operations and is scalable to large data sets.
- A report describing the architecture of the implemented key value store.
- A report on the performed experiments and simulation results.

- Supporting advanced dictionary operations, e.g., range queries.
- More efficient storage of data by using coding techniques.
- Simulating node churn.

- Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
- James Aspnes, Gauri Shah: Skip graphs. ACM Trans. Algorithms 3(4): 37 (2007)

The problem is computationally intractable in general. However the modern world does not rely so much on the polynomial versus NP-hard distinction (P vs NP). Instead we use the much more nuanced Fixed Parameter Tractability (FPT) landscape. This is when we hope to find some problem characteristic that allows a computationally hard problem to become tractable subject only to limiting a particular parameter.

Recent research, led by researchers in this department, has discovered that the workflow problem, subject to access control restrictions, is indeed FPT in some cases. In this project you will research the theory of fixed parameter tractability and the workflow scheduling problem. You will implement heuristic approaches to solving the problem, and also implement the FPT algorithm for work flow scheduling (the parameter is the number of required steps in a given workflow).

- Report on Fixed Parameter Tractability and how it is different from the P vs NP tradition, with examples.
- About Workflow Scheduling subject to access control: Where it occurs, why it is important and the kinds of constraints and authorisations that are useful.
- Report on the analysis of the Fixed Parameter Tractability and the NP-hardness of WSP
- Report on various other constraint satisfaction techniques used to solve industrial problems such as SAT and CSP solvers.
- Programs: A simple backtracking algorithm, an FPT algorithm for the Vertex Cover problem.

- The program should have an excellent GUI for experimentation.
- The report should include experimental assessment of the FPT algorithm, possibly in comparison with a SAT or CSP solver.
- The report will describe the software engineering process involved in generating your software.
- The report will describe and analyse your implementation of the backtracking FPT algorithm for solving WSP with user-independent constraints.

- Study of the Valued WSP allowing us to provide solutions even when WSP is unsatisfiable.

- Basic knowledge in cryptography
- Knowledge of C/C++ and web development (HTML, CSS, JavaScript, REST)

- Implementation of the first-price sealed-bid auction (FPSBA) functionality in one existing MPC framework
- Implementation of REST interfaces to submit bids to and receive results from auction servers
- Implementation of controller node to configure the auction platform (e.g., which items should be auctioned and how long the auction runs)

- User-friendly web interface
- Implementation of client-side secret sharing + communication with auction servers

- Marcella Hastings, Brett Hemenway, Daniel Noble, Steve Zdancewic: "SoK: General Purpose Compilers for Secure Multi-Party Computation". In IEEE Symposium on Security and Privacy 2019.
- Marcel Keller: "MP-SPDZ: A Versatile Framework for Multi-Party Computation". In CCS 2020.
- Lennart Braun, Daniel Demmler, Thomas Schneider, Oleksandr Tkachenko: "MOTION - A Framework for Mixed-Protocol Multi-Party Computation". In ACM TOPS 2022.
- https://github.com/data61/MP-SPDZ
- https://github.com/encryptogroup/MOTION

- Good programming skills, particularly Python and/or NodeJS.
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report describing current serverless offerings from Google (G Cloud), Microsoft (Azure) and Amazon (AWS)
- A report with an overview access control systems for G Cloud, Azure and AWS
- A proof of concept function that executes in one of these clouds

- A detailed descriptoin of the access control system of one of the cloud providers
- Examples of functions that represent each of the risks described
- A detailed description of a set of policies and recommendations so developers avoid these risks
- A critical comparison with findings from related works

- https://cloud.google.com
- https://aws.amazon.com
- https://azure.microsoft.com
- https://dl.acm.org/doi/abs/10.1145/3366423.3380173
- https://www.usenix.org/conference/hotcloud19/presentation/obetz
- https://dl.acm.org/doi/abs/10.1145/3366615.3368351>

Maze games are a video game genre in which the entire playing field is a maze. Usually, the player needs to navigate the maze to collect items and also performs other actions (e.g. escape monsters and outrace an opponent) within a time limit. The most popular example is Pac-Man by Namco (1980), but many other maze games exist.

You will model a maze game as a search problem and then design and implement agents that inhabit the maze. The agents will find efficient paths through the maze world, to reach a particular location, to collect food and to combat opponents.

- Given a maze game of your interest (e.g. Pac-Man), design and implement a simple graphical visualisation for it.
- Consider an agent inhabiting the maze world. Model the agent's navigation task as a search problem. Implement a graph search version of depth-first search and breadth-first search to reach a specific location in the maze. Run experiments with three different size maze: small, medium and large. Visualise the agent following the plan step-by-step.
- Now change the cost function in such a way that traversing some areas is more expensive than traversing other areas (e.g. Pac-Man: you can charge more for dangerous steps in areas close to the ghosts). Implement the uniform-cost graph search algorithm to reach a specific location in the maze. Run experiments with three different size maze: small, medium and large. Visualise the agent following the plan step-by-step.

- Implement the algorithm A* to find the shortest path through the maze that touches all four corners. Use two different, consistent heuristic functions. Run experiments with three different size maze: small, medium and large. Visualise the agent following the plan step-by-step.
- Assume each cell in the maze has an item to collect. Formulate a search problem for collecting all items (e.g. in Pac-Man: eating all the food) and implement A* to do that in as few steps as possible. Also implement a greedy search and compare the results of the two algorithms in three different size maze: small, medium and large. Visualise the actions of the agent.
- Now assume that you have additional agents in your maze, one is the main player and the other are the opponents (e.g. Pac-Man and the ghosts). Model this new problem and implement the player as a reflex agent that tries to visits all cells avoiding the opponents.
- Implement an adversarial Minmax search agent for each agent; start with the player and one opponent and then increase the number of opponents. You need to write an evaluation function that evaluates states. Run experiments with three different size maze: small, medium and large. Visualise the actions of the agents.
- Design and implement a new agent that uses Alpha-beta pruning to more efficiently explore the minimax tree. Run experiments with three different size maze: small, medium and large. Visualise the actions of the agents.
- Design and implement an Expectimax agent to model the probabilistic behaviour of an agents who makes suboptimal choices. Run experiments with three different size maze: small, medium and large. Visualise the actions of the agents.

- Use value iteration and Q-learning to underpin the behaviour of the player.

- Stuart Russell and Peter Norvig, AI: A Modern Approach, 3rd Edition.

The recommended problems are either the graph problem Vertex Cover or the SAT problem which is the original NP-hard problem. The Vertex Cover problem has a structure that is simple to understand, yet many different approaches work for solving it (see below) and there are many interesting algorithms. There are also good benchmark repositories with example problems to test your code against. The SAT problem comes from logic, and simply asks whether a given propositional formula has a satisfying assignment. For SAT there is a very powerful yet simple family of algorithms that are successfully used in industry for solving many applied problems such as program verification. Like Vertex Cover there are also good benchmark repositories with example problems to test your code against. Students are expected to choose from two options in this project: option 1 (heuristics and approximations) and Option 2 (SAT-solving).

These problems have the important feature that they are "puzzle-like": you can easily recognise or verify a solution to these problems if you are given one, but it can be very challenging to find a solution given an instance of such a problem, or decide whether there is no solution.

In particular, it is quite strongly believed that there are no algorithms that can efficiently find solutions (if they exist) for all instances of such a problem, in time that is "worst-case tractable" (e.g., in polynomial worst-case time).

Problems like this are ubiquitous in reality - verify that a proposed circuit design has no bugs in it; plan routes for your delivery vehicles to deliver some goods with a minimum number of delivery runs; schedule speakers at an event so that you never have two talks at the same time about the same topic... In all these cases, you may run into situations where there is no algorithm that is guaranteed to always work efficiently, but you still have to find a solution to the problem somehow. Many approaches for this have been proposed, both in theory and in practice.

Some strategies for solving NP-complete problems are using heuristics; approximation algorithms; tractable special cases; algorithms that are slow in the worst case but fast on some 'non-worst-case' inputs.

An example of an algorithm for SAT that is known to be slow for some worst-case inputs but is actually very fast for many typical is the DPLL algorithm and its extensions. DPLL is the name of a very simple algorithm for SAT introduced in 1962, which basically amounts to (a partial) exhaustive search of the space of solutions to a given SAT instance. Almost all successful industrial level SAT-solvers nowadays use variants of DPLL equipped with clever heuristics to improve their performance. Though theoretically, DPLL is known to require huge (namely, exponential) run-time, striking experience shows that current SAT-solvers are very efficient on real-world instances. Hence the ubiquitous successful use of SAT-solving in the industry.

- You are to choose either Option 1 (heuristics and approximations) or Option 2 (SAT-solving).
- Report: A full formal description of the problem.
- Report: P vs NP, NP-completeness, NP-hardness
- Option 1: Report: Strategies for solving NP-complete problems, such as: heuristics; approximation algorithms; tractable special cases; algorithms that are slow in the worst case but fast on some 'non-worst-case' inputs.
- Option 1: Report: Fast in theory, slow in practice. Merge sort and heap sort are sorting algorithms with excellent theoretical worst-case guarantees, but they may find themselves beaten in practice by quicksort, which has worse worst-case guarantees. Another example: linear programming is the basis of many planning systems, and there are algorithms for solving LPs in polynomial time, but these are not used in practice; instead, algorithms are used that have worst-case exponential-time behaviour. Report and comment on this phenomenon.
- Option 1: Proof of concept program: A program that generates a random instance of Vertex Cover, shows it to the user, reads a suggested solution from the user, and verifies whether this solution is correct.
- Option 1: Proof of concept program: A simple greedy heuristic
- Option 1: Proof of concept program: The simple 2-approximation algorithm
- Option 2: Encoding and data structures: write a program that reads a CNF from a file in the DIMACS format and a truth-assignment (generated randomly), and outputs true or false depending if the assignment satisfies the input CNF or not.
- Option 2: Investigate the DPLL algorithm and write a report that shows manually worked examples of its operation. The report should include the pseudo code of the DPLL algorithm, providing also diagrammatic examples of its run on simple example input CNF formulas. Explain the algorithm and its running time.
- Option 2: Write procedures that will serve as building blocks for the e cental SAT solver. Specifically, write a procedure that gets as an input a CNF formula and a *partial* assignment to its variable and outputs the new CNF formula resulting from assigning to the original formula. This should include simplifying (getting rid) of satisfied clauses, if they exist, and should output False if the CNF formula becomes false under the given assignment.

- Option 1: A program with a GUI that can show an instance on the screen and illustrate a solution for it. Your program should be able to read a problem instance from disk, in a format taken from an existing 'benchmark library', and it should also be able to generate an instance of the problem at random with specified parameters.
- Option 1: The program should offer at least three different solution strategies, such as a heuristic, an approximation algorithm, and a backtracking search 'branching algorithm'.
- Option 1: The report should contain a section on the outcomes of experiments done between the different algorithms for instances of some benchmark library, with conclusions drawn about which algorithms seem suitable in different situations.
- Option 2: Write a program which solves SAT using the DPLL algorithm. It is advisable to write the code in a language like C++ or C. But it is also possible to implement it in java or python, as the program is intended for educational purposes.
- Option 2: Demonstrate the run of the SAT-solver on substantial instances (taken from known benchmarks).
- Option 2: Investigate the conflict-driven clause learning heuristic (CDCL). Show in your report manually worked examples of DPLL with the clause learning heuristic as well a pseudocode for CDCL.
- Option 2: Extend your DPLL solver to uses some kind of very simple clause learning heuristic, and explain your choice, showing how it improves the run time on some of the benchmarks.
- Final report should contain a section on the outcomes of experiments done between the different algorithms/SAT-solvers for instances of some benchmark library, with conclusions drawn about which algorithms seem suitable in different situations.

- A more careful theory report, including notions of 'problem reductions' for NP-complete problems. For Vertex Cover, this should at least cover the two problems Independent Set and Max Clique.
- A larger range of algorithms, or algorithms using more advanced concepts. One advanced approach for this problem would be the use of an LP relaxation. Another advanced approach would be to use methods from parameterized complexity (you may ask for references here).
- Careful optimisation of one particular algorithm for the problem, for example a carefully tweaked branching algorithm, to make it as fast as possible across a range of instances.
- Investigations into 'lower bounds' for your problem, to test how far off each algorithm is from an optimal solution.
- For option 2: Extend the DPLL SAT-solver to include a fully functional conflict driven clause learning (CDCL). The full specification for this could be taken from the Handbook (see references below) or from the Wikipedia page. Apply these extensions to the SAT-solver and systematically compare the run-time improvement.
- For option 2: Extend the DPLL SAT-solver to include Restarts. The full specification for this could be taken from the Handbook (see references below) or from the Wikipedia page. Apply these extensions to the SAT-solver and systematically compare the run-time improvement.
- Investigate the notion of Satisfiability Modulo a Theory (SMT). Show how it extends SAT. Apply this to your tool for a chosen theory and systematically compare the run-time improvement on relevant instances.
- Investigate the theoretical foundation of SAT solving: the resolution refutation system and its provable limitations.

- Selected parts of Handbook of Satisfiability. Editors: Biere, A., Heule, M., Van Maaren, H., Walsh, T. Feb 2009. Volume 185 of Frontiers in Artificial Intelligence and Applications.
- https://en.wikipedia.org/wiki/Conflict-Driven_Clause_Learning

A good problem choice is Feedback Vertex Set (see Wikipedia for a description). This is a problem that is easy to understand, but still allows for several interesting algorithms and solution strategies (see below). It is also possible to instead select the Vertex Cover problem as recommended for the BSc full-unit level, but in that case several of the extensions should be implemented as well.

The milestones below are written under the assumption that the problem Feedback Vertex Set is used.

These problems have the important feature that they are "puzzle-like": you can easily recognise or verify a solution to these problems if you are given one, but it can be very challenging to prove or verify that an instance of such a problem does not have a solution (or that it does not have a solution that is "good enough", e.g., that all solutions are too big, too expensive...).

In particular, it is strongly believed that there are no algorithms that can efficiently find solutions (if they exist) for all instances of such a problem, in time that is "worst-case tractable" (e.g., in polynomial worst-case time).

Problems like this are ubiquitous in reality - verify that a proposed circuit design has no bugs in it; plan routes for your delivery vehicles to deliver some goods with a minimum number of delivery runs; schedule speakers at an event so that you never have two talks at the same time about the same topic... In all these cases, you may run into situations where there is no algorithm that is guaranteed to always work efficiently, but you still have to find a solution to the problem somehow. Many approaches for this have been proposed, both in theory and in practice.

- Report: A full formal description of the problem
- Report: P vs NP, NP-completeness, NP-hardness
- Report: Strategies for solving NP-complete problems, such as: heuristics; approximation algorithms; tractable special cases; algorithms that are slow in the worst case but fast on some 'non-worst-case' inputs
- Report: Fast in theory, slow in practice. Merge sort and heap sort are sorting algorithms with excellent theoretical worst-case guarantees, but they may find themselves beaten in practice by quicksort, which has worse worst-case guarantees. Another example: linear programming is the basis of many planning systems, and there are algorithms for solving LPs in polynomial time, but these are not used in practice; instead, algorithms are used that have worst-case exponential-time behaviour. Report and comment on this phenomenon.
- Proof of concept program: A program that takes a graph and either finds a cycle in it or reports that the graph contains no cycles.
- Proof of concept program: A 'cleaning algorithm' to remove vertices of degree 1 and 2 and self-loops
- Proof of concept program: A simple greedy heuristic, with or without using the cleaning step
- Proof of concept program: An approximation algorithm, for example the 3-approximation described in the lecture notes at https://www.cs.cmu.edu/afs/cs/academic/class/15854-f05/www/scribe/lec9.pdf

- A program with a GUI that can show an instance on the screen and illustrate a solution for it. Your program should be able to read a problem instance from disk, in a format taken from an existing 'benchmark library', and it should also be able to generate an instance of the problem at random with specified parameters.
- The program should offer at least three different solution strategies, such as a heuristic, an approximation algorithm, and a backtracking search 'branching algorithm'.
- The report should contain a section on the outcomes of experiments done between the different algorithms for instances of some benchmark library, with conclusions drawn about which algorithms seem suitable in different situations.

- A more careful theory report, including notions of 'problem reductions' for NP-complete problems. For Feedback Vertex Set, this should include at least the hardness reduction from Vertex Cover.
- A larger range of algorithms, or algorithms using more advanced concepts. One advanced approach for this problem would be the use of an LP relaxation. Another advanced approach would be to use methods from parameterized complexity (you may ask for references here).
- Careful optimisation of one particular algorithm for the problem, for example a carefully tweaked branching algorithm, to make it as fast as possible across a range of instances.
- Investigations into 'lower bounds' for your problem, to test how far off each algorithm is from an optimal solution.

- A good set of benchmark instances are found on the following page: https://pacechallenge.wordpress.com/pace-2016/track-b-feedback-vertex-set/

In this introduction I will focus on football. You are welcome to choose any other sport you are familiar with, but you must ensure that you can acquire the relevant datasets.

Around the globe, there are hundreds of football matches every week. An extensive amount of data on the players' performance is collected. While some of these data is behind paywall, other data is openly available in websites including FBref, sofascore, transfermarkt and kaggle. In this project, you will be a data scientist of a sports team. By using data, you and your colleagues together aim to build a strong team.

There are many aspects a data scientist can be useful in sports team management. For instance, when your team needs to bring in a player of a certain type (say a centre back with good aerial ability), you can use clustering algorithm or principal component analysis (PCA) to identify the right players from less elite leagues. This can help the team bring in a "hidden gem" with a low transfer fee. On the other hand, you can use the players' performance data to identify their weaknesses, so the coaching staffs know which training programs should be provided to the players. Remember, your colleagues are good at kicking the ball, but they may not be as good with numbers as you. To help them understand your analysis, you need to provide them with simple but informative visual illuminations.

Sports data analytic is a science but also an art. You must have good sense about the sport you analyze. If you have watched only a few sport matches in your whole life, you are strongly advised not to choose this topic.

- You will need to work with datasets from different sources. Be familiar with the datasets. Identify the attributes which are strong indicators of good players in every position.
- Learn the powerful visualization library ggplot2 (or plotnine if you are using Python). In particular, learn how to represent high-dimensional data using bubble plot, and/or via PCA.
- Make a plan how you are going to use the datasets to complete a mission that can help your sports team. The proposal should indicate how the data should be pre-processed, which learning algorithms you will use, and how the results of data analysis will be used.

- Generate simple but informative visual illuminations which allow you to present your data analysis results to your colleagues.
- Implement a software with user-friendly interface, which allows your colleagues to explore the data themselves. The software should achieve the followings: (i) recommend which players your team can bring in, based on the constraints on position, age and budget your colleagues inputs; (ii) provide your colleagues of customizable visual illuminations, and (iii) display annotated tables of detailed data when your colleague wants to look into it.

You must have done Data Analysis (CS5100 or CS5100J). You must be comfortable with programming in Python or R. Doing well in Visualisation and Exploratory Analysis (CS5250) is preferred.

- Report - Protein Structures
- Report - The Protein Data Bank and the file formats used in accessing it.
- Proof of Concept - distributing protein structure files amongst MapReduce cluster.
- Proof of Concept - Running MapReduce using a single type of executable for analysing protein structures.

- The final framework will allow users to input a map and reduce step (namely an executable to run on an individual protein structure in the former and a parser for the generated log file from the reduce step) to analyse a set of protein structures.
- The developer will provide a clear set of guidelines for the map/reduce executables.
- The final data should be returnable in variety of different formats.

- The PDB is updated on a regular basis. The data set can mirror the official database and is distributed amongst the MapReduce cluster.
- A set of libraries so that arbitrary executables can be easily put into the MapReduce formalism.
- The PDB is comprised of many different types of structure (DNA and RNA macro-molecules are included as well as proteins). An interface to select fractions of the PDB for analysis.
- Implementation on a Public Mapreduce cluster e.g. Amazon.

- The protein data bank H.M.Berman et al. Nucl. Acid. Res. (2000) 28(1): 235-242. Doi: 10.1093/nar/28.1.235

The quality of a set of unit tests (a test suite) is difficult to assess: how much testing is enough? This question is especially important for safety-critical systems. As a consequence, many coverage metrics such as statement or function coverage have been developed to give an intuition for how well a test suite exercises the program under test. A number of coverage visualisers are available, but new metrics require new tool support.

The goal of this project is to use the Soot framework for instrumenting Java class files to insert profiling code that can gather coverage information. Soot is written in Java and allows to both analyse and modify Java bytecode. You will implement different types of instrumentation for several different coverage metrics and a visualisation for the results.

In addition to standard metrics, you should also implement MC/DC - Multiple Condition/Decision coverage, which is mandated for safety-critical avionics software.

- Code: Instrumentation for statement and function coverage and basic visualisation
- Report: Java Bytecode and Soot's intermediate representation
- Report: Code instrumentation
- Report: Survey on coverage metrics

- Code: Branch coverage, condition coverage, and MC/DC coverage. GUI for running and visualising the analysis.
- Report: Architecture of Soot
- Report: Design and implementation of the instrumentation tool
- Report: Test requirements for safety critical systems
- Formal description and practical implementation of a custom coverage metric.

- Computing cyclomatic complexity using Soot's analysis components

- Lam et al: The Soot framework for Java program analysis: a retrospective. Cetus Users and Compiler Infastructure Workshop (CETUS 2011)
- The Soot Wiki: https://github.com/Sable/soot/wiki

- Good programming skills, particularly in Python C, or C++.
- A basic grasp of network security and the OSI model.
- Willingness to learn and engage with simulation and statistical analysis software (NS3/SimPy and MATLAB/Octave/R).
- Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

- A report on task allocation algorithms in current usage and the degree to which trust is implicit in their design
- A report on the current state of the art in anomaly detection (behavioural and statistical) in task allocation systems. This may extend to underlying network infrastructure (for example selfish routing attacks in Mobile Ad hoc Networks (MANETs) performed by malicious nodes in an attempt to gain intelligence about the larger mission)
- A clear workplan describing the set of tests to be performed, and a specific use case defining what type of network (MANET, traditional wireless/wired, Internet-based) and task allocation protocol will form the focus of experiments

- A simulation or experimental demonstration of the principles explored in the project, such as a simple (half dozen nodes) decision making system with attack and mitigation demonstrations
- A final report documenting the research, design, and implementation of the system. This report will extend to cover the experimental design, and how the choices made in the systems design phase informed experimental choices (and vice versa). A clear grasp of experimental process and hypothesis testing should be demonstrated through appropriate use of tools and techniques

- https://ieeexplore.ieee.org/abstract/document/9002704
- https://ieeexplore.ieee.org/abstract/document/7776949
- https://gala.gre.ac.uk/id/eprint/16178/
- https://ieeexplore.ieee.org/abstract/document/8758417
- https://ieeexplore.ieee.org/abstract/document/7809102

- Ability to research a new topic and implement designs
- System modelling
- Good programming skills particularly Node JS, Java, Golang, C/C++ etc and knowledge of Windows/Linux internals.
- Knowledge Data/Database Modelling
- Familiarity with DLT and Blockchain technologies (e.g. HyperLedger)
- Public Key encryption and digital certificates standards

- A report describing architectural components of a PKI system and addressing Blockchain technologies potentials
- A design of the PKI data model
- A proof of concept implementation of the auditing system using a DLT technologies
- Testing the Blockchain-based PKI system

- Design and develop the data model of PKI
- Develop a prototype of PKI system using Blockchain
- Complete report
- The report will have a User Manual

- https://medium.com/swlh/top-5-dlt-blockchain-protocols-to-consider-in-your-project-dae7fc6d381d
- https://www.ibm.com/blogs/research/2018/02/architecture-hyperledger-fabric/
- https://hyperledger-fabric.readthedocs.io/en/release-1.4/getting_started.html
- https://medium.com/swlh/top-5-dlt-blockchain-protocols-to-consider-in-your-project-dae7fc6d381d
- https://pki-tutorial.readthedocs.io/en/latest/simple/
- www.openssl.org
- EJBCA https://www.ejbca.org
- Dierks, T., (2008). The transport layer security (TLS) protocol version 1.2, RFC-5246
- Vacca, J. R. (2004). Public Key Infrastructure. Boca Raton: C R C Press LLC.
- Cooper, D., Santesson, S., Farrell, S., Boeyen, S., Housley, R. Polk, W. (2008) Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile (RFC-5280)
- Santesson, S., Myers, M., Ankney, R., Malpani, A., Galperin, S., Adams, C., (2013) X.509 Internet Public Key Infrastructure a Certificate Status Protocol – OCSP (RFC-6960)
- Eastlake, D., (2011) Transport Layer Security (TLS) Extensions: Extension Definitions (RFC-6066)

Timetabling is a computationally hard problem. It is believed that the only way to find an optimal timetable is to construct all possible timetables and choose the best one. This exhaustive search is impractical for all but trivial problems.

Timetabling is an example of combinatorial optimisation - a problem in which a set of objects must be ordered so as to optimise some cost function. Since exhaustive search is impractical, we instead use heuristic algorithms to find solutions which are good enough, rather than guaranteed optimal. For instance, another well known combinatorial optimisation problem is the Travelling Salesman Problem in which a set of cities must be visited once each in an order which minimises the overall route length. One heuristic might be to start at the start city, then move to the next nearest city that has not yet been visited, and then repeat until no more cities are left. Actually this often produces quite good results, but it is hard to see what the equivalent heuristic might be for the timetabling problem which is multi-dimensional and has a variety of different cost factors.Simulated annealing is a general purpose heuristic which minimises combinatorial systems using the Metropolis Algorithm - a simulation of annealing behaviour in crystals as they are warmed and slowly cooled. The underlying theory is drawn from statistical thermodynamics, and is likely to be quite opaque to most computer scientists so you are not required to master it. However, the resulting algorithm is quite easy to implement and experiment with, and produces remarkably good results in many kinds of combinatorial optimisation problems.

To succeed in this project, you need to be interested in computationally hard algorithms; and to possess good programming skills.

- Write a Java program which exhaustively solves the Travelling Salesman problem for small problems - say up to 10 cities.
- Write a Java program which uses the nearest city hearistic and test its performance on large problems with 100-200 cities.
- Write a report on these two algorithms which gives both asymptotic and actual runtimes for your implementations.
- Write a Java implementation of simulated annealing which solves large Travellig Salesman problems.
- Write a report describing the effects of modifying the parameters of your algorithm.

- Write a report which gives a specification of a useful timetabling problem, and describes the software engineering approach that you will use to develop the final application.
- Demonstrate a working timetabling system.
- A final report on the project activities, fully acknowledging and citing sources.

- Add graphical visualisations of your algorithms runtime activity and results.
- Consider the proble of rip-up and retry: taking an existing solution with new constraints and costs and attempting to produce a good modified solution that retains as much as possible of the old solution.

- Section on simulated annealing from the Numerical Recipes book published by Cambridge.
- Many online resources: start with Wikipedia

```
AG (request => AF response)
```

This means "it is always true that if a request event happens, there will always be
a point in the future where the request is responded to".
A problem with the above example is that there may be multiple request and response events. The formula cannot ensure that each request is paired with an response. In fact, the set of all traces that satisfy this property is a "context free language".

Recent work extends CTL to allow languages such as context free languages to be built into the logic, to enable these kinds of properties to be expressed.

While many CTL model-checkers have been implemented, there are currently no implementations of model-checkers for this extended CTL. The goal of this project would be to produce the first such implementation.

- Report: A survey of CTL and CTL model-checking algorithms.
- Report: A survey of extended CTL.
- Report: A survey of model-checking algorithms for context-free languages.
- Proof of concept programs: A simple CTL model-checker.
- Proof of concept programs: A simple analysis tool for context-free languages.

- An analysis tool for finite-state systems and CTL extended with finite-state and context-free languages.
- An experimental evaluation of the tool and comparison with CTL model-checkers.

- Extensions allowing infinite-state systems to be analysed.

- A report describing several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy).
- Reports: Complexity, NP hardness and the big O notation
- Implementations of Nearest Neighbour and Repeated Nearest Neighbour.

- Description of several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy, Vertex Insertion).
- Implementation of all heuristics described in the report.
- Tables comparing quality and runtime of these heuristics on various testbeds.
- GUI for the heuristics.
- A report on TSP applications.
- A short report on complexity of algorithms.

- To study and implement Greedy+2-Opt. To include it into the tables.

- Start with Wikipedia on TSP.

- A report describing several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy).
- Reports: Complexity, NP hardness and the big O notation
- Implementations of Nearest Neighbour and Repeated Nearest Neighbour.

- Description of several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy, Vertex Insertion).
- To study and implement Greedy+2-Opt.
- Implementation of all heuristics described in the report.
- Tables comparing quality and runtime of these heuristics on various testbeds.
- GUI for the heuristics.
- A report on TSP applications.

- Start with Wikipedia on TSP.

- You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
- Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
- The report will describe the program (software engineering, algorithms etc.,).
- The report will also discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

- The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation.
- Allow an arbitrary number of stocks (using eigen- or Cholesky decomposition)
- The final program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
- Ideally, it will have a graphical user interface.
- The final report will describe: the theory behind the algorithms.
- The final report will describe: the implementation issues necessary to apply the theory.
- The final report will describe: the software engineering process involved in generating your software.
- The final report will describe: computational experiments with different data sets, methods, and parameters.

- Computing conditional VaR (also known as expected shortfall)
- Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
- Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
- Complement back testing by stress testing
- Replacing the Gaussian model for stock returns by robust models
- Computing monetary measures of risk different from VaR

- John C. Hull. Options, futures and other derivatives, 7th ed., Pearson, Upper Saddle River, NJ, 2009.
- Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time, 3rd ed., de Gruyter, Berlin, 2011.
- Yahoo Finance as source of data: http://finance.yahoo.com/
- http://www.gloriamundi.org

The project is recommended for Computational Finance students.

- You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
- Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
- The report will describe the program (software engineering, algorithms, etc.); discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

- The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation. Option pricing can be performed using the Black-Scholes formula, or binomial trees (for American options), or a different variety of Monte Carlo simulation (for Asian options).
- Allow an arbitrary number of stocks in the model-building approach (using eigen- or Cholesky decomposition).
- Allow an arbitrary number of stocks using histotical simulation.
- Ideally, it will have a graphical user interface.
- The final report will describe the theory behind the algorithms; the implementation issues necessary to apply the theory; the software engineering process involved in generating your software; computational experiments with different data sets, methods, and parameters.

- Computing conditional VaR (also known as expected shortfall)
- Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
- Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
- Complement back testing by stress testing.
- Replacing the Gaussian model for stock returns by robust models.
- Computing monetary measures of risk different from VaR.
- Using proper scoring rules for quantiles for valuating computed VaRs.
- Using methods of worst-case risk aggregation, such as the Rearrangement Algorithm, or worst-case online decision making, such as the Weak Aggregating Algorithm (currently active areas of research).

- John C. Hull. Options, futures and other derivatives, 10th ed., Pearson, Upper Saddle River, NJ, 2018.
- Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time 3rd ed., de Gruyter, Berlin, 2011.
- Yahoo Finance as source of data: http://finance.yahoo.com/
- http://gloria-mundi.com

The project is recommended for Computational Finance students.

- The report will describe the program (software engineering, algorithms, etc.); discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

- The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation. Option pricing can be performed using the Black-Scholes formula, or binomial trees (for American options), or a different variety of Monte Carlo simulation (for Asian options).
- Allow an arbitrary number of stocks in the model-building approach (using eigen- or Cholesky decomposition).
- Allow an arbitrary number of stocks using histotical simulation.
- Ideally, it will have a graphical user interface.
- The final report will describe the theory behind the algorithms; the implementation issues necessary to apply the theory; the software engineering process involved in generating your software; computational experiments with different data sets, methods, and parameters.

- Computing conditional VaR (also known as expected shortfall)
- Complement back testing by stress testing.
- Replacing the Gaussian model for stock returns by robust models.
- Computing monetary measures of risk different from VaR.
- Using proper scoring rules for quantiles for valuating computed VaRs.
- Using methods of worst-case risk aggregation, such as the Rearrangement Algorithm, or worst-case online decision making, such as the Weak Aggregating Algorithm (currently active areas of research).

- John C. Hull. Options, futures and other derivatives, 10th ed., Pearson, Upper Saddle River, NJ, 2018.
- Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time 3rd ed., de Gruyter, Berlin, 2011.
- Yahoo Finance as source of data: http://finance.yahoo.com/
- http://gloria-mundi.com