KWEST: A Semantically Tagged Virtual File System

Report
Pune University
Aseem Gogte* , Sahil Gupta* , Harshvardhan J. Pandit* , Rohit Sharma*
publication 🔓copies: harshp.com , zenodo
📦resources: code , documentation
A virtual filesystem that allows automated semantic organisation and suggestions

ABSTRACT

The limitation of data representation in today’s file systems is that data representation is bound only in a single way of hierarchically organising files. A semantic file system provides addressing and querying based on the content rather than storage location. Semantic tagging is a new way to organise files by using tags in place of directories. In traditional file systems, symbolic links become non-existent when file paths are changed. Assigning multiple tags to each file ensures that the file is linked to several virtual directories based on its content. By providing semantic access to information, users can organise files in a more intuitive way. In this way, the same file can be accessed through more than one virtual directory. The metadata and linkages for tagging are stored in a relational database which is invisible to the user. This allows efficient searching based on context rather than keywords. The classification of files into various ontologies can be done by the user manually or through automated rules. For certain files types, tags can be suggested by analysing the contents of files. The system would be modular in design to allow customisation while retaining a flexible and stable structure.

Keywords : virtual file system, semantics, indexing, classification, database, tagging, information access, metadata

Chapter 1

INTRODUCTION

1.1 Overview

Information organisation and management is an essential and inevitable part of everyday computer usage. Huge amount of data is produced on daily basis. With data growing in size, we are faced with the problem of locating files. Traditional file systems impose a hierarchical structure of storage on the user. Traditional file systems are mono-hierarchical and implement directory trees to categorise and store files. In such systems, directories are the only means to access particular files.

In such systems, directories are the only means to access particular files. The path of a file contains directories, which refer to its context and categorisation. As an example

“c:nphotosncollegetripnmuseum*.jpg” refers to all photos of a museum from a college trip. In this case, it is not possible to store that photo in another directory say “c:n photosnmuseum*.jpg” without copying the file. This severely limits the searching capabilities in a file system. The user is faced with the dilemma of which directory best represents the context of current file. While storing the file is identified by its file name alone which serves as its identifier. For searching a particular file, the user has to accurately remember the path and file name.

A file cannot be searched by any other information relating to its context. Creating the directory structure is based on the users organisational skills. Searching or browsing throughsomeoneelsedataistrickyastheorganisationisdifferentforeveryuser.Previous approaches to such problems provided symbolic links and aliases as an incomplete answer. Symbolic links become redundant when the target file paths are changed. Similarly, aliases may become redundant or may not function properly with certain programs. Working with such solutions requires advanced skills on the users part. Keyword based searches which extract metadata from files were brought to fore by Apple’s Spotlight[13] and Google’s Desktop Search[14] . Both function only on limited filetypesanddonotallowmanualcategorisation.Thisledtothedevelopmentofsemantic file systems, containing categorisation of files based on context. It provides access to files by using categories formed from extracting metadata. It is similar to how music files can be searched by artist, genre, album etc. However, this presents a limitation on the amount and capabilities of what metadata can be extracted from a file. Virtual directories are used to represent data from the file system. These directories do not have a permanent listing and the user has to explicitly query for data. There have been several implementations based on semantic file systems.

However, they have several limitations in usability. Most of the projects are based only on a few key points, such as limitations over file types. This project thus is to create a semantic solution to the problems and shortcomings of traditional file systems while covering the limitations of other implemented projects.

1.2 Brief Introduction

Organising and retrieving information accurately and efficiently has attracted a lot of attention. While few have been successful, a number of innovative implementations have

emerged. KWEST is a virtual file system capable of storing semantics with which it facilitates the finding of relevant information. Information is stored in tags, which are extracted from a files metadata. This information may be generated implicitly by the system or supplied explicitly by the user. Thus, the validity of information is based on the user’s level of organising things.

The current system can extract metadata from a limited set of known file types. However, the modular architecture allows for plugins to be added which can add functionality for other file types. This allows for the project to be extended and modified according to the functionality required. The level of awareness generated by the system is based on the frequency of access and input provided by the user. The amount of relevance is determined by the tags generated and their associated files. This affects how the system categorises and searches files. Thus the actual outcome of the system which is the searching capabilities is totally dependent on these relationships. The current implementation is based on the Linux kernel. Future implementations can be extended to other platforms and devices. Furthermore, as the system is a virtual one, it needs only slight modifications to be ported to other file systems and operating systems.

1.3 Problem Definition

The goal of this project is to develop a semantic file system that extracts metadata from files and allows storage and searching based on its context. Such a system should overcome the drawbacks of traditional file systems while leveraging the limitations of other such similar implementations.

1.4 Feasibility Study

1.4.1 Technical Feasibility

The team have knowledge of C and Object Oriented concepts. The Project is being implemented using loadable kernel module known as FUSE. In the current versions of some Linux based OS this module is included in the kernel itself. The query processing and programming will be done using SQLite. It is a relational database contained in a small C programming library. In contrast to other database management systems, SQLite is not a process that is accessed from the client application, but an integral part of it. This is also open source and is freely available. Thus, the cost for developing KWEST will be minimal and hence will be feasible without the need for large capital.

1.4.2 Economic Feasibility

Cost of Software: We have used open-source technologies for building our system, thus there was no software cost incurred.

Cost of Hardware: No hardware was required to be purchased, thus no cost was incurred in building the system.

1.5 Application of a Software Engineering Approach

We are using incremental model of Software Development Life Cycle. The Software development life cycle (SDLC) is a process used by a systems analyst to develop an information system, training, and user (stakeholder) ownership. The SDLC aims to

produce a high quality system that meets or exceeds customer expectations, reaches completionwithintimeandcostestimates,workseffectivelyandefficientlyinthecurrent and planned Information Technology infrastructure, and is inexpensive to maintain and cost-effective to enhance.The Software Development Life Cycle framework provides a sequence of activities for system designers and developers to follow. It consists of a set of steps or phases in which each phase of the SDLC uses the results of the previous one. A Software Development Life Cycle (SDLC) adheres to important phases that are essential for developers, such as planning, analysis, design, and implementation.

Figure 1.1: Incremental Model

1.5.1 Planning

During this stage, business opportunities and problems are identified, and information technology solutions are discussed. Multiple alternative projects may be suggested and their feasibility analysed.

Tasks proposed were

1. Analysing need of user to access related data.

2. Identifying drawbacks of existing file system.

3. The development of a project goals- identifying relations between data, file system operations, metadata storage, accessing related data.

4. The collection of project requirements.

5. The development of a project schedule.

1.5.2 System Analysis

The goal of system analysis is to determine where the problem is in an attempt to fix the system.This step involves breaking down the system in different pieces to analyse the situation,analysingprojectgoals,breakingdownwhatneedstobecreatedandattempting to engage users so that definite requirements can be defined.

Tasks proposed were

1. Interface for implementing file system.

2. Analyse various database alternatives based on size and speed of operation.

3. Analysis of libraries to be used for metadata operations.

4. Generation of files and tags suggestions.

5. Analysing various methods to provide user with suggestions.

1.5.3 System Design

In systems design the design functions and operations are described in detail, including screen layouts, business rules, process diagrams and other documentation. The output of this stage will describe the new system as a collection of modules or subsystems.

Tasks proposed were

1. Separating system implementation into following modules.

2. File System Operations.

3. Generating Tags.

4. Importing Semantics.

5. Exporting Semantics.

6. Database Consistency.

1.5.4 Coding

In coding we do actual implementation of system design, produced running software.

Tasks proposed were

1. Implementing modules of project using finalised programming language and libraries

2. Having modularity in code to allow for reusable modules.

3. Documenting the code using standard tools.

4. Testing project using various software testing approaches.

1.5.5 Testing

Testing is the process of evaluating a system or application, to check whether the application meets all requirements of the client and to detect the errors.

Tasks proposed were

1. Testing of executable code using robust testing tools.

2. Performing Regression Testing.

3. Usability Testing.

4. GUI Testing.

5. Stress Testing.

6. Integrity Testing.

1.5.6 Deployment

Deployment starts after the code is appropriately tested, approved for release, and sold or otherwise distributed into a production environment. This may involve installation, customisation (such as by setting parameters to the customer’s values), testing, and possibly an extended period of evaluation.

Tasks proposed were

1. The project will be deployed as a executable which will mount the KWEST file system.

2. Utilise available GUI tools in the form of file managers.

3. Installation of utility should follow standard Linux OS procedures.

Chapter 2

LITERATURE SURVEY

Over the years, organising and retrieving information accurately and efficiently has attracted lot of attention. While few have been successful, a number of innovative implementations [11] have emerged. The idea of using a file’s semantics as the means to categorise it has been around for quite some time. This section discusses the various implementations made in the field of semantic file system.

An efficient implementation of keyword based searching was brought to the desktop by Apple’s Spotlight[13] and Google’s Desktop Search[14] . Both allow efficient and quick file retrieval based on keywords. They support many file types and have a simple interface which attracts a large number of users. However, both of them are limited to returning search results without any way to organising contents. In addition, they do not provide any provision to the user for classification of data. This limitation prevented the user from having a personalised way to retrieve data stored by them.

Semantic systems depend on data stored inside the files rather merely relying on an file’s attributes. Most implementations use common methodologies like content recognition[2] , tagging[7] , extracting metadata, etc. to categorise files by using various algorithms.

“Semantic File System”[8] , as developed by O’Toole and Gittord in 1992, provides access to file contents and metadata by extracting the attributes using special modules called “transducers”. It was one of the very first attempts to classify files by semantics using metadata. Its biggest drawback was the need for file type specific transducers which were necessary to extract meta information and content from the file. Also, the user does not have any say in what kind of category the file is classified under. This drawback makes it an unattractive option to the general user. It was decided during designing KWEST, that it is necessary to involve the end-user in the tagging process. This allows each user to have their own personal way of classification and organisation of files.

NHFS (Non Hierarchical File System)[12] was a project developed by Robert Freund in July 2007. It allows the user to place any file into any number of directories. Likewise, any directory can be placed into as many directories as required. NHFS therefore allows one to create a non-hierarchical structure with poly-hierarchically connected files. This allows for a powerful metaphor of finding a file in any of the category (directory) it could be stored under. Therefore, we decided to retain this feature by using tags in place of actual directories. Tags are associated with files and other tags as well. Thus, a tag may be placed under multiple tags allowing a relationship to be defined between them. This analogy is much more powerful than restricting files to actual directories. Using tags prevents duplication and redundancy, making it an efficient implementation.

A more recent implementation is Tagsistant[15] , which is a semantic file system that also attempts to organise files using tags. It interacts with the Linux kernel using the FUSE module. Under Tagsistant, directories are considered to be equivalent to tags. As a consequence, creating a directory is creating a tag and putting a file inside a directory means tagging that file. After you have tagged your files, you can search all of them by using queries. Queries are just paths where each element is either a directory or logical operators. The entire system has a modular design and uses SQLite. However, it suffers from some speed issues and the lack of SQL indexes. Major flaws of this design were

high consumption of inodes on real file systems and high computational time which was required to fulfil each request. Most of the features of Tagsistant were decided to be included in KWEST. These were modular design, SQLite repository, tagged structure, etc. whichenhancethesemanticsofafilesystem. However,caremustbetakentoprevent the occurrence of similar drawbacks.

Another implementation called Tagster[16] , is a peer-to-peer tagging application for organisingdesktopdata.ItisplatformindependentandisimplementedinJAVA.Multiple files and also directories can be tagged through its interface. The selected directories are recursively examined and all files contained within them are tagged. The GUI for a Linux system consists of three main areas.

1. Tag view

It displays a list of tags.

2. Resource view

It lists resources that have the currently selected tags assigned.

3. User view

It displays a list of users that have tagged the currently selected resource with some selectedtag. ItalsoincludesGUIsupportforWindowswithsomeunresolvedissues. However, it lacks auto classification of data due to which several common tags may be generated for each user increasing the database size.

Papers referred for the developement of KWEST.

1. K. Chang, W.T. Perdana, B. Ramadhana, K. Sethuraman, T.V. Le, and N. Chachra, Knowledge File System-A principled approach to personal information management, 2010 IEEE International Conference on Data Mining Workshops, Sydney, December 2010, pp. 1037-1044[1] .

Abstract: The Knowledge File System (KFS) is a smart virtual file system that sits between the operating system and the file system. Its primary functionality is to automatically organise files in a transparent and seamless manner so as to facilitate easyretrieval.ThinkoftheKFSasapersonalassistant,whocanfileeveryoneofyou documents into multiple appropriate folders, so that when it comes time for you to retrieveafile,youcaneasilyfinditamonganyofthefoldersthatarelikelytocontain it. Technically, KFS analyses each file and hard links (which are simply pointers to a physical file on POSIX file systems) it to multiple destination directories (categories). The actual classification can be based on a combination of file content analysis,fileusageanalysis,andmanuallyconfiguredrules.SincetheKFSorganises files using the familiar file/folder metaphor, it enjoys 3 key advantages against desktopsearchbasedsolutionssuchasGoogle’sdesktopSearch,namely1)usability, 2) portability, and 3) compatibility. The KFS has been prototyped using the FUSE (File system in USErspace) framework on Linux. Apache Lucene was used to provide traditional desktop search capability in the KFS. A machine learning text classifier was used as the KFS content classifier, complimenting the customisable rule-based KFS classification framework. Lastly, an embedded database is used to log all file access to support file-usage classification.

Usefulness: This paper describes approach to personal information management through Knowledge File System. It is designed to help users organise information usingVirtualFileSystemtoreducetheproblemofmanualinformationclassification andretrieval. KFSprovidesfunctionssoastoautomaticallyclassifytheinformation based on the content similarity with respect to predefined ontologies or also give the option for manual classification of the information. The operations carried out on the KFS can also be monitored with the event logger feature. Searching of files can be carried out by keyword with the help of a text indexer. Furthermore the comparisons between Google desktop file system, beagle and KFS are given. Finally the details of the implementation of the KFS on the Linux platform using of FUSE are given.

2. O. Eck, and D. Schaefer, A semantic file system for integrated product data management, Advanced Engineering Informatics, 2011, pp. 177-184[4] .

Abstract:Weinitiallydiscussanumberofdisadvantagesofcurrentfilemanagement systems. In the body of the paper our main contribution is presented. That is, a formal mathematical model of a new semantic file system, SIL (Semantics Instead of Location), that allows engineers to access data based on semantic information rather than storage location is proposed. A major difference between our approach and previous related work is that we do not aim at yet another point solution and, instead, propose an approach that may be employed by next generation engineering

dataprocessingsystemsonalargerscale.Inaddition,acorrespondingprogramming interface along with a graphical user interface used as a file browser is presented and the benefits of utilising the proposed semantic file system for product data management in the field of integrated design of mechatronic systems are discussed.

Usefulness: This paper describes a formal mathematical model that allows engineers to access data based on semantics rather than actual storage location. The main goal of this file system is to search files based on content of data. The browsing of the system is done based on file’s metadata and attributes. Logical operator such as AND, OR, NOT are used to filter the results. The classification allows multiple tags to be created for files. Furthermore API’s are written to create views and for the automation of notification updates.

3. P. Mohan, S. Venkateswaran, Raghuraman, and A. Siromoney, Semantic File Retrieval in File Systems using Virtual Directories, Proc. Linux Expo Conference, Raleigh, NC, May 2007, pp. 141-151[5] .

Abstract: Hard Disk capacity is no longer a problem. However, increasing disk capacityhasbroughtwithitanewproblem,theproblemoflocatingfiles. Retrieving a document from a myriad of files and directories is no easy task. Industry solutions are being created to address this short coming. We propose to create an extendable UNIX based File System which will integrate searching as a basic function of the file system. The File System will provide Virtual Directories which list the results of a query. The contents of the Virtual Directory are formed at run time. Although, the Virtual Directory is used mainly to facilitate the searching of file, it can also be used by plugins to interpret other queries.

Usefulness: This paper describes the design of SemFS, which provides semantics based on the file’s meta-data and attributes. It allows the usage of logical operators to filter query results. It is implemented as a user space file system upon journaling storage. The architecture is server-clientwithsupportforAPI’s to extend functionality. Features such as file tagging and versioning are also implemented.

4. R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, 1993 ACM SIGMOD Conference, Washington DC, USA, May 1993, pp. 207-216[6] .

Abstract : We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efficient algorithm that generates all significant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

Usefulness : For finding frequently occurring file and tags, creating association rules based on them give and providing suggestions to user.

Many tagging systems exist that allow efficient manual classification of information. Most implementations tend to be theoretical demonstrations or compleximplementations[9]existingforsomeveryspecificpurpose. Thesesuggest the possibility of using semantics[3] in operating systems in some future date. But a major problem is scalability with regard to related information. However, on a large multi-user file system, one can get tons of tags to shift through in each folder, increasing the load for users to search and maintain data. Our idea is to introduce a new concept of relating tags to overcome this situation. Implementing all the desired and necessary features from previous implementations,our design goal is to create an efficient Semantic File System which could be used by any class of users.

Chapter 3

SOFTWARE REQUIREMENT SPECIFICATION

3.1 Introduction

3.1.1 Purpose

The goal of this project is to develop a semantic file system that extracts metadata from files and allows storage and searching based on its context. The purpose of this document is to present a detailed description of KWEST. It will explain the purpose and features of the system, the interfaces of the system, what the system will do, the constraint under which it must operate and how the system will react to the users actions.

3.1.2 Intended Audience and Reading Suggestion

The intended audience of this document includes both developers and reviewers of the system.

3.1.3 Project Scope

This project is a virtual file system capable of storing semantics with which it facilitates the finding of relevant information.

1. Information is stored in tags, which are extracted from a file’s metadata. This information may be generated implicitly by the system or supplied explicitly by the user.

2. The validity of information is based on the users level of organising things.

3. The system is currently designed to extract metadata from a limited set of popular file types for audio(.mp3, .ogg, .spx, .mpc, .ape, .flac, .wv, .tta, .wma, .m4a, .wav, .aif[f]), video(.flv, .real, .riff(.avi), .mpeg, .qt, .asf), images(.jpeg, .gif, .png, .tiff) and PDF documents.

4. The modular architecture allows for plugins to be added which can add additional functionality, and recognition for more file types. This allows the project to be extended and modified according to the functionality required.

5. The level of awareness generated by the system is based on the frequency of access and input provided by the user.

6. The current implementation is based on the Linux kernel.

7. As the system is an virtual entity, it does not need extensive modifications to be ported to other file systems and operating systems.

User Classes and Characteristics

The system can be used by three types of users:

1. General User

Uses the system without any complex modifications in the system.

2. Advanced User

Understands the system and creates rules and automations based on personal needs.

3. Developer

UsestheAPIprovidedanddevelopsmodulesthatextendthesystem.Onlyadvanced users utilise the semantic nature of underlying file system to the fullest. This does not create any blocks for the general user, who can also use KWEST satisfactorily. Developers are a differentgroupofusers who can extendKWEST throughmodules. Thesemodulescanmodifyordefineadditionalbehaviourforthesystemforspecific file types.

Operating Environment

1. KWEST requires FUSE [17] minimum version 2.8.6

2. KWEST can run on any Linux installation which contains required versions of FUSE.

3. Furthermore, since kernel version 2.6.14, FUSE has been merged into the mainstreamLinuxkerneltree.AsaresultKWESTcanrunonanyLinuxdistribution having a minimum Kernel version 2.6.14.

4. KWEST is a virtual file system mounted to a folder or a loop device.

3.1.4 Design and Implementation Constraints

Implementation Constraints

1. KWEST uses FUSE to manage userspace file systems. Hence, access is limited to the executing userspace for the program.

2. Sincetheentireapplicationisexecutedinuserspaceonly,therecannotbeinteraction swith the kernel directly.

3. AlthoughKWESTimplementsavirtualfilesystemwhichisaccessibletoallentities, wehaveimplementedthesystemwithcommandlineastheprimaryinterface. Other file managers such as Nautilus can only browse but not tag files. This limitation can be addressed with plugins or additional modules built for the specific file manager.

4. It is the responsibility of the user that mounting and unmounting of the system be done with standard rules and precaution.

5. The SQLite database is an integral part of KWEST and is contained in a single file. It is vital for the system that integrity of the database is maintained.

Design Constraints

1. Currently, the auto-tagging feature has been limited to common and popular file types such as audio(.mp3, .ogg, .spx, .mpc, .ape, .flac, .wv, .tta, .wma, .m4a, .wav, .aif[f]), images(.jpeg, .gif, .png, .tiff), video(.flv, .real, .riff(.avi), .mpeg, .qt, .asf) and .pdf documents. This functionality can be extended with modules or through special tools made specifically for this purpose.

2. The amount of information visible through common file system commands (e.g. ls - list contents) is a limitation for KWEST. We cannot show tagging information through these interfaces. Alternate methods for this can be implemented keeping the end user in mind.

3.1.5 Assumptions and Dependencies

Assumptions

1. Users of this software should be aware of how semantics are used to categorise information.

2. Users should recognise or identify appropriate tags in relation to files.

3. It is assumed that the user is well versed in organisation information and uses KWEST as a tool rather than an assistant.

4. The user should have the required privileges and rights to run KWEST and all its operations.

Dependencies

1. KWEST uses several external libraries to extract metadata from supported popular formats like TagLib for audio, Libextractor for images, video and Poppler for PDF documents.Theselibrariesarerequiredatcompilationtime.Theyenablethesystem to handle metadata extraction.

3.2 System Features

3.2.1 Tags

1. Manual Tagging

Manual tagging is the basis of semantics in KWEST. The user can assign any tag to the files in KWEST. These tags are then stored internally in a database. The user can create new tags or use tags already defined by the system. Total freedom is given to the user to organise data. Multiple tags can be assigned to the to a single thus allowing its access through multiple locations without duplication of data.

2. Automatic Tagging

KWEST also features automatic tagging of files. The user can define certain rules under which files will be assigned tags. The system will implement those rules for all files satisfying the defined constraints. This would prevent repetitive tagging operations for the user.

3. Importing tags

Certain popularfile formats such as mp3,jpeg etc have metadata embedded in them. KWEST supports such popular format and uses this metadata to automatically assign tags to the files. This feature enables the user to collectively classify and store the data under these tags.

3.2.2 Database

1. Consistency

KWEST uses an internal database to store and manage data. It is vital that the database always remains consistent. KWEST uses logging mechanisms to ensure that operations on the database always reach an endpoint.

2. Access

The database files used by KWEST are not locked down or access restricted. The KWEST API provides facilities that can be used to access the database. However, this feature is made available with the understanding that the integrity of the database will be maintained always.

3.2.3 Relation with Existing Data

1. Importing semantics

Users alreadyhave certain organisationalstructures in the waytheystore data in file systems. KWEST imports these semantics by converting the storage hierarchy to tag-based hierarchy. This allows the entire file system to be imported into KWEST along with the users previous organisation structure.

2. Files are executable ready

The files can be directly executed through the virtual file system without making any modification to the files like audio, video files can be played through the virtual system, images can viewed and documents can be opened and read.

3.2.4 Exporting Semantics

1. Export file system

As the entire file system exists as a virtual entity, KWEST provides the export feature. Where the file system can be exported to another system where the data can be imported by another instance of KWEST.

2. Export tagged files

Itisalsopossiblefortheusertoexportdataundercertaintagstoanexternallocation. The semantic organisation showed by tags is converted to actual directories and files are then copied to these directories. This way the user can export KWEST semantics and data to outside locations.

3.2.5 Modularity

1. Modules As Plugins

KWEST is an extendable system. It can use external modules to increase functionality or to modify existing operations. Support for using modules is built into KWEST right from the design stage. Additional extraction libraries can be included using the plugin module.

2. Support for developers

KWEST provides support to developers by providing access to all internal features and database. The API layer allows developers to easily supplement internal operations with their modules.

3.3 External Interface Requirements

3.3.1 User Interfaces

The user can use the system through the command line. The system mounts a virtual file system which the user can use to navigate through. If the file explorer/browser supports virtual file system, the user can use that to navigate through the files.

3.3.2 Hardware Interfaces

No hardware interfaces used as the file system exists as a virtual entity.

3.3.3 Software Interfaces

• FUSE[17]

KWEST uses FUSE to run the file system code in user space without editing the kernel code.The FUSE module provides only a “bridge” to the actual kernel interfaces. Major file system operations are defined by FUSE and forwarded to KWEST for implementation.

• SQLite[18]

KWEST uses SQLite database to store data. In contrast to other database management systems, SQLite is not a separate process but an integral part of KWEST. Database is accessed and modified for most of the operations performed by KWEST.

3.3.4 Communication Interfaces

KWEST can be accessed like any other file system via Command line or file managers like Nautilus, Thunar etc.

3.4 Nonfunctional Requirements

3.4.1 Performance Requirements

• Response Time

The response time for any action on the file system or the database should be reasonable under normal operational circumstances.

• Capacity

KWEST can be used by any user with sufficient permissions to initialise the filesystem. Subsequent operations like read,write,modify etc. are restricted by user permissions similar to normal file systems.

• Scalability

KWEST provides suggestions and includes automated tagging rules for Audio, Video and Images. It also allows manual tagging of files by the user. Various modules can be further added for recognising and categorising other file types.

3.4.2 Safety Requirements

• Power Failure

The system maintains a log file for database. It prevents data corruption by committing data that has been fully written to the log. This should prevent most, if not all, data corruption.

• Data Loss

If any file is accessed which is mentioned in database but deleted from the system then it is removed from the database and appropriate message is displayed to user.

• Access Rights

The system checks if the file tagged is available to user for access. It does not allow files to be included which cannot be access by the user.

3.4.3 Security Requirements

KWEST can be used only by a single user with sufficient rights to execute the software. No other user will have permissions to modify tags and files in the system.

3.4.4 Software Quality Attributes

• Availability

The System shall be available from mounting the file system until its unmounted.

• Updatability

The system shall allow for addition or deletion of files under tags.

• Reliability

– The system shall save new tags created by active user.

– The system shall save the file path of an active user to database whenever new files are added to a tag.

– The system shall maintain a log file which records every operation on database. – The system shall modify database whenever tags or files are deleted.

– The system shall remove file entries from database whenever it is unable to access them.

• Portability

The system is implemented on Linux. It is compatible with various other Linux distributions like Ubuntu, Fedora, Red Hat, etc.

• Testability

New modules designed to be added to the system, to identify file types other than audio, video and images must be tested to check if they are compatible with input-output format of the system.

• Usability

The system does not have a large learning curve as the user deals with commands common to all file systems. Documentation provided for the system will include user manuals, developer reference and common FAQ.

3.5 Other Requirements

3.5.1 Legal Requirements

All the libraries, programs used in this project are open sourced under GPL. SQLite is a database which is free to use, distribute or modify. The FUSE kernel module is merged with the Linux Kernel, which is open sourced and freely available under GPL. There are no proprietary or closed source products, libraries or interfaces used in this program. The project KWEST and its subsequent implementations will be open sourced under the GPL upon completion.

3.6 Analysis Model

3.6.1 Data Flow diagram

Level 0

Figure 3.1: Data flow diagram - Level 0

Level 1

Figure 3.2: Data flow diagram - Level 1

Level 2

Figure 3.3: Data flow diagram - Level 2

3.6.2 Entity Relationship diagram

Figure 3.4: Entity Relationship diagram

3.7 System Implementation Plan

Phase 1: September 2012 - November 2012

Task

Start Date

End Date

Problem identification

01/09/12

07/09/12

Information gathering

08/09/12

14/09/12

Creating problem definition

15/09/12

21/09/12

Understanding underlying technology

22/09/12

05/10/12

Analysing problem

06/10/12

12/10/12

Designing solution

13/10/12

26/10/12

Refining design

27/10/12

02/11/12

Creating design report

03/11/12

09/11/12

Table 3.1: Phase 1 Implementation Plan

Phase 2: December 2012 - March 2013

Task

Start Date

End Date

Building a stub implementation

01/12/12

14/12/12

Implement file system operations

15/12/12

28/12/12

Extract Metadata from files

28/12/12

11/01/13

Implement modular plugins

12/01/13

25/01/13

Apply Apriori algorithm to database

26/01/13

08/02/13

Rigorous testing

09/02/13

22/02/13

Debugging and refining

23/02/13

08/03/13

System testing

09/03/13

15/03/13

Creating implementation report

16/03/13

31/03/13

Table 3.2: Phase 2 Implementation Plan

Chapter 4

SYSTEM DESIGN

4.1 System Architecture

Figure 4.1: System Architecture

4.2 UML Diagrams

4.2.1 Use-case diagram

Figure 4.2: Use Case diagram

4.2.2 Component diagram

Figure 4.3: Component diagram

4.2.3 Deployment diagram

Figure 4.4: Deployment diagram

Chapter 5

TECHNICAL SPECIFICATIONS

5.1 Technology Details Used in the Project

5.1.1 Development

FUSE

For developing and managing the virtual file system we have used FUSE. Minimum required: 2.8

Version used: 2.9.0

SQLite

We have used SQLite as the data repository for this project. Minimum required: 3.6

Version Used: 3.7.13

Language for Implementation

We use ANSI C for implementing our project modules. Specifically, we follow the GNU C99 standards while compiling our code. GNU C99 is an extension of the C99 providing some extra features. Note: Some features are incompatible with other standards.

GNU C99

Compiler

For compiling our project modules we use the GNU Compiler Collection (GCC). Minimum required: 4.6

Version used: 4.7.2 on Linux Mint 14

5.1.2 Operating Environment

Platform

The project is based on operating systems utilising the linux kernel. Any implementation which provides a POSIX compatible environment is sufficient.

Minimum required: 2.6.14

Version used: 3.5.0-generic on Linux Mint 14

Operating System

We have used various operating environments while creating and testing the project. Linux distributions were Ubuntu 12.04, Ubuntu 12.10, Fedora 18, Linux Mint 13, Linux Mint 14.

5.1.3 External Libraries

TagLib

Used for extracting Audio metadata from files. Minimum required: TagLib 1.7.1

Version used: TagLib 1.8

LibExtractor

GNU Libextractor is a library used to extract meta data from files. It supports extraction from Image, PDF and Video file types.

Minimum required: 1.0.0 Version used: 1.0.1

Poppler

Poppler is a PDF rendering library based on the xpdf-3.0 code base. Minimum required: 0.21

Version used: 0.22

5.1.4 Debugging

GDB: GNU Project Debugger

Used to debug program crashes and memory errors. Version used: 7.5

Valgrind

Used as a profiling tool, for memory related issues and program crashes. Version used: 3.7.0

5.2 Reference to Technology

5.2.1 Development

FUSE

FUSE is a loadable kernel module for Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.

http://fuse.sourceforge.net/

SQLite

SQLite is a relational database management system contained in a small C programming library. It implements a self-contained, zero-configuration, transactional SQL database engine which can be embedded in applications.

http://www.sqlite.org/

Language for Implementation

We use GNU99 C standard for implementing our project modules. http://gcc.gnu.org/c99status.html

Compiler

GCC is a compiler system which provides front ends for various languages including C. It provides optimisation’s, debugging and other features to help program development. http://gcc.gnu.org/

5.2.2 Operating Environment

Platform

The project is based on operating systems utilising the linux kernel. Any implementation which provides a POSIX compatible environment is sufficient.

https://www.kernel.org/

Operating System

Linux distributions were : http://www.ubuntu.com/ http://linuxmint.com/ http://fedoraproject.org/

5.2.3 External Libraries

TagLib

TagLib is a library for reading and editing the meta-data of several popular audio formats. Currently it supports both ID3v1 and ID3v2 for MP3 files, Ogg Vorbis comments and ID3 tags and Vorbis comments in FLAC,MPC,Speex,WavPack TrueAudio,WAV,AIFF, MP4 and ASF files.

http://taglib.github.com/

LibExtractor

GNU Libextractor is a library used to extract meta data from files. The goal is to provide developers of file-sharing networks, browsers or WWW-indexing bots with a universal library to obtain simple keywords and meta data to match against queries and to show to users instead of only relying on file names.

http://www.gnu.org/software/libextractor/

Poppler

Poppler is a PDF rendering library based on the xpdf-3.0 code base. http://poppler.freedesktop.org/

5.2.4 Debugging

GDB: GNU Project Debugger

GDB, the GNU Project debugger, allows you to see what is going on ‘inside’ another programwhileitexecutes–orwhatanotherprogramwasdoingatthemomentitcrashed. http://www.gnu.org/software/gdb/

Valgrind

Valgrind is a GPL licensed programming tool for memory debugging, memory leak detection,andprofiling. Valgrindwas originallydesignedto be afree memorydebugging tool for Linux on x86, but has since evolved to become a generic framework for creating dynamic analysis tools such as checkers and profilers.

http://valgrind.org/

Chapter 6

PROJECT ESTIMATION, SCHEDULE AND TEAM STRUCTURE

6.1 Project Estimate and Schedule

We use the Work Breakdown Structure (WBS) for project estimation. The WBS is organised around the primary products of the project (or planned outcomes) instead of the work needed to produce the products (planned actions). The Work Breakdown Structure presented here represents all the work required to complete this project.

1

Project Management

1 Week

1.1 1.2 1.3 1.4

Project Initiation Project Planning

Project Execution and Control Project Closeout

1 Day 2 Days 3 Days 1 Day

2

Definition

2 Weeks

2.1 2.2 2.3 2.4 2.5

Start-up and Orientation Determine Project Requirements Create future process model Reconcile Project Requirements Functional Specification

2 Days 3 Days 2 Days 2 Days 5 Days

3

System Design

3 Weeks

3.1 3.2 3.3 3.4 3.5 3.6

Technical Architecture System Standards Physical Environment Technical Specification Creating Prototypes Test Plans

5 Days 2 Days 2 Days 3 Days 6 Days 3 Days

4

System Development

5 Weeks

4.1 4.2 4.3 4.4 4.5 4.6

Develop and Test Software Modules User Training Materials

Technical Documentation

Unit, Integration and System Test Results Performance testing

Acceptance testing

10 Days 2 Days 5 Days 10 Days 4 Days 4 Days

5

System Implementation

3 Weeks

5.1 5.2 5.3 5.4

Acceptance Environment

Data Initialisation and Conversion Test Results Acceptance test results

Supporting Material

5 Days 8 Days 4 Days 4 Days

6

Transition to Ready State

1 Week

6.1 6.2 6.3

Train Users Convert Data Deploy System

2 Days 2 Days 3 Days

Table 6.1: Work Breakdown Structure

6.2 Team Structure

Member Member 1 Member 2 Member 3 Member 4

Member 1: Aseem Gogte Prior Experience:

• Well versed in C, C++

• Visual Basic

• Java

• Databases - Oracle

New Technology Learnt:

• Version Control - GIT

• FUSE

Name Aseem Gogte Sahil Gupta

Harshvardhan Pandit Rohit Sharma

• Profiling and Debugging Tools - Valgrind, GDB

Member 2: Sahil Gupta Prior Experience:

• Well versed in C, C++

• Visual Basic

• Java

• Databases - Oracle

New Technology Learnt:

• Version Control - GIT

• FUSE

• Databases - SQLite

Member 3: Harshvardhan Pandit Prior Experience:

• Well versed in C, C++

• Visual Basic, .NET

• Java

• System programming

New Technology Learnt:

• Version Control - GIT

• FUSE

• Profiling and Debugging Tools - Valgrind, GDB

Member 4: Rohit Sharma Prior Experience:

• Well versed in C, C++

• Visual Basic

• Java

• Databases - Oracle

New Technology Learnt:

• Version Control - GIT

• FUSE

• Databases - SQLite

Chapter 7

SOFTWARE IMPLEMENTATION

7.1 Introduction

The implementation ofthe software was done in a modularmannerusing the incremental approach of SDLC. ANSI C was the language used to implement the project. An SQLite database was used for creating and managing the data repository. Various external libraries were used for the extraction of the metadata. The following modules constituted our project: •

• Module 1: Creation of a virtual file system using FUSE

• Module 2: Interfacing a Data Repository using SQLite

• Module 3: Adding automated Extraction of Metadata

• Module 4: Importing Semantics in to the file system

• Module 5: Exporting Semantics from the file system

• Module 6: Association Rule learning using Apriori Algorithm

7.2 Databases

The project was to implement a semantic file system. This required a data repository to store all the information such as file name, physical location, attributes, etc. Also, the metadata extracted from the files was also stored in the database. This way the database formed a central information location for the file system. Therefore, for the file system implementation to be stable andefficient,the system requiredan in-place robustdatabase whichoffers integrity andspeed. From the various options available,the SQLite database was selected and used.

SQLite is a relational database management system contained in a small C programming library. Unlike client-server database management systems, the SQLite engine has no standalone processes with which the application program communicates. Instead,theSQLitelibraryislinkedinandthusbecomesanintegralpartoftheapplication program. SQLitestorestheentiredatabase(definitions,tables,indices,andthedataitself) as a single cross- platform file on a host machine. Features include :

1. Zero Configuration

2. Serverless

3. Single Database File

4. Stable Cross-platform database

5. Manifest typing

7.3 Important modules and algorithms

7.3.1 Creation of a Virtual File System using FUSE

The first module of implementation was to create a basic file system using FUSE. Using FUSE we created a virtual file system capable of doing all the operations that a normal file system does. The module consists of the following phases: •

• Phase 1: Implement FUSE to create a basic file system structure using ANSI C as the implementation language.

• Phase 2: Connect the file system created to the data repository created using a SQLite database. Create tags and files view by querying the database through FUSE.

• Phase 3: Implementing common file operations with respect to tags and files such as read, write, open, copy, move etc.

• Phase 4: Extraction of metadata from the files using external libraries and organisation of the file system based on metadata.

• Phase 5: Displaying suggestion based on associations derived using Apriori Algorithm.

7.3.2 Interfacing a Data Repository using SQLite

The database is an important module of the file system. All the data required to browse and navigate the file system is stored in the database. FUSE interacts with the data in the database by querying for particular data based on path accessed. It is vital for the proper functioning of the system that the database always remains consistent. Logging mechanisms ensure that operations on the database always reach an endpoint. This module is used to check, correct and maintain integrity of the database by checking for redundant entries. Also, if there are new files which have not been added to KWEST, this module can help the user add them. We implement this module in the following ways: •

• Phase 1: Create database tables for a file system.

• Phase 2: The relation tables between tags, files are stored.

• Phase 3: Store the extracted metadata in the database.

• Phase 4: The association rules for the data are derived using the Apriori algorithm.

7.3.3 Adding Automated Extraction of Metadata

Metadata (meta content) is defined as data providing information about one or more aspects of the data. Metadata can be stored either internally, in the same file as the data, or externally, in a separate file. Metadata that is embedded with content is called embedded metadata. The metadata of the file is extracted by using external libraries. The data repository stores the extracted metadata in a predetermined format. •

• Phase 1: Test external libraries to determine which of them can be used.

• Phase 2: Extract metadata using external libraries.

• Phase 3: Store the extracted metadata in the database.

• Phase 4: Form relations between metadata and files using association rules.

7.3.4 Importing Semantics in to the File System

Users already have certain organisational structures in the way they store data in file systems. This module imports semantics by converting the storage hierarchy to tag-based hierarchy. This means the directory structure present in the file system will be used to form tags and the files listed under the directory are tagged under that tag. •

• Phase 1: Parse the folder structure on local hard disk.

• Phase 2: Add entry for each file and folder to the database. • Phase 3: Remove or ignore hidden and system files.

• Phase 4: Prune the database entries on every start up.

7.3.5 Exporting Semantics from the File System

This module can export the storage hierarchy to some external location. The semantic organisation of tags is converted to actual directories and the files are then copied to these directories. This is similar to copying contents from one file system to another. •

• Phase 1: Copy virtual locations to external location. • Phase 2: Perform physical copy of files.

• Phase 3: Create folders and sub folders based on tags. • Phase 4: Copy suggestions using data repository.

7.3.6 Association Rule Learning using Apriori Algorithm

Association rules help in organising the file system data by providing suggestions while tagging files. These suggestions can be helpful when the user has either forgotten to tag the file, or is yet about to do it. This association rule learning approach uses the Apriori algorithm. •

• Phase 1: Run Apriori over the KWEST database. • Phase 2: Perform optimisation’s and prune steps. • Phase 3: Store association rules in database.

• Phase 4: Integrate with KWEST to show suggestions.

7.4 Buisness logic and architecture

7.4.1 Buisness logic

In computing, a file system (or filesystem) is a type of data store which can be used to store,retrieve andupdate a setoffiles. The term couldreferto the abstractdata structures used to define files, or to the actual software or firmware components that implement the abstract ideas.

Traditionally, file systems were always developed with performance parameters in mind. However, with the data-rich systems of today, the responsibilities of a file system have increased. The onus of organisation and retrieval of data is much more on the file system than the user. The file system structure defines the capability and capacity of searches performed on it. In such a scenario, a file system that facilitates retrieval by providing features that help automate organisation is a lucrative option.

7.4.2 Expenses and Legal ramnifications

The software has been open sourced under the Apache License. As such, there are no cost requirments that can be incurred by the adaption of KWEST. All the utilised tools, libraries, platforms and algorithms are free to use. This encourages technology adaption for other students, developers and organisations.

7.4.3 Novelty of idea

There have been no previous implementations of adapting data mining techniques to a file system. This suggests the novelty of the idea of KWEST and the possibilities for future work.

7.4.4 Buisness Architecture

Chapter 8

SOFTWARE TESTING

8.1 Introduction

8.1.1 Purpose

Software testing can be stated as the process of validating and verifying that a computer program/application/product:

• Meets the requirements that guided its design and development. • Works as expected.

• Can be implemented with the same characteristics. • Satisfies the needs of stakeholders.

Software testing, depending on the testing method employed, can be implemented at any time in the development process. Traditionally most of the test effort occurs after the requirements have been defined and the coding process has been completed, but in the Agile approaches most of the test effort is on-going. As such, the methodology of the test is governed by the chosen software development methodology.

8.1.2 Scope

The testing of the system was done manually and no testing tools were used. The test plan describes the unit, functional, performance, usability, regression tests that were performed. Only codes that were pushed as commits were considered as candidates for testing.

8.1.3 Intended Audience

The testing of this system is intended for 3 types of audiences:

1. End Users: The users who will be using the system will review the testing as a mark of stability and performance of the system.

2. Developers: Will view the testing for knowing existing limitations and bugs. 3. Reviewers: Will use these test results as a metric to evaluate the project.

8.2 Test Plan

8.2.1 Target Items

The following have been identified as targets for testing: 1. Code and associated areas

2. File system operations 3. Databases: SQLite3

4. Operating Systems

8.2.2 Outline of Tests

Tests performed

1. Performance tests

2. Functional tests

3. Data Integrity tests

4. Regression tests

5. Usability tests

8.2.3 Test Approach

Any bugs found should be reported with related information, which should include who discovered it, how, a description of the bug, and who fixed it and when. Also, re-testing of the code done to make sure that defect has been fixed and there no new bugs produced due to change in code.

1. Performance Testing:

The focus of Performance testing is checking a software program’s

• Speed : Determines whether the system responds quickly.

• Scalability : Determines maximum user load the software application can handle.

• Stability : Determines if the application is stable under varying loads.

Tools required: Software timers Success criteria:

a) Manual(user) perception does not notice any “lags”.

b) Time to perform operations is within an acceptable range.

2. Functional Testing:

The prime objective of Functional testing is checking the functionalities of the software system. It mainly concentrates on -

• Mainline functions : Testing the main functions of an application.

• Basic Usability : It involves basic usability testing of the system. It checks whetheran usercan freelynavigate throughthe screens withoutanydifficulties.

• Accessibility : Checks the accessibility of the system for the user.

• Error Conditions : Usage of testing techniques to check for error conditions. It checks whether suitable error messages are displayed.

Tools required: None(manual testing)

Success criteria: All of the following are successfully tested:

a) all key use-case scenarios. b) all key features.

3. Data Integrity Testing:

Data integrity refers to the quality of the data in databases and is the measurement by which users examine data quality, reliability and usefulness. Data integrity testing verifies that converted data is accurate andfunctions correctly within a given application. Testing data integrity involves:

• Database : Verifying that correct values are saved in databases. • Write-back : Correct data is written to disk.

• Read : Correct data is read from disk.

• File Integrity : Operations do not break existing files.

Tools required: File compare tools (manual testing)

Success criteria: All of the following are successfully tested:

a) files are same in size, byte-blocks, permissions and parameters. b) data is not changed, modified or removed unless intended.

4. Regression Testing:

Regression Testing is required when there is a :

• Change in requirements and code is modified according to the requirement • New feature is added to the software

• Defect fixing

• Performance issue fix

Tools required: None(manual testing)

Success criteria: All of the following are successfully tested:

a) all previous operations are successfully executed. b) previously solved bugs are not re-introduced.

c) operations do not suffer from unwanted performance hits.

5. Usability Testing:

Goal of this testing is to satisfy users and it mainly concentrates on the following parameters of a system:

Effectiveness of the system

• Is the system is easy to learn?

• Is the system useful and adds value to the target audience?

• Is Content, Colour, Icons, Images used are aesthetically pleasing ? Efficiency

• Navigation required to reach desired screen/web page should be very less. Scroll bars shouldn’t be used frequently.

• Uniformity in the format of screen/pages in your application/website. • Provision to search within your software application or website

Accuracy

• No outdated or incorrect data like contact information/address should be present.

• No broken links should be present. • User Friendliness

• Controls used should be self-explanatory and must not require training to operate

• Help should be provided for the users to understand the application / website • Alignment with above goals helps in effective usability testing

Tools required: None(manual testing)

Success criteria: All of the following are successfully tested:

a) operations are not changed drastically from a traditional file system. b) user can use the file system without any special tools.

8.2.4 Entry and Exit Criteria:

1. Test Plan

A Test Plan Entry Criteria: Code is complete and has been pushed to the Git repository.

B Test Plan Exit Criteria: All functional requirements have been verified.

C Suspension and Resumption Criteria: Testing will be suspended on critical design flaws that will changes in redesign of critical components. Testing will resume when the coding is complete and code is reviewed successfully.

2. Test Cycle

A Test Cycle Entry Criteria: When a module has been completed.

B Test Cycle Exit Criteria: All tests specified at the start of the testing have completed successfully.

8.2.5 Risks, Dependencies, Assumptions, Constraints

Risk

Mitigation Strategy

Contingency

FUSE API changes

Use FUSE version numbers to run a static check while compiling for required version of FUSE.

Change operation code to new version.

External Library is no longer maintained

Try to use the latest version number of library available and keep a source ready for distribution.

Change to alternate library.

Performance has degraded

Code with performance in mind, using fast algorithms and approaches.

Use profiling tools to detect memory issues and static code analysers for code checking.

Table 8.1: Risk Management

8.2.6 Problem Reporting, Escalation, and Issue Resolution

Eachbugwillbegivenapriority,whichwilldeterminewhenitisaddressedinthecurrent iteration. The bug priority may change due to other bugs, issues or re-evaluation of the bug by a peer review.

8.3 Test Cases

8.3.1 Introduction

The purpose of this Test Case document is to specify and communicate the specific conditions which need to be validated to enable an assessment of the system. Test Cases aremotivatedbymanythingsbutwillusuallyincludeasubsetofUseCases,performance

characteristics and the risks the project is concerned with. A separate test case document is prepared for each testing phase (unit, integration, integrity, etc.) identified in the test plan. The test cases should be organised into related groups that are meaningful to the project – i.e. test suites.

8.3.2 File System Operations

Testing the file system for implementations of required operations:

Operation

Status

Comment

getattr readdir access truncate destroy open release mknod rename unlink read write chmod chown mkdir rmdir symlink readlink link utimens statfs fsync setxattr getxattr listxattr

removexattr

YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES INVALID INVALID INVALID NO

NO INVALID NO

NO NO NO

checks whether the given path exits lists the contents of the given tag checks for access to specified tag closes file after operation

called on file system unmount opens for file for access releases file after access creates new file

renames files and folders removes file from system reads data from file writes data to file changes permissions changes owner

creates new directory removes directory

not required in KWEST not required in KWEST not required in KWEST not implemented

not implemented

not required in KWEST not implemented

not implemented not implemented not implemented

Table 8.2: File system operations

8.3.3 Performance of file system operations

Comparison of KWEST file system against underlying file system. Test Bench:

• Operating System: Linux Mint 14 3.5.0-25-generic

• Original file system: ext4 500GB disk with partition size 150GB

• RAM: 4GB

• swap: 8GB on disk

• CPU utilisation: average 4

Contents of Music folder imported into KWEST:

• Audio: 17 files totalling 102.9MB

• Images: 81 files totalling 196MB

• PDF: 11 files totalling 17.9MB

• Video: 4 files totalling 1GB

• Others: 7 files totalling 7MB

• Total: 120 files of size 1.3GB

Test

Time taken

Comment

all files

120sec

total file size imported was 1.3GB

videos

40sec

extracting metadata from videos is more expensive compared to other file types

images

35sec

images having metadata take longer than those without

audio

4sec

audio files are the fastest to parse and load

PDF

2sec

PDF files are parsed quickly as compared to other document types

forming associations

2sec

time is proportional to number of common files in user tags

Table 8.3: Performance tests for mounting KWEST

Test scripts

The following is a simple test script used to evaluate file system operations. The script works by calculating the difference in time before and after the execution of operations.

#store current time let DA=(‘date +%s ‘)

#perform file system operation ls -R kwest/src/mnt

#store new time

let DB=(‘date +%s‘) #calculate the difference let DC=$DB-$DA

#output the time taken echo $DC

Operation

ext4

KWEST

Comment

list directory

500ms

550ms

there is no noticeable difference

read file

700ms

850ms

some extra time is taken to read a file depending on the amount of data being read. In general, there is no noticeable difference.

write file

1200ms

2200ms

(for5MB textfile) writing takes slightlymore time, but the difference is within acceptable range.

read and write

1010ms

2400ms

(for 5MB text file) reading and writing simultaneously does not produce any performance degradation.

Table 8.4: Performance tests for using KWEST

READ operation TEST

Read operation test included reading from a 4.1MB file and outputing the contents on terminal. The file was accessed on the KWEST file system.

GNU nano 2.2.6 File: ./test.sh echo "READ test"

echo "cat ./mnt/Files/Music/t2.txt let kT=0

for i in 1 2 3 4 5 do

echo "Run#" $i #store current time let kS=(‘date +%s‘)

#perform file system operation

cat ./mnt/Files/Music/t2.txt > ./mnt/Files/Music/test.txt #store new time

let kE=(‘date +%s‘) #calculate the difference let kO=$kE-$kS

let kT=kT+kO

#output the time taken

echo "kwest operation time = " $kO "sec" done

let kT=kT/5

echo "average time taken = " $kT

Figure 8.1: Read operation average time for 4.1MB text file

WRITE operation test

Write operation test included reading from a 4.1MB file and writing the contents to another file. Both the files were accessed from the KWEST file system.

GNU nano 2.2.6 File: ./test.sh echo "READ test"

echo "cat ./mnt/Files/Music/t2.txt > ./mnt/Files/Music/ test.txt"

let kT=0

for i in 1 2 3 4 5 do

echo "Run#" $i #store current time let kS=(‘date +%s‘)

#perform file system operation

cat ./mnt/Files/Music/t2.txt > ./mnt/Files/Music/test.txt #store new time

let kE=(‘date +%s‘) #calculate the difference let kO=$kE-$kS

let kT=kT+kO

#output the time taken

echo "kwest operation time = " $kO "sec" done

let kT=kT/5

echo "average time taken = " $kT

Figure 8.2: Write operation average time on 4.1MB text file

8.3.4 Profiling Code

Code can be profiled using Manual methods, or using specific tools such as Valgrind, GDB, Splint etc. For testing KWEST, we have used the following profiling tools:

GDB

GDB can be used to debug the file system and check for memory leaks, errors and irregular operations. The sample output given below shows a clean mount and unmount of the KWEST file system.

$ gdb ./kwest

GNU gdb (GDB) 7.5-ubuntu

Copyright (C) 2012 Free Software Foundation, Inc. Reading symbols from kwest/src/kwest....

(gdb) run -s -d -f mnt

Starting program: kwest/src/kwest -s -d -f mnt [Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/ libthread_db.so.1".

KWEST - A Semantically Tagged Virtual File System ...

...

[Inferior 1 (process 20863) exited normally] (gdb) bt

No stack.

Test

Status

Possible Errors

list directory

PASS

I/O error, illegal operation, transport endpoint not connection, connection abort

read file

PASS

I/O error, illegal operation, access denied, database error

write file

PASS

I/O error,illegal operation,access denied,file system busy

mknod

PASS

Operation not permitted, I/O error, database error

unlink

PASS

Device busy, Operation not permitted, I/O error, database error

mkdir

PASS

Operation not permitted, I/O error, database error

rmdir

PASS

Device busy, Operation not permitted, I/O error, database error

chmod

PASS

Access denied, I/O error, device busy, database error

associations

PASS

memoryerror,segmentationfault,unconditionaljump,I/O error

fuse main

PASS

incompatible version

Table 8.5: GDB debugging for KWEST

Chapter 9

RESULTS

9.1 File System View after Mounting

The KWEST file system needs to be mounted by running a program and specifying a mount point. Once mounted, the file system can be used through any tool or application.

9.1.1 Terminal View

Figure 9.1: KWEST file system in Terminal after mounting

9.1.2 File Manager View

A file manager or file browser is a computer program that provides a user interface to work with file systems. Files are typically displayed in a hierarchy. This is the default and only way traditional file systems can work. For semantic file systems like KWEST, virtual entries act like folders and files. These allow the file system to specify what entries to display without adhering to any strict hierarchy.

File managers used and tested with KWEST include:

• Nautilus - The default file manager on Gnome distributions.

• Nemo - The default file manager for Cinnamon desktop.

• PCManFM - Lightweight file manager found in LXDE.

• XFE - File manager for the X Window system.

• Dolphin - The default file manager for KDE distributions.

• Thunar - Fast and extendable file manager which supports plugins.

Figure 9.2: KWEST file system in PCManFM after mounting

9.1.3 Organisation by File Types

The KWEST file system extracts metadata from files while it is being imported. Depending on the extracted metadata, files are categorised and put into folders corresponding to their types. Currently, KWEST supports the following file types:

1. Audio files

2. Images

3. PDF Documents

4. Videos

The separation of files into individual folders based on their types makes it easy to find a particular file. This is because each file type has its own metadata, which is used to categorise that file. An automated and organised view of files helps avoid clutter and provides efficient searching facilities.

Figure 9.3: Chaotic organisation in normal file system vs KWEST auto-organisation

9.2 Usage and Performance

Figure 9.4: Performance test for KWEST with normal user task loads

Although KWEST is a virtual file system, it can be used for normal daily tasks such as listening to music, reading documents, watching videos, etc. In the screenshot below, the applications each run or open a file on the KWEST file system simultaneously. No perceptible lag or performance degradation was noticed while playing music or watching the video. The file system performed sufficiently well in all four application operations.

9.3 Displaying Suggested Files

Suggestions are files which are not actually present in that tag, but can be tagged as per the system’s recommendation. There are two types of suggestions:

• Probably Related : The files probably belong to the tag, but the system is not sure about it.

• Related : The file definitely belongs to the tag, and the user can tag it.

Figure 9.5: KWEST showing suggestions for including files in the favourites tag

9.4 Exporting Files

Figure 9.6: KWEST Export function - Copy file to external location

The entries, files, tags in KWEST are all virtual entries. They do not exist anywhere on a physical disk. In order to “export” this organisation to another location outside of KWEST, there is the Export feature. Practically there is not difference between a normal copy and export. Export copies the virtual entries as real files by physically accessing the location where they are stored, and then copying them. To the end user, there is no increase in steps or time for the copying to complete.

Chapter 10

DEPLOYMENT AND MAINTAINENCE

10.1 Installation and Uninstallaion

10.1.1 Installing

• Get the latest copy of KWEST by downloading from the project hosting site: https://code.google.com/p/kwest/downloads

• After extracting the contents, open up a terminal and type in the following commands:

make kwest_libs

export LD_LIBRARY_PATH=../lib:D_LIBRARY_PATH make

10.1.2 Mounting and Unmounting

• To mount a KWEST file system, the user needs to run the mount script which will mount the file system in the specified folder.

./kwest mount-point

• TounmountaKWESTfilesystem,theuserneedstorunthefusermount-ucommand with the argument path of KWEST mount point. A successful unmount operation does not return any message.

fusermount -u mount-point

10.1.3 Uninstallation and Removal of Data

1. KWEST “installs” files in the users local directory.

2. To clean allthe installation data including the database andlog files,the usershould use the make clean or make ob commands.

3. These commands are present in the Makefile, which comes with the project source.

4. In case of manual installation, there is a KWEST folder in the .config directory.

5. Incomplete removal of files may cause haphazard execution of the program or corruption of files.

10.2 User Help

KWEST is a virtual file system. This means that the folders and files represented by it are a part of its virtual organisation. Each file in KWEST represents an actual file stored somewhere on the underlying file system. The main focus of using KWEST is organisation.

10.2.1 Importing Files and Folders

• By default KWEST imports from the users HOME folder. It recursively scans all the sub folders and imports the folder hierarchy into KWEST. Hidden files are ignored by KWEST.

• The user can also explicitly use the import tool to import files and folders into KWEST. The import tool accepts the folder to import as argument and imports all files and folders within it. It can work regardless of whether the KWEST file system is mounted or not.

• While importing, the metadata of the file is used to organise each file. For e.g.: A music file contains the song’s artist, which is used to categorise that file.

10.3 Browsing Files by Type

10.3.1 Audio

This folder contains all the files recognised by the system as being of type audio. Upon accessing the folder, each Audio file is further categorised by Album, Artist, Genre. If a particular metadata is absent for the audio file, KWEST categorises it in the Unknown folder.

For e.g. If the audio file does not have any artist associated with it, it can be found in UnknownArtist.

Figure 10.1: Audio files being categorised by /Album and /Artist/Album views

10.3.2 Image

This folder contains all the image files recognised by the system. Inside, the images are organised by ImageCreator and ImageDate. The ImageCreator sub folder is based on Creators like software - Adobe Photoshop, or hardware - Camera Models. Each ImageCreator tag is further organised by ImageDate. The ImageDate folder contains images sorted by Month-Year of creation. E.g.: A picture taken on “2nd March 1992” will appear under “1992Mar”. As with Audio, files with metadata missing will appear under appropriate Unknown sub folders.

Figure 10.2: Images being categorised by Creator and CreationDate

10.3.3 PDF

All PDF documents are tagged with the PDF tag. Each PDF document is organised by its Author,Publisher,Subject and Title. Files with metadata missing are appropriately tagged under Unknown tags.

Figure 10.3: PDF documents being separated by Author and Title

10.3.4 Video

Currently, the system organises video based only on Length, with the categories being Short, Medium and Long. A short video is anything with less than 1800s of playtime. Videos with play time equal to or greater than 5400s are considered Long, and anything between them is considered Medium. The Video tag does not contain any categories as most of the videos stored by the user do not have any metadata present.

Figure 10.4: Browsing the KWEST Video folder in a terminal

10.3.5 USER

The tag ‘USER’ refers to the username of the current user. The user can create and manage his own personal tags in this folder.

Figure 10.5: User tags displayed in KWEST

10.3.6 Using Suggestions to tag files

KWEST helps the user with organisation by providing suggestions for tagging files. These suggestions are provided as files prefixed with the word - “SUGGESTED”. The user may make use of that suggestion by tagging that file in the current tag.

e.g. in tag Zodiac, there are 3 suggestions: ARIES, GEMINI, CANCER each shown as a file with SUGGESTED prefixed in their names. To make use of the suggestion on ARIES, the user tags the file ARIES under the tag Zodiac. The file is now seen in the tag without the suggested prefix. The other suggestions are still present and may be used further or deleted.

Figure 10.6: KWEST offering suggestions for tag favourites

Chapter 11

Conclusion and Future Scope

11.1 Conclusion

Considering data organisation

A file system, considering that it stores data, is created with stability and performance in mind. An end-user is more concerned with how they can store and access their data efficiently. Using a KWEST file system, the user can organise their data efficiently. All files are stored and accessed by their content or context rather than just a bunch of string-names. This allows the user to think and remember the file in terms of what it represents, rather than a path-name which may not be related to the data it contains. This semantic approach is helpful to the user, as it becomes easier to manage for them to search for and manage their data.

Automation of Tasks

KWEST automatically uses the metadata embedded in a file to apply tags and categorise it.Thisautomationhelpstheuserbyprovidingaccesstofilesaccordingtotheirrespective contexts. By doing this, the user gets all Audio files under the Audio folder, all Image files under the Image folder and so on. Further, each folder is organised by meta-type belonging to that mime type. E.g. the Audio folder is sorted by Album, Artist, Genre.

Providing Suggestions

It may happen that the user inadvertently misses out tagging some file, which may result in an incomplete organisation. The user may later search for that particular file in the tag, but will not find it. In a KWEST file system, based on the occurrences of files in various user tags, the system provides suggestions that help the user tag a file in appropriate places. This helps avoid missing out on important files, and allows a faster method of organisation as the files to be tagged are available as suggestions.

Performance and Stability

Although a KWEST file system is created and operates as a VFS, there is no perceptible lag or performance hit on the operations the user performs. Common operations like listening to music, watching a video, writing to a document can be carried out smoothly.

Also, since KWEST is based on an always stable implementation of FUSE, the file system itself is stable. Strict coding standards and rigorous tests against memory allow the file system to remain stable even under moderate usage.

11.2 Future Scope

The project, with its novel concept of Applying association rule learning in a semantic file system; is the first implementation of a semantic file system to actively help the user categorise and organise their data.

Support for more Standard File Types

Right now the import feature can extract metadata only from a fixed set of file types. More file types can be handled which increase the feature and usefulness of the file system. Every operating system or file manager has a certain knowledge of what kind of metadata each file type can contain. Using this knowledge in the KWEST file system will allow the user to be able to browse a file system completely based on it’s semantics.

More Relations and Associations

Currently, the system creates associations based on the common occurrences of files between various tags. This is done using the Apriori algorithm. There are a lot of other interesting approaches which can be utilised. Like algorithms to create various different associations. OrchangingthewayApriorihandlesfilesandtags. Fileaccesses,frequency of usage, explicit user choices can also be utilised for forming results.

Efficiency and Performance

Although the system is both efficient and performant, it can be vastly improved to provide a high-quality file system. Operations can be threaded to reduce the wait time. Simultaneous access to files can be used to provide a fluid experience. Algorithm throughput can be raised to get more accurate results. These are just a few performance and efficiency related things we can do with KWEST. The ultimate approach is to integrate this file system at the kernel level. This will allow performance and stability similar those of traditional file systems.

Collect Data from Various Locations

The KWEST file system imports data only from the user’s HOME folder. The data stored there might not be the only location a user wants to use. In today’s world, each person has a multitude of devices ranging from laptops,PCs,tablets and phones to Cloud services like Dropbox, Google Drive. Each of them have data which the KWEST file system can utilise to form associations and display using virtual suggestions. It will result in a unified view of all the user’s data categorised and organised, which is spread and available across all of their devices.

11.3 Need for KWEST Tomorrow

With data usage almost doubling with each passing year, people are bound to focus on organising it. A tool like KWEST, with its semantic roots, automation and suggestions will be immensely helpful when a large data storage has to be properly catalogued, organised and accessed. Thus, we have tried to implement a research based project with its usefulness reflected in the problems of tomorrow.

Chapter 12

APPENDIX

A: MATHEMATICAL MODEL

The relationship between files andtags can be representedby using Settheory. Settheory is the branch of mathematics that studies sets, which are collections of objects. The following mathematical model represents the working of this filesystem.

The following dynamic and variable sets are defined as, F : Set of Files

T : Set of Tags

S : Set of Tags in query ( S T )

.1 Relation between Files (F) and Tags (T)

R=f(f;t)j f hastagt; f 2F;t 2Tg

Here R defines the relation between a file f and its tag t where R F T. This relationship is many-to-many. That is a file can have many tags, and a tag can describe many files.

.2 Association between Tags (T)

Using discovered associations, we can form various relations between tags. These tag-to-tag help in displaying related information. For any two tags there exists a distinct relation between them given by r. The function Xr(A;B) returns the relation between two tags. Associations can be broadly categorized as:

• AB : A and B are not directly related, but there may exist some indirect relation between them.

• AB : A and B are directly related, where A always has a path leading to B. This relation is similar to AB.

• A./B : A and B are not directly related, but B supplements additional information related to A.

• f : This relation states that there does not exist any relation between the two tags.

.3 Operations

g(f)= ft : f Rtg

g is an operation which takes input as a file f and returns the set of tags (t 2S) related by R to that file.

h(t)= ff : f Rtg

h is an operation which takes input as a tag t and returns the set of files (f 2F ) related by R to that tag.

.4 Storing Tags and Files

The relation R is stored as a set of ordered pairs (f;t), where RFT. The operations g and h operate on these ordered pairs and return mapped or matched elements. A relation which has to be added must be represented in the form of of ordered pair (f;t). Storage of all relations is given by FT where ordered pairs exist according to R=ff 2F;t 2T j f Rtg.

For example, we have the sets and their relations as:

F =ff1; f2; f3g;T =ft1;t2;t3g;R=ff1Rt1; f2Rt2; f3Rt1; f1Rt3g

Then we store this relation by its ordered pairs given by:

R=f(f1;t1);(f2;t2);(f3;t1);(f1;t3)g

.5 Extraction of metadata

The extraction of metadata is defined by the function XE which takes a file f(f 2F) and returns a set of tags (TE T) that form the relation (f Rt :t 2TE).

XE(f)= TE 22T

Addition of new information (metadata, tags) is done as:

if(t 2= T)then(T T [ftg)

We then store this relation as a ordered pair f(f;t)8t 2TEg. At the end of this operation TE T will hold true.

.6 Importing Semantics

The existing file-directory structure can be imported to the system and represented in the form of tags and files. We define:

FH : Set of Files on hard-disk which are not represented in system.

DH : Set of Directories on hard-disk which are not represented in system.

Then for every file stored within a directory d, the relation R is expressed as (f Rd). When importing semantics we create the ordered pair (f;d) given by the relation (f Rd;8d 2Dd). Where Dd contains the directory the file is stored in, as well as every parent directory of that directory itself.

Directories also contain sub-directories which we store in the form of tag relationships. We represent them as: 8(d1;d2)2DH, if d2 d1 (d2 is a sub-directory of d1) store the relation as

d1 !d2

.7 Apriori Algorithm

The apriori algorithm[10] is a classic algorithm for learning association rules. The algorithm is designed to operate on databases containing transactions. As is common in association rule mining, given a set of itemsets (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to find subsets which are common to at least a minimum number C of the itemsets. Apriori uses a ”bottom up” approach, where frequent subsets are extended one item at a time (a step known as

candidategeneration),andgroupsofcandidatesaretestedagainstthedata. Thealgorithm terminates when no further successful extensions are found.

The purpose of the apriori algorithm is to find associations between different sets of data. Each set of data has a number of items and is called a transaction. The output of Apriori is sets of rules that tell us how often items are contained in sets of data.

Itemset

A collection of one or more items. Example: A, B, C

k-itemset

An itemset that contains k items.

Support count (S)

Number of transactions containing an itemset. Example: S(A, B) = 2

Support (supp)

The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. Suppose minsup is the minimum support threshold. Example: supp(A, B) = 2/5

Frequent Itemset (L)

Anitemsetsatisfiesminimumsupportiftheoccurrencefrequencyoftheitemsetisgreater or equal to a threshold. If an itemset satisfies minimum support, then it is a frequent itemset. Thusanitemsetwhosesupportisgreaterorequaltominsupisafrequentitemset.

Confidence

The confidence of a rule is defined as,

Conf(A!B)= supp(A!B)=supp(A) (12.1) Suppose minconf is the minimum confidence threshold.

Rule Generation

Given a setoftransactions T,the goalofassociation rule mining is to findallrules having

support minsupthreshold (12.2) confidenceminconf threshold (12.3)

Given a frequent itemset L, find all non-empty subsets f L such that f ! L f satisfies the minimum confidence requirement.

Example: If A;B;C is a frequent itemset, then the following candidate rules are formed AB!C; AC !B; BC !A; A!BC; B!AC; C !AB

If jLj = k, then there are (2k) 2 candidate association rules (ignoring L ! Fand F!L)

Apriori principle

The princciple sttes that if an itemset is frequent, then all of its subsets must also be frequent. Apriori principle holds due to the following property of the support measure:

8X;Y :(X Y)!s(X)s(Y) (12.4)

Support of an itemset never exceeds the support of its subsets. This is known as the anti-monotone property of support.

Algorithm Input

T - Database of transactions I - Items

L - Itemset s - support

c - confidence

Output

R - Association rules satisfying s and c

Algorithm to Generate Frequent Itemsets Apriori(T,s)

L1 Large1 itemset k 2

whileLk 1 =F

Ck =Generate(Lk 1) fortransactionst 2T

C Subset(Ck;t) for candidatesc2C

count[c] count[c]+1 Lk c2Ckjcount[c]s k k+1

return[Lk

Algorithm to Generate Association Rules GenRule

R=F;

for eachl 2L do

for eachx l suchthat x =Fand x =l do if(supp(l)=supp(x)) cthen

R=R [ (x !(l x));

B: TESTING OF DATA

Testing of the system will be done on the following points:

.2 Storing the relation R:

The working of the system depends on correctly storing the relation R. These tests check whether the relations are stored and represented correctly.

1. For any file f associated with tag t there should exist an ordered pair (f;t).

2. For a file f associated with tags S = ft1;t2;t3g, the operation g(f) should return exacty S.

3. For a tagt containing files F =ff1; f2; f3g, the operation h(t) should return exactly F.

.3 Relation between tags:

1. The relation r between two tags t1;t2 is given by function Xr(t1;t2)= c.

2. The relation r is distinct i.e. there exists only one relation between any two tags. If contradictions arise where more than one relation is present between two tags, then all those relations must be made void and r =f. The user can then explicitly specify which relation should be created between those two tags.

3. Extensive testing must be done on relations to determine which r needs to be set under certain conditions.

.4 Extraction of Metadata:

Let M be the set of all metadata tagst for file f. The function XE represents an algorithm to extract t from f. It returns a set TE such that ft 2TE j f Rtg and (TE M). To test the efficiency of the function, or the effectiveness of it, we compare the cardinality of the generated set TE with the set of Metadata M. The efficiency can be calculated by

Efficiencye= jTE j

Using e we can compare algorithms and their efficiency. Appropriate algorithms can be chosen for various file types so that efficiency of the entire system remains high. For e.g.

XE1 :fe=0:7 for audio;e=0:4 for images g XE2 :fe=0:6 for audio;e=0:6 for images g

Then we have the following options:

1. XE2 is a better choice as it yeilds a more consisten efficiency.

2. XE1 is used only for audio and XE2 is used only for image extractions.

.5 Queries and their results:

Queries are parsed into tokens of the form (t1;s;t2). We need to test whether s returns the correct results for the query. A Query Q(S) is said to be successfully executed when the expected result are shown.

• The Query q(t) for a single tag t should return a set of files (f 2 F ) through the operation h(t).

• In a query if no operation is given, Intersubsection should be performed.

• For Query Q(S) the operations should be performed from left to right unless precedence is specified by paranthesis.

Example: Let t1;t2;t3 be tags having the following files: t1 : fphoto1;photo2;doc1g

t2 : fphoto1;doc1;ppt1g t3 : fphoto1;doc3g

1. q(t) = F where f should follow the relation (f 2 F j f RT). For example, q(t1) returns F =fphoto1;photo2;doc1g.

2. Consider a query containing tags S=ft1;t2g. It generates results by performing the operations Q(S)= q(t1)sq(t2) where s is an operator.

a) Putting s = [ for Union in query Q(S) = q(t1)[q(t2); the result should be F =fphoto1;photo2;doc1;ppt1g.

b) Putting s = \ for Intersubsection in query Q(S) = q(t1)\q(t2); the result should be F =fphoto1;doc1g.

c) Puttings=nforSetDifferenceinqueryQ(S)=q(t1)nq(t2);theresultshould be F =fphoto2g,whereas the query Q(S)=q(t2)nq(t1) willreturn the result as F =fppt1g.

d) Putting s = for Symmetric Difference in query Q(S) = q(t1) q(t2); the result should be F =fphoto2;ppt1g.

3. Consider a query containing tags S = ft1;t2;t3g. It performs operations as : Q(S) = q(t1)s1q(t2)s2q(t3) The default order for processing is from left to right unless precedence is specified through parenthesis.

a) For the query Q(S) = q(t1)q(t2)q(t3) the result will be returned as files fphoto1g

b) For the query Q(S) = q(t1)[q(t2)\q(t3) the result will be returned as files fphoto1;doc1;doc3g

After verifying that individual queries return correct results, we must check whether Q(S) runs correctly as well. This is done by seperating the query into tokens, and calculating their results in turn. It must be verified that results are correct and have not been mis-intepreted through tokenization of the query.

.6 Forming Associations

Consider a database, D, consisting of 9 transactions. Suppose minimum support count required is 2 (i.e. minsup = 2=9 = 22% ). Let minimum confidence required is minconf =70%. We have to first find out the frequent itemset using apriori algorithm. Then, Association rules will be generated using minsup and minconf.

TagID

Files

T100

F1, F2, F5

T101

F1, F2, F5

T102

F2, F4

T103

F2, F3

T104

F1, F2, F4

T105

F1, F3

T106

F2, F3

T107

F1, F3

T108

F1, F2, F3, F5

T109

F1, F2, F3

Step 1: Generating initial Candidate Itemset

In the first iteration of the algorithm, each item is a member of the set of candidate.

Itemset

Support count

fF1g

6

fF2g

7

fF3g

6

fF4g

2

fF5g

2

Step 2: Generating 1-itemset Frequent Pattern

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.

Itemset

Support count

fF1g

6

fF2g

7

fF3g

6

fF4g

2

fF5g

2

Step 3: Generating 2-itemset Frequent Pattern

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2. Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

Itemset

Support count

fF1, F2g

4

fF1, F3g

4

fF1, F5g

2

fF2, F3g

4

fF2, F4g

2

fF2, F5g

2

Step 4: Generating 3-itemset Frequent Pattern Thegenerationofthesetofcandidate3-itemsets,C3,involvesuseoftheAprioriProperty. First, we generate C3 using L2 join L2.

C3 = ffF1, F2, F3g, fF1, F2, F5g, fF1, F3, F5g, fF2, F3, F4g, fF2, F3, F5g, fF2, F4, F5gg.

Now we will apply Apriori property to determine which candidate itemsets are frequent. The 2-item subsets of fF1, F2, F3g are fF1, F2g, fF1, F3g and fF2, F3g. Since all

2-item subsets of fF1, F2, F3g are members of L2, We will keep fF1, F2, F3g in C3. The 2-item subsets of fF2, F3, F5g are fF2, F3g, fF2, F5g and fF3, F5g. But, fF3,

F5g is not a member of L2 and hence it is violating Apriori Property. Thus we will remove fF2, F3, F5g from C3. Therefore,

C3 = ffF1, F2, F3g, fF1, F2, F5gg.

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.

Itemset

Support count

fF1, F2, F3g

2

fF1, F2, F5g

2

Step 5: Generating 4-itemset Frequent Pattern

The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in ffF1, F2, F3, F5gg, this itemset is removed since its subset ffF2, F3, F5gg is not frequent. Thus,C4=F , and algorithm terminates, having found all of the frequent items. This completes our apriori algorithm.

Step 6: Generating Association Rules from Frequent Itemsets

For each frequent itemset ’l’, generate all nonempty subsets of l. For every nonempty subset s of l, output the rule s!(l s) if supp(l)=supp(s)minconf

We had L = ffF1g, fF2g, fF3g, fF4g, fF5g, fF1, F2g, fF1, F3g, fF1, F5g, fF2, F3g, fF2, F4g, fF2, F5g, fF1, F2, F3g, fF1, F2, F5gg.

Consider l = fF1, F2, F5g. Its all nonempty subsets are fF1, F2g, fF1, F5g, fF2, F5g, fF1g, fF2g, fF5g.

The association rules are shown below, each listed with its confidence.

1. R1:F1;F2!F5 Confidence=suppfF1;F2;F5g=suppfF1;F2g=2=4=50% R1 is Rejected.

2. R2:F1;F5!F2 Confidence=suppfF1;F2;F5g=suppfF1;F5g=2=2=100% R2 is Selected.

3. R3:F2;F5!F1 Confidence=suppfF1;F2;F5g=suppfF2;F5g=2=2=100% R3 is Selected.

4. R4:F1!F2;F5 Confidence=suppfF1;F2;F5g=suppfF1g=2=6=33% R4 is Rejected.

5. R5:F2!F1;F5 Confidence=suppfF1;F2;F5g=suppfF2g=2=7=29% R5 is Rejected.

6. R6:F5!F1;F2 Confidence=suppfF1;F2;F5g=suppfF5g=2=2=100% R6 is Selected.

In this way, we have found the following three strong association rules.

1. F1;F5!F2

2. F2;F5!F1

3. F5!F1;F2

C: PAPERS PUBLISHED

Sr.No. Paper Name

[1] A. Gogte, S. Gupta, H. Pandit, R. Sharma, Using association rule learninginaSemanticfilesystem,InternationalConferenceonAdvanced Computer Sciences and Information Technology, Pune, Februrary 15, 2013, pp. 29-31. [Published]

[2] A. Gogte, S. Gupta, H. Pandit, R. Sharma, KWEST - A Semantically Tagged Virtual File System, International Conference on Advanced Computer Engineering and Applications, Trivandrum, 2012. [Accepted]

D: PAPERS REFERRED

Sr.No. Paper Name

[1] K. Chang, W.T. Perdana, B. Ramadhana, K. Sethuraman, T.V. Le, and N. Chachra, Knowledge File System-A principled approach to personal information management, 2010 IEEE International Conference on Data Mining Workshops, Sydney, December 2010, pp. 1037-1044.

[2] O. Eck, and D. Schaefer, A semantic file system for integrated product data management, Advanced Engineering Informatics, 2011, pp. 177-184.

[3] P. Mohan, S. Venkateswaran, Raghuraman, and A. Siromoney, Semantic FileRetrievalinFileSystemsusingVirtualDirectories,Proc.LinuxExpo Conference, Raleigh, NC, May 2007, pp. 141-151.

[4] R. Agarwal, and R. Srikant, Fast algorithms for mining association rules in large databases, 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, September 1994, pp. 487-499.

E : CONTRIBUTION OF TEAM MEMBERS

Member 1: Aseem Gogte

• Literature Survey

• Mathematical model

• Module for Extraction of metadata from audio using TagLib

• Testing related to File operations

• Source code formatting and documentation

Member 2: Sahil Gupta

• Designing database queries

• Database design

• Module for importing and exporting semantics

• Tag associations

• Integrating database with FUSE

• Data mining, association rule learning alternatives

• Module for Apriori algorithm study and implementation

Member 3: Harshvardhan Pandit

• Design program flow

• Component design

• Module for interact with FUSE technology

• Module for extraction of metadata from images and videos using Libextrator

• Plugin architecture for metadata extraction

• API for developers, dynamic libraries

• Logging mechanism for for debugging code

• Debugging using GDB, valgrind

Member 4: Rohit Sharma

• Database design

• Module for managing database consistency

• Module for extraction of metadata from PDF using Poppler

• Integrating modules

• Testing related to database

APPENDIX F: GLOSSARY

Acronym SDLC FUSE GCC GPL

API GUI FAQ SFS VFS KFS

Definition

Software Development Life Cycle File system in Userspace

GNU Compiler Collection General Public License

Application programming interface Graphical user interface Frequently asked questions Semantic File System

Virtual File System Knowledge File System

Chapter 13

BIBLIOGRAPHY

[1] K. Chang, W.T. Perdana, B. Ramadhana, K. Sethuraman, T.V. Le, and N. Chachra, Knowledge File System-A principled approach to personal information management, 2010 IEEE International Conference on Data Mining Workshops, Sydney, December 2010, pp. 1037-1044.

[2] S. Gopal, Y. Yang, K. Salomatin, and J. Carbonell, Statistical Learning for File-Type Identification, 2011 10th International Conference on Machine Learning and Applications, May 2011, pp. 68-73.

[3] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, Semantic-Aware Metadata Organization Paradigm in Next-Generation File Systems, IEEE Transactions On Parallel And Distributed Systems, Vol. 23, No. 2, February 2012, pp. 337-344.

[4] O. Eck, and D. Schaefer, A semantic file system for integrated product data management, Advanced Engineering Informatics, 2011, pp. 177-184.

[5] P. Mohan, S. Venkateswaran, Raghuraman, and A. Siromoney, Semantic File Retrieval in File Systems using Virtual Directories, Proc. Linux Expo Conference, Raleigh, NC, May 2007, pp. 141-151.

[6] R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, 1993 ACM SIGMOD Conference, Washington DC, USA, May 1993, pp. 207-216.

[7] S. Bloehdorn, O. Grlitz, S. Schenk, and M. Vlkel, TagFS - Tag Semantics for Hierarchical File Systems, 6th International Conference on Knowledge Management (I-KNOW 06), Graz, Austria, September 6-8, 2006.

[8] D. Gifford, P. Jouvelot, M. Sheldon, and J.W. O’Toole, Sematic File Systems. 13th ACM Symposium on Operating Systems Principles, ACM Operating Systems Review, October 1991, pp. 16-25.

[9] A. Schroder, R. Fritzsche, S. Schmidt, A. Mitschick, and K. Meißner, Semantic Extension of a Hierarchical Storage Management System for Small and Medium-sized Enterprises, 1st International Workshop on Semantic Digital Archives, 2011.

[10] R. Agarwal, and R. Srikant, Fast algorithms for mining association rules in large databases, 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, September 1994, pp. 487-499.

[11] C. Mangold, A survey and classification of semantic search approaches, Int. J. Metadata, Semantics and Ontology, Vol. 2, No. 1, pp.23-34.

[12] R. Freund, File Systems and Usability — the Missing Link, Cognitive Science, University of Osnabruck, July 2007.

[13] Apple Spotlight, from http://developer.apple.com/macosx/spotlight.html

[14] Google Desktop Search, from http://googledesktop.blogspot.in

[15] Tagsistant, from http://www.tagsistant.net

[16] G.Olaf (2008), Tagster, from http://www.uni-koblenz.de

[17] Homepage and documentation File system in USERspace (FUSE), from http://fuse.sourceforge.net

[18] Homepage and documentation SQLite database, from http://www.sqlite.org