Archival, Backup, Data-Loss Prevention and e-Discovery¶
Different challenges could potentially be resolved by implementing a single solution, providing each of the functional aspects in an integrated fashion.
A brief overview of the functional components:
Archival
Archival is the retention of business records, in a fashion that allows them to be used as evidence.
Many archival solutions only include actual communications that descend over an SMTP server that can keep the archival solution in the loop.
Backup
Backup is the lesser part to the ability to restore, a frequently occuring, everyday event.
It is often requested backup happens on a per-mailbox or even per-message level.
Data-Loss Prevention
e-Discovery
Maintenance of a changelog on object entries that can change state (email read/deleted), or are volatile (changes to an appointment).
Functional Requirements¶
Audit Trail
Item Changelog
A per-item changelog, of whom changed what, to what item, and when.
Queue ID Chasing
Chase so-called Queue IDs for messages being exchanged with the outside world, and internally between systems throughout the deployment.
Functional Design¶
Functional Components¶
Dealer
A dealer is a script executed once for each event notification, used to receive the initial event notification from Cyrus IMAP 2.5, and broadcast the event on to the message bus or queue.
The dealer awaits confirmation of a broker having received the event notification.
Broker
A broker retrieves the notifications from the message bus or queue, and acknowledges having received the event notification.
The event notification is put in to a persistent queue, awaiting workers to become ready for handling the event notification.
Worker
The worker is where the processing happens – one can have as many workers as necessary, or as few as required.
The worker announces its presence to the broker, which subsequently assigns jobs to the worker [1].
The worker may require additional information to be obtained, such as the message payload [2].
Collector
The collector daemon is an optional component subscribing to requests for additional information that can only reliably be obtained from a Cyrus IMAP backend spool directory.
System Log Centralization
The centralization of system log files such as
/var/log/maillog
aids in tracing the exchange of messages as they descend across infrastructure, and helps in associating, for example, a Login event to an IMAP frontend with the corresponding web server session [3].
Operational Requirements¶
Broker – Worker Interaction¶
When the broker starts up, it creates three listener sockets:
A dealer router,
used for incoming event notifications from IMAP servers passed through the Dealer component.
A worker router,
used to exchange job information and notification payload with workers.
A control router,
used to exchange worker and job state information.
When the worker starts, it connects to both the control router and worker router.
Using the controller channel, the worker let’s the broker know it is ready to receive a job.
The broker adds the worker to its list of workers.
The broker will continue to receive occasional messages from the worker to allow it to determine whether or not it is still available.
The broker, maintaining a queue of jobs to assign to workers, lets the worker know about a newly assigned job – again using the controller channel.
The worker internally triggers the retrieval of the job using the worker channel.
The worker is now in state BUSY and must respond within a set interval or the broker will set the job back in to PENDING state, and mark the worker as unavailable.
Worker Design¶
The worker is built out of plugins, that subscribe to an event type, where event types available are listed in Event Notification Types.
Each event type individually may require handling – for example, a logout event is associated with the corresponding login event.
The following components will be pluggable and configurable:
subscribing to a message bus or queue, as
inputs
, initially including onlyzmq
.event handling, as
handlers
, initially including only one handler per event notification and higher level processorschangelog
andfreebusy
to detect changes in groupwware objects.result output, as
output
, initially including onlyelasticsearch
.storage for transactions pending or aggregated meta information, as
storage
, initialy including onlyelasticsearch
.
Assuming an installation path of bonnie/worker/
, the following
depicts its tree layout:
handlers/
`- changelog.py
`- freebusy.py
`- mailboxcreate.py
`- messageappend.py
`- ...
inputs/
`- zmq_input.py
outputs/
`- elasticsearch_output.py
storage/
`- elasticsearch_storage.py
To take the changelog and freebusy handlers as an example, the following event notification types may need to be subscribed to.
A new mailbox that is an event folder may have been created.
The initial event is handled by the base handler for the event notification type.
Passing this event right through to the changelog handler would make it require obtaining the
/shared/vendor/kolab/folder-type
and/or/private/vendor/kolab/folder-type
metadata value(s) in order to determine whether the folder indeed is an event folder.However, the setting of metadata is an event separate from the mailbox creation, and at the moment the handler receives the initial event notification, the metadata may not have been set yet.
Note
At the time of this writing, no separate event notification for setting folder-level METADATA exists.
A mailbox that was an event folder may have been deleted.
A mailbox that was an event folder may have been renamed.
Only applicable to event folders, this depicts a new or updated version of an existing event has been appended.
One or more events may have been copied from an event folder into another event folder.
One or more events may have been moved from one event folder into another event folder.
Note
Plugins that are interested in the vendor/kolab/folder-type METADATA value(s) of a folder can reply with additional commands for the collector component which will put the current job back into the PENDING state and send it through the handler again once the requested information was added to the notification payload.
Event Notification Types¶
Event types available include, in alphabetical order:
FlagsClear¶
This event notification type indicates one or more messages have had its flags cleared.
Flags having been cleared may include \Seen
, but also \Deleted
,
and any custom other flag on an IMAP message.
Subscribe to this notification for:
Backup/Restore
e-Discovery
FlagsSet¶
Subscribe to this notification for:
Backup/Restore
e-Discovery
Login¶
Additional information to obtain for this event notification type:
The persistent unique attribute for the user object.
Additional LDAP object attributes.
Information storage:
This event needs to be stored until it can be associated with a Logout event notification type.
Subscribe to this notification for:
e-Discovery
Logout¶
Subscribe to this notification for:
e-Discovery
MailboxCreate¶
Additional information to obtain
MailboxDelete¶
MailboxRename¶
MailboxSubscribe¶
MailboxUnsubscribe¶
MessageAppend¶
MessageCopy¶
MessageExpire¶
MessageExpunge¶
MessageMove¶
MessageNew¶
MessageRead¶
MessageTrash¶
QuotaExceeded¶
QuotaWithin¶
QuotaChange¶
An Integrated Solution¶
The following aspects of an environment need to be tracked;
Logs such as
/var/log/maillog
, which contain the information about exchange of messages between internal and external systems and software (Postfix/LMTP -> Cyrus IMAP).Cyrus IMAP 2.5 Events broadcasted.
In this picture, IMAP (using Cyrus IMAP 2.5) issues so-called event notifications to a message bus, that can be picked up by the appropriate subscribers.
Note that the subscribers are different components to plug in and enable, or leave out – not everyone has a need for Archival and e-Discovery capabilities.
As such, a component plugged in could announce its presence, and start working backwards as well as start collecting the relevant subsets of data in a retro- active manner.
To allow scaling, the intermediate medium is likely a message bus such as ActiveMQ, AMQP, ZeroMQ, etc.
Between Cyrus IMAP 2.5 and the message bus must be a thin application that is capable of:
Retrieving the payload of the message(s) involved if necessary,
Submit the remainder to a message bus.
This is because Cyrus IMAP 2.5:
at the time of this writing, does not support submitting the event notifications to a message bus directly [4],
the size of the message payload is likely to exceed the maximum size of an event notification datagram [5].
Processing of inbound messages must happen real-time or near-time, but should also be post-processed:
e-Discovery requires post-processing to sufficiently associate the message in its context, and contains an audit trail.
Archival and Backup require payload, and may also use post-processing to facilitate Restore.
Event Notifications¶
The following events trigger notifications:
/*
* event types defined in RFC 5423 - Internet Message Store Events
*/
enum event_type {
EVENT_CANCELLED = (0),
/* Message Addition and Deletion */
EVENT_MESSAGE_APPEND = (1<<0),
EVENT_MESSAGE_EXPIRE = (1<<1),
EVENT_MESSAGE_EXPUNGE = (1<<2),
EVENT_MESSAGE_NEW = (1<<3),
EVENT_MESSAGE_COPY = (1<<4), /* additional event type to notify IMAP COPY */
EVENT_MESSAGE_MOVE = (1<<5), /* additional event type to notify IMAP MOVE */
EVENT_QUOTA_EXCEED = (1<<6),
EVENT_QUOTA_WITHIN = (1<<7),
EVENT_QUOTA_CHANGE = (1<<8),
/* Message Flags */
EVENT_MESSAGE_READ = (1<<9),
EVENT_MESSAGE_TRASH = (1<<10),
EVENT_FLAGS_SET = (1<<11),
EVENT_FLAGS_CLEAR = (1<<12),
/* Access Accounting */
EVENT_LOGIN = (1<<13),
EVENT_LOGOUT = (1<<14),
/* Mailbox Management */
EVENT_MAILBOX_CREATE = (1<<15),
EVENT_MAILBOX_DELETE = (1<<16),
EVENT_MAILBOX_RENAME = (1<<17),
EVENT_MAILBOX_SUBSCRIBE = (1<<18),
EVENT_MAILBOX_UNSUBSCRIBE = (1<<19)
};
In addition, Kolab Groupware makes available the following event notifications:
enum event_type {
(...)
EVENT_MAILBOX_UNSUBSCRIBE = (1<<19),
EVENT_ACL_CHANGE = (1<<20)
};
This means the following event notifications are lacking:
METADATA change notification
It is possible to run Cyrus IMAP 2.5 notifications in a blocking fashion, allowing the (post-)processing operation(s) to complete in full before the IMAP session is allowed to continue / confirms the modification/mutation.
Queries and Information Distribution¶
ZeroMQ¶
Dealer <-> Broker <-> Worker Message Exchange¶
Modelled after an article about tracking worker status at http://rfc.zeromq.org/spec:14
Dealer - Broker Concerns
The dealer is queuing without a high-water mark and without a local swap defined. It is only after the broker is available this queue is flushed. This could introduce a loss of notifications.
The dealer is not awaiting confirmation in the sense that it will replay the submission if needed, such as after the dealer has been restarted. This too could introduce a loss of notifications.
The dealer is certainly not awaiting confirmation from any worker that the notification had been submitted to for handling.
The dealer is a sub-process of the cyrus-imapd service, and should this service be restarted, is not handling such signals to preserve state.
Broker Concerns
The broker is keeping the job queue in memory for fast updates and responses.
Note
The broker component shall periodically dump the job queue and registered worker and collector connections into a persistant storage layer which has yet to be defined.
Storage Layout and Schema¶
Logging Event Notifications¶
Logging event notification into the storage backend (currently elasticsearch)
is inspired by logstash and writes to daily rotated indexes logstash-Y-m-d
using document type logs
. The basic schema of an event notification
contains the following attributes:
{
"@timestamp": "2014-10-11T23:10:20.536000Z",
"@version": 1,
"event": "SomeEvent",
"client_ip": "::1",
"folder_id": "4ed7903ebd7722d12596a2e2ed57bbdf",
"folder_uniqueid": "f83c6305-f884-440a-b93d-eff285ada1f4",
"service": "imap",
"session_id": "kolab.example.org-2819-1413069020-1",
"uri": "imap://john.doe@example.org@kolab.example.org/INBOX;UIDVALIDITY=1411487701",
"user": "john.doe@example.org",
"user_id": "f6c10801-1dd111b2-9d31a2a8-bebbcb98",
}
The very minimal attributes required for an event notification entry are
@timestamp
: The UTC time when the event was logged@version
: Bonnie data API versionevent
: The Cyrus IMAP eventservice
: “imap” denoting that this logstash entry represents an IMAP event notificationsession_id
: The Cyrus IMAP session identifieruser
: The authenticated user who triggered the event
Depending on the event type, additional attributes containg message IDs, message
headers or payload, flag names or ACL. For message or mailbox based events the uri
attribute is added and refers to the mailbox/folder the operation was executed on.
From the basic attributes, some relations to metadata (see Storing Metadata) are extracted and the logstash entry is extended with identifiers referring to user and folder metadata entries:
folder_uniqueid
: The gobally unique folder identifer of a mailbox folder from IMAP.folder_id
: Links to a folder entry representing the current state of a mailbox folder at the time the event occurred. This includes folder name, metadata and access rights.user_id
: Unique identifier (from the LDAPnsuniqueid
attribute) of the use who executed the logged operation in IMAP.
Storing Metadata¶
Metadata records are used to amend log data with more complete and persistent information of rather volatile attributes like username and mailbox URIs issued by Cyrus IMAP 2.5 notifications. For example, the same physical human being (jane.gi@example.org) could change email addresses for any of many unrelated causes (jane.doe@example.org) and IMAP folders can be renamed at any given time.
Users¶
Stored in objects/user
with the following schema:
{
"@timestamp": "2014-10-11T19:30:24.330029Z",
"dn": "uid=doe,ou=People,dc=example,dc=org",
"user": "john.doe@example.org",
"cn": "John Doe"
}
The nsuniqueid
attribute from the LDAP is used as the primary key/id
of user records.
Folders¶
Stored in objects/folder
with the following schema:
{
"@timestamp": "2014-10-11T23:10:54.055272Z",
"@version": 1,
"acl": {
"anyone": "lrswiptedn",
"f6c10801-1dd111b2-9d31a2a8-bebbcb98": "lrswipkxtecdan"
},
"metadata": {
"/shared/vendor/cmu/cyrus-imapd/duplicatedeliver": "false",
"/shared/vendor/cmu/cyrus-imapd/lastupdate": "12-Oct-2014 01:10:20 +0200",
"/shared/vendor/cmu/cyrus-imapd/partition": "default",
"/shared/vendor/cmu/cyrus-imapd/pop3newuidl": "true",
"/shared/vendor/cmu/cyrus-imapd/sharedseen": "false",
"/shared/vendor/cmu/cyrus-imapd/size": "2593",
"/shared/vendor/cmu/cyrus-imapd/uniqueid": "f83c6305-f884-440a-b93d-eff285ada1f4",
"/shared/vendor/kolab/folder-type": "mail"
},
"name": "INBOX",
"owner": "john.doe",
"server": "kolab.example.org",
"type": "mail",
"uniqueid": "f83c6305-f884-440a-b93d-eff285ada1f4",
"uri": "imap://john.doe@example.org@kolab.example.org/INBOX"
}
The primary key/id of folder records is computed as a checksum of all attributes and metadata entries considered relevant for the “state” of a folder. This means that a new folder record is created when ACLs or folder type metadata is changed.
The keys of acl
entries provided by the Collector module from IMAP data
are translated into static user identifers.
Note
In order to compute the folder identifier, the complete set of folder information like metadata and acl has to be pulled from IMAP using a collector job on every single event notification. Once Cyrus IMAP supports notifications for metadata changes (#3698), this could be skipped and the folder metadata records can be updated on specific events only.
Object Relations¶
Although elasticsearch isn’t a relational database, the Bonnie storage model implies a simple object relation model between logs and metadata.
Accessing the Collected Data¶
Some of the collected data, primarily changelogs of groupware objects, shall be made available to Kolab clients to display the history of a certain object or creation/last-modification information including the according usernames which is not stored in the Kolab data format itself.
A dedicated web service provides access to the archived data through an API and thereby translates the raw information from the storage backend into more concrete groupware object related data.
See the Bonnie Client API for details.
Footnotes