Agenda

Tentative Program

June 24, 2016

8:30 am:  Keynote: Roger Barga, Amazon Web Services, US
Living on the Edge – Stream Data Processing at Scale 

1st Session: Spatial, Temporal and Smartcity Applications

9:45: Chang-Tien Lu, Virginia Tech
Social Media Mining using Real-time Twitter Stream

10:15: Vana Kalogeraki, Athens University of Economics and Business
Real-time reliable crowdsourcing for mobile human-center systems

10:45: Leonardo Querzoni, Sapienza Roma
Efficient Notification Ordering for Geo-Distributed Pub/Sub Systems

11:15 Coffee Break

2nd Session: Novel Database Applications
11:30: Chinya V. Ravishankar, UC Riverside
Inferring Insertion Times in Bloom Filters

12:00: Carlo Zaniolo, UCLA
Reengineering Negation and Aggregates in Data Stream Query Languages for Higher Expressivity

12:30: Yannis Papakonstantinou, UCSD
Streaming and View Maintenance for Semistructured, NoSQL databases

1:00-2:00 lunch

1st Session (cont’d)

2:00: Avi Gal, Technion, and Dimitris Gunopulos, University of Athens
Real-time heterogeneous stream mining for traffic management.

2nd Session (cont’d)

2:30: Gautam Das, UT Arlington
Aggregate Tracking over Dynamic Deep Web Databases

Talks Info

Reengineering Negation and Aggregates in Data Stream
Query Languages for Higher Expressivity.

Speaker: Carlo Zaniolo, CSD/UCLA

As the number and sophistication of big data stream applications
grow, so does the importance of continuous queries which, owing
to their declarative semantics, can achieve scalability through
parallel execution over multiple platforms. However, the fact that
blocking queries are disallowed severely limits their expressive
power. In fact, we know that non-monotonic operators are blocking,
and negation and aggregates are non-monotonic in traditional DB queries.
Fortunately, we can show that, for continuous queries
on data streams, the non-monotonicity curse can often be avoided
by (i) relaxing the closed-world assumption used in negation
and (ii) replacing the traditional aggregates with their monotonic
counterparts. Several continuous SQL and Datalog queries
made possible by these extensions will be given as examples.

 

Spatiotemporal Event Forecasting in Social Media

Speaker: Chang-Tien Lu, Virginia Tech

Social media has become a popular data source as a surrogate for monitoring and detecting events. Analyzing social media (e.g., tweets) to reveal event information requires sophisticated techniques. Tweets are written in unstructured language and often contain typos, non-standard acronyms, and spam. In addition to the textual content, Twitter data form a heterogeneous information network where users, tweets, and hashtags have mutual relationships. These features pose technical challenges for designing event detection and forecasting methods. In this talk, I will present the design and implementation of EMBERS, a fully automated 24×7 forecasting system for significant societal events using open source data including tweets, blog posts, and news articles. I will describe the system architecture of EMBERS, individual models that leverage specific data sources, and a fusion engine that supports trading off specific evaluation criteria. I will also demonstrate the superiority of EMBERS over base rate methods and its capability to forecast significant societal happenings.

 

Efficient Notification Ordering for Geo-Distributed Pub/Sub Systems

Speaker: Leonardo Querzoni, Sapienza Roma

A distributed event notification service is at the core of many modern messaging infrastructures providing applications with scalable and robust publish/subscribe communication primitives. Such services can route events toward subscribers using multiple paths with different lengths and latencies. As a consequence, subscribers can receive events out of order, and this may have a negative impact on any application that relies on consistent delivery ordering.
In this talk I will present a novel solution for ordered notifications on top of an existing distributed topic-based distributed notification service. This solutions guarantees that each pair of events published in the system will be notified in the same order to all their target subscribers independently from the topics they are published in. During the presentation I’ll show how our solution produces timestamps whose size is dynamically adjusted to accommodate changing subscriptions in the system. Furthermore I’ll provide hints on how timestamps can be slightly modified to also guarantee causal consistency among different topics.

 

Aggregate Tracking over Dynamic Deep Web Databases

Speaker: Gautam Das, UT Arlington

Abstract: Many Deep Web databases are “hidden” behind (i.e., accessible only through) their restrictive, form-like, search interfaces. Recent studies have shown that even though full data access is not available to external clients, it is possible to approximately estimate aggregate query answers over such hidden web databases by random sampling approaches – based on issuing a small number of carefully designed search queries through the restrictive web interface.

A problem with these existing works, however, is that they all assume the underlying database to be static, while most real-world web databases (e.g., Amazon, eBay) are frequently updated. In this talk we describe our recent  investigations on the problem of estimating/tracking aggregates over dynamic hidden web databases. Theoretical analysis and extensive real-world experiments demonstrate the effectiveness of our proposed algorithms and their superiority over baseline solutions (e.g., the repeated execution of algorithms designed for static web databases).

Streaming and View Maintenance for Semistructured, NoSQL databases

Speaker: Yannis Papakonstantinou, on joint work with Yannis Katsis

Relational database R&D has produced multiple solutions to two related problems:
Given a materialized view and a stream of changes (inserts, deletes, updates)
on the base tables of the view, output the stream of view changes that most
efficiently change the materialized view to its new state (that reflects the changes on
the base tables). In the streaming variation of the problem, the algorithm produces
the output stream without requiring that the view is actually materialized.
We show how to extend streaming and view maintenance into JSON databases,
such as the NoSQL databases and the SQL/JSON extensions.
We build our work on the SQL++ query language, which is backwards-compatible
with SQL, while it also captures the querying abilities of NoSQL databases that
use the JSON model.
We show the extension of streaming and view maintenance algorithms to SQL++.
In particular, we utilize provenance-based algorithms (introduced in SIGMOD 2015
for the case of plain SQL) that make it especially easy to associate inputs
and outputs and hence turn changes in the input to changes in the output.
The algorithms pose implementation requirements on the access methods of NoSQL,
which we discuss along with providing a flexible framework of streaming and view
maintenance, where algorithm modification can be introduced in the interest of
best fitting the access method of a particular system.
Interestingly, the study of streaming and view maintenance in the context of
semistructured data, enlightens even the plain SQL case. For example, it shows
what aspects of schema knowledge are truly needed and what aspects of past
algorithms have been tuned to row-based SQL databases, without a clear
statement of the dependencies to the storage and access model.

 

 

Inferring Insertion Times in Bloom Filters

Speaker: Chinya Ravishankar

Bloom filters are extremely useful in dealing with streaming data, but are easy to misuse. We first illustrate pitfalls in using Bloom Filters, and show how to generalize our techniques to infer insertion times. We introduce inferential filters, which integrate Bayesian priors and information latent in filters to make optimal query-specific decisions.

We illustrate our approach on Time-Decaying Bloom Filters, which are probabilistic structures to test membership on recently inserted items. As new items are inserted, memory of older items decays. Wrong query responses penalize the application using the filter. Current filters may only be tuned to static penalties, and ignore Bayesian priors and much information latent in the filter. Our methods are general, but here we focus on inferential time-decaying filters, and show how to support novel query types and sliding window queries with varying error penalties.
We have developed inferential versions of the existing Timing Bloom Filter and Generalized Bloom Filter. Our experiments on real and synthetic datasets show that when penalties are dynamic and prior probabilities are known, these filters reduce penalties for incorrect responses to sliding-window queries by up to 70%.