Scraping Web Pages with Retrofit – jspoon Library
Have you ever worked with JSON converters like GSON or Moshi? They are extremely useful when it comes to operating with internet data. I try to provide a similar mechanism for web scraping and take HTML parsing to the next level.
Story
Before I started working at Droids On Roids, I was a freelancer – often creating Android versions of popular Polish sites. As it was unofficial, I had no API, so everything was based on web scraping. I started with operating on raw HTML texts – I was looking for specific strings, adding shifts to get the desired value.
Then I’ve found jsoup, which made HTML parsing much more comfortable. Recently, I worked on commercial projects, with API in the JSON format. With Retrofit and GSON/Moshi converters, it was effortless to create POJO objects from Internet content. I thought “it would be great to have a similar mechanism for HTML” – so here it is:
jspoon with Retrofit converter!
jspoon is a library which uses annotations with CSS selectors to create Java POJO objects. It uses jsoup as a HTML parser and caches reflections for better performance. It is also Java 7 compatible, so it works on Android too.
You can check the details on GitHub
It can be used when you don’t have access to the API – for example, if it isn’t ready yet. Another possible case is when the web page is yours, but you don’t have full access to the database (or you are just lazy), as you can omit the API and just scrap the page. Moreover, when you are dealing with third party web pages in your app and you need some data, like meta tags, this library is for you.
In this post, I’m going to parse a Droids On Roids /blog page using jspoon, Retrofit, and RxJava2.
Installation
We will need the following dependencies (using gradle):
1 2 3 4 5 6 7 | dependencies { compile 'pl.droidsonroids:jspoon:1.0.0' compile 'pl.droidsonroids.retrofit2:converter-jspoon:1.0.0' compile 'com.squareup.retrofit2:retrofit:2.3.0' compile 'com.squareup.retrofit2:adapter-rxjava2:2.3.0' } |
Setting up
First of all, our page needs mapping from HTML to the POJO Java class. This is what jspoon does. We create the BlogPage
class with the list of Post
s:
1 2 3 | public class BlogPage { @Selector(".post") public List<Post> posts; } |
1 2 3 4 5 6 | public class Post { @Selector(".post-content > h2 > a") public String title; @Selector(".excerpt") public String excerpt; @Selector(value = ".post-featured-image > a > img", attr = "data-lazy-src") public String imageUrl; @Selector(".post-category > a") public List<String> tags; } |
Then, we need to configure Retrofit, following the Retrofit web page. We create an API interface:
1 2 3 4 | public interface BlogService { @GET("blog") Single<BlogPage> getBlogPage(@Query("page") int page); } |
We also write methods for building the BlogService instance for API calls. At this point, we add the jspoon converter and rxjava2 adapter:
1 2 3 4 5 6 7 8 9 10 11 12 | private static Retrofit createRetrofit() { return new Retrofit.Builder() .baseUrl("https://www.thedroidsonroids.com/") .addConverterFactory(JspoonConverterFactory.create()) .addCallAdapterFactory(RxJava2CallAdapterFactory.create()) .build(); } private static BlogService createBlogService() { return createRetrofit() .create(BlogService.class); } |
Let’s scrap!
That’s all! Everything is set up and we are ready to start scraping:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | public class Example { public static void main(String args[]) { BlogService blogService = createBlogService(); blogService.getBlogPage(1) .subscribe(Example::printBlogPage); } private static void printBlogPage(BlogPage blogPage) { blogPage.posts.forEach(Example::printPost); } private static void printPost(Post post) { System.out.println(post.title); System.out.println(post.excerpt); System.out.println(post.imageUrl); System.out.println(String.join(", ", post.tags)); System.out.println(); } //..... } |
Et voilà! We get the posts in our console:
1 2 3 4 5 6 7 8 9 10 11 | How to Connect Physical Devices to Bitrise.io Bitrise.io is a cloud CI/CD service. It can build, test and deploy your apps. In this article, we will focus on testing, namely on Android apps UI and instrumented unit tests. https://www.thedroidsonroids.com/wp-content/uploads/2017/07/tuscany-grape-field-nature-51947-360x240.jpeg Android, Blog How to Create a Measuring App With ARKit In iOS 11 Learn how to create a basic measuring app using ARKit – a new framework in iOS 11 announced at WWDC 2017. Let’s develop a simple demo app! https://www.thedroidsonroids.com/wp-content/uploads/2017/07/miarka-3-360x240.jpg Blog, iOS ------13 more------ |
You can check the full source of this example in Java and Kotlin here.
Conclusion
Scraping HTML will never beat professional JSON API, but I think that jspoon can make it much simpler and similar to modern JSON parsing. If you find any bugs or lack of functionality, feel free to contribute on GitHub.
Ready to take your business to the next level with a digital product?
We'll be with you every step of the way, from idea to launch and beyond!
hello, I modified the function for android but I have problems to complilar. I have the function in the following way. Can you help me please.
public void createRetrofit(){
Retrofit retrofit = new Retrofit.Builder()
.baseUrl(“https://www.thedroidsonroids.com/”)
.addConverterFactory(JspoonConverterFactory.create())
.addCallAdapterFactory(RxJava2CallAdapterFactory.create())
.build();
retrofit.create(BlogService.class)
.getBlogPage(1)
.subscribe ( blog -> cicle( blog.posts ) );
public void createRetrofit(){
Retrofit retrofit = new Retrofit.Builder()
.baseUrl(“https://www.thedroidsonroids.com/”)
.addConverterFactory(JspoonConverterFactory.create())
.addCallAdapterFactory(RxJava2CallAdapterFactory.create())
.build();
retrofit.create(BlogService.class)
.getBlogPage(1)
.subscribe ( blog -> cicle( blog.posts ) );
}
private void cicle(List post){
for (int i = 0; i < post.size(); i++)
{
Log.d("Title", post.get(i).title);
}
}